ImputeAutoencoder
Overview
ImputeAutoencoder implements a standard encoder-decoder autoencoder for
genotype imputation. The model maps genotype vectors into a low-dimensional
latent space and reconstructs per-locus genotype logits. It uses masked focal
cross-entropy to ignore missing entries and handle class imbalance.
Model formulation
Let \(X \in \mathbb{R}^{N \times L}\) be the genotype matrix encoded as 0/1/2 (missing = -1). The autoencoder learns an encoder \(f_{\phi}\) and decoder \(f_{\theta}\):
Training minimizes a masked focal cross-entropy loss over observed entries, with optional class weights and L1 regularization:
where \(M\) indexes non-missing entries, \(p_{ij}\) is the probability assigned to the true genotype class, and \(\gamma\) is the focal-loss parameter.
Algorithm summary
Encode genotypes to 0/1/2, simulate missingness once on the full matrix, and build masks for original and simulated missingness (reused across splits).
Train the encoder-decoder network on observed entries using masked focal loss with class weighting; optional gamma scheduling is supported.
Optimize with AdamW and a warmup-to-cosine learning rate schedule, while monitoring validation loss for early stopping; metrics are scored on simulated-missing entries only.
transform()predicts genotype logits, fills only originally missing entries, and decodes to IUPAC outputs.
Configuration highlights
ImputeAutoencoder uses pgsui.data_processing.containers.AutoencoderConfig
with the standard io, model, train, tune, plot, and sim
sections.
model.latent_dimandmodel.layer_schedulecontrol architecture.train.gammaandtrain.weights_*control focal loss and class weights.train.gamma_scheduleoptionally anneals focal-loss gamma during training.train.early_stop_gen/train.min_epochsgate early stopping.
See Optuna Hyperparameter Tuning for Optuna-driven tuning details.
Usage
from snpio import VCFReader
from pgsui import ImputeAutoencoder
from pgsui.data_processing.containers import AutoencoderConfig
gdata = VCFReader("cohort.vcf.gz", popmapfile="pops.popmap")
cfg = AutoencoderConfig.from_preset("balanced")
cfg.model.latent_dim = 12
model = ImputeAutoencoder(genotype_data=gdata, config=cfg)
model.fit()
genotypes_iupac = model.transform()
References
Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504-507.