Unsupervised Imputers
Overview
PG-SUI ships four neural imputers—NLPCA, UBP, Autoencoder, and VAE—that share the pgsui.impute.unsupervised.base.BaseNNImputer scaffolding. They operate on 0/1/2 encodings generated by snpio.analysis.genotype_encoder.GenotypeEncoder, learn to reconstruct genotypes with missing calls masked as -1, and expose a transform() method that returns IUPAC strings.
Workflow Highlights
Instantiate with a
GenotypeDataobject and a typed*Config(dataclass instance, nesteddict, or YAML path). Dot-key overrides ({"model.latent_dim": 12}) are applied last.fit()standardises the data split, computes class weights from observed zygosity, optionally tunes hyperparameters with Optuna, then trains the model while caching metrics/plots under{prefix}_output/Unsupervised/.transform()reuses the trained model to impute the full dataset, decodes to IUPAC, and writes genotype-distribution plots for original vs. imputed calls.
Configuration
Each imputer exposes the same top-level sections: io (run identity + logging), model (architecture), train (optimizer schedule, class weighting, early stopping), tune (Optuna surface), plot (figure styling), and sim (simulated missingness). Model-specific extras add vae (VAE KL controls) or projection blocks (nlpca / ubp). Presets (fast, balanced, thorough) trade accuracy for runtime; select them via --preset or *.from_preset(...) and then overlay YAML/CLI overrides. Heavy class imbalance can be tempered by train.weights_power / train.weights_inverse / train.weights_max_ratio; layer widths are governed by model.layer_schedule and model.layer_scaling_factor.
Tip
The neural imputers honour cfg.io.prefix by creating models/, plots/, metrics/, optimize/, and parameters/ subdirectories, mirroring the supervised toolkit.
See Optuna Hyperparameter Tuning for details on automated hyperparameter optimization.
Config Dataclasses
Standard Autoencoder (Config)
A standard autoencoder config capturing architecture and training settings for pgsui.impute.unsupervised.imputers.autoencoder.ImputeAutoencoder.
- class pgsui.data_processing.containers.AutoencoderConfig(io: IOConfig = <factory>, model: ModelConfig = <factory>, train: TrainConfig = <factory>, tune: TuneConfig = <factory>, plot: PlotConfig = <factory>, sim: SimConfig = <factory>)[source]
Top-level configuration for ImputeAutoencoder.
This configuration class encapsulates all settings required for the ImputeAutoencoder model, including I/O, model architecture, training, hyperparameter tuning, plotting, and simulated-missing configuration.
- io
I/O configuration.
- Type:
IOConfig
- model
Model architecture configuration.
- Type:
ModelConfig
- train
Training procedure configuration.
- Type:
TrainConfig
- tune
Hyperparameter tuning configuration.
- Type:
TuneConfig
- plot
Plotting configuration.
- Type:
PlotConfig
- sim
Simulated-missing configuration.
- Type:
SimConfig
- classmethod from_preset(preset: Literal['fast', 'balanced', 'thorough'] = 'balanced') AutoencoderConfig[source]
Build a AutoencoderConfig from a named preset.
- Parameters:
preset (Literal["fast", "balanced", "thorough"]) – Preset name.
- Returns:
Configuration instance corresponding to the preset.
- Return type:
AutoencoderConfig
- apply_overrides(overrides: Dict[str, Any] | None) AutoencoderConfig[source]
Apply flat dot-key overrides.
- Parameters:
overrides (Dict[str, Any] | None) – Dictionary of overrides with dot-separated keys.
- Returns:
New configuration instance with overrides applied.
- Return type:
AutoencoderConfig
Variational Autoencoder (Config)
Adds a vae section (kl_beta) to the autoencoder defaults, controlling the KL weight.
- class pgsui.data_processing.containers.VAEConfig(io: IOConfig = <factory>, model: ModelConfig = <factory>, train: TrainConfig = <factory>, tune: TuneConfig = <factory>, plot: PlotConfig = <factory>, vae: VAEExtraConfig = <factory>, sim: SimConfig = <factory>)[source]
Top-level configuration for ImputeVAE (AE-parity + VAE extras).
Mirrors AutoencoderConfig sections and adds a
vaeblock with KL-beta controls for the VAE loss.- io
I/O configuration.
- Type:
IOConfig
- model
Model architecture configuration.
- Type:
ModelConfig
- train
Training procedure configuration.
- Type:
TrainConfig
- tune
Hyperparameter tuning configuration.
- Type:
TuneConfig
- plot
Plotting configuration.
- Type:
PlotConfig
- vae
VAE-specific configuration.
- Type:
VAEExtraConfig
- sim
Simulated-missing configuration.
- Type:
SimConfig
- classmethod from_preset(preset: Literal['fast', 'balanced', 'thorough'] = 'balanced') VAEConfig[source]
Build a VAEConfig from a named preset.
- Parameters:
preset (Literal["fast", "balanced", "thorough"]) – Preset name.
- Returns:
Configuration instance corresponding to the preset.
- Return type:
VAEConfig
- apply_overrides(overrides: Dict[str, Any] | None) VAEConfig[source]
Apply flat dot-key overrides.
Non-linear PCA (Config)
Adds an nlpca section with projection controls for latent refinement.
- class pgsui.data_processing.containers.NLPCAConfig(io: IOConfig = <factory>, model: ModelConfig = <factory>, train: TrainConfig = <factory>, tune: TuneConfig = <factory>, plot: PlotConfig = <factory>, sim: SimConfig = <factory>, nlpca: NLPCAExtraConfig = <factory>)[source]
Top-level configuration for ImputeUBP.
This configuration class encapsulates all settings required for the ImputeUBP model, including I/O, model architecture, training, hyperparameter tuning, plotting, and simulated-missing configuration.
- io
I/O configuration.
- Type:
IOConfig
- model
Model architecture configuration.
- Type:
ModelConfig
- train
Training procedure configuration.
- Type:
TrainConfig
- tune
Hyperparameter tuning configuration.
- Type:
TuneConfig
- plot
Plotting configuration.
- Type:
PlotConfig
- sim
Simulated-missing configuration.
- Type:
SimConfig
- nlpca
NLPCA-specific configuration.
- Type:
NLPCAExtraConfig
- classmethod from_preset(preset: Literal['fast', 'balanced', 'thorough'] = 'balanced') NLPCAConfig[source]
Build a NLPCAConfig from a named preset.
- Parameters:
preset (Literal["fast", "balanced", "thorough"]) – Preset name.
- Returns:
Configuration instance corresponding to the preset.
- Return type:
NLPCAConfig
- apply_overrides(overrides: Dict[str, Any] | None) NLPCAConfig[source]
Apply flat dot-key overrides.
- Parameters:
overrides (Dict[str, Any] | None) – Dictionary of overrides with dot-separated keys.
- Returns:
New configuration instance with overrides applied.
- Return type:
NLPCAConfig
Unsupervised Backpropagation (Config)
Adds a ubp section with projection controls for latent refinement.
- class pgsui.data_processing.containers.UBPConfig(io: IOConfig = <factory>, model: ModelConfig = <factory>, train: TrainConfig = <factory>, tune: TuneConfig = <factory>, plot: PlotConfig = <factory>, sim: SimConfig = <factory>, ubp: UBPExtraConfig = <factory>)[source]
Top-level configuration for ImputeUBP.
This configuration class encapsulates all settings required for the ImputeUBP model, including I/O, model architecture, training, hyperparameter tuning, plotting, and simulated-missing configuration.
- io
I/O configuration.
- Type:
IOConfig
- model
Model architecture configuration.
- Type:
ModelConfig
- train
Training procedure configuration.
- Type:
TrainConfig
- tune
Hyperparameter tuning configuration.
- Type:
TuneConfig
- plot
Plotting configuration.
- Type:
PlotConfig
- sim
Simulated-missing configuration.
- Type:
SimConfig
- ubp
UBP-specific configuration.
- Type:
UBPExtraConfig
- classmethod from_preset(preset: Literal['fast', 'balanced', 'thorough'] = 'balanced') UBPConfig[source]
Build a UBPConfig from a named preset.
- Parameters:
preset (Literal["fast", "balanced", "thorough"]) – Preset name.
- Returns:
Configuration instance corresponding to the preset.
- Return type:
UBPConfig
- apply_overrides(overrides: Dict[str, Any] | None) UBPConfig[source]
Apply flat dot-key overrides.
- Parameters:
overrides (Dict[str, Any] | None) – Dictionary of overrides with dot-separated keys.
- Returns:
New configuration instance with overrides applied.
- Return type:
UBPConfig
Model Summaries
For algorithm details, see ImputeAutoencoder, ImputeVAE, ImputeNLPCA, and ImputeUBP.
Non-linear PCA (ImputeNLPCA)
Encodes genotypes to 0/1/2 (missing → -1) and detects haploid panels, collapsing ALT/HET where appropriate.
Initializes per-sample latent embeddings with PCA and optimizes them directly (no encoder network).
Trains with focal cross-entropy (
train.gamma) while jointly optimizing decoder weights and latent vectors, using class-weight controls fromtrain.weights_*.Input refinement updates originally missing entries in the working matrix after selected epochs while keeping simulated-missing positions masked.
Projection-based evaluation refines latents with the decoder frozen, controlled by
nlpca.projection_lr/nlpca.projection_epochs.Optional Optuna tuning uses the
tuneenvelope (trial count, metric selection, and optional patience settings) before the final training run.
Unsupervised Backpropagation (ImputeUBP)
Uses a phased UBP schedule: PCA initialization, decoder-only refinement, then joint optimization of embeddings and decoder weights.
Trains with focal cross-entropy using
train.weights_*settings and optional gamma scheduling.Class weighting can be capped by
train.weights_max_ratioto prevent extreme rebalancing on sparse genotypes.Projection-based evaluation and inference refine embeddings with the decoder frozen (
ubp.projection_lr/ubp.projection_epochs).Tunable through Optuna with the same tuning surface as NLPCA (
tune.*fields).
Autoencoder (ImputeAutoencoder)
Feed-forward encoder/decoder with symmetric hidden layers (
model.layer_schedule) and dropout; no latent refinement is performed during evaluation.Uses focal cross-entropy with class weighting from
train.weights_*and early stopping controlled bytrain.early_stop_gen/train.min_epochs.Optional Optuna tuning searches over architecture and loss/optimizer controls (latent_dim, layer schedule, dropout, learning rate, L1, weights, gamma, gamma_schedule) using
tune.metricsas the objective.transform()applies the trained model to reconstructed logits, fills only previously missing calls, decodes to IUPAC, and plots original vs. imputed counts.
Variational Autoencoder (ImputeVAE)
Extends the autoencoder with a KL divergence term weighted by
vae.kl_beta(or tuned via Optuna).Uses focal cross-entropy reconstruction with
train.gammaplus optionaltrain.gamma_scheduleand KL scheduling viavae.kl_beta_schedule.Hyperparameter tuning extends the autoencoder search surface with
vae.kl_betaand scheduling flags (vae.kl_beta_schedule,train.gamma_schedule).transform()predicts class probabilities across genotypes, fills masked entries with MAP labels, and emits IUPAC arrays with paired distribution plots.
Usage Examples
from snpio import VCFReader
from pgsui import ImputeVAE
from pgsui.data_processing.containers import VAEConfig
gdata = VCFReader("cohort.vcf.gz", popmapfile="pops.popmap")
cfg = VAEConfig.from_preset("balanced")
cfg.io.prefix = "vae_demo"
cfg.vae.kl_beta = 1.5
cfg.tune.enabled = False # accept preset defaults
imputer = ImputeVAE(genotype_data=gdata, config=cfg)
imputer.fit()
genotypes_iupac = imputer.transform()
CLI usage mirrors the Python API:
pg-sui \
--input cohort.vcf.gz \
--models ImputeNLPCA ImputeUBP \
--preset thorough \
--set io.prefix=nlpca_vs_ubp \
--config configs/nlpca.yaml \
--sim-strategy random_weighted \
--sim-prop 0.20
--sim-strategy and --sim-prop apply to every selected neural model. Each imputer simulates missingness independently; set sim.sim_kwargs.seed if you need identical masks across runs. Unsupervised models always simulate missingness during training and evaluation; sim.simulate_missing is currently ignored for these models, so adjust sim.sim_prop and sim.sim_strategy instead. See Overview for how each strategy behaves.