Unsupervised Imputers

Overview

PG-SUI ships four neural imputers—NLPCA, UBP, Autoencoder, and VAE—that share the pgsui.impute.unsupervised.base.BaseNNImputer scaffolding. They operate on 0/1/2 encodings generated by snpio.analysis.genotype_encoder.GenotypeEncoder, learn to reconstruct genotypes with missing calls masked as -1, and expose a transform() method that returns IUPAC strings.

Workflow Highlights

Instantiate with a GenotypeData object and a typed *Config (dataclass instance, nested dict, or YAML path). Dot-key overrides ({"model.latent_dim": 12}) are applied last.
fit() standardises the data split, computes class weights from observed zygosity, optionally tunes hyperparameters with Optuna, then trains the model while caching metrics/plots under {prefix}_output/Unsupervised/.
transform() reuses the trained model to impute the full dataset, decodes to IUPAC, and writes genotype-distribution plots for original vs. imputed calls.

Configuration

Each imputer exposes the same top-level sections: io (run identity + logging), model (architecture), train (optimizer schedule, class weighting, early stopping), tune (Optuna surface), plot (figure styling), and sim (simulated missingness). Model-specific extras add vae (VAE KL controls) or projection blocks (nlpca / ubp). Presets (fast, balanced, thorough) trade accuracy for runtime; select them via --preset or *.from_preset(...) and then overlay YAML/CLI overrides. Heavy class imbalance can be tempered by train.weights_power / train.weights_inverse / train.weights_max_ratio; layer widths are governed by model.layer_schedule and model.layer_scaling_factor.

Tip

The neural imputers honour cfg.io.prefix by creating models/, plots/, metrics/, optimize/, and parameters/ subdirectories, mirroring the supervised toolkit.

See Optuna Hyperparameter Tuning for details on automated hyperparameter optimization.

Config Dataclasses

Standard Autoencoder (Config)

A standard autoencoder config capturing architecture and training settings for pgsui.impute.unsupervised.imputers.autoencoder.ImputeAutoencoder.

class pgsui.data_processing.containers.AutoencoderConfig(io: IOConfig = <factory>, model: ModelConfig = <factory>, train: TrainConfig = <factory>, tune: TuneConfig = <factory>, plot: PlotConfig = <factory>, sim: SimConfig = <factory>)[source]

Top-level configuration for ImputeAutoencoder.

This configuration class encapsulates all settings required for the ImputeAutoencoder model, including I/O, model architecture, training, hyperparameter tuning, plotting, and simulated-missing configuration.

io

I/O configuration.

Type:: IOConfig

model

Model architecture configuration.

Type:: ModelConfig

train

Training procedure configuration.

Type:: TrainConfig

tune

Hyperparameter tuning configuration.

Type:: TuneConfig

plot

Plotting configuration.

Type:: PlotConfig

sim

Simulated-missing configuration.

Type:: SimConfig

classmethod from_preset(preset: Literal['fast', 'balanced', 'thorough'] = 'balanced') → AutoencoderConfig[source]

Build a AutoencoderConfig from a named preset.

Parameters:: preset (Literal["fast", "balanced", "thorough"]) – Preset name.
Returns:: Configuration instance corresponding to the preset.
Return type:: AutoencoderConfig

apply_overrides(overrides: Dict[str, Any] | None) → AutoencoderConfig[source]

Apply flat dot-key overrides.

Parameters:: overrides (Dict[str, Any] | None) – Dictionary of overrides with dot-separated keys.
Returns:: New configuration instance with overrides applied.
Return type:: AutoencoderConfig

Variational Autoencoder (Config)

Adds a vae section (kl_beta) to the autoencoder defaults, controlling the KL weight.

class pgsui.data_processing.containers.VAEConfig(io: IOConfig = <factory>, model: ModelConfig = <factory>, train: TrainConfig = <factory>, tune: TuneConfig = <factory>, plot: PlotConfig = <factory>, vae: VAEExtraConfig = <factory>, sim: SimConfig = <factory>)[source]

Top-level configuration for ImputeVAE (AE-parity + VAE extras).

Mirrors AutoencoderConfig sections and adds a vae block with KL-beta controls for the VAE loss.

io

I/O configuration.

Type:: IOConfig

model

Model architecture configuration.

Type:: ModelConfig

train

Training procedure configuration.

Type:: TrainConfig

tune

Hyperparameter tuning configuration.

Type:: TuneConfig

plot

Plotting configuration.

Type:: PlotConfig

vae

VAE-specific configuration.

Type:: VAEExtraConfig

sim

Simulated-missing configuration.

Type:: SimConfig

classmethod from_preset(preset: Literal['fast', 'balanced', 'thorough'] = 'balanced') → VAEConfig[source]

Build a VAEConfig from a named preset.

Parameters:: preset (Literal["fast", "balanced", "thorough"]) – Preset name.
Returns:: Configuration instance corresponding to the preset.
Return type:: VAEConfig

apply_overrides(overrides: Dict[str, Any] | None) → VAEConfig[source]: Apply flat dot-key overrides.

Non-linear PCA (Config)

Adds an nlpca section with projection controls for latent refinement.

class pgsui.data_processing.containers.NLPCAConfig(io: IOConfig = <factory>, model: ModelConfig = <factory>, train: TrainConfig = <factory>, tune: TuneConfig = <factory>, plot: PlotConfig = <factory>, sim: SimConfig = <factory>, nlpca: NLPCAExtraConfig = <factory>)[source]

Top-level configuration for ImputeUBP.

This configuration class encapsulates all settings required for the ImputeUBP model, including I/O, model architecture, training, hyperparameter tuning, plotting, and simulated-missing configuration.

io

I/O configuration.

Type:: IOConfig

model

Model architecture configuration.

Type:: ModelConfig

train

Training procedure configuration.

Type:: TrainConfig

tune

Hyperparameter tuning configuration.

Type:: TuneConfig

plot

Plotting configuration.

Type:: PlotConfig

sim

Simulated-missing configuration.

Type:: SimConfig

nlpca

NLPCA-specific configuration.

Type:: NLPCAExtraConfig

classmethod from_preset(preset: Literal['fast', 'balanced', 'thorough'] = 'balanced') → NLPCAConfig[source]

Build a NLPCAConfig from a named preset.

Parameters:: preset (Literal["fast", "balanced", "thorough"]) – Preset name.
Returns:: Configuration instance corresponding to the preset.
Return type:: NLPCAConfig

apply_overrides(overrides: Dict[str, Any] | None) → NLPCAConfig[source]

Apply flat dot-key overrides.

Parameters:: overrides (Dict[str, Any] | None) – Dictionary of overrides with dot-separated keys.
Returns:: New configuration instance with overrides applied.
Return type:: NLPCAConfig

Unsupervised Backpropagation (Config)

Adds a ubp section with projection controls for latent refinement.

class pgsui.data_processing.containers.UBPConfig(io: IOConfig = <factory>, model: ModelConfig = <factory>, train: TrainConfig = <factory>, tune: TuneConfig = <factory>, plot: PlotConfig = <factory>, sim: SimConfig = <factory>, ubp: UBPExtraConfig = <factory>)[source]

Top-level configuration for ImputeUBP.

This configuration class encapsulates all settings required for the ImputeUBP model, including I/O, model architecture, training, hyperparameter tuning, plotting, and simulated-missing configuration.

io

I/O configuration.

Type:: IOConfig

model

Model architecture configuration.

Type:: ModelConfig

train

Training procedure configuration.

Type:: TrainConfig

tune

Hyperparameter tuning configuration.

Type:: TuneConfig

plot

Plotting configuration.

Type:: PlotConfig

sim

Simulated-missing configuration.

Type:: SimConfig

ubp

UBP-specific configuration.

Type:: UBPExtraConfig

classmethod from_preset(preset: Literal['fast', 'balanced', 'thorough'] = 'balanced') → UBPConfig[source]

Build a UBPConfig from a named preset.

Parameters:: preset (Literal["fast", "balanced", "thorough"]) – Preset name.
Returns:: Configuration instance corresponding to the preset.
Return type:: UBPConfig

apply_overrides(overrides: Dict[str, Any] | None) → UBPConfig[source]

Apply flat dot-key overrides.

Parameters:: overrides (Dict[str, Any] | None) – Dictionary of overrides with dot-separated keys.
Returns:: New configuration instance with overrides applied.
Return type:: UBPConfig

Model Summaries

For algorithm details, see ImputeAutoencoder, ImputeVAE, ImputeNLPCA, and ImputeUBP.

Non-linear PCA (ImputeNLPCA)

Encodes genotypes to 0/1/2 (missing → -1) and detects haploid panels, collapsing ALT/HET where appropriate.
Initializes per-sample latent embeddings with PCA and optimizes them directly (no encoder network).
Trains with focal cross-entropy (train.gamma) while jointly optimizing decoder weights and latent vectors, using class-weight controls from train.weights_*.
Input refinement updates originally missing entries in the working matrix after selected epochs while keeping simulated-missing positions masked.
Projection-based evaluation refines latents with the decoder frozen, controlled by nlpca.projection_lr / nlpca.projection_epochs.
Optional Optuna tuning uses the tune envelope (trial count, metric selection, and optional patience settings) before the final training run.

Unsupervised Backpropagation (ImputeUBP)

Uses a phased UBP schedule: PCA initialization, decoder-only refinement, then joint optimization of embeddings and decoder weights.
Trains with focal cross-entropy using train.weights_* settings and optional gamma scheduling.
Class weighting can be capped by train.weights_max_ratio to prevent extreme rebalancing on sparse genotypes.
Projection-based evaluation and inference refine embeddings with the decoder frozen (ubp.projection_lr / ubp.projection_epochs).
Tunable through Optuna with the same tuning surface as NLPCA (tune.* fields).

Autoencoder (ImputeAutoencoder)

Feed-forward encoder/decoder with symmetric hidden layers (model.layer_schedule) and dropout; no latent refinement is performed during evaluation.
Uses focal cross-entropy with class weighting from train.weights_* and early stopping controlled by train.early_stop_gen / train.min_epochs.
Optional Optuna tuning searches over architecture and loss/optimizer controls (latent_dim, layer schedule, dropout, learning rate, L1, weights, gamma, gamma_schedule) using tune.metrics as the objective.
transform() applies the trained model to reconstructed logits, fills only previously missing calls, decodes to IUPAC, and plots original vs. imputed counts.

Variational Autoencoder (ImputeVAE)

Extends the autoencoder with a KL divergence term weighted by vae.kl_beta (or tuned via Optuna).
Uses focal cross-entropy reconstruction with train.gamma plus optional train.gamma_schedule and KL scheduling via vae.kl_beta_schedule.
Hyperparameter tuning extends the autoencoder search surface with vae.kl_beta and scheduling flags (vae.kl_beta_schedule, train.gamma_schedule).
transform() predicts class probabilities across genotypes, fills masked entries with MAP labels, and emits IUPAC arrays with paired distribution plots.

Usage Examples

from snpio import VCFReader
from pgsui import ImputeVAE
from pgsui.data_processing.containers import VAEConfig

gdata = VCFReader("cohort.vcf.gz", popmapfile="pops.popmap")

cfg = VAEConfig.from_preset("balanced")
cfg.io.prefix = "vae_demo"
cfg.vae.kl_beta = 1.5
cfg.tune.enabled = False  # accept preset defaults

imputer = ImputeVAE(genotype_data=gdata, config=cfg)
imputer.fit()
genotypes_iupac = imputer.transform()

CLI usage mirrors the Python API:

pg-sui \
   --input cohort.vcf.gz \
   --models ImputeNLPCA ImputeUBP \
   --preset thorough \
   --set io.prefix=nlpca_vs_ubp \
   --config configs/nlpca.yaml \
   --sim-strategy random_weighted \
   --sim-prop 0.20

--sim-strategy and --sim-prop apply to every selected neural model. Each imputer simulates missingness independently; set sim.sim_kwargs.seed if you need identical masks across runs. Unsupervised models always simulate missingness during training and evaluation; sim.simulate_missing is currently ignored for these models, so adjust sim.sim_prop and sim.sim_strategy instead. See Overview for how each strategy behaves.