Unsupervised Imputers

Overview

PG-SUI ships four neural imputers—NLPCA, UBP, Autoencoder, and VAE—that share the pgsui.impute.unsupervised.base.BaseNNImputer scaffolding. They operate on 0/1/2 encodings generated by snpio.analysis.genotype_encoder.GenotypeEncoder, learn to reconstruct genotypes with missing calls masked as -1, and expose a transform() method that returns IUPAC strings.

Workflow Highlights

  • Instantiate with a GenotypeData object and a typed *Config (dataclass instance, nested dict, or YAML path). Dot-key overrides ({"model.latent_dim": 12}) are applied last.

  • fit() standardises the data split, computes class weights from observed zygosity, optionally tunes hyperparameters with Optuna, then trains the model while caching metrics/plots under {prefix}_output/Unsupervised/.

  • transform() reuses the trained model to impute the full dataset, decodes to IUPAC, and writes genotype-distribution plots for original vs. imputed calls.

Configuration

Each imputer exposes the same top-level sections: io (run identity + logging), model (architecture), train (optimizer schedule, class weighting, early stopping), tune (Optuna surface), plot (figure styling), and sim (simulated missingness). Model-specific extras add vae (VAE KL controls) or projection blocks (nlpca / ubp). Presets (fast, balanced, thorough) trade accuracy for runtime; select them via --preset or *.from_preset(...) and then overlay YAML/CLI overrides. Heavy class imbalance can be tempered by train.weights_power / train.weights_inverse / train.weights_max_ratio; layer widths are governed by model.layer_schedule and model.layer_scaling_factor.

Tip

The neural imputers honour cfg.io.prefix by creating models/, plots/, metrics/, optimize/, and parameters/ subdirectories, mirroring the supervised toolkit.

See Optuna Hyperparameter Tuning for details on automated hyperparameter optimization.

Config Dataclasses

Standard Autoencoder (Config)

A standard autoencoder config capturing architecture and training settings for pgsui.impute.unsupervised.imputers.autoencoder.ImputeAutoencoder.

class pgsui.data_processing.containers.AutoencoderConfig(io: IOConfig = <factory>, model: ModelConfig = <factory>, train: TrainConfig = <factory>, tune: TuneConfig = <factory>, plot: PlotConfig = <factory>, sim: SimConfig = <factory>)[source]

Top-level configuration for ImputeAutoencoder.

This configuration class encapsulates all settings required for the ImputeAutoencoder model, including I/O, model architecture, training, hyperparameter tuning, plotting, and simulated-missing configuration.

io

I/O configuration.

Type:

IOConfig

model

Model architecture configuration.

Type:

ModelConfig

train

Training procedure configuration.

Type:

TrainConfig

tune

Hyperparameter tuning configuration.

Type:

TuneConfig

plot

Plotting configuration.

Type:

PlotConfig

sim

Simulated-missing configuration.

Type:

SimConfig

classmethod from_preset(preset: Literal['fast', 'balanced', 'thorough'] = 'balanced') AutoencoderConfig[source]

Build a AutoencoderConfig from a named preset.

Parameters:

preset (Literal["fast", "balanced", "thorough"]) – Preset name.

Returns:

Configuration instance corresponding to the preset.

Return type:

AutoencoderConfig

apply_overrides(overrides: Dict[str, Any] | None) AutoencoderConfig[source]

Apply flat dot-key overrides.

Parameters:

overrides (Dict[str, Any] | None) – Dictionary of overrides with dot-separated keys.

Returns:

New configuration instance with overrides applied.

Return type:

AutoencoderConfig

Variational Autoencoder (Config)

Adds a vae section (kl_beta) to the autoencoder defaults, controlling the KL weight.

class pgsui.data_processing.containers.VAEConfig(io: IOConfig = <factory>, model: ModelConfig = <factory>, train: TrainConfig = <factory>, tune: TuneConfig = <factory>, plot: PlotConfig = <factory>, vae: VAEExtraConfig = <factory>, sim: SimConfig = <factory>)[source]

Top-level configuration for ImputeVAE (AE-parity + VAE extras).

Mirrors AutoencoderConfig sections and adds a vae block with KL-beta controls for the VAE loss.

io

I/O configuration.

Type:

IOConfig

model

Model architecture configuration.

Type:

ModelConfig

train

Training procedure configuration.

Type:

TrainConfig

tune

Hyperparameter tuning configuration.

Type:

TuneConfig

plot

Plotting configuration.

Type:

PlotConfig

vae

VAE-specific configuration.

Type:

VAEExtraConfig

sim

Simulated-missing configuration.

Type:

SimConfig

classmethod from_preset(preset: Literal['fast', 'balanced', 'thorough'] = 'balanced') VAEConfig[source]

Build a VAEConfig from a named preset.

Parameters:

preset (Literal["fast", "balanced", "thorough"]) – Preset name.

Returns:

Configuration instance corresponding to the preset.

Return type:

VAEConfig

apply_overrides(overrides: Dict[str, Any] | None) VAEConfig[source]

Apply flat dot-key overrides.

Non-linear PCA (Config)

Adds an nlpca section with projection controls for latent refinement.

class pgsui.data_processing.containers.NLPCAConfig(io: IOConfig = <factory>, model: ModelConfig = <factory>, train: TrainConfig = <factory>, tune: TuneConfig = <factory>, plot: PlotConfig = <factory>, sim: SimConfig = <factory>, nlpca: NLPCAExtraConfig = <factory>)[source]

Top-level configuration for ImputeUBP.

This configuration class encapsulates all settings required for the ImputeUBP model, including I/O, model architecture, training, hyperparameter tuning, plotting, and simulated-missing configuration.

io

I/O configuration.

Type:

IOConfig

model

Model architecture configuration.

Type:

ModelConfig

train

Training procedure configuration.

Type:

TrainConfig

tune

Hyperparameter tuning configuration.

Type:

TuneConfig

plot

Plotting configuration.

Type:

PlotConfig

sim

Simulated-missing configuration.

Type:

SimConfig

nlpca

NLPCA-specific configuration.

Type:

NLPCAExtraConfig

classmethod from_preset(preset: Literal['fast', 'balanced', 'thorough'] = 'balanced') NLPCAConfig[source]

Build a NLPCAConfig from a named preset.

Parameters:

preset (Literal["fast", "balanced", "thorough"]) – Preset name.

Returns:

Configuration instance corresponding to the preset.

Return type:

NLPCAConfig

apply_overrides(overrides: Dict[str, Any] | None) NLPCAConfig[source]

Apply flat dot-key overrides.

Parameters:

overrides (Dict[str, Any] | None) – Dictionary of overrides with dot-separated keys.

Returns:

New configuration instance with overrides applied.

Return type:

NLPCAConfig

Unsupervised Backpropagation (Config)

Adds a ubp section with projection controls for latent refinement.

class pgsui.data_processing.containers.UBPConfig(io: IOConfig = <factory>, model: ModelConfig = <factory>, train: TrainConfig = <factory>, tune: TuneConfig = <factory>, plot: PlotConfig = <factory>, sim: SimConfig = <factory>, ubp: UBPExtraConfig = <factory>)[source]

Top-level configuration for ImputeUBP.

This configuration class encapsulates all settings required for the ImputeUBP model, including I/O, model architecture, training, hyperparameter tuning, plotting, and simulated-missing configuration.

io

I/O configuration.

Type:

IOConfig

model

Model architecture configuration.

Type:

ModelConfig

train

Training procedure configuration.

Type:

TrainConfig

tune

Hyperparameter tuning configuration.

Type:

TuneConfig

plot

Plotting configuration.

Type:

PlotConfig

sim

Simulated-missing configuration.

Type:

SimConfig

ubp

UBP-specific configuration.

Type:

UBPExtraConfig

classmethod from_preset(preset: Literal['fast', 'balanced', 'thorough'] = 'balanced') UBPConfig[source]

Build a UBPConfig from a named preset.

Parameters:

preset (Literal["fast", "balanced", "thorough"]) – Preset name.

Returns:

Configuration instance corresponding to the preset.

Return type:

UBPConfig

apply_overrides(overrides: Dict[str, Any] | None) UBPConfig[source]

Apply flat dot-key overrides.

Parameters:

overrides (Dict[str, Any] | None) – Dictionary of overrides with dot-separated keys.

Returns:

New configuration instance with overrides applied.

Return type:

UBPConfig

Model Summaries

For algorithm details, see ImputeAutoencoder, ImputeVAE, ImputeNLPCA, and ImputeUBP.

Non-linear PCA (ImputeNLPCA)

  • Encodes genotypes to 0/1/2 (missing → -1) and detects haploid panels, collapsing ALT/HET where appropriate.

  • Initializes per-sample latent embeddings with PCA and optimizes them directly (no encoder network).

  • Trains with focal cross-entropy (train.gamma) while jointly optimizing decoder weights and latent vectors, using class-weight controls from train.weights_*.

  • Input refinement updates originally missing entries in the working matrix after selected epochs while keeping simulated-missing positions masked.

  • Projection-based evaluation refines latents with the decoder frozen, controlled by nlpca.projection_lr / nlpca.projection_epochs.

  • Optional Optuna tuning uses the tune envelope (trial count, metric selection, and optional patience settings) before the final training run.

Unsupervised Backpropagation (ImputeUBP)

  • Uses a phased UBP schedule: PCA initialization, decoder-only refinement, then joint optimization of embeddings and decoder weights.

  • Trains with focal cross-entropy using train.weights_* settings and optional gamma scheduling.

  • Class weighting can be capped by train.weights_max_ratio to prevent extreme rebalancing on sparse genotypes.

  • Projection-based evaluation and inference refine embeddings with the decoder frozen (ubp.projection_lr / ubp.projection_epochs).

  • Tunable through Optuna with the same tuning surface as NLPCA (tune.* fields).

Autoencoder (ImputeAutoencoder)

  • Feed-forward encoder/decoder with symmetric hidden layers (model.layer_schedule) and dropout; no latent refinement is performed during evaluation.

  • Uses focal cross-entropy with class weighting from train.weights_* and early stopping controlled by train.early_stop_gen / train.min_epochs.

  • Optional Optuna tuning searches over architecture and loss/optimizer controls (latent_dim, layer schedule, dropout, learning rate, L1, weights, gamma, gamma_schedule) using tune.metrics as the objective.

  • transform() applies the trained model to reconstructed logits, fills only previously missing calls, decodes to IUPAC, and plots original vs. imputed counts.

Variational Autoencoder (ImputeVAE)

  • Extends the autoencoder with a KL divergence term weighted by vae.kl_beta (or tuned via Optuna).

  • Uses focal cross-entropy reconstruction with train.gamma plus optional train.gamma_schedule and KL scheduling via vae.kl_beta_schedule.

  • Hyperparameter tuning extends the autoencoder search surface with vae.kl_beta and scheduling flags (vae.kl_beta_schedule, train.gamma_schedule).

  • transform() predicts class probabilities across genotypes, fills masked entries with MAP labels, and emits IUPAC arrays with paired distribution plots.

Usage Examples

from snpio import VCFReader
from pgsui import ImputeVAE
from pgsui.data_processing.containers import VAEConfig

gdata = VCFReader("cohort.vcf.gz", popmapfile="pops.popmap")

cfg = VAEConfig.from_preset("balanced")
cfg.io.prefix = "vae_demo"
cfg.vae.kl_beta = 1.5
cfg.tune.enabled = False  # accept preset defaults

imputer = ImputeVAE(genotype_data=gdata, config=cfg)
imputer.fit()
genotypes_iupac = imputer.transform()

CLI usage mirrors the Python API:

pg-sui \
   --input cohort.vcf.gz \
   --models ImputeNLPCA ImputeUBP \
   --preset thorough \
   --set io.prefix=nlpca_vs_ubp \
   --config configs/nlpca.yaml \
   --sim-strategy random_weighted \
   --sim-prop 0.20

--sim-strategy and --sim-prop apply to every selected neural model. Each imputer simulates missingness independently; set sim.sim_kwargs.seed if you need identical masks across runs. Unsupervised models always simulate missingness during training and evaluation; sim.simulate_missing is currently ignored for these models, so adjust sim.sim_prop and sim.sim_strategy instead. See Overview for how each strategy behaves.