pgsui.impute.unsupervised.imputers package

Submodules

pgsui.impute.unsupervised.imputers.autoencoder module

pgsui.impute.unsupervised.imputers.autoencoder.ensure_autoencoder_config(config: AutoencoderConfig | dict | str | None) → AutoencoderConfig[source]

Return a concrete AutoencoderConfig from dataclass, dict, YAML path, or None.

Notes

Supports top-level preset, or io.preset inside dict/YAML.
Does not mutate user-provided dict (deep-copies before processing).
Flattens nested dicts into dot-keys and applies them as overrides.

Parameters:: config – AutoencoderConfig instance, dict, YAML path, or None.
Returns:: Concrete AutoencoderConfig.

class pgsui.impute.unsupervised.imputers.autoencoder.ImputeAutoencoder(genotype_data: GenotypeData, *, tree_parser: 'TreeParser' | None = None, config: 'AutoencoderConfig' | dict | str | None = None, overrides: dict | None = None, sim_strategy: Literal['random', 'random_weighted', 'random_weighted_inv', 'nonrandom', 'nonrandom_weighted'] | None = None, sim_prop: float | None = None, sim_kwargs: dict | None = None)[source]

Bases: BaseNNImputer

Autoencoder imputer for 0/1/2 genotypes.

Trains a feedforward autoencoder on a genotype matrix encoded as 0/1/2 with missing values represented by any negative integer. Missingness is simulated once on the full matrix, then train/val/test splits reuse those masks. It supports haploid and diploid data, focal-CE reconstruction loss (optional scheduling), and Optuna-based hyperparameter tuning. Output is returned as IUPAC strings via decode_012.

Notes

Simulates missingness once on the full 0/1/2 matrix, then splits indices on clean ground truth.
Maintains clean targets and corrupted inputs per train/val/test, plus per-split masks.
Haploid harmonization happens after the single simulation (no re-simulation).
Training/validation loss is computed only where targets are known (~orig_mask_*).
Evaluation is computed only on simulated-missing sites (sim_mask_*).
transform() fills only originally missing sites and hard-errors if decoding yields “N”.

Initialize the Autoencoder imputer with a unified config interface.

Parameters:

genotype_data (GenotypeData) – Backing genotype data object.
tree_parser (Optional[TreeParser]) – Optional SNPio tree parser for nonrandom simulated-missing modes.
config (Optional[Union[AutoencoderConfig, dict, str]]) – AutoencoderConfig, nested dict, YAML path, or None.
overrides (Optional[dict]) – Optional dot-key overrides with highest precedence.
sim_strategy (Literal["random", "random_weighted" "random_weighted_inv", "nonrandom", "nonrandom_weighted"]) – Override sim strategy; if None, uses config default.
sim_prop (Optional[float]) – Override simulated missing proportion; if None, uses config default. Default is None.
sim_kwargs (Optional[dict]) – Override/extend simulated missing kwargs; if None, uses config default.

fit() → ImputeAutoencoder[source]

Fit the Autoencoder imputer model to the genotype data.

This method performs the following steps:

Validates the presence of SNP data in the genotype data.
Determines ploidy and sets up the number of classes accordingly.
Cleans the ground truth genotype matrix and simulates missingness.
Splits the data into training, validation, and test sets.
Prepares one-hot encoded inputs for the model.
Initializes plotting utilities and valid-class masks.
Sets up data loaders for training and validation.
Performs hyperparameter tuning if enabled, otherwise uses fixed hyperparameters.
Builds and trains the Autoencoder model.
Evaluates the trained model on the test set.
Returns the fitted ImputeAutoencoder instance.

Returns:: The fitted ImputeAutoencoder instance.
Return type:: ImputeAutoencoder

transform() → ndarray[source]

Impute missing genotypes and return IUPAC strings.

This method performs the following steps:

Validates that the model has been fitted.
Uses the trained model to predict missing genotypes for the entire dataset.
Fills in the missing genotypes in the original dataset with the predicted values from the model.
Decodes the imputed genotype matrix from 0/1/2 encoding to IUPAC strings.
Checks for any remaining missing values or decoding issues, raising errors if found.
Optionally generates and displays plots comparing the original and imputed genotype distributions.
Returns the imputed IUPAC genotype matrix.

Returns:

IUPAC genotype matrix of shape (n_samples, n_loci).

Return type:

np.ndarray

Raises:

NotFittedError – If called before fit().
RuntimeError – If any missing values remain or decoding yields “N”.
RuntimeError – If loci contain ‘N’ after imputation due to missing REF/ALT metadata.

pgsui.impute.unsupervised.imputers.nlpca module

pgsui.impute.unsupervised.imputers.nlpca.ensure_nlpca_config(config: NLPCAConfig | dict | str | None) → NLPCAConfig[source]

Return a concrete config for NLPCA.

NLPCA reuses NLPCAConfig (latent_dim, hidden sizes, loss controls, projection controls, tuning spec).

Parameters:: config – NLPCAConfig instance, dict, YAML path, or None.
Returns:: Concrete configuration.
Return type:: NLPCAConfig

Bases: BaseNNImputer

Non-linear PCA (NLPCA) Imputer for Genotype Data.

This is “UBP Phase 3 only” + explicit input refinement.

Key differences vs ImputeUBP:

No Phase 2 (decoder-only refinement).
Joint optimization only (V and W updated together).
- EM-like updates of originally missing inputs during training: after each epoch, replace originally-missing values in the working matrix with the model’s current reconstructions. Simulated-missing values are NEVER filled during training (prevents leakage).

model_

Trained decoder model with learnable embeddings.

Type:: nn.Module

is_fit_

Whether fit() has been called successfully.

Type:: bool

X_train_work_

Working training matrix used as targets during NLPCA.

Type:: np.ndarray

_X_train_work_init_

Initial copy for per-trial reset during tuning.

Type:: np.ndarray

Initialize the ImputeNLPCA model.

Parameters:

genotype_data (GenotypeData) – Genotype data object.
tree_parser (TreeParser) – Tree parser for nonrandom missingness simulation.
config (NLPCAConfig | dict | str | None) – Configuration (NLPCAConfig or compatible dict/YAML path).
overrides (dict | None) – Dot-notation overrides for config.
sim_strategy (str | None) – Missingness simulation strategy.
sim_prop (float | None) – Proportion to simulate as missing.
sim_kwargs (dict | None) – Additional simulation kwargs.

fit() → ImputeNLPCA[source]: Fit NLPCA model (joint refinement + input refinement).

transform() → ndarray[source]

Impute missing values via final projection and decoding.

This method fills in all missing values (original + simulated) in the genotype matrix. It first refines the embeddings for all samples using the trained model, then predicts the missing genotypes, and finally decodes the imputed genotypes back to their original representation.

Returns:: Imputed genotype matrix with missing values filled.
Return type:: np.ndarray

pgsui.impute.unsupervised.imputers.ubp module

pgsui.impute.unsupervised.imputers.ubp.ensure_ubp_config(config: UBPConfig | dict | str | None) → UBPConfig[source]: Return a concrete UBPConfig.

Bases: BaseNNImputer

Unsupervised Backpropagation (UBP) Imputer for Genotype Data.

This model performs missing value imputation by learning a low-dimensional continuous manifold (latent space) that best explains the observed genotype patterns. Unlike standard autoencoders that require an encoder network to map inputs to the latent space, UBP treats the latent embeddings \(V\) for each sample as learnable parameters that are optimized alongside the decoder weights \(W\).

Model Description

Imagine every individual in your dataset can be described by a small set of abstract coordinates (like “ancestry X”, “ancestry Y”, etc.). We don’t know thes coordinates, so we guess them randomly (or using PCA). We then train a neural networ (the decoder) to take these coordinates and reconstruct the person’s DNA. If th reconstruction is wrong, we adjust both the neural network and the person’s coordinates t make the prediction better. Once trained, we fill in missing DNA markers based o where the individual sits in this abstract space.

Mathematical Formulation

The objective is to minimize the reconstruction error between the observed genotypes \(X\) and the model output \(\hat{X}\).

\[\hat{X} = f(V; W)\]

The optimization minimizes the cost function \(J\):

\[J(V, W) = \mathcal{L}_{Focal}(X_{obs}, \hat{X}_{obs}) + \lambda \|W\|_1\]

Where:

\(V \in \mathbb{R}^{N imes K}\) are the latent embeddings.

\(W\) are the network weights.

\(\mathcal{L}_{Focal}\) is the Focal Cross-Entropy loss (handling class imbalance).

\(\lambda\) is the L1 regularization coefficient.

Training Procedure This implementation modifies the original Gashler et al. algorithm for genomics:

Initialization (Modified Phase 1): Instead of training random projections, we initialize \(V\) using Principal Component Analysis (PCA) on the observed data to provide a “warm start” for the manifold.

Decoder Refinement (Phase 2): We freeze \(V\) and optimize only the network weights \(W\) to map the PCA embeddings to the genotypes.

Joint Optimization (Phase 3): We unfreeze \(V\) and optimize both \(V\) and \(W\) simultaneously. This allows the embeddings to drift off the linear PCA plane into a non-linear manifold.

model_: torch.nn.Module The trained PyTorch model (Decoder).

is_fit_: bool Whether the model has been successfully fitted.

model_tuned_: bool Whether the model hyperparameters were tuned via Optuna.

model_params: dict Dictionary defining the model architecture (layers, dimensions, activations).

best_params_: dict The optimal hyperparameters found during tuning (or loaded from config).

tuned_params_: dict The full set of parameters selected after the tuning process.

num_tuned_params_: int Number of hyperparameters that were subject to tuning.

total_samples_: int Total number of samples (individuals) in the genotype data.

num_features_: int Number of SNP features (loci columns) in the genotype data.

num_classes_: int Number of genotype classes (2 for haploid, 3 for diploid).

is_haploid_: bool True if the data is haploid, False if diploid.

v_init_: torch.Tensor The initial PCA-derived embeddings used to warm-start the manifold.

class_weights_: torch.Tensor Calculated weights for the Focal Loss to handle genotype imbalance.

ground_truth_: np.ndarray The complete, original genotype matrix (0/1/2 encoded) used for training and evaluation.

sim_mask_: np.ndarray Boolean mask representing simulated missingness for the full dataset.

orig_mask_: np.ndarray Boolean mask representing original missingness (already missing in input) for the full dataset.

train_idx_: np.ndarray Indices of samples used for training.

val_idx_: np.ndarray Indices of samples used for validation.

test_idx_: np.ndarray Indices of samples used for testing.

X_train_: np.ndarray Corrupted genotype matrix (inputs) for training. Alias for X_train_corrupted_.

y_train_: np.ndarray Clean genotype matrix (targets) for training. Alias for X_train_clean_.

X_val_: np.ndarray Corrupted genotype matrix (inputs) for validation.

y_val_: np.ndarray Clean genotype matrix (targets) for validation.

X_test_: np.ndarray Corrupted genotype matrix (inputs) for testing.

y_test_: np.ndarray Clean genotype matrix (targets) for testing.

X_train_clean_: np.ndarray Persisted clean training genotypes.

X_train_corrupted_: np.ndarray Persisted corrupted training genotypes.

X_val_clean_: np.ndarray Persisted clean validation genotypes.

X_val_corrupted_: np.ndarray Persisted corrupted validation genotypes.

X_test_clean_: np.ndarray Persisted clean test genotypes.

X_test_corrupted_: np.ndarray Persisted corrupted test genotypes.

sim_mask_train_: np.ndarray Simulated missingness mask for training data.

sim_mask_val_: np.ndarray Simulated missingness mask for validation data.

sim_mask_test_: np.ndarray Simulated missingness mask for test data.

orig_mask_train_: np.ndarray Original missingness mask for training data.

orig_mask_val_: np.ndarray Original missingness mask for validation data.

orig_mask_test_: np.ndarray Original missingness mask for test data.

eval_mask_train_: np.ndarray Evaluation mask for training data (intersection of simulated mask and observed data).

eval_mask_val_: np.ndarray Evaluation mask for validation data.

eval_mask_test_: np.ndarray Evaluation mask for test data.

train_loader_: torch.utils.data.DataLoader PyTorch DataLoader for iterating over training batches.

val_loader_: torch.utils.data.DataLoader PyTorch DataLoader for iterating over validation batches.

plotter_: PrettyPlotter Plotting utilities for visualizing training progress and results.

scorers_: Scorer Scoring functions for evaluating imputation performance.

References

Gashler, M.S., Smith, M.R., Morris, R., & Martinez, T.R. (2014). Missing Value Imputation with Unsupervised Backpropagation. Computational Intelligence, 32(2), 196-215. https://doi.org/10.1111/coin.12048

Initialize the ImputeUBP model.

Parameters:

genotype_data (GenotypeData) – Genotype data object.
tree_parser (Optional[TreeParser]) – Tree parser for nonrandom missingness simulation.
config (Optional[Union[UBPConfig, dict, str]]) – Configuration for UBP model.
overrides (Optional[dict]) – Dot-notation overrides for config.
sim_strategy (Optional[str]) – Missingness simulation strategy.
sim_prop (Optional[float]) – Proportion of data to simulate as missing.
sim_kwargs (Optional[dict]) – Additional kwargs for simulation.

fit() → ImputeUBP[source]

Fit the UBP model using the 3-phase algorithm.

This method coordinates the entire training pipeline:

Preprocessing: Validates ploidy, encodes genotypes (0/1/2), and simulates missingness for self-supervised validation.
Split: partitions data into Train, Validation, and Test sets based on the simulated masks.
Initialization: Performs PCA on the training set to initialize the latent embeddings (Phase 1 variant).
Training: Executes Phase 2 (Decoder Refinement) and Phase 3 (Joint Refinement) loops, optionally using Optuna for hyperparameter tuning.

Returns:

The fitted instance.

Return type:

ImputeUBP

Raises:

AttributeError – If genotype_data does not contain loaded SNPs.
ValueError – If ploidy is not 1 (haploid) or 2 (diploid).
RuntimeError – If training fails to converge or produces non-finite loss.

transform() → ndarray[source]

Impute missing values by projecting samples onto the learned manifold.

Unlike simple prediction, this method iteratively refines the embeddings \(V\) for all samples to minimize the reconstruction error of the observed genotypes, given the fixed weights \(W\) learned during fit(). Once the embeddings settle, the decoder generates the missing values.

Returns:

The fully imputed genotype matrix in 0/1/2 encoding.

Return type:

np.ndarray

Raises:

NotFittedError – If the model has not been trained yet.
RuntimeError – If imputation results in ‘N’ (invalid) characters.

pgsui.impute.unsupervised.imputers.vae module

pgsui.impute.unsupervised.imputers.vae.ensure_vae_config(config: VAEConfig | dict | str | None) → VAEConfig[source]

Ensure a VAEConfig instance from various input types.

Parameters:: config (VAEConfig | dict | str | None) – Configuration input.
Returns:: The resulting VAEConfig instance.
Return type:: VAEConfig

class pgsui.impute.unsupervised.imputers.vae.ImputeVAE(genotype_data: GenotypeData, *, tree_parser: 'TreeParser' | None = None, config: 'VAEConfig' | dict | str | None = None, overrides: dict | None = None, sim_strategy: Literal['random', 'random_weighted', 'random_weighted_inv', 'nonrandom', 'nonrandom_weighted'] = 'random', sim_prop: float | None = None, sim_kwargs: dict | None = None)[source]

Bases: BaseNNImputer

Variational Autoencoder (VAE) imputer for 0/1/2 genotypes.

Trains a feedforward autoencoder on a genotype matrix encoded as 0/1/2 with missing values represented by any negative integer. Missingness is simulated once on the full matrix, then train/val/test splits reuse those masks. It supports haploid and diploid data, focal-CE reconstruction loss (optional scheduling), and Optuna-based hyperparameter tuning. Output is returned as IUPAC strings via decode_012.

Notes

Simulates missingness once on the full 0/1/2 matrix, then splits indices on clean ground truth.
Maintains clean targets and corrupted inputs per train/val/test, plus per-split masks.
Haploid harmonization happens after the single simulation (no re-simulation).
Training/validation loss is computed only where targets are known (~orig_mask_*).
Evaluation is computed only on simulated-missing sites (sim_mask_*).
transform() fills only originally missing sites and hard-errors if decoding yields “N”.

__init__(genotype_data: GenotypeData, *, tree_parser: 'TreeParser' | None = None, config: 'VAEConfig' | dict | str | None = None, overrides: dict | None = None, sim_strategy: Literal['random', 'random_weighted', 'random_weighted_inv', 'nonrandom', 'nonrandom_weighted'] = 'random', sim_prop: float | None = None, sim_kwargs: dict | None = None) → None[source]

Initialize the ImputeVAE imputer.

Adds robustness checks for:

missing required genotype_data attributes
invalid sim_strategy / sim_prop
nonrandom strategies requiring tree_parser
misconfigured config fields used in logging/training

Parameters:

genotype_data (GenotypeData) – Genotype data for imputation.
tree_parser (Optional[TreeParser]) – Tree parser required for nonrandom strategies.
config (Optional[Union[VAEConfig, dict, str]]) – Config dataclass, nested dict, YAML path, or None.
overrides (Optional[dict]) – Dot-key overrides applied last with highest precedence.
sim_strategy – Missingness simulation strategy (overrides config).
sim_prop (Optional[float]) – Proportion of entries to simulate as missing (overrides config).
sim_kwargs (Optional[dict]) – Extra missingness kwargs merged into config.

Raises:

AttributeError – If genotype_data is missing required attributes.
ValueError – If nonrandom strategy without tree_parser, or invalid sim_prop.
TypeError – If config type invalid (via ensure_vae_config).

fit() → ImputeVAE[source]

Fit the VAE imputer model to the genotype data.

Adds robustness checks for:

empty / degenerate genotype matrices
NaN/Inf in encoded matrices
empty train/val/test splits
empty evaluation masks (no sites to score)
device / tensor conversion issues
save-path failures

Returns:: The fitted ImputeVAE instance.
Return type:: ImputeVAE
Raises:: AttributeError, ValueError, RuntimeError – On invalid inputs or failed training.

transform() → ndarray[source]

Impute missing genotypes and return IUPAC strings.

Adds robustness checks for:

presence of fitted model + ground_truth_
prediction shape alignment
remaining missing values after filling
decode failures

Returns:

IUPAC genotype matrix of shape (n_samples, n_loci).

Return type:

np.ndarray

Raises:

NotFittedError – If called before fit().
RuntimeError – If imputation or decoding is inconsistent.

pgsui.impute.unsupervised.imputers package

Submodules

pgsui.impute.unsupervised.imputers.autoencoder module

pgsui.impute.unsupervised.imputers.nlpca module

pgsui.impute.unsupervised.imputers.ubp module

pgsui.impute.unsupervised.imputers.vae module

Module contents