pgsui.impute.unsupervised.imputers package
Submodules
pgsui.impute.unsupervised.imputers.autoencoder module
- pgsui.impute.unsupervised.imputers.autoencoder.ensure_autoencoder_config(config: AutoencoderConfig | dict | str | None) AutoencoderConfig[source]
Return a concrete AutoencoderConfig from dataclass, dict, YAML path, or None.
Notes
Supports top-level preset, or io.preset inside dict/YAML.
Does not mutate user-provided dict (deep-copies before processing).
Flattens nested dicts into dot-keys and applies them as overrides.
- Parameters:
config – AutoencoderConfig instance, dict, YAML path, or None.
- Returns:
Concrete AutoencoderConfig.
- class pgsui.impute.unsupervised.imputers.autoencoder.ImputeAutoencoder(genotype_data: GenotypeData, *, tree_parser: 'TreeParser' | None = None, config: 'AutoencoderConfig' | dict | str | None = None, overrides: dict | None = None, sim_strategy: Literal['random', 'random_weighted', 'random_weighted_inv', 'nonrandom', 'nonrandom_weighted'] | None = None, sim_prop: float | None = None, sim_kwargs: dict | None = None)[source]
Bases:
BaseNNImputerAutoencoder imputer for 0/1/2 genotypes.
Trains a feedforward autoencoder on a genotype matrix encoded as 0/1/2 with missing values represented by any negative integer. Missingness is simulated once on the full matrix, then train/val/test splits reuse those masks. It supports haploid and diploid data, focal-CE reconstruction loss (optional scheduling), and Optuna-based hyperparameter tuning. Output is returned as IUPAC strings via
decode_012.Notes
Simulates missingness once on the full 0/1/2 matrix, then splits indices on clean ground truth.
Maintains clean targets and corrupted inputs per train/val/test, plus per-split masks.
Haploid harmonization happens after the single simulation (no re-simulation).
Training/validation loss is computed only where targets are known (~orig_mask_*).
Evaluation is computed only on simulated-missing sites (sim_mask_*).
transform()fills only originally missing sites and hard-errors if decoding yields “N”.
- __init__(genotype_data: GenotypeData, *, tree_parser: 'TreeParser' | None = None, config: 'AutoencoderConfig' | dict | str | None = None, overrides: dict | None = None, sim_strategy: Literal['random', 'random_weighted', 'random_weighted_inv', 'nonrandom', 'nonrandom_weighted'] | None = None, sim_prop: float | None = None, sim_kwargs: dict | None = None) None[source]
Initialize the Autoencoder imputer with a unified config interface.
- Parameters:
genotype_data (GenotypeData) – Backing genotype data object.
tree_parser (Optional[TreeParser]) – Optional SNPio tree parser for nonrandom simulated-missing modes.
config (Optional[Union[AutoencoderConfig, dict, str]]) – AutoencoderConfig, nested dict, YAML path, or None.
overrides (Optional[dict]) – Optional dot-key overrides with highest precedence.
sim_strategy (Literal["random", "random_weighted" "random_weighted_inv", "nonrandom", "nonrandom_weighted"]) – Override sim strategy; if None, uses config default.
sim_prop (Optional[float]) – Override simulated missing proportion; if None, uses config default. Default is None.
sim_kwargs (Optional[dict]) – Override/extend simulated missing kwargs; if None, uses config default.
- fit() ImputeAutoencoder[source]
Fit the Autoencoder imputer model to the genotype data.
- This method performs the following steps:
Validates the presence of SNP data in the genotype data.
Determines ploidy and sets up the number of classes accordingly.
Cleans the ground truth genotype matrix and simulates missingness.
Splits the data into training, validation, and test sets.
Prepares one-hot encoded inputs for the model.
Initializes plotting utilities and valid-class masks.
Sets up data loaders for training and validation.
Performs hyperparameter tuning if enabled, otherwise uses fixed hyperparameters.
Builds and trains the Autoencoder model.
Evaluates the trained model on the test set.
Returns the fitted ImputeAutoencoder instance.
- Returns:
The fitted ImputeAutoencoder instance.
- Return type:
- transform() ndarray[source]
Impute missing genotypes and return IUPAC strings.
- This method performs the following steps:
Validates that the model has been fitted.
Uses the trained model to predict missing genotypes for the entire dataset.
Fills in the missing genotypes in the original dataset with the predicted values from the model.
Decodes the imputed genotype matrix from 0/1/2 encoding to IUPAC strings.
Checks for any remaining missing values or decoding issues, raising errors if found.
Optionally generates and displays plots comparing the original and imputed genotype distributions.
Returns the imputed IUPAC genotype matrix.
- Returns:
IUPAC genotype matrix of shape (n_samples, n_loci).
- Return type:
np.ndarray
- Raises:
NotFittedError – If called before fit().
RuntimeError – If any missing values remain or decoding yields “N”.
RuntimeError – If loci contain ‘N’ after imputation due to missing REF/ALT metadata.
pgsui.impute.unsupervised.imputers.nlpca module
- pgsui.impute.unsupervised.imputers.nlpca.ensure_nlpca_config(config: NLPCAConfig | dict | str | None) NLPCAConfig[source]
Return a concrete config for NLPCA.
NLPCA reuses NLPCAConfig (latent_dim, hidden sizes, loss controls, projection controls, tuning spec).
- Parameters:
config – NLPCAConfig instance, dict, YAML path, or None.
- Returns:
Concrete configuration.
- Return type:
NLPCAConfig
- class pgsui.impute.unsupervised.imputers.nlpca.ImputeNLPCA(genotype_data: GenotypeData, *, tree_parser: 'TreeParser' | None = None, config: NLPCAConfig | dict | str | None = None, overrides: dict | None = None, sim_strategy: str | None = None, sim_prop: float | None = None, sim_kwargs: dict | None = None)[source]
Bases:
BaseNNImputerNon-linear PCA (NLPCA) Imputer for Genotype Data.
This is “UBP Phase 3 only” + explicit input refinement.
- Key differences vs ImputeUBP:
No Phase 2 (decoder-only refinement).
- Joint optimization only (V and W updated together).
EM-like updates of originally missing inputs during training: after each epoch, replace originally-missing values in the working matrix with the model’s current reconstructions. Simulated-missing values are NEVER filled during training (prevents leakage).
- model_
Trained decoder model with learnable embeddings.
- Type:
nn.Module
- is_fit_
Whether fit() has been called successfully.
- Type:
bool
- X_train_work_
Working training matrix used as targets during NLPCA.
- Type:
np.ndarray
- _X_train_work_init_
Initial copy for per-trial reset during tuning.
- Type:
np.ndarray
- __init__(genotype_data: GenotypeData, *, tree_parser: 'TreeParser' | None = None, config: NLPCAConfig | dict | str | None = None, overrides: dict | None = None, sim_strategy: str | None = None, sim_prop: float | None = None, sim_kwargs: dict | None = None) None[source]
Initialize the ImputeNLPCA model.
- Parameters:
genotype_data (GenotypeData) – Genotype data object.
tree_parser (TreeParser) – Tree parser for nonrandom missingness simulation.
config (NLPCAConfig | dict | str | None) – Configuration (NLPCAConfig or compatible dict/YAML path).
overrides (dict | None) – Dot-notation overrides for config.
sim_strategy (str | None) – Missingness simulation strategy.
sim_prop (float | None) – Proportion to simulate as missing.
sim_kwargs (dict | None) – Additional simulation kwargs.
- fit() ImputeNLPCA[source]
Fit NLPCA model (joint refinement + input refinement).
- transform() ndarray[source]
Impute missing values via final projection and decoding.
This method fills in all missing values (original + simulated) in the genotype matrix. It first refines the embeddings for all samples using the trained model, then predicts the missing genotypes, and finally decodes the imputed genotypes back to their original representation.
- Returns:
Imputed genotype matrix with missing values filled.
- Return type:
np.ndarray
pgsui.impute.unsupervised.imputers.ubp module
- pgsui.impute.unsupervised.imputers.ubp.ensure_ubp_config(config: UBPConfig | dict | str | None) UBPConfig[source]
Return a concrete UBPConfig.
- class pgsui.impute.unsupervised.imputers.ubp.ImputeUBP(genotype_data: GenotypeData, *, tree_parser: 'TreeParser' | None = None, config: UBPConfig | dict | str | None = None, overrides: dict | None = None, sim_strategy: str | None = None, sim_prop: float | None = None, sim_kwargs: dict | None = None)[source]
Bases:
BaseNNImputerUnsupervised Backpropagation (UBP) Imputer for Genotype Data.
This model performs missing value imputation by learning a low-dimensional continuous manifold (latent space) that best explains the observed genotype patterns. Unlike standard autoencoders that require an encoder network to map inputs to the latent space, UBP treats the latent embeddings \(V\) for each sample as learnable parameters that are optimized alongside the decoder weights \(W\).
Model Description
Imagine every individual in your dataset can be described by a small set of abstract coordinates (like “ancestry X”, “ancestry Y”, etc.). We don’t know thes coordinates, so we guess them randomly (or using PCA). We then train a neural networ (the decoder) to take these coordinates and reconstruct the person’s DNA. If th reconstruction is wrong, we adjust both the neural network and the person’s coordinates t make the prediction better. Once trained, we fill in missing DNA markers based o where the individual sits in this abstract space.
Mathematical Formulation
The objective is to minimize the reconstruction error between the observed genotypes \(X\) and the model output \(\hat{X}\).
\[\hat{X} = f(V; W)\]The optimization minimizes the cost function \(J\):
\[J(V, W) = \mathcal{L}_{Focal}(X_{obs}, \hat{X}_{obs}) + \lambda \|W\|_1\]- Where:
\(V \in \mathbb{R}^{N imes K}\) are the latent embeddings.
\(W\) are the network weights.
\(\mathcal{L}_{Focal}\) is the Focal Cross-Entropy loss (handling class imbalance).
\(\lambda\) is the L1 regularization coefficient.
Training Procedure This implementation modifies the original Gashler et al. algorithm for genomics:
Initialization (Modified Phase 1): Instead of training random projections, we initialize \(V\) using Principal Component Analysis (PCA) on the observed data to provide a “warm start” for the manifold.
Decoder Refinement (Phase 2): We freeze \(V\) and optimize only the network weights \(W\) to map the PCA embeddings to the genotypes.
Joint Optimization (Phase 3): We unfreeze \(V\) and optimize both \(V\) and \(W\) simultaneously. This allows the embeddings to drift off the linear PCA plane into a non-linear manifold.
- model_
torch.nn.Module The trained PyTorch model (Decoder).
- is_fit_
bool Whether the model has been successfully fitted.
- model_tuned_
bool Whether the model hyperparameters were tuned via Optuna.
- model_params
dict Dictionary defining the model architecture (layers, dimensions, activations).
- best_params_
dict The optimal hyperparameters found during tuning (or loaded from config).
- tuned_params_
dict The full set of parameters selected after the tuning process.
- num_tuned_params_
int Number of hyperparameters that were subject to tuning.
- total_samples_
int Total number of samples (individuals) in the genotype data.
- num_features_
int Number of SNP features (loci columns) in the genotype data.
- num_classes_
int Number of genotype classes (2 for haploid, 3 for diploid).
- is_haploid_
bool True if the data is haploid, False if diploid.
- v_init_
torch.Tensor The initial PCA-derived embeddings used to warm-start the manifold.
- class_weights_
torch.Tensor Calculated weights for the Focal Loss to handle genotype imbalance.
- ground_truth_
np.ndarray The complete, original genotype matrix (0/1/2 encoded) used for training and evaluation.
- sim_mask_
np.ndarray Boolean mask representing simulated missingness for the full dataset.
- orig_mask_
np.ndarray Boolean mask representing original missingness (already missing in input) for the full dataset.
- train_idx_
np.ndarray Indices of samples used for training.
- val_idx_
np.ndarray Indices of samples used for validation.
- test_idx_
np.ndarray Indices of samples used for testing.
- X_train_
np.ndarray Corrupted genotype matrix (inputs) for training. Alias for X_train_corrupted_.
- y_train_
np.ndarray Clean genotype matrix (targets) for training. Alias for X_train_clean_.
- X_val_
np.ndarray Corrupted genotype matrix (inputs) for validation.
- y_val_
np.ndarray Clean genotype matrix (targets) for validation.
- X_test_
np.ndarray Corrupted genotype matrix (inputs) for testing.
- y_test_
np.ndarray Clean genotype matrix (targets) for testing.
- X_train_clean_
np.ndarray Persisted clean training genotypes.
- X_train_corrupted_
np.ndarray Persisted corrupted training genotypes.
- X_val_clean_
np.ndarray Persisted clean validation genotypes.
- X_val_corrupted_
np.ndarray Persisted corrupted validation genotypes.
- X_test_clean_
np.ndarray Persisted clean test genotypes.
- X_test_corrupted_
np.ndarray Persisted corrupted test genotypes.
- sim_mask_train_
np.ndarray Simulated missingness mask for training data.
- sim_mask_val_
np.ndarray Simulated missingness mask for validation data.
- sim_mask_test_
np.ndarray Simulated missingness mask for test data.
- orig_mask_train_
np.ndarray Original missingness mask for training data.
- orig_mask_val_
np.ndarray Original missingness mask for validation data.
- orig_mask_test_
np.ndarray Original missingness mask for test data.
- eval_mask_train_
np.ndarray Evaluation mask for training data (intersection of simulated mask and observed data).
- eval_mask_val_
np.ndarray Evaluation mask for validation data.
- eval_mask_test_
np.ndarray Evaluation mask for test data.
- train_loader_
torch.utils.data.DataLoader PyTorch DataLoader for iterating over training batches.
- val_loader_
torch.utils.data.DataLoader PyTorch DataLoader for iterating over validation batches.
- plotter_
PrettyPlotter Plotting utilities for visualizing training progress and results.
- scorers_
Scorer Scoring functions for evaluating imputation performance.
References
Gashler, M.S., Smith, M.R., Morris, R., & Martinez, T.R. (2014). Missing Value Imputation with Unsupervised Backpropagation. Computational Intelligence, 32(2), 196-215. https://doi.org/10.1111/coin.12048
- __init__(genotype_data: GenotypeData, *, tree_parser: 'TreeParser' | None = None, config: UBPConfig | dict | str | None = None, overrides: dict | None = None, sim_strategy: str | None = None, sim_prop: float | None = None, sim_kwargs: dict | None = None) None[source]
Initialize the ImputeUBP model.
- Parameters:
genotype_data (GenotypeData) – Genotype data object.
tree_parser (Optional[TreeParser]) – Tree parser for nonrandom missingness simulation.
config (Optional[Union[UBPConfig, dict, str]]) – Configuration for UBP model.
overrides (Optional[dict]) – Dot-notation overrides for config.
sim_strategy (Optional[str]) – Missingness simulation strategy.
sim_prop (Optional[float]) – Proportion of data to simulate as missing.
sim_kwargs (Optional[dict]) – Additional kwargs for simulation.
- fit() ImputeUBP[source]
Fit the UBP model using the 3-phase algorithm.
- This method coordinates the entire training pipeline:
Preprocessing: Validates ploidy, encodes genotypes (0/1/2), and simulates missingness for self-supervised validation.
Split: partitions data into Train, Validation, and Test sets based on the simulated masks.
Initialization: Performs PCA on the training set to initialize the latent embeddings (Phase 1 variant).
Training: Executes Phase 2 (Decoder Refinement) and Phase 3 (Joint Refinement) loops, optionally using Optuna for hyperparameter tuning.
- Returns:
The fitted instance.
- Return type:
- Raises:
AttributeError – If genotype_data does not contain loaded SNPs.
ValueError – If ploidy is not 1 (haploid) or 2 (diploid).
RuntimeError – If training fails to converge or produces non-finite loss.
- transform() ndarray[source]
Impute missing values by projecting samples onto the learned manifold.
Unlike simple prediction, this method iteratively refines the embeddings \(V\) for all samples to minimize the reconstruction error of the observed genotypes, given the fixed weights \(W\) learned during fit(). Once the embeddings settle, the decoder generates the missing values.
- Returns:
The fully imputed genotype matrix in 0/1/2 encoding.
- Return type:
np.ndarray
- Raises:
NotFittedError – If the model has not been trained yet.
RuntimeError – If imputation results in ‘N’ (invalid) characters.
pgsui.impute.unsupervised.imputers.vae module
- pgsui.impute.unsupervised.imputers.vae.ensure_vae_config(config: VAEConfig | dict | str | None) VAEConfig[source]
Ensure a VAEConfig instance from various input types.
- Parameters:
config (VAEConfig | dict | str | None) – Configuration input.
- Returns:
The resulting VAEConfig instance.
- Return type:
VAEConfig
- class pgsui.impute.unsupervised.imputers.vae.ImputeVAE(genotype_data: GenotypeData, *, tree_parser: 'TreeParser' | None = None, config: 'VAEConfig' | dict | str | None = None, overrides: dict | None = None, sim_strategy: Literal['random', 'random_weighted', 'random_weighted_inv', 'nonrandom', 'nonrandom_weighted'] = 'random', sim_prop: float | None = None, sim_kwargs: dict | None = None)[source]
Bases:
BaseNNImputerVariational Autoencoder (VAE) imputer for 0/1/2 genotypes.
Trains a feedforward autoencoder on a genotype matrix encoded as 0/1/2 with missing values represented by any negative integer. Missingness is simulated once on the full matrix, then train/val/test splits reuse those masks. It supports haploid and diploid data, focal-CE reconstruction loss (optional scheduling), and Optuna-based hyperparameter tuning. Output is returned as IUPAC strings via
decode_012.Notes
Simulates missingness once on the full 0/1/2 matrix, then splits indices on clean ground truth.
Maintains clean targets and corrupted inputs per train/val/test, plus per-split masks.
Haploid harmonization happens after the single simulation (no re-simulation).
Training/validation loss is computed only where targets are known (~orig_mask_*).
Evaluation is computed only on simulated-missing sites (sim_mask_*).
transform()fills only originally missing sites and hard-errors if decoding yields “N”.
- __init__(genotype_data: GenotypeData, *, tree_parser: 'TreeParser' | None = None, config: 'VAEConfig' | dict | str | None = None, overrides: dict | None = None, sim_strategy: Literal['random', 'random_weighted', 'random_weighted_inv', 'nonrandom', 'nonrandom_weighted'] = 'random', sim_prop: float | None = None, sim_kwargs: dict | None = None) None[source]
Initialize the ImputeVAE imputer.
- Adds robustness checks for:
missing required genotype_data attributes
invalid sim_strategy / sim_prop
nonrandom strategies requiring tree_parser
misconfigured config fields used in logging/training
- Parameters:
genotype_data (GenotypeData) – Genotype data for imputation.
tree_parser (Optional[TreeParser]) – Tree parser required for nonrandom strategies.
config (Optional[Union[VAEConfig, dict, str]]) – Config dataclass, nested dict, YAML path, or None.
overrides (Optional[dict]) – Dot-key overrides applied last with highest precedence.
sim_strategy – Missingness simulation strategy (overrides config).
sim_prop (Optional[float]) – Proportion of entries to simulate as missing (overrides config).
sim_kwargs (Optional[dict]) – Extra missingness kwargs merged into config.
- Raises:
AttributeError – If genotype_data is missing required attributes.
ValueError – If nonrandom strategy without tree_parser, or invalid sim_prop.
TypeError – If config type invalid (via ensure_vae_config).
- fit() ImputeVAE[source]
Fit the VAE imputer model to the genotype data.
- Adds robustness checks for:
empty / degenerate genotype matrices
NaN/Inf in encoded matrices
empty train/val/test splits
empty evaluation masks (no sites to score)
device / tensor conversion issues
save-path failures
- Returns:
The fitted ImputeVAE instance.
- Return type:
- Raises:
AttributeError, ValueError, RuntimeError – On invalid inputs or failed training.
- transform() ndarray[source]
Impute missing genotypes and return IUPAC strings.
- Adds robustness checks for:
presence of fitted model + ground_truth_
prediction shape alignment
remaining missing values after filling
decode failures
- Returns:
IUPAC genotype matrix of shape (n_samples, n_loci).
- Return type:
np.ndarray
- Raises:
NotFittedError – If called before fit().
RuntimeError – If imputation or decoding is inconsistent.