Simulated Missingness and Evaluation
Overview
PG-SUI evaluates imputers by masking a subset of observed genotypes and checking whether each model can recover those values. This is called “simulated missingness” and it is essential because true missing genotypes have no known ground truth. The same strategy can be applied across models so their evaluation masks are aligned.
When simulated missingness is used
Unsupervised models (VAE, Autoencoder, NLPCA, UBP): Simulated masking is required for training and evaluation. The masked entries define the evaluation set; original missing values are never scored because the models are trained to reconstruct the input from a corrupted version of itself.
Deterministic and supervised models (RF, HGB, Mode): Simulated masking is optional. If it is disabled, evaluation is performed on all originally observed entries within the test split.
Simulation process
Start from the encoded genotype matrix (0/1/2 with negative values for missing).
Build an “original missing” mask for any pre-existing missing entries.
Use
pgsui.data_processing.transformers.SimMissingTransformerto select a subset of observed cells to mask based onsim_strategyandsim_prop.Produce a masked matrix for model training/inference plus three boolean masks:
original_missing_mask_: missing in the input data.sim_missing_mask_: simulated missing on observed cells.all_missing_mask_: union of original and simulated missing.
The simulated mask is always applied to observed cells only, so model performance is measured against known truth.
Simulation strategies
random: uniform masking across eligible cells until the target proportion is reached.random_weighted: probability proportional to genotype frequency in each column (common genotypes are masked more often).random_weighted_inv: probability inversely proportional to genotype frequency in each column (rare genotypes are masked more often).nonrandom: phylogenetically clustered masking based on a genotype tree.nonrandom_weighted: likenonrandombut clades are sampled in proportion to branch length.
nonrandom and nonrandom_weighted require a tree parser; provide
--treefile, --qmatrix, and --siterates on the CLI.
Simulation Strategy Summary Table
Strategy |
Selection logic |
Biologically mimics |
Expected difficulty |
|---|---|---|---|
Random |
Uniform coin flip per cell |
Random sequencing errors; read-depth fluctuations |
Easy |
Random weighted |
Probability proportional to genotype frequency (masks common) |
Reference bias; over-representation of common alleles |
Moderate |
Random weighted inv |
Probability inversely proportional to genotype frequency (masks rare) |
Allelic dropout; minor-allele loss; ascertainment bias |
Hard |
Nonrandom |
Phylogenetically clustered masking |
Clade-specific dropout; sample batch effects |
Hard |
Nonrandom weighted |
Clustered masking weighted by branch length |
Divergence-linked dropout; lineage-specific failures |
Very hard |
Evaluation workflow
PG-SUI evaluates models only on cells with known truth:
Unsupervised models: the simulated mask defines the evaluation set. Metrics are computed by comparing predictions to the pre-mask ground truth values.
Deterministic and supervised models: a train/test split is created. If simulated masking is enabled, the simulated mask is restricted to the test rows; if disabled, all observed test cells are scored.
Outputs include zygosity (REF/HET/ALT) and 10-class IUPAC reports, confusion
matrices, PR curves, and summary metrics. These files are written under
{prefix}_output/<Family>/metrics/<Model>/ with associated plots under the
corresponding plots directory.
Configuration and CLI controls
YAML configuration (applies per model config):
sim:
simulate_missing: true
sim_strategy: random_weighted_inv
sim_prop: 0.30
sim_kwargs: {}
CLI overrides (apply to all selected models):
pg-sui \
--input data.vcf.gz \
--sim-strategy random_weighted_inv \
--sim-prop 0.30
Use --disable-simulate-missing to turn off simulated masking for
supervised/deterministic runs. Unsupervised models require simulated
masking and will error if it is disabled.
Notes and best practices
Keep
sim_propsmall enough to avoid removing too much signal; 0.1 to 0.3 is typical for evaluation.Use the same strategy and proportion across models when comparing performance.
For
nonrandomstrategies, ensure the tree inputs correspond to the dataset you are imputing; mismatched trees can bias the mask placement.