Deterministic (non-Machine Learning) Imputers
Overview
The deterministic imputers provide fast, interpretable baselines that mirror the fit/transform contract used across PG-SUI:
You instantiate with a GenotypeData and a dataclass config (or YAML path).
Call
fit()with no arguments to set up evaluation (TRAIN/TEST split with simulated masking).Call
transform()with no arguments to impute and write plots/metrics.
Both imputers operate on SNPio’s 0/1/2 working encoding (with -1 or -9 as missing),
and produce the same evaluation artifacts as the deep models (zygosity reports,
IUPAC-10 reports, confusion matrices, and distribution plots).
What’s included
ImputeMostFrequent — per-locus mode imputation. Supports global modes and population-aware modes when a popmap is available.
ImputeRefAllele — replaces all missing values with REF genotype (0).
Quick start (Python)
from snpio import VCFReader
from pgsui.data_processing.containers import MostFrequentConfig, RefAlleleConfig
from pgsui.impute.deterministic.imputers.mode import ImputeMostFrequent
from pgsui.impute.deterministic.imputers.ref_allele import ImputeRefAllele
gd = VCFReader(
filename="data.vcf.gz",
popmapfile="pops.popmap", # optional but recommended
prefix="demo"
)
# Most-frequent (global)
mf_cfg = MostFrequentConfig.from_preset("fast")
mf_cfg.io.prefix = "mf_demo"
mf = ImputeMostFrequent(genotype_data=gd, config=mf_cfg)
mf.fit()
X_mf = mf.transform() # IUPAC array (n_samples, n_loci)
# Most-frequent (population-aware)
mf_pop = MostFrequentConfig.from_preset("balanced")
mf_pop.io.prefix = "mf_perpop"
mf_pop.algo.by_populations = True
mf2 = ImputeMostFrequent(genotype_data=gd, config=mf_pop)
mf2.fit()
X_mf_perpop = mf2.transform()
# Reference-allele filler
ra_cfg = RefAlleleConfig.from_preset("fast")
ra_cfg.io.prefix = "ref_demo"
ra = ImputeRefAllele(genotype_data=gd, config=ra_cfg)
ra.fit()
X_ref = ra.transform()
YAML configuration
Both imputers accept a YAML file that overlays dataclass defaults (or a CLI-selected
preset) and support dot-path overrides in the CLI (and in Python via helpers). The
preset key is CLI-only, so select it with --preset or *.from_preset(...).
Minimal examples:
MostFrequent (``mostfrequent.yaml``)
io:
prefix: "mf_yaml"
split:
test_size: 0.2
algo:
by_populations: true
missing: -1
default: 0
plot:
fmt: "pdf"
dpi: 300
show: false
RefAllele (``refallele.yaml``)
io:
prefix: "ref_yaml"
split:
test_size: 0.25
algo:
missing: -1
plot:
fmt: "png"
dpi: 150
show: false
CLI usage (concept)
These deterministic models follow the same CLI precedence model as the rest of PG-SUI
(code defaults < preset < YAML < explicit flags < --set k=v). Example:
# Most-frequent, population-aware, YAML + a final override
pg-sui \
--input data.vcf.gz \
--popmap pops.popmap \
--models ImputeMostFrequent \
--preset balanced \
--config mostfrequent.yaml \
--set io.prefix=mf_cli \
--set algo.by_populations=true
# REF-allele baseline
pg-sui \
--input data.vcf.gz \
--models ImputeRefAllele \
--preset fast \
--set io.prefix=ref_cli
Even deterministic runs honour --sim-strategy/--sim-prop and --disable-simulate-missing; you can also set sim.simulate_missing=False via YAML or --set for cross-family comparisons.
Configuration dataclasses
- class pgsui.data_processing.containers.MostFrequentConfig(io: IOConfig = <factory>, plot: PlotConfig = <factory>, split: DeterministicSplitConfig = <factory>, algo: MostFrequentAlgoConfig = <factory>, sim: SimConfig = <factory>, tune: TuneConfig = <factory>, train: TrainConfig = <factory>)[source]
Bases:
objectTop-level configuration for ImputeMostFrequent.
Deterministic imputers primarily use
io,plot,split,algo, andsim. Thetrainandtunesections are retained for schema parity with NN models but are not currently used by ImputeMostFrequent.- io
I/O configuration.
- Type:
IOConfig
- plot
Plotting configuration.
- Type:
PlotConfig
- split
Data splitting configuration.
- Type:
DeterministicSplitConfig
- algo
Algorithmic configuration.
- Type:
MostFrequentAlgoConfig
- sim
Simulation configuration.
- Type:
SimConfig
- tune
Hyperparameter tuning configuration.
- Type:
TuneConfig
- train
Training configuration.
- Type:
TrainConfig
- classmethod from_preset(preset: Literal['fast', 'balanced', 'thorough'] = 'balanced') MostFrequentConfig[source]
Construct a preset configuration.
- Parameters:
preset (Literal["fast", "balanced", "thorough"]) – Preset name.
- Returns:
Configuration instance corresponding to the preset.
- Return type:
MostFrequentConfig
- apply_overrides(overrides: Dict[str, Any] | None) MostFrequentConfig[source]
Apply dot-key overrides.
- class pgsui.data_processing.containers.RefAlleleConfig(io: IOConfig = <factory>, plot: PlotConfig = <factory>, split: DeterministicSplitConfig = <factory>, algo: RefAlleleAlgoConfig = <factory>, sim: SimConfig = <factory>, tune: TuneConfig = <factory>, train: TrainConfig = <factory>)[source]
Bases:
objectTop-level configuration for ImputeRefAllele.
Deterministic imputers primarily use
io,plot,split,algo, andsim. Thetrainandtunesections are retained for schema parity with NN models but are not currently used by ImputeRefAllele.- io
I/O configuration.
- Type:
IOConfig
- plot
Plotting configuration.
- Type:
PlotConfig
- split
Data splitting configuration.
- Type:
DeterministicSplitConfig
- algo
Algorithmic configuration.
- Type:
RefAlleleAlgoConfig
- sim
Simulation configuration.
- Type:
SimConfig
- tune
Hyperparameter tuning configuration.
- Type:
TuneConfig
- train
Training configuration.
- Type:
TrainConfig
- classmethod from_preset(preset: Literal['fast', 'balanced', 'thorough'] = 'balanced') RefAlleleConfig[source]
Presets mainly keep parity with logging/IO and split test_size.
- Parameters:
preset (Literal["fast", "balanced", "thorough"]) – Preset name.
- Returns:
Configuration instance corresponding to the preset.
- Return type:
RefAlleleConfig
- apply_overrides(overrides: Dict[str, Any] | None) RefAlleleConfig[source]
Apply dot-key overrides.
API reference
ImputeMostFrequent
- pgsui.impute.deterministic.imputers.mode.ensure_mostfrequent_config(config: MostFrequentConfig | dict | str | None) MostFrequentConfig[source]
Return a concrete MostFrequentConfig (dataclass, dict, YAML path, or None).
- Parameters:
config (Union[MostFrequentConfig, dict, str, None]) – The configuration to ensure is a MostFrequentConfig.
- Returns:
The ensured MostFrequentConfig.
- Return type:
MostFrequentConfig
- class pgsui.impute.deterministic.imputers.mode.ImputeMostFrequent(genotype_data: GenotypeData, *, tree_parser: TreeParser | None = None, config: MostFrequentConfig | dict | str | None = None, overrides: dict | None = None, simulate_missing: bool = True, sim_strategy: Literal['random', 'random_weighted', 'random_weighted_inv', 'nonrandom', 'nonrandom_weighted'] = 'random', sim_prop: float = 0.2, sim_kwargs: dict | None = None)[source]
Bases:
objectMost-frequent (mode) deterministic imputer for 0/1/2 genotypes.
Computes the per-locus mode (globally or per population) from the training set and uses it to fill missing values. The evaluation protocol mirrors the DL imputers: train/test split with evaluation on either all observed test cells or a simulated-missing subset (depending on config), plus classification reports and plots. It handles both diploid and haploid data. Input genotypes are expected in 0/1/2 encoding with missing values represented by any negative integer. Output is returned as IUPAC strings via
decode_012.- __init__(genotype_data: GenotypeData, *, tree_parser: TreeParser | None = None, config: MostFrequentConfig | dict | str | None = None, overrides: dict | None = None, simulate_missing: bool = True, sim_strategy: Literal['random', 'random_weighted', 'random_weighted_inv', 'nonrandom', 'nonrandom_weighted'] = 'random', sim_prop: float = 0.2, sim_kwargs: dict | None = None) None[source]
Initialize the Most-Frequent (mode) imputer from a unified config.
This constructor ensures that the provided configuration is valid and initializes the imputer’s internal state. It sets up logging, random number generation, genotype encoding, and various parameters based on the configuration. The imputer is prepared to handle population-specific modes if specified in the configuration.
- Parameters:
genotype_data (GenotypeData) – Backing genotype data.
tree_parser (TreeParser | None) – Optional SNPio phylogenetic tree parser for nonrandom sim_strategy modes.
config (MostFrequentConfig | dict | str | None) – Configuration as a dataclass, nested dict, or YAML path. If None, defaults are used.
overrides (Optional[dict]) – Flat dot-key overrides applied last with highest precedence, e.g. {‘algo.by_populations’: True, ‘split.test_size’: 0.3}.
simulate_missing (bool) – Whether to simulate missing data if enabled in config. Defaults to True.
sim_strategy (Literal["random", "random_weighted", "random_weighted_inv", "nonrandom", "nonrandom_weighted"]) – Strategy for simulating missing data if enabled in config.
sim_prop (float) – Proportion of data to simulate as missing if enabled in config. Default is 0.2.
sim_kwargs (Optional[dict]) – Additional keyword arguments for the simulated missing data transformer.
Notes
This mirrors other config-driven models (AE/VAE).
Evaluation split behavior uses cfg.split; plotting uses cfg.plot.
I/O/logging seeds and verbosity use cfg.io.
- fit() ImputeMostFrequent[source]
Learn per-locus modes on TRAIN rows; mask simulated cells on TEST rows.
This method computes the most frequent genotype (mode) for each locus based on the training set and prepares the evaluation masks for the test set. It supports both global modes and population-specific modes if population data is provided. The method sets up the internal state required for imputation and evaluation.
- Returns:
The fitted imputer instance.
- Return type:
ImputeMostFrequent
- transform() ndarray[source]
Impute missing cells in the FULL dataset; evaluate on masked test cells.
This method first imputes the evaluation-masked training DataFrame to compute metrics, then imputes the full dataset (only true missings) for final output. It produces the same evaluation reports and plots as the DL models, including both 0/1/2 zygosity and 10-class IUPAC reports.
- Returns:
Imputed genotypes as IUPAC strings, shape (n_samples, n_variants).
- Return type:
np.ndarray
- Raises:
NotFittedError – If fit() has not been called prior to transform().
- decode_012(X: ndarray | DataFrame | list[list[int]], is_nuc: bool = False) ndarray[source]
Decode 012-encodings to IUPAC chars with metadata repair.
Supports: - is_nuc=True: direct 0..9 -> IUPAC mapping - is_nuc=False: ref/alt-based decoding with metadata repair
Additional behavior: - Multiallelic ALT is allowed. The ALT used for decoding is chosen as the
most common alternate base (A/C/G/T) observed in the source SNP column.
- If REF/ALT are missing or ambiguous, they are inferred from observed
base counts in the source SNP column (if available).
- Returns:
IUPAC strings as a 2D array of shape (n_samples, n_snps).
- Return type:
np.ndarray
ImputeRefAllele
- pgsui.impute.deterministic.imputers.ref_allele.ensure_refallele_config(config: RefAlleleConfig | dict | str | None) RefAlleleConfig[source]
Return a concrete RefAlleleConfig (dataclass, dict, YAML path, or None).
This function normalizes the input configuration for the RefAllele imputer. It accepts a RefAlleleConfig instance, a dictionary of parameters, a path to a YAML file, or None. If None is provided, it returns a default RefAlleleConfig instance. If a dictionary is provided, it flattens any nested structures and applies the parameters to a base configuration, honoring any top-level ‘preset’ key. If a string path is provided, it loads the configuration from the specified YAML file.
- Parameters:
config (Union[RefAlleleConfig, dict, str, None]) – Configuration input which can be a RefAlleleConfig instance, a dictionary of parameters, a path to a YAML file, or None.
- Returns:
A concrete RefAlleleConfig instance.
- Return type:
RefAlleleConfig
- Raises:
TypeError – If the input type is not supported.
- class pgsui.impute.deterministic.imputers.ref_allele.ImputeRefAllele(genotype_data: GenotypeData, *, tree_parser: TreeParser | None = None, config: RefAlleleConfig | dict | str | None = None, overrides: dict | None = None, simulate_missing: bool = True, sim_strategy: Literal['random', 'random_weighted', 'random_weighted_inv', 'nonrandom', 'nonrandom_weighted'] = 'random', sim_prop: float = 0.2, sim_kwargs: dict | None = None)[source]
Bases:
objectDeterministic imputer that fills missing genotypes with REF (0).
Operates on 0/1/2 encodings with missing values represented by any negative integer. Evaluation splits samples into TRAIN/TEST once, then evaluates on either all observed test cells or a simulated-missing subset (depending on config). Produces 0/1/2 (zygosity) and 10-class IUPAC reports plus confusion matrices, and plots genotype distributions before/after imputation. Output is returned as IUPAC strings via
decode_012.- __init__(genotype_data: GenotypeData, *, tree_parser: TreeParser | None = None, config: RefAlleleConfig | dict | str | None = None, overrides: dict | None = None, simulate_missing: bool = True, sim_strategy: Literal['random', 'random_weighted', 'random_weighted_inv', 'nonrandom', 'nonrandom_weighted'] = 'random', sim_prop: float = 0.2, sim_kwargs: dict | None = None) None[source]
Initialize the Ref-Allele imputer from a unified config.
This constructor ensures that the provided configuration is valid and initializes the imputer’s internal state. It sets up logging, random number generation, genotype encoding, and simulated-missing controls.
- Parameters:
genotype_data (GenotypeData) – Backing genotype data.
tree_parser (Optional[TreeParser]) – Optional SNPio tree parser for nonrandom simulated-missing modes.
config (RefAlleleConfig | dict | str | None) – Configuration as a dataclass, nested dict, or YAML path. If None, defaults are used.
overrides (Optional[dict]) – Flat dot-key overrides applied last with highest precedence, e.g. {‘split.test_size’: 0.25, ‘algo.missing’: -1}.
simulate_missing (bool) – Whether to simulate missing data during evaluation. Default is True.
sim_strategy (Literal["random", "random_weighted", "random_weighted_inv", "nonrandom", "nonrandom_weighted"]) – Strategy for simulating missing data if enabled in config.
sim_prop (float) – Proportion of data to simulate as missing if enabled in config. Default is 0.2.
sim_kwargs (Optional[dict]) – Additional keyword arguments for the simulated missing data transformer.
- fit() ImputeRefAllele[source]
Create TRAIN/TEST split and build eval mask, with optional sim-missing.
This method prepares the imputer by splitting the data into training and testing sets and constructing an evaluation mask. If cfg.sim.simulate_missing is False (default), it masks all originally observed genotype entries on TEST rows. If cfg.sim.simulate_missing is True, it uses SimMissingTransformer to select a subset of observed cells as simulated-missing, then restricts that mask to TEST rows only. Evaluation is then performed only on these simulated-missing cells, mirroring the deep learning models.
- Returns:
The fitted imputer instance.
- Return type:
ImputeRefAllele
- transform() ndarray[source]
Impute missing values with REF genotype (0) and evaluate on masked test cells.
This method performs the imputation by replacing all missing genotype values with the REF genotype (0). It evaluates the imputation performance on the masked test cells, producing classification reports and plots that mirror those generated by deep learning models. The final output is the fully imputed genotype matrix in IUPAC string format.
- Returns:
The fully imputed genotype matrix in IUPAC string format.
- Return type:
np.ndarray
- Raises:
NotFittedError – If the model has not been fitted yet.
- decode_012(X: ndarray | DataFrame | list[list[int]], is_nuc: bool = False) ndarray[source]
Decode 012-encodings to IUPAC chars with metadata repair.
Supports: - is_nuc=True: direct 0..9 -> IUPAC mapping - is_nuc=False: ref/alt-based decoding with metadata repair
Additional behavior: - Multiallelic ALT is allowed. The ALT used for decoding is chosen as the
most common alternate base (A/C/G/T) observed in the source SNP column.
- If REF/ALT are missing or ambiguous, they are inferred from observed
base counts in the source SNP column (if available).
- Returns:
IUPAC strings as a 2D array of shape (n_samples, n_snps).
- Return type:
np.ndarray