Deterministic (non-Machine Learning) Imputers

Overview

The deterministic imputers provide fast, interpretable baselines that mirror the fit/transform contract used across PG-SUI:

You instantiate with a GenotypeData and a dataclass config (or YAML path).
Call fit() with no arguments to set up evaluation (TRAIN/TEST split with simulated masking).
Call transform() with no arguments to impute and write plots/metrics.

Both imputers operate on SNPio’s 0/1/2 working encoding (with -1 or -9 as missing), and produce the same evaluation artifacts as the deep models (zygosity reports, IUPAC-10 reports, confusion matrices, and distribution plots).

What’s included

ImputeMostFrequent — per-locus mode imputation. Supports global modes and population-aware modes when a popmap is available.
ImputeRefAllele — replaces all missing values with REF genotype (0).

Shared behavior & outputs

Evaluation protocol: single TRAIN/TEST split by samples; on TEST rows, either a simulated-missingness mask (sim.simulate_missing=True) or all originally observed cells (when disabled) are used for scoring.
Metrics & figures: macro F1/PR/accuracy at zygosity level (REF/HET/ALT, with haploid folding), and IUPAC-10 classification plus confusion matrices; genotype distribution plots pre/post imputation.
I/O layout: artifacts are written under {prefix}_output/Deterministic/{plots,metrics,models,optimize,parameters}/{Model}/.

Quick start (Python)

from snpio import VCFReader
from pgsui.data_processing.containers import MostFrequentConfig, RefAlleleConfig
from pgsui.impute.deterministic.imputers.mode import ImputeMostFrequent
from pgsui.impute.deterministic.imputers.ref_allele import ImputeRefAllele

gd = VCFReader(
    filename="data.vcf.gz",
    popmapfile="pops.popmap",   # optional but recommended
    prefix="demo"
)

# Most-frequent (global)
mf_cfg = MostFrequentConfig.from_preset("fast")
mf_cfg.io.prefix = "mf_demo"
mf = ImputeMostFrequent(genotype_data=gd, config=mf_cfg)
mf.fit()
X_mf = mf.transform()   # IUPAC array (n_samples, n_loci)

# Most-frequent (population-aware)
mf_pop = MostFrequentConfig.from_preset("balanced")
mf_pop.io.prefix = "mf_perpop"
mf_pop.algo.by_populations = True
mf2 = ImputeMostFrequent(genotype_data=gd, config=mf_pop)
mf2.fit()
X_mf_perpop = mf2.transform()

# Reference-allele filler
ra_cfg = RefAlleleConfig.from_preset("fast")
ra_cfg.io.prefix = "ref_demo"
ra = ImputeRefAllele(genotype_data=gd, config=ra_cfg)
ra.fit()
X_ref = ra.transform()

YAML configuration

Both imputers accept a YAML file that overlays dataclass defaults (or a CLI-selected preset) and support dot-path overrides in the CLI (and in Python via helpers). The preset key is CLI-only, so select it with --preset or *.from_preset(...). Minimal examples:

MostFrequent (``mostfrequent.yaml``)

io:
  prefix: "mf_yaml"
split:
  test_size: 0.2
algo:
  by_populations: true
  missing: -1
  default: 0
plot:
  fmt: "pdf"
  dpi: 300
  show: false

RefAllele (``refallele.yaml``)

io:
  prefix: "ref_yaml"
split:
  test_size: 0.25
algo:
  missing: -1
plot:
  fmt: "png"
  dpi: 150
  show: false

CLI usage (concept)

These deterministic models follow the same CLI precedence model as the rest of PG-SUI (code defaults < preset < YAML < explicit flags < --set k=v). Example:

# Most-frequent, population-aware, YAML + a final override
pg-sui \
  --input data.vcf.gz \
  --popmap pops.popmap \
  --models ImputeMostFrequent \
  --preset balanced \
  --config mostfrequent.yaml \
  --set io.prefix=mf_cli \
  --set algo.by_populations=true

# REF-allele baseline
pg-sui \
  --input data.vcf.gz \
  --models ImputeRefAllele \
  --preset fast \
  --set io.prefix=ref_cli

Even deterministic runs honour --sim-strategy/--sim-prop and --disable-simulate-missing; you can also set sim.simulate_missing=False via YAML or --set for cross-family comparisons.

Configuration dataclasses

class pgsui.data_processing.containers.MostFrequentConfig(io: IOConfig = <factory>, plot: PlotConfig = <factory>, split: DeterministicSplitConfig = <factory>, algo: MostFrequentAlgoConfig = <factory>, sim: SimConfig = <factory>, tune: TuneConfig = <factory>, train: TrainConfig = <factory>)[source]

Bases: object

Top-level configuration for ImputeMostFrequent.

Deterministic imputers primarily use io, plot, split, algo, and sim. The train and tune sections are retained for schema parity with NN models but are not currently used by ImputeMostFrequent.

io

I/O configuration.

Type:: IOConfig

plot

Plotting configuration.

Type:: PlotConfig

split

Data splitting configuration.

Type:: DeterministicSplitConfig

algo

Algorithmic configuration.

Type:: MostFrequentAlgoConfig

sim

Simulation configuration.

Type:: SimConfig

tune

Hyperparameter tuning configuration.

Type:: TuneConfig

train

Training configuration.

Type:: TrainConfig

classmethod from_preset(preset: Literal['fast', 'balanced', 'thorough'] = 'balanced') → MostFrequentConfig[source]

Construct a preset configuration.

Parameters:: preset (Literal["fast", "balanced", "thorough"]) – Preset name.
Returns:: Configuration instance corresponding to the preset.
Return type:: MostFrequentConfig

apply_overrides(overrides: Dict[str, Any] | None) → MostFrequentConfig[source]: Apply dot-key overrides.

class pgsui.data_processing.containers.RefAlleleConfig(io: IOConfig = <factory>, plot: PlotConfig = <factory>, split: DeterministicSplitConfig = <factory>, algo: RefAlleleAlgoConfig = <factory>, sim: SimConfig = <factory>, tune: TuneConfig = <factory>, train: TrainConfig = <factory>)[source]

Bases: object

Top-level configuration for ImputeRefAllele.

Deterministic imputers primarily use io, plot, split, algo, and sim. The train and tune sections are retained for schema parity with NN models but are not currently used by ImputeRefAllele.

io

I/O configuration.

Type:: IOConfig

plot

Plotting configuration.

Type:: PlotConfig

split

Data splitting configuration.

Type:: DeterministicSplitConfig

algo

Algorithmic configuration.

Type:: RefAlleleAlgoConfig

sim

Simulation configuration.

Type:: SimConfig

tune

Hyperparameter tuning configuration.

Type:: TuneConfig

train

Training configuration.

Type:: TrainConfig

classmethod from_preset(preset: Literal['fast', 'balanced', 'thorough'] = 'balanced') → RefAlleleConfig[source]

Presets mainly keep parity with logging/IO and split test_size.

Parameters:: preset (Literal["fast", "balanced", "thorough"]) – Preset name.
Returns:: Configuration instance corresponding to the preset.
Return type:: RefAlleleConfig

apply_overrides(overrides: Dict[str, Any] | None) → RefAlleleConfig[source]: Apply dot-key overrides.

API reference

ImputeMostFrequent

pgsui.impute.deterministic.imputers.mode.ensure_mostfrequent_config(config: MostFrequentConfig | dict | str | None) → MostFrequentConfig[source]

Return a concrete MostFrequentConfig (dataclass, dict, YAML path, or None).

Parameters:: config (Union[MostFrequentConfig, dict, str, None]) – The configuration to ensure is a MostFrequentConfig.
Returns:: The ensured MostFrequentConfig.
Return type:: MostFrequentConfig

class pgsui.impute.deterministic.imputers.mode.ImputeMostFrequent(genotype_data: GenotypeData, *, tree_parser: TreeParser | None = None, config: MostFrequentConfig | dict | str | None = None, overrides: dict | None = None, simulate_missing: bool = True, sim_strategy: Literal['random', 'random_weighted', 'random_weighted_inv', 'nonrandom', 'nonrandom_weighted'] = 'random', sim_prop: float = 0.2, sim_kwargs: dict | None = None)[source]

Bases: object

Most-frequent (mode) deterministic imputer for 0/1/2 genotypes.

Computes the per-locus mode (globally or per population) from the training set and uses it to fill missing values. The evaluation protocol mirrors the DL imputers: train/test split with evaluation on either all observed test cells or a simulated-missing subset (depending on config), plus classification reports and plots. It handles both diploid and haploid data. Input genotypes are expected in 0/1/2 encoding with missing values represented by any negative integer. Output is returned as IUPAC strings via decode_012.

__init__(genotype_data: GenotypeData, *, tree_parser: TreeParser | None = None, config: MostFrequentConfig | dict | str | None = None, overrides: dict | None = None, simulate_missing: bool = True, sim_strategy: Literal['random', 'random_weighted', 'random_weighted_inv', 'nonrandom', 'nonrandom_weighted'] = 'random', sim_prop: float = 0.2, sim_kwargs: dict | None = None) → None[source]

Initialize the Most-Frequent (mode) imputer from a unified config.

This constructor ensures that the provided configuration is valid and initializes the imputer’s internal state. It sets up logging, random number generation, genotype encoding, and various parameters based on the configuration. The imputer is prepared to handle population-specific modes if specified in the configuration.

Parameters:

genotype_data (GenotypeData) – Backing genotype data.
tree_parser (TreeParser | None) – Optional SNPio phylogenetic tree parser for nonrandom sim_strategy modes.
config (MostFrequentConfig | dict | str | None) – Configuration as a dataclass, nested dict, or YAML path. If None, defaults are used.
overrides (Optional[dict]) – Flat dot-key overrides applied last with highest precedence, e.g. {‘algo.by_populations’: True, ‘split.test_size’: 0.3}.
simulate_missing (bool) – Whether to simulate missing data if enabled in config. Defaults to True.
sim_strategy (Literal["random", "random_weighted", "random_weighted_inv", "nonrandom", "nonrandom_weighted"]) – Strategy for simulating missing data if enabled in config.
sim_prop (float) – Proportion of data to simulate as missing if enabled in config. Default is 0.2.
sim_kwargs (Optional[dict]) – Additional keyword arguments for the simulated missing data transformer.

Notes

This mirrors other config-driven models (AE/VAE).
Evaluation split behavior uses cfg.split; plotting uses cfg.plot.
I/O/logging seeds and verbosity use cfg.io.

fit() → ImputeMostFrequent[source]

Learn per-locus modes on TRAIN rows; mask simulated cells on TEST rows.

This method computes the most frequent genotype (mode) for each locus based on the training set and prepares the evaluation masks for the test set. It supports both global modes and population-specific modes if population data is provided. The method sets up the internal state required for imputation and evaluation.

Returns:: The fitted imputer instance.
Return type:: ImputeMostFrequent

transform() → ndarray[source]

Impute missing cells in the FULL dataset; evaluate on masked test cells.

This method first imputes the evaluation-masked training DataFrame to compute metrics, then imputes the full dataset (only true missings) for final output. It produces the same evaluation reports and plots as the DL models, including both 0/1/2 zygosity and 10-class IUPAC reports.

Returns:: Imputed genotypes as IUPAC strings, shape (n_samples, n_variants).
Return type:: np.ndarray
Raises:: NotFittedError – If fit() has not been called prior to transform().

decode_012(X: ndarray | DataFrame | list[list[int]], is_nuc: bool = False) → ndarray[source]

Decode 012-encodings to IUPAC chars with metadata repair.

Supports: - is_nuc=True: direct 0..9 -> IUPAC mapping - is_nuc=False: ref/alt-based decoding with metadata repair

Additional behavior: - Multiallelic ALT is allowed. The ALT used for decoding is chosen as the

most common alternate base (A/C/G/T) observed in the source SNP column.

If REF/ALT are missing or ambiguous, they are inferred from observed
base counts in the source SNP column (if available).

Returns:: IUPAC strings as a 2D array of shape (n_samples, n_snps).
Return type:: np.ndarray

ImputeRefAllele

pgsui.impute.deterministic.imputers.ref_allele.ensure_refallele_config(config: RefAlleleConfig | dict | str | None) → RefAlleleConfig[source]

Return a concrete RefAlleleConfig (dataclass, dict, YAML path, or None).

This function normalizes the input configuration for the RefAllele imputer. It accepts a RefAlleleConfig instance, a dictionary of parameters, a path to a YAML file, or None. If None is provided, it returns a default RefAlleleConfig instance. If a dictionary is provided, it flattens any nested structures and applies the parameters to a base configuration, honoring any top-level ‘preset’ key. If a string path is provided, it loads the configuration from the specified YAML file.

Parameters:: config (Union[RefAlleleConfig, dict, str, None]) – Configuration input which can be a RefAlleleConfig instance, a dictionary of parameters, a path to a YAML file, or None.
Returns:: A concrete RefAlleleConfig instance.
Return type:: RefAlleleConfig
Raises:: TypeError – If the input type is not supported.

class pgsui.impute.deterministic.imputers.ref_allele.ImputeRefAllele(genotype_data: GenotypeData, *, tree_parser: TreeParser | None = None, config: RefAlleleConfig | dict | str | None = None, overrides: dict | None = None, simulate_missing: bool = True, sim_strategy: Literal['random', 'random_weighted', 'random_weighted_inv', 'nonrandom', 'nonrandom_weighted'] = 'random', sim_prop: float = 0.2, sim_kwargs: dict | None = None)[source]

Bases: object

Deterministic imputer that fills missing genotypes with REF (0).

Operates on 0/1/2 encodings with missing values represented by any negative integer. Evaluation splits samples into TRAIN/TEST once, then evaluates on either all observed test cells or a simulated-missing subset (depending on config). Produces 0/1/2 (zygosity) and 10-class IUPAC reports plus confusion matrices, and plots genotype distributions before/after imputation. Output is returned as IUPAC strings via decode_012.

__init__(genotype_data: GenotypeData, *, tree_parser: TreeParser | None = None, config: RefAlleleConfig | dict | str | None = None, overrides: dict | None = None, simulate_missing: bool = True, sim_strategy: Literal['random', 'random_weighted', 'random_weighted_inv', 'nonrandom', 'nonrandom_weighted'] = 'random', sim_prop: float = 0.2, sim_kwargs: dict | None = None) → None[source]

Initialize the Ref-Allele imputer from a unified config.

This constructor ensures that the provided configuration is valid and initializes the imputer’s internal state. It sets up logging, random number generation, genotype encoding, and simulated-missing controls.

Parameters:

genotype_data (GenotypeData) – Backing genotype data.
tree_parser (Optional[TreeParser]) – Optional SNPio tree parser for nonrandom simulated-missing modes.
config (RefAlleleConfig | dict | str | None) – Configuration as a dataclass, nested dict, or YAML path. If None, defaults are used.
overrides (Optional[dict]) – Flat dot-key overrides applied last with highest precedence, e.g. {‘split.test_size’: 0.25, ‘algo.missing’: -1}.
simulate_missing (bool) – Whether to simulate missing data during evaluation. Default is True.
sim_strategy (Literal["random", "random_weighted", "random_weighted_inv", "nonrandom", "nonrandom_weighted"]) – Strategy for simulating missing data if enabled in config.
sim_prop (float) – Proportion of data to simulate as missing if enabled in config. Default is 0.2.
sim_kwargs (Optional[dict]) – Additional keyword arguments for the simulated missing data transformer.

fit() → ImputeRefAllele[source]

Create TRAIN/TEST split and build eval mask, with optional sim-missing.

This method prepares the imputer by splitting the data into training and testing sets and constructing an evaluation mask. If cfg.sim.simulate_missing is False (default), it masks all originally observed genotype entries on TEST rows. If cfg.sim.simulate_missing is True, it uses SimMissingTransformer to select a subset of observed cells as simulated-missing, then restricts that mask to TEST rows only. Evaluation is then performed only on these simulated-missing cells, mirroring the deep learning models.

Returns:: The fitted imputer instance.
Return type:: ImputeRefAllele

transform() → ndarray[source]

Impute missing values with REF genotype (0) and evaluate on masked test cells.

This method performs the imputation by replacing all missing genotype values with the REF genotype (0). It evaluates the imputation performance on the masked test cells, producing classification reports and plots that mirror those generated by deep learning models. The final output is the fully imputed genotype matrix in IUPAC string format.

Returns:: The fully imputed genotype matrix in IUPAC string format.
Return type:: np.ndarray
Raises:: NotFittedError – If the model has not been fitted yet.

decode_012(X: ndarray | DataFrame | list[list[int]], is_nuc: bool = False) → ndarray[source]

Decode 012-encodings to IUPAC chars with metadata repair.

Supports: - is_nuc=True: direct 0..9 -> IUPAC mapping - is_nuc=False: ref/alt-based decoding with metadata repair

Additional behavior: - Multiallelic ALT is allowed. The ALT used for decoding is chosen as the

most common alternate base (A/C/G/T) observed in the source SNP column.

If REF/ALT are missing or ambiguous, they are inferred from observed
base counts in the source SNP column (if available).

Returns:: IUPAC strings as a 2D array of shape (n_samples, n_snps).
Return type:: np.ndarray