Supervised Imputers
Overview
PG-SUI’s supervised imputers frame genotype imputation as multiclass prediction (0 = REF, 1 = HET, 2 = ALT for diploids; haploids collapse to two classes). They use typed *Config dataclasses with presets (fast, balanced, thorough), optional YAML, and a consistent instantiate → fit() → transform() workflow. Under the hood each model wraps sklearn.impute.IterativeImputer with a tree-based estimator, evaluates on simulated missingness, and reports both 0/1/2 and IUPAC metrics.
from snpio import VCFReader
from pgsui import ImputeRandomForest, RFConfig
gd = VCFReader("data.vcf.gz", popmapfile="pops.popmap", prefix="demo")
cfg = RFConfig.from_preset("balanced")
cfg.io.prefix = "rf_demo"
model = ImputeRandomForest(genotype_data=gd, config=cfg)
model.fit()
X_iupac = model.transform()
Config Dataclasses (API)
Random Forest (Config)
Captures sklearn.ensemble.RandomForestClassifier knobs plus IterativeImputer and simulation settings used by pgsui.impute.supervised.imputers.random_forest.ImputeRandomForest.
- class pgsui.data_processing.containers.RFConfig(io: IOConfigSupervised = <factory>, model: RFModelConfig = <factory>, train: TrainConfigSupervised = <factory>, imputer: ImputerConfigSupervised = <factory>, sim: SimConfigSupervised = <factory>, plot: PlotConfigSupervised = <factory>, tune: TuningConfigSupervised = <factory>)[source]
Bases:
objectConfiguration for ImputeRandomForest.
- io
Run identity, logging, and seeds.
- Type:
IOConfigSupervised
- model
RandomForest hyperparameters.
- Type:
RFModelConfig
- train
Sample split for validation.
- Type:
TrainConfigSupervised
- imputer
IterativeImputer scaffolding.
- Type:
ImputerConfigSupervised
- sim
Simulated missingness.
- Type:
SimConfigSupervised
- plot
Plot styling.
- Type:
PlotConfigSupervised
- tune
Optuna knobs.
- Type:
TuningConfigSupervised
- classmethod from_preset(preset: str = 'balanced') RFConfig[source]
Build a config from a named preset.
- Parameters:
preset (str) – Preset name.
- Returns:
Configuration instance corresponding to the preset.
- Return type:
RFConfig
- classmethod from_yaml(path: str) RFConfig[source]
Load from YAML; honors optional top-level ‘preset’.
- apply_overrides(overrides: Dict[str, Any] | None) RFConfig[source]
Apply flat dot-key overrides.
HistGradientBoosting (Config)
Defines the histogram-based gradient boosting estimator parameters together with IterativeImputer and simulation envelopes consumed by pgsui.impute.supervised.imputers.hist_gradient_boosting.ImputeHistGradientBoosting.
- class pgsui.data_processing.containers.HGBConfig(io: IOConfigSupervised = <factory>, model: HGBModelConfig = <factory>, train: TrainConfigSupervised = <factory>, imputer: ImputerConfigSupervised = <factory>, sim: SimConfigSupervised = <factory>, plot: PlotConfigSupervised = <factory>, tune: TuningConfigSupervised = <factory>)[source]
Bases:
objectConfiguration for ImputeHistGradientBoosting.
- io
Run identity, logging, and seeds.
- Type:
IOConfigSupervised
- model
HistGradientBoosting hyperparameters.
- Type:
HGBModelConfig
- train
Sample split for validation.
- Type:
TrainConfigSupervised
- imputer
IterativeImputer scaffolding.
- Type:
ImputerConfigSupervised
- sim
Simulated missingness.
- Type:
SimConfigSupervised
- plot
Plot styling.
- Type:
PlotConfigSupervised
- tune
Optuna knobs.
- Type:
TuningConfigSupervised
- classmethod from_preset(preset: str = 'balanced') HGBConfig[source]
Build a config from a named preset.
- Parameters:
preset (str) – Preset name.
- Returns:
Configuration instance corresponding to the preset.
- Return type:
HGBConfig
Supervised Imputer Models
Random Forest
Wraps sklearn.impute.IterativeImputer around a RandomForestClassifier to iteratively fill masked loci; metrics are reported on both 0/1/2 and decoded IUPAC targets using the simulated validation mask.
- class pgsui.impute.supervised.imputers.random_forest.ImputeRandomForest(genotype_data: GenotypeData, *, config: RFConfig | Dict | str | None = None, overrides: Dict | None = None)[source]
Bases:
BaseImputerSupervised RF imputer driven by
RFConfig.- fit() BaseImputer[source]
Fit the imputer using self.genotype_data with no arguments.
This method trains the imputer on the provided genotype data.
- Steps:
Encode to 0/1/2 with -9/-1 as missing.
Split samples into train/test.
Train IterativeImputer on train (convert missing -> NaN).
Evaluate on test non-missing positions (reconstruction metrics) and call your original plotting stack via _make_class_reports().
- Returns:
self.
- Return type:
BaseImputer
- transform() ndarray[source]
Impute all samples and return imputed genotypes.
This method applies the trained imputer to the entire dataset, filling in missing genotype values. It ensures that any remaining missing values after imputation are set to -9, and decodes the imputed 0/1/2 genotypes back to their original format.
- Returns:
(n_samples, n_loci) IUPAC strings (single-character codes).
- Return type:
np.ndarray
HistGradientBoosting
Uses sklearn.ensemble.HistGradientBoostingClassifier as the IterativeImputer estimator, enabling faster training on wide genotype matrices while sharing the evaluation and plotting stack with the Random Forest variant.
- class pgsui.impute.supervised.imputers.hist_gradient_boosting.ImputeHistGradientBoosting(genotype_data: GenotypeData, *, config: HGBConfig | Dict | str | None = None, overrides: Dict | None = None)[source]
Bases:
BaseImputerSupervised HGB imputer driven by
HGBConfig.- fit() BaseImputer[source]
Fit the imputer using self.genotype_data with no arguments.
This method prepares the imputer by splitting the data into training and testing sets, and masking all originally observed genotype entries in the test set to facilitate unbiased evaluation. It does not perform any actual imputation since the RefAllele imputer is deterministic.
- Steps:
Encode to 0/1/2 with -9/-1 as missing.
Split samples into train/test.
Train IterativeImputer on train (convert missing -> NaN).
Evaluate on test non-missing positions (reconstruction metrics) and call your original plotting stack via _make_class_reports().
- Returns:
self.
- Return type:
BaseImputer
- transform() ndarray[source]
Impute all samples and return imputed genotypes.
This method applies the trained imputer to the entire dataset, filling in missing genotype values. It ensures that any remaining missing values after imputation are set to -9, and decodes the imputed 0/1/2 genotypes back to their original format.
- Returns:
(n_samples, n_loci) IUPAC strings (single-character codes).
- Return type:
np.ndarray
- Raises:
NotFittedError – If fit() has not been called prior to transform().
CLI Examples
Run both supervised models with the balanced preset and a shared prefix:
pg-sui \
--input data.vcf.gz \
--popmap pops.popmap \
--models ImputeRandomForest ImputeHistGradientBoosting \
--preset balanced \
--sim-strategy random_weighted_inv \
--sim-prop 0.35 \
--set io.prefix=supervised_demo
YAML + overrides:
pg-sui \
--input data.vcf.gz \
--popmap pops.popmap \
--models ImputeHistGradientBoosting \
--preset thorough \
--config hgb.yaml \
--set io.prefix=hgb_thorough \
--set imputer.max_iter=12 \
--sim-prop 0.40
Use --disable-simulate-missing to temporarily disable simulated masking for diagnostics or ablation studies; omit it to honour the preset/YAML defaults. Strategy definitions are listed in Overview.
Outputs
Plots:
{prefix}_output/Supervised/plots/{Model}/Metrics (CSV/JSON):
{prefix}_output/Supervised/metrics/{Model}/Cross-model radar summary compares macro-F1, macro-PR, accuracy, and HET-F1.