Supervised Imputers

Overview

PG-SUI’s supervised imputers frame genotype imputation as multiclass prediction (0 = REF, 1 = HET, 2 = ALT for diploids; haploids collapse to two classes). They use typed *Config dataclasses with presets (fast, balanced, thorough), optional YAML, and a consistent instantiate → fit() → transform() workflow. Under the hood each model wraps sklearn.impute.IterativeImputer with a tree-based estimator, evaluates on simulated missingness, and reports both 0/1/2 and IUPAC metrics.

from snpio import VCFReader
from pgsui import ImputeRandomForest, RFConfig

gd = VCFReader("data.vcf.gz", popmapfile="pops.popmap", prefix="demo")

cfg = RFConfig.from_preset("balanced")
cfg.io.prefix = "rf_demo"

model = ImputeRandomForest(genotype_data=gd, config=cfg)
model.fit()
X_iupac = model.transform()

Shared Arguments

The supervised configs expose the following sections (see the specific *Config definitions below for field-level defaults):

  • ``io`` - run prefix, logging toggles, seeds, and job counts for sklearn.

  • ``model`` - estimator hyperparameters (e.g., tree depth, number of estimators, learning rate).

  • ``train`` - validation split size used when carving out the hold-out set.

  • ``imputer`` - IterativeImputer settings such as max_iter and n_nearest_features.

  • ``sim`` - controls the pgsui.data_processing.transformers.SimGenotypeDataTransformer used to mask additional sites for evaluation (proportion, strategy, missing code).

  • ``plot`` - figure export format, DPI, font size, despine toggle, and interactive display.

  • ``tune`` - retained for API parity; presets disable Optuna for tree models but the structure accepts future tuning hooks.

Evaluation artefacts (metrics, plots, tuned parameters) land under {prefix}_output/Supervised/{plots,metrics,parameters}/ with one folder per model.

Notes

  • Inputs are taken from SNPio’s GenotypeData at instantiation and encoded to 0/1/2 with -9/-1 treated as missing.

  • fit() splits with train.validation_split, trains an IterativeImputer that wraps the chosen estimator, simulates extra missingness via sim settings, and writes macro metrics plus plots.

  • transform() imputes the full cohort, coerces any residual negatives/NaNs to -9, and returns IUPAC strings in addition to cached plots.

  • Class imbalance can be managed via estimator class_weight plus the macro-averaged scoring suite reported for every run.

Config Dataclasses (API)

Random Forest (Config)

Captures sklearn.ensemble.RandomForestClassifier knobs plus IterativeImputer and simulation settings used by pgsui.impute.supervised.imputers.random_forest.ImputeRandomForest.

class pgsui.data_processing.containers.RFConfig(io: IOConfigSupervised = <factory>, model: RFModelConfig = <factory>, train: TrainConfigSupervised = <factory>, imputer: ImputerConfigSupervised = <factory>, sim: SimConfigSupervised = <factory>, plot: PlotConfigSupervised = <factory>, tune: TuningConfigSupervised = <factory>)[source]

Bases: object

Configuration for ImputeRandomForest.

io

Run identity, logging, and seeds.

Type:

IOConfigSupervised

model

RandomForest hyperparameters.

Type:

RFModelConfig

train

Sample split for validation.

Type:

TrainConfigSupervised

imputer

IterativeImputer scaffolding.

Type:

ImputerConfigSupervised

sim

Simulated missingness.

Type:

SimConfigSupervised

plot

Plot styling.

Type:

PlotConfigSupervised

tune

Optuna knobs.

Type:

TuningConfigSupervised

classmethod from_preset(preset: str = 'balanced') RFConfig[source]

Build a config from a named preset.

Parameters:

preset (str) – Preset name.

Returns:

Configuration instance corresponding to the preset.

Return type:

RFConfig

classmethod from_yaml(path: str) RFConfig[source]

Load from YAML; honors optional top-level ‘preset’.

apply_overrides(overrides: Dict[str, Any] | None) RFConfig[source]

Apply flat dot-key overrides.

HistGradientBoosting (Config)

Defines the histogram-based gradient boosting estimator parameters together with IterativeImputer and simulation envelopes consumed by pgsui.impute.supervised.imputers.hist_gradient_boosting.ImputeHistGradientBoosting.

class pgsui.data_processing.containers.HGBConfig(io: IOConfigSupervised = <factory>, model: HGBModelConfig = <factory>, train: TrainConfigSupervised = <factory>, imputer: ImputerConfigSupervised = <factory>, sim: SimConfigSupervised = <factory>, plot: PlotConfigSupervised = <factory>, tune: TuningConfigSupervised = <factory>)[source]

Bases: object

Configuration for ImputeHistGradientBoosting.

io

Run identity, logging, and seeds.

Type:

IOConfigSupervised

model

HistGradientBoosting hyperparameters.

Type:

HGBModelConfig

train

Sample split for validation.

Type:

TrainConfigSupervised

imputer

IterativeImputer scaffolding.

Type:

ImputerConfigSupervised

sim

Simulated missingness.

Type:

SimConfigSupervised

plot

Plot styling.

Type:

PlotConfigSupervised

tune

Optuna knobs.

Type:

TuningConfigSupervised

classmethod from_preset(preset: str = 'balanced') HGBConfig[source]

Build a config from a named preset.

Parameters:

preset (str) – Preset name.

Returns:

Configuration instance corresponding to the preset.

Return type:

HGBConfig

Supervised Imputer Models

Random Forest

Wraps sklearn.impute.IterativeImputer around a RandomForestClassifier to iteratively fill masked loci; metrics are reported on both 0/1/2 and decoded IUPAC targets using the simulated validation mask.

class pgsui.impute.supervised.imputers.random_forest.ImputeRandomForest(genotype_data: GenotypeData, *, config: RFConfig | Dict | str | None = None, overrides: Dict | None = None)[source]

Bases: BaseImputer

Supervised RF imputer driven by RFConfig.

fit() BaseImputer[source]

Fit the imputer using self.genotype_data with no arguments.

This method trains the imputer on the provided genotype data.

Steps:
  1. Encode to 0/1/2 with -9/-1 as missing.

  2. Split samples into train/test.

  3. Train IterativeImputer on train (convert missing -> NaN).

  4. Evaluate on test non-missing positions (reconstruction metrics) and call your original plotting stack via _make_class_reports().

Returns:

self.

Return type:

BaseImputer

transform() ndarray[source]

Impute all samples and return imputed genotypes.

This method applies the trained imputer to the entire dataset, filling in missing genotype values. It ensures that any remaining missing values after imputation are set to -9, and decodes the imputed 0/1/2 genotypes back to their original format.

Returns:

(n_samples, n_loci) IUPAC strings (single-character codes).

Return type:

np.ndarray

HistGradientBoosting

Uses sklearn.ensemble.HistGradientBoostingClassifier as the IterativeImputer estimator, enabling faster training on wide genotype matrices while sharing the evaluation and plotting stack with the Random Forest variant.

class pgsui.impute.supervised.imputers.hist_gradient_boosting.ImputeHistGradientBoosting(genotype_data: GenotypeData, *, config: HGBConfig | Dict | str | None = None, overrides: Dict | None = None)[source]

Bases: BaseImputer

Supervised HGB imputer driven by HGBConfig.

fit() BaseImputer[source]

Fit the imputer using self.genotype_data with no arguments.

This method prepares the imputer by splitting the data into training and testing sets, and masking all originally observed genotype entries in the test set to facilitate unbiased evaluation. It does not perform any actual imputation since the RefAllele imputer is deterministic.

Steps:
  1. Encode to 0/1/2 with -9/-1 as missing.

  2. Split samples into train/test.

  3. Train IterativeImputer on train (convert missing -> NaN).

  4. Evaluate on test non-missing positions (reconstruction metrics) and call your original plotting stack via _make_class_reports().

Returns:

self.

Return type:

BaseImputer

transform() ndarray[source]

Impute all samples and return imputed genotypes.

This method applies the trained imputer to the entire dataset, filling in missing genotype values. It ensures that any remaining missing values after imputation are set to -9, and decodes the imputed 0/1/2 genotypes back to their original format.

Returns:

(n_samples, n_loci) IUPAC strings (single-character codes).

Return type:

np.ndarray

Raises:

NotFittedError – If fit() has not been called prior to transform().

CLI Examples

Run both supervised models with the balanced preset and a shared prefix:

pg-sui \
   --input data.vcf.gz \
   --popmap pops.popmap \
   --models ImputeRandomForest ImputeHistGradientBoosting \
   --preset balanced \
   --sim-strategy random_weighted_inv \
   --sim-prop 0.35 \
   --set io.prefix=supervised_demo

YAML + overrides:

pg-sui \
   --input data.vcf.gz \
   --popmap pops.popmap \
   --models ImputeHistGradientBoosting \
   --preset thorough \
   --config hgb.yaml \
   --set io.prefix=hgb_thorough \
   --set imputer.max_iter=12 \
   --sim-prop 0.40

Use --disable-simulate-missing to temporarily disable simulated masking for diagnostics or ablation studies; omit it to honour the preset/YAML defaults. Strategy definitions are listed in Overview.

Outputs

  • Plots: {prefix}_output/Supervised/plots/{Model}/

  • Metrics (CSV/JSON): {prefix}_output/Supervised/metrics/{Model}/

  • Cross-model radar summary compares macro-F1, macro-PR, accuracy, and HET-F1.