pgsui.data_processing package

Submodules

pgsui.data_processing.transformers module

class pgsui.data_processing.transformers.SimGenotypeDataTransformer(*, prop_missing: float = 0.1, strategy: Literal['random', 'random_inv_genotype'] = 'random', missing_val: int = -1, seed: int | None = None, logger: Logger | None = None, het_boost: float = 1.0)[source]

Simulates missing genotypes at the locus level on a 2D integer matrix.

This transformer masks a proportion of known genotypes in the input matrix X, setting them to a specified missing value. The masking can be done randomly or based on inverse genotype frequencies, with an option to boost the likelihood of masking heterozygous genotypes.

Parameters:
  • prop_missing (float) – Proportion of known loci to mask (0..1).

  • strategy (Literal) – Strategy name.

  • missing_val (int) – Missing code value (default: -9).

  • seed (int | None) – RNG seed.

  • logger (logging.Logger | None) – Logger for messages.

  • het_boost (float) – Multiplier for heterozygotes in inv-genotype mode.

fit(X, y=None) SimGenotypeDataTransformer[source]

Stateless.

Parameters:
  • X (np.ndarray) – (n_samples, n_features), integer codes {0..9} or <0 as missing.

  • y – Ignored.

transform(X: ndarray) tuple[ndarray, dict][source]

Apply missing-data simulation on a 2D genotype matrix.

Parameters:

X (np.ndarray) – (n_samples, n_features), integer codes {0..9} or <0 as missing.

Returns:

(X_masked, masks) where masks has keys: ‘original’: original missing (boolean 2D). ‘simulated’: loci masked here (boolean 2D). ‘all’: union of original + simulated (boolean 2D)

Return type:

tuple[np.ndarray, dict]

class pgsui.data_processing.transformers.SimMissingTransformer(genotype_data, *, tree_parser: TreeParser | None = None, prop_missing=0.1, strategy='random', missing_val=-9, mask_missing=True, verbose=0, tol=None, max_tries=None, seed: int | None = None, logger: Logger | None = None)[source]

Simulate missing data on genotypes encoded as 0/1/2 integers.

This transformer is designed to work with genotype data that has been preprocessed into a suitable format. It simulates missing data according to various strategies, allowing for the testing and evaluation of imputation methods. The simulated missing data can be controlled in terms of proportion and distribution across samples and loci.

Parameters:
  • genotype_data (GenotypeData object) – GenotypeData instance.

  • prop_missing (float, optional) – Proportion of missing data desired in output. Must be in the interval [0, 1]. Defaults to 0.1

  • strategy (Literal["nonrandom", "nonrandom_weighted", "random_weighted", "random_weighted_inv", "random"]) – Strategy for simulating missing data. “random”: Uniformly masks genotypes at random among eligible entries until the target missing proportion is reached. “random_weighted”: Masks genotypes at random with probabilities proportional to their observed genotype frequencies in each column (more common genotypes are more likely to be masked). “random_weighted_inv”: Masks genotypes at random with probabilities inversely proportional to their observed genotype frequencies in each column (rarer genotypes are more likely to be masked). “nonrandom”: Uses the supplied genotype tree to place missing data on clades that are sampled uniformly from internal and/or tip nodes, producing phylogenetically clustered missingness. “nonrandom_weighted”: As in “nonrandom”, but clades are sampled with probabilities proportional to their branch lengths, concentrating missingness on longer branches (e.g., mimicking locus dropout tied to evolutionary divergence). Defaults to “random”.

  • missing_val (int, optional) – Value that represents missing data. Defaults to -9.

  • mask_missing (bool, optional) – True if you want to skip original missing values when simulating new missing data, False otherwise. Defaults to True.

  • verbose (bool, optional) – Verbosity level. Defaults to 0.

  • tol (float) – Tolerance to reach proportion specified in self.prop_missing. Defaults to 1/num_snps*num_inds

  • max_tries (int) – Maximum number of tries to reach targeted missing data proportion within specified tol. If None, num_inds will be used. Defaults to None.

original_missing_mask_

Array with boolean mask for original missing locations.

Type:

numpy.ndarray

sim_missing_mask_

Array with boolean mask for simulated missing locations, excluding the original ones.

Type:

numpy.ndarray

all_missing_mask_

Array with boolean mask for all missing locations, including both simulated and original.

Type:

numpy.ndarray

__init__(genotype_data, *, tree_parser: TreeParser | None = None, prop_missing=0.1, strategy='random', missing_val=-9, mask_missing=True, verbose=0, tol=None, max_tries=None, seed: int | None = None, logger: Logger | None = None) None[source]

Initialize the SimMissingTransformer.

Parameters:
  • genotype_data (GenotypeData object) – GenotypeData instance.

  • tree_parser (TreeParser | None) – TreeParser instance with a loaded tree. Required for “nonrandom” and “nonrandom_weighted” strategies.

  • prop_missing (float, optional) – Proportion of missing data desired in output. Must be in the interval [0, 1]. Defaults to 0.1

  • strategy (Literal["nonrandom", "nonrandom_weighted", "random_weighted", "random_weighted_inv", "random"]) – Strategy for simulating missing data. “random”: Uniformly masks genotypes at random among eligible entries until the target missing proportion is reached. “random_weighted”: Masks genotypes at random with probabilities proportional to their observed genotype frequencies in each column (more common genotypes are more likely to be masked). “random_weighted_inv”: Masks genotypes at random with probabilities inversely proportional to their observed genotype frequencies in each column (rarer genotypes are more likely to be masked). “nonrandom”: Uses the supplied genotype tree to place missing data on clades that are sampled uniformly from internal and/or tip nodes, producing phylogenetically clustered missingness. “nonrandom_weighted”: As in “nonrandom”, but clades are sampled with probabilities proportional to their branch lengths, concentrating missingness on longer branches (e.g., mimicking locus dropout tied to evolutionary divergence). Defaults to “random”.

  • missing_val (int, optional) – Value that represents missing data. Defaults to -9.

  • mask_missing (bool, optional) – True if you want to skip original missing values when simulating new missing data, False otherwise. Defaults to True.

  • verbose (bool, optional) – Verbosity level. Defaults to 0.

  • tol (float) – Tolerance to reach proportion specified in self.prop_missing. Defaults to 1/num_snps*num_inds

  • max_tries (int) – Maximum number of tries to reach targeted missing data proportion within specified tol. If None, num_inds will be used. Defaults to None.

  • seed (int | None) – RNG seed.

  • logger (logging.Logger | None) – Logger for messages.

fit(X: ndarray, y=None) SimMissingTransformer[source]

Fit to input data X by simulating missing data.

Missing data will be simulated in varying ways depending on the strategy setting.

Parameters:

X (np.ndarray) – Data with which to simulate missing data. It should have already been imputed with one of the non-machine learning simple imputers. X may contain original missing values; simulation is applied to eligible entries depending on mask_missing.

Raises:
  • TypeErrorSimGenotypeDataTreeTransformer.tree must not be NoneType when using strategy=”nonrandom” or “nonrandom_weighted”.

  • ValueError – Invalid strategy parameter provided.

transform(X: ndarray) ndarray[source]

Function to generate masked sites in a SimGenotypeData object

Parameters:

X (np.ndarray) – Data to transform. No missing data should be present in X. It should have already been imputed with one of the non-machine learning simple imputers.

Returns:

Transformed data with missing data added.

Return type:

np.ndarray

sqrt_transform(proportions: ndarray) ndarray[source]

Apply the square root transformation to an array of proportions.

Parameters:

proportions (np.ndarray) – An array of proportions.

Returns:

The transformed proportions.

Return type:

np.ndarray

random_weighted_missing_data(X: ndarray, transform_fn: Literal['sqrt', 'exp'] = 'sqrt', power: float = 0.5, inv: bool = False, rng: Generator | None = None, target_rate: float | None = None, *, mask_missing: bool = True) ndarray[source]

Simulate missing data proportional or inversely proportional to genotype frequencies.

This method simulates missing data in a genotype matrix based on genotype frequencies. It allows for different transformation functions to be applied to the base probabilities, and can optionally use inverse genotype frequencies.

Parameters:
  • X (np.ndarray) – Input genotype matrix.

  • transform_fn (Literal["sqrt", "exp"]) – Transformation function to apply to base probabilities.

  • power (float) – Exponent to raise transformed probabilities.

  • inv (bool) – If True, use inverse genotype frequencies. If False, use direct frequencies to weight missingness.

  • rng (Optional[np.random.Generator]) – Optional NumPy Generator for reproducibility.

  • target_rate (float | None) – If provided, scales the probabilities to achieve this target missing rate.

Returns:

Simulated missing mask.

Return type:

np.ndarray

write_mask(filename_prefix: str)[source]

Write mask to file.

Parameters:

filename_prefix (str) – Prefix for the filenames to write to.

read_mask(filename_prefix: str) Tuple[ndarray, ndarray, ndarray][source]

Read mask from file.

Parameters:

filename_prefix (str) – Prefix for the filenames to read from.

Returns:

The read masks. (mask, original_missing_mask, all_missing_mask).

Return type:

Tuple[np.ndarray, np.ndarray, np.ndarray]

property missing_count: int

Count of masked genotypes in SimGenotypeData.mask

Returns:

Integer count of masked alleles.

Return type:

int

property prop_missing_real: float

Proportion of genotypes masked in SimGenotypeData.mask

Returns:

Total number of masked alleles divided by SNP matrix size.

Return type:

float

pgsui.data_processing.containers module

class pgsui.data_processing.containers.ModelConfig(latent_init: Literal['random', 'pca'] = 'random', latent_dim: int = 2, dropout_rate: float = 0.2, num_hidden_layers: int = 2, activation: Literal['relu', 'elu', 'selu', 'leaky_relu'] = 'relu', layer_scaling_factor: float = 5.0, layer_schedule: Literal['pyramid', 'linear'] = 'pyramid')[source]

Model architecture configuration.

latent_init

Method for initializing the latent space.

Type:

Literal[“random”, “pca”]

latent_dim

Dimensionality of the latent space.

Type:

int

dropout_rate

Dropout rate for regularization.

Type:

float

num_hidden_layers

Number of hidden layers in the neural network.

Type:

int

activation

Activation function.

Type:

Literal[“relu”, “elu”, “selu”, “leaky_relu”]

layer_scaling_factor

Scaling factor for the number of neurons in hidden layers.

Type:

float

layer_schedule

Schedule for scaling hidden layer sizes.

Type:

Literal[“pyramid”, “linear”]

latent_init: Literal['random', 'pca'] = 'random'
latent_dim: int = 2
dropout_rate: float = 0.2
num_hidden_layers: int = 2
activation: Literal['relu', 'elu', 'selu', 'leaky_relu'] = 'relu'
layer_scaling_factor: float = 5.0
layer_schedule: Literal['pyramid', 'linear'] = 'pyramid'
class pgsui.data_processing.containers.TrainConfig(batch_size: int = 64, learning_rate: float = 0.001, l1_penalty: float = 0.0, early_stop_gen: int = 25, min_epochs: int = 100, max_epochs: int = 2000, validation_split: float = 0.2, device: Literal['gpu', 'cpu', 'mps'] = 'cpu', weights_max_ratio: float | None = None, weights_power: float = 1.0, weights_normalize: bool = True, weights_inverse: bool = False, gamma: float = 0.0, gamma_schedule: bool = False)[source]

Training procedure configuration.

batch_size

Number of samples per training batch.

Type:

int

learning_rate

Learning rate for the optimizer.

Type:

float

l1_penalty

L1 regularization penalty.

Type:

float

early_stop_gen

Number of generations with no improvement to wait before early stopping.

Type:

int

min_epochs

Minimum number of epochs to train.

Type:

int

max_epochs

Maximum number of epochs to train.

Type:

int

validation_split

Proportion of data to use for validation.

Type:

float

weights_max_ratio

Maximum ratio for class weights to prevent extreme values.

Type:

float | None

gamma

Focusing parameter for focal loss.

Type:

float

device

Device to use for computation.

Type:

Literal[“gpu”, “cpu”, “mps”]

batch_size: int = 64
learning_rate: float = 0.001
l1_penalty: float = 0.0
early_stop_gen: int = 25
min_epochs: int = 100
max_epochs: int = 2000
validation_split: float = 0.2
device: Literal['gpu', 'cpu', 'mps'] = 'cpu'
weights_max_ratio: float | None = None
weights_power: float = 1.0
weights_normalize: bool = True
weights_inverse: bool = False
gamma: float = 0.0
gamma_schedule: bool = False
class pgsui.data_processing.containers.TuneConfig(enabled: bool = False, metrics: Literal['f1', 'accuracy', 'pr_macro', 'average_precision', 'roc_auc', 'precision', 'recall', 'mcc', 'jaccard'] | list[str] | tuple[str, ...] = 'f1', n_trials: int = 100, resume: bool = False, save_db: bool = False, epochs: int = 500, batch_size: int = 64, patience: int = 10)[source]

Hyperparameter tuning configuration.

enabled

If True, enables hyperparameter tuning.

Type:

bool

metrics

Metric(s) to optimize during tuning. Multi-objective tuning is supported by providing a list or tuple of metric names.

Type:

Literal[“f1”, “accuracy”, “pr_macro”, “average_precision”, “roc_auc”, “precision”, “recall”, “mcc”, “jaccard”] | list[str] | tuple[str, …]

n_trials

Number of hyperparameter trials to run.

Type:

int

resume

If True, resumes tuning from a previous state.

Type:

bool

save_db

If True, saves the tuning results to a database.

Type:

bool

epochs

Number of epochs to train each trial.

Type:

int

batch_size

Batch size for training during tuning.

Type:

int

patience

Number of evaluations with no improvement before stopping early.

Type:

int

enabled: bool = False
metrics: Literal['f1', 'accuracy', 'pr_macro', 'average_precision', 'roc_auc', 'precision', 'recall', 'mcc', 'jaccard'] | list[str] | tuple[str, ...] = 'f1'
n_trials: int = 100
resume: bool = False
save_db: bool = False
epochs: int = 500
batch_size: int = 64
patience: int = 10
class pgsui.data_processing.containers.PlotConfig(fmt: Literal['pdf', 'png', 'jpg', 'jpeg', 'svg'] = 'pdf', dpi: int = 300, fontsize: int = 18, despine: bool = True, show: bool = True, multiqc: bool = True)[source]

Plotting configuration.

fmt

Output file format.

Type:

Literal[“pdf”, “png”, “jpg”, “jpeg”, “svg”]

dpi

Dots per inch for the output figure.

Type:

int

fontsize

Font size for text in the plots.

Type:

int

despine

If True, removes the top and right spines from plots.

Type:

bool

show

If True, displays the plot interactively.

Type:

bool

multiqc

If True, generates MultiQC-compatible plots.

Type:

bool

fmt: Literal['pdf', 'png', 'jpg', 'jpeg', 'svg'] = 'pdf'
dpi: int = 300
fontsize: int = 18
despine: bool = True
show: bool = True
multiqc: bool = True
class pgsui.data_processing.containers.IOConfig(prefix: str = 'pgsui', ploidy: int = 2, verbose: bool = False, debug: bool = False, seed: int | None = None, n_jobs: int = 1, scoring_averaging: Literal['macro', 'weighted'] = 'macro')[source]

I/O configuration.

Dataclass that includes configuration settings for file naming, logging verbosity, random seed, and parallelism.

prefix

Prefix for output files. Default is “pgsui”.

Type:

str

ploidy

Ploidy level of the organism. Default is 2.

Type:

int

verbose

If True, enables verbose logging. Default is False.

Type:

bool

debug

If True, enables debug mode. Default is False.

Type:

bool

seed

Random seed for reproducibility. Default is None.

Type:

int | None

n_jobs

Number of parallel jobs to run. Default is 1.

Type:

int

scoring_averaging

Averaging method.

Type:

Literal[“macro”, “weighted”]

prefix: str = 'pgsui'
ploidy: int = 2
verbose: bool = False
debug: bool = False
seed: int | None = None
n_jobs: int = 1
scoring_averaging: Literal['macro', 'weighted'] = 'macro'
class pgsui.data_processing.containers.SimConfig(simulate_missing: bool = False, sim_strategy: Literal['random', 'random_weighted', 'random_weighted_inv', 'nonrandom', 'nonrandom_weighted'] = 'random', sim_prop: float = 0.2, sim_kwargs: dict | None = None)[source]

Top-level configuration for data simulation and imputation.

simulate_missing

If True, simulates missing data.

Type:

bool

sim_strategy

Strategy for simulating missing data.

Type:

Literal[“random”, …]

sim_prop

Proportion of data to simulate as missing.

Type:

float

sim_kwargs

Additional keyword arguments for simulation.

Type:

dict | None

simulate_missing: bool = False
sim_strategy: Literal['random', 'random_weighted', 'random_weighted_inv', 'nonrandom', 'nonrandom_weighted'] = 'random'
sim_prop: float = 0.2
sim_kwargs: dict | None = None
class pgsui.data_processing.containers.AutoencoderConfig(io: IOConfig = <factory>, model: ModelConfig = <factory>, train: TrainConfig = <factory>, tune: TuneConfig = <factory>, plot: PlotConfig = <factory>, sim: SimConfig = <factory>)[source]

Top-level configuration for ImputeAutoencoder.

This configuration class encapsulates all settings required for the ImputeAutoencoder model, including I/O, model architecture, training, hyperparameter tuning, plotting, and simulated-missing configuration.

io

I/O configuration.

Type:

IOConfig

model

Model architecture configuration.

Type:

ModelConfig

train

Training procedure configuration.

Type:

TrainConfig

tune

Hyperparameter tuning configuration.

Type:

TuneConfig

plot

Plotting configuration.

Type:

PlotConfig

sim

Simulated-missing configuration.

Type:

SimConfig

io: IOConfig
model: ModelConfig
train: TrainConfig
tune: TuneConfig
plot: PlotConfig
sim: SimConfig
classmethod from_preset(preset: Literal['fast', 'balanced', 'thorough'] = 'balanced') AutoencoderConfig[source]

Build a AutoencoderConfig from a named preset.

Parameters:

preset (Literal["fast", "balanced", "thorough"]) – Preset name.

Returns:

Configuration instance corresponding to the preset.

Return type:

AutoencoderConfig

apply_overrides(overrides: Dict[str, Any] | None) AutoencoderConfig[source]

Apply flat dot-key overrides.

Parameters:

overrides (Dict[str, Any] | None) – Dictionary of overrides with dot-separated keys.

Returns:

New configuration instance with overrides applied.

Return type:

AutoencoderConfig

to_dict() Dict[str, Any][source]
class pgsui.data_processing.containers.VAEExtraConfig(kl_beta: 'float' = 1.0, kl_beta_schedule: 'bool' = False)[source]
kl_beta: float = 1.0
kl_beta_schedule: bool = False
class pgsui.data_processing.containers.VAEConfig(io: IOConfig = <factory>, model: ModelConfig = <factory>, train: TrainConfig = <factory>, tune: TuneConfig = <factory>, plot: PlotConfig = <factory>, vae: VAEExtraConfig = <factory>, sim: SimConfig = <factory>)[source]

Top-level configuration for ImputeVAE (AE-parity + VAE extras).

Mirrors AutoencoderConfig sections and adds a vae block with KL-beta controls for the VAE loss.

io

I/O configuration.

Type:

IOConfig

model

Model architecture configuration.

Type:

ModelConfig

train

Training procedure configuration.

Type:

TrainConfig

tune

Hyperparameter tuning configuration.

Type:

TuneConfig

plot

Plotting configuration.

Type:

PlotConfig

vae

VAE-specific configuration.

Type:

VAEExtraConfig

sim

Simulated-missing configuration.

Type:

SimConfig

io: IOConfig
model: ModelConfig
train: TrainConfig
tune: TuneConfig
plot: PlotConfig
vae: VAEExtraConfig
sim: SimConfig
classmethod from_preset(preset: Literal['fast', 'balanced', 'thorough'] = 'balanced') VAEConfig[source]

Build a VAEConfig from a named preset.

Parameters:

preset (Literal["fast", "balanced", "thorough"]) – Preset name.

Returns:

Configuration instance corresponding to the preset.

Return type:

VAEConfig

apply_overrides(overrides: Dict[str, Any] | None) VAEConfig[source]

Apply flat dot-key overrides.

to_dict() Dict[str, Any][source]
class pgsui.data_processing.containers.NLPCAExtraConfig(projection_lr: 'float' = 0.05, projection_epochs: 'int' = 100)[source]
projection_lr: float = 0.05
projection_epochs: int = 100
class pgsui.data_processing.containers.NLPCAConfig(io: IOConfig = <factory>, model: ModelConfig = <factory>, train: TrainConfig = <factory>, tune: TuneConfig = <factory>, plot: PlotConfig = <factory>, sim: SimConfig = <factory>, nlpca: NLPCAExtraConfig = <factory>)[source]

Top-level configuration for ImputeUBP.

This configuration class encapsulates all settings required for the ImputeUBP model, including I/O, model architecture, training, hyperparameter tuning, plotting, and simulated-missing configuration.

io

I/O configuration.

Type:

IOConfig

model

Model architecture configuration.

Type:

ModelConfig

train

Training procedure configuration.

Type:

TrainConfig

tune

Hyperparameter tuning configuration.

Type:

TuneConfig

plot

Plotting configuration.

Type:

PlotConfig

sim

Simulated-missing configuration.

Type:

SimConfig

nlpca

NLPCA-specific configuration.

Type:

NLPCAExtraConfig

io: IOConfig
model: ModelConfig
train: TrainConfig
tune: TuneConfig
plot: PlotConfig
sim: SimConfig
nlpca: NLPCAExtraConfig
classmethod from_preset(preset: Literal['fast', 'balanced', 'thorough'] = 'balanced') NLPCAConfig[source]

Build a NLPCAConfig from a named preset.

Parameters:

preset (Literal["fast", "balanced", "thorough"]) – Preset name.

Returns:

Configuration instance corresponding to the preset.

Return type:

NLPCAConfig

apply_overrides(overrides: Dict[str, Any] | None) NLPCAConfig[source]

Apply flat dot-key overrides.

Parameters:

overrides (Dict[str, Any] | None) – Dictionary of overrides with dot-separated keys.

Returns:

New configuration instance with overrides applied.

Return type:

NLPCAConfig

to_dict() Dict[str, Any][source]
class pgsui.data_processing.containers.UBPExtraConfig(projection_lr: 'float' = 0.05, projection_epochs: 'int' = 100)[source]
projection_lr: float = 0.05
projection_epochs: int = 100
class pgsui.data_processing.containers.UBPConfig(io: IOConfig = <factory>, model: ModelConfig = <factory>, train: TrainConfig = <factory>, tune: TuneConfig = <factory>, plot: PlotConfig = <factory>, sim: SimConfig = <factory>, ubp: UBPExtraConfig = <factory>)[source]

Top-level configuration for ImputeUBP.

This configuration class encapsulates all settings required for the ImputeUBP model, including I/O, model architecture, training, hyperparameter tuning, plotting, and simulated-missing configuration.

io

I/O configuration.

Type:

IOConfig

model

Model architecture configuration.

Type:

ModelConfig

train

Training procedure configuration.

Type:

TrainConfig

tune

Hyperparameter tuning configuration.

Type:

TuneConfig

plot

Plotting configuration.

Type:

PlotConfig

sim

Simulated-missing configuration.

Type:

SimConfig

ubp

UBP-specific configuration.

Type:

UBPExtraConfig

io: IOConfig
model: ModelConfig
train: TrainConfig
tune: TuneConfig
plot: PlotConfig
sim: SimConfig
ubp: UBPExtraConfig
classmethod from_preset(preset: Literal['fast', 'balanced', 'thorough'] = 'balanced') UBPConfig[source]

Build a UBPConfig from a named preset.

Parameters:

preset (Literal["fast", "balanced", "thorough"]) – Preset name.

Returns:

Configuration instance corresponding to the preset.

Return type:

UBPConfig

apply_overrides(overrides: Dict[str, Any] | None) UBPConfig[source]

Apply flat dot-key overrides.

Parameters:

overrides (Dict[str, Any] | None) – Dictionary of overrides with dot-separated keys.

Returns:

New configuration instance with overrides applied.

Return type:

UBPConfig

to_dict() Dict[str, Any][source]
class pgsui.data_processing.containers.MostFrequentAlgoConfig(by_populations: bool = False, default: int = 0, missing: int = -1)[source]

Algorithmic knobs for ImputeMostFrequent.

by_populations

Whether to compute per-population modes. Default is False.

Type:

bool

default

Fallback mode if no valid entries in a locus. Default is 0.

Type:

int

missing

Code for missing genotypes in 0/1/2. Default is -1.

Type:

int

by_populations: bool = False
default: int = 0
missing: int = -1
class pgsui.data_processing.containers.DeterministicSplitConfig(test_size: float = 0.2, test_indices: Sequence[int] | None = None)[source]

Evaluation split configuration shared by deterministic imputers.

test_size

Proportion of data to use as the test set. Default is 0.2.

Type:

float

test_indices

Specific indices to use as the test set. Default is None.

Type:

Optional[Sequence[int]]

test_size: float = 0.2
test_indices: Sequence[int] | None = None
class pgsui.data_processing.containers.MostFrequentConfig(io: IOConfig = <factory>, plot: PlotConfig = <factory>, split: DeterministicSplitConfig = <factory>, algo: MostFrequentAlgoConfig = <factory>, sim: SimConfig = <factory>, tune: TuneConfig = <factory>, train: TrainConfig = <factory>)[source]

Top-level configuration for ImputeMostFrequent.

Deterministic imputers primarily use io, plot, split, algo, and sim. The train and tune sections are retained for schema parity with NN models but are not currently used by ImputeMostFrequent.

io

I/O configuration.

Type:

IOConfig

plot

Plotting configuration.

Type:

PlotConfig

split

Data splitting configuration.

Type:

DeterministicSplitConfig

algo

Algorithmic configuration.

Type:

MostFrequentAlgoConfig

sim

Simulation configuration.

Type:

SimConfig

tune

Hyperparameter tuning configuration.

Type:

TuneConfig

train

Training configuration.

Type:

TrainConfig

io: IOConfig
plot: PlotConfig
split: DeterministicSplitConfig
algo: MostFrequentAlgoConfig
sim: SimConfig
tune: TuneConfig
train: TrainConfig
classmethod from_preset(preset: Literal['fast', 'balanced', 'thorough'] = 'balanced') MostFrequentConfig[source]

Construct a preset configuration.

Parameters:

preset (Literal["fast", "balanced", "thorough"]) – Preset name.

Returns:

Configuration instance corresponding to the preset.

Return type:

MostFrequentConfig

apply_overrides(overrides: Dict[str, Any] | None) MostFrequentConfig[source]

Apply dot-key overrides.

to_dict() Dict[str, Any][source]
class pgsui.data_processing.containers.RefAlleleAlgoConfig(missing: int = -1)[source]

Algorithmic knobs for ImputeRefAllele.

missing

Code for missing genotypes in 0/1/2.

Type:

int

missing: int = -1
class pgsui.data_processing.containers.RefAlleleConfig(io: IOConfig = <factory>, plot: PlotConfig = <factory>, split: DeterministicSplitConfig = <factory>, algo: RefAlleleAlgoConfig = <factory>, sim: SimConfig = <factory>, tune: TuneConfig = <factory>, train: TrainConfig = <factory>)[source]

Top-level configuration for ImputeRefAllele.

Deterministic imputers primarily use io, plot, split, algo, and sim. The train and tune sections are retained for schema parity with NN models but are not currently used by ImputeRefAllele.

io

I/O configuration.

Type:

IOConfig

plot

Plotting configuration.

Type:

PlotConfig

split

Data splitting configuration.

Type:

DeterministicSplitConfig

algo

Algorithmic configuration.

Type:

RefAlleleAlgoConfig

sim

Simulation configuration.

Type:

SimConfig

tune

Hyperparameter tuning configuration.

Type:

TuneConfig

train

Training configuration.

Type:

TrainConfig

io: IOConfig
plot: PlotConfig
split: DeterministicSplitConfig
algo: RefAlleleAlgoConfig
sim: SimConfig
tune: TuneConfig
train: TrainConfig
classmethod from_preset(preset: Literal['fast', 'balanced', 'thorough'] = 'balanced') RefAlleleConfig[source]

Presets mainly keep parity with logging/IO and split test_size.

Parameters:

preset (Literal["fast", "balanced", "thorough"]) – Preset name.

Returns:

Configuration instance corresponding to the preset.

Return type:

RefAlleleConfig

apply_overrides(overrides: Dict[str, Any] | None) RefAlleleConfig[source]

Apply dot-key overrides.

to_dict() Dict[str, Any][source]
class pgsui.data_processing.containers.IOConfigSupervised(prefix: str = 'pgsui', seed: int | None = None, n_jobs: int = 1, verbose: bool = False, debug: bool = False)[source]

I/O, logging, and run identity.

prefix

Prefix for output files and logs.

Type:

str

seed

Random seed for reproducibility.

Type:

Optional[int]

n_jobs

Number of parallel jobs to use.

Type:

int

verbose

Whether to enable verbose logging.

Type:

bool

debug

Whether to enable debug mode.

Type:

bool

prefix: str = 'pgsui'
seed: int | None = None
n_jobs: int = 1
verbose: bool = False
debug: bool = False
class pgsui.data_processing.containers.PlotConfigSupervised(fmt: Literal['pdf', 'png', 'jpg', 'jpeg'] = 'pdf', dpi: int = 300, fontsize: int = 18, despine: bool = True, show: bool = False)[source]

Plot/figure styling.

fmt

File format.

Type:

Literal[“pdf”, “png”, “jpg”, “jpeg”]

dpi

Resolution in dots per inch.

Type:

int

fontsize

Base font size for plot text.

Type:

int

despine

Whether to remove top/right spines.

Type:

bool

show

Whether to display plots interactively.

Type:

bool

fmt: Literal['pdf', 'png', 'jpg', 'jpeg'] = 'pdf'
dpi: int = 300
fontsize: int = 18
despine: bool = True
show: bool = False
class pgsui.data_processing.containers.TrainConfigSupervised(validation_split: float = 0.2)[source]

Training/evaluation split (by samples).

validation_split

Proportion of data to use for validation.

Type:

float

validation_split: float = 0.2
class pgsui.data_processing.containers.ImputerConfigSupervised(n_nearest_features: int | None = 10, max_iter: int = 10)[source]

IterativeImputer-like scaffolding used by current supervised wrappers.

n_nearest_features

Number of nearest features to use.

Type:

Optional[int]

max_iter

Maximum number of imputation iterations to perform.

Type:

int

n_nearest_features: int | None = 10
max_iter: int = 10
class pgsui.data_processing.containers.SimConfigSupervised(prop_missing: float = 0.5, strategy: Literal['random', 'random_inv_genotype'] = 'random_inv_genotype', het_boost: float = 2.0, missing_val: int = -1)[source]

Simulation of missingness for evaluation.

prop_missing

Proportion of features to set as missing.

Type:

float

strategy

Strategy.

Type:

Literal[“random”, “random_inv_genotype”]

het_boost

Boosting factor for heterogeneity.

Type:

float

missing_val

Internal code for missing genotypes.

Type:

int

prop_missing: float = 0.5
strategy: Literal['random', 'random_inv_genotype'] = 'random_inv_genotype'
het_boost: float = 2.0
missing_val: int = -1
class pgsui.data_processing.containers.TuningConfigSupervised(enabled: bool = True, n_trials: int = 100, metric: str = 'pr_macro', n_jobs: int = 8, fast: bool = True)[source]

Optuna tuning envelope.

enabled: bool = True
n_trials: int = 100
metric: str = 'pr_macro'
n_jobs: int = 8
fast: bool = True
class pgsui.data_processing.containers.RFModelConfig(n_estimators: int = 100, max_depth: int | None = None, min_samples_split: int = 2, min_samples_leaf: int = 1, max_features: Literal['sqrt', 'log2'] | float | None = 'sqrt', criterion: Literal['gini', 'entropy', 'log_loss'] = 'gini', class_weight: Literal['balanced', 'balanced_subsample', None] = 'balanced')[source]

Random Forest hyperparameters.

n_estimators

Number of trees in the forest.

Type:

int

max_depth

Maximum depth of the trees.

Type:

Optional[int]

min_samples_split

Minimum number of samples required to split.

Type:

int

min_samples_leaf

Minimum number of samples required at a leaf.

Type:

int

max_features

Features to consider.

Type:

Literal[“sqrt”, “log2”] | float | None

criterion

Split quality metric.

Type:

Literal[“gini”, “entropy”, “log_loss”]

class_weight

Class weights.

Type:

Literal[“balanced”, “balanced_subsample”, None]

n_estimators: int = 100
max_depth: int | None = None
min_samples_split: int = 2
min_samples_leaf: int = 1
max_features: Literal['sqrt', 'log2'] | float | None = 'sqrt'
criterion: Literal['gini', 'entropy', 'log_loss'] = 'gini'
class_weight: Literal['balanced', 'balanced_subsample', None] = 'balanced'
class pgsui.data_processing.containers.HGBModelConfig(n_estimators: int = 100, learning_rate: float = 0.1, max_depth: int | None = None, min_samples_leaf: int = 1, max_features: float | None = 1.0, n_iter_no_change: int = 10, tol: float = 1e-07)[source]

Histogram-based Gradient Boosting hyperparameters.

n_estimators

Number of boosting iterations (max_iter).

Type:

int

learning_rate

Step size for each boosting iteration.

Type:

float

max_depth

Maximum depth of each tree.

Type:

Optional[int]

min_samples_leaf

Minimum number of samples required at a leaf.

Type:

int

max_features

Proportion of features to consider.

Type:

float | None

n_iter_no_change

Iterations to wait for early stopping.

Type:

int

tol

Minimum improvement in the loss.

Type:

float

n_estimators: int = 100
learning_rate: float = 0.1
max_depth: int | None = None
min_samples_leaf: int = 1
max_features: float | None = 1.0
n_iter_no_change: int = 10
tol: float = 1e-07
class pgsui.data_processing.containers.RFConfig(io: IOConfigSupervised = <factory>, model: RFModelConfig = <factory>, train: TrainConfigSupervised = <factory>, imputer: ImputerConfigSupervised = <factory>, sim: SimConfigSupervised = <factory>, plot: PlotConfigSupervised = <factory>, tune: TuningConfigSupervised = <factory>)[source]

Configuration for ImputeRandomForest.

io

Run identity, logging, and seeds.

Type:

IOConfigSupervised

model

RandomForest hyperparameters.

Type:

RFModelConfig

train

Sample split for validation.

Type:

TrainConfigSupervised

imputer

IterativeImputer scaffolding.

Type:

ImputerConfigSupervised

sim

Simulated missingness.

Type:

SimConfigSupervised

plot

Plot styling.

Type:

PlotConfigSupervised

tune

Optuna knobs.

Type:

TuningConfigSupervised

io: IOConfigSupervised
model: RFModelConfig
train: TrainConfigSupervised
imputer: ImputerConfigSupervised
sim: SimConfigSupervised
plot: PlotConfigSupervised
tune: TuningConfigSupervised
classmethod from_preset(preset: str = 'balanced') RFConfig[source]

Build a config from a named preset.

Parameters:

preset (str) – Preset name.

Returns:

Configuration instance corresponding to the preset.

Return type:

RFConfig

classmethod from_yaml(path: str) RFConfig[source]

Load from YAML; honors optional top-level ‘preset’.

apply_overrides(overrides: Dict[str, Any] | None) RFConfig[source]

Apply flat dot-key overrides.

to_dict() Dict[str, Any][source]
to_imputer_kwargs() Dict[str, Any][source]
class pgsui.data_processing.containers.HGBConfig(io: IOConfigSupervised = <factory>, model: HGBModelConfig = <factory>, train: TrainConfigSupervised = <factory>, imputer: ImputerConfigSupervised = <factory>, sim: SimConfigSupervised = <factory>, plot: PlotConfigSupervised = <factory>, tune: TuningConfigSupervised = <factory>)[source]

Configuration for ImputeHistGradientBoosting.

io

Run identity, logging, and seeds.

Type:

IOConfigSupervised

model

HistGradientBoosting hyperparameters.

Type:

HGBModelConfig

train

Sample split for validation.

Type:

TrainConfigSupervised

imputer

IterativeImputer scaffolding.

Type:

ImputerConfigSupervised

sim

Simulated missingness.

Type:

SimConfigSupervised

plot

Plot styling.

Type:

PlotConfigSupervised

tune

Optuna knobs.

Type:

TuningConfigSupervised

io: IOConfigSupervised
model: HGBModelConfig
train: TrainConfigSupervised
imputer: ImputerConfigSupervised
sim: SimConfigSupervised
plot: PlotConfigSupervised
tune: TuningConfigSupervised
classmethod from_preset(preset: str = 'balanced') HGBConfig[source]

Build a config from a named preset.

Parameters:

preset (str) – Preset name.

Returns:

Configuration instance corresponding to the preset.

Return type:

HGBConfig

classmethod from_yaml(path: str) HGBConfig[source]
apply_overrides(overrides: Dict[str, Any] | None) HGBConfig[source]
to_dict() Dict[str, Any][source]
to_imputer_kwargs() Dict[str, Any][source]

pgsui.data_processing.config module

class pgsui.data_processing.config.T

Config utilities for PG-SUI.

We keep nested configs as dataclasses at all times.

Public API: - load_yaml_to_dataclass - apply_dot_overrides - dataclass_to_yaml - save_dataclass_yaml

alias of TypeVar(‘T’)

pgsui.data_processing.config.dataclass_to_yaml(dc: Any) str[source]

Convert a dataclass instance to a YAML string.

This function uses the asdict function from the dataclasses module to convert the dataclass instance into a dictionary, which is then serialized to a YAML string using the yaml module.

Parameters:

dc (t.Any) – A dataclass instance.

Returns:

The YAML representation of the dataclass.

Return type:

str

Raises:

TypeError – If dc is not a dataclass instance.

pgsui.data_processing.config.save_dataclass_yaml(dc: Any, path: str) None[source]

Save a dataclass instance as a YAML file.

This function uses the dataclass_to_yaml function to convert the dataclass instance into a YAML string, which is then written to a file.

Parameters:
  • dc (T) – A dataclass instance.

  • path (str) – Path to save the YAML file.

Raises:

TypeError – If dc is not a dataclass instance.

pgsui.data_processing.config.load_yaml_to_dataclass(path: str, dc_type: Type[T], *, base: T | None = None, overlays: Dict[str, Any] | None = None, yaml_preset_behavior: Literal['ignore', 'error'] = 'ignore') T[source]

Load a YAML file and merge into a dataclass instance with strict precedence.

This function is designed for the new argument hierarchy: defaults < CLI preset (build base from it) < YAML file < CLI args/–set

Notes

  • preset is CLI-only. If the YAML contains preset, it will be ignored (default) or cause an error depending on yaml_preset_behavior.

  • Pass a base instance that is already constructed from the CLI-selected preset (e.g., VAEConfig.from_preset(args.preset)), and this function will overlay the YAML on top of it. Any additional overlays (a nested dict) are applied last.

Parameters:
  • path (str) – Path to the YAML file.

  • dc_type (Type[T]) – Dataclass type to construct if base is not provided.

  • base (T | None) – A preconstructed dataclass instance to start from (typically built from the CLI preset). If provided, it takes precedence over any other starting point.

  • overlays (Dict[str, Any] | None) – A nested mapping to apply after the YAML (e.g., derived CLI flags). These win over YAML values.

  • yaml_preset_behavior (Literal["ignore","error"]) – What to do if the YAML contains a preset key. Default: “ignore”.

Returns:

The merged dataclass instance.

Return type:

Type[T]

Raises:
  • TypeError – If base is not a dataclass, or YAML root isn’t a mapping, or overlays isn’t a mapping when provided.

  • ValueError – If yaml_preset_behavior=”error” and YAML contains preset.

  • KeyError – If any override path is invalid.

pgsui.data_processing.config.apply_dot_overrides(dc: Any, overrides: dict[str, Any] | None, *, root_cls: type | None = None, create_missing: bool = False, registry: dict[str, type] | None = None) Any[source]

Apply overrides like {‘io.prefix’: ‘…’, ‘train.batch_size’: 64} to any Config dataclass.

This function updates the fields of a dataclass instance with values from a nested mapping (dict). It ensures that all keys in the mapping correspond to fields in the dataclass, and it handles nested dataclass fields as well.

Parameters:
  • dc (t.Any) – A dataclass instance (or a dict that can be up-cast).

  • overrides (dict[str, t.Any] | None) – Mapping of dot-key paths to values.

  • root_cls (type | None) – Optional dataclass type to up-cast a root dict into (if dc is a dict).

  • create_missing (bool) – If True, instantiate missing intermediate dataclass nodes when the schema defines them.

  • registry (dict[str, type] | None) – Optional mapping from top-level segment → dataclass type to assist up-casting.

Returns:

The updated dataclass instance (same object identity is not guaranteed; a deep copy is made).

Return type:

t.Any

Notes

  • Dict payloads encountered at intermediate nodes are merged into the expected dataclass type using schema introspection.

  • Enforces unknown-key errors to keep configs honest.

Raises:
  • TypeError – If dc is not a dataclass or dict (for up-cast).

  • KeyError – If any override path is invalid.

Module contents