pgsui.data_processing package
Submodules
pgsui.data_processing.transformers module
- class pgsui.data_processing.transformers.SimGenotypeDataTransformer(*, prop_missing: float = 0.1, strategy: Literal['random', 'random_inv_genotype'] = 'random', missing_val: int = -1, seed: int | None = None, logger: Logger | None = None, het_boost: float = 1.0)[source]
Simulates missing genotypes at the locus level on a 2D integer matrix.
This transformer masks a proportion of known genotypes in the input matrix X, setting them to a specified missing value. The masking can be done randomly or based on inverse genotype frequencies, with an option to boost the likelihood of masking heterozygous genotypes.
- Parameters:
prop_missing (float) – Proportion of known loci to mask (0..1).
strategy (Literal) – Strategy name.
missing_val (int) – Missing code value (default: -9).
seed (int | None) – RNG seed.
logger (logging.Logger | None) – Logger for messages.
het_boost (float) – Multiplier for heterozygotes in inv-genotype mode.
- fit(X, y=None) SimGenotypeDataTransformer[source]
Stateless.
- Parameters:
X (np.ndarray) – (n_samples, n_features), integer codes {0..9} or <0 as missing.
y – Ignored.
- transform(X: ndarray) tuple[ndarray, dict][source]
Apply missing-data simulation on a 2D genotype matrix.
- Parameters:
X (np.ndarray) – (n_samples, n_features), integer codes {0..9} or <0 as missing.
- Returns:
(X_masked, masks) where masks has keys: ‘original’: original missing (boolean 2D). ‘simulated’: loci masked here (boolean 2D). ‘all’: union of original + simulated (boolean 2D)
- Return type:
tuple[np.ndarray, dict]
- class pgsui.data_processing.transformers.SimMissingTransformer(genotype_data, *, tree_parser: TreeParser | None = None, prop_missing=0.1, strategy='random', missing_val=-9, mask_missing=True, verbose=0, tol=None, max_tries=None, seed: int | None = None, logger: Logger | None = None)[source]
Simulate missing data on genotypes encoded as 0/1/2 integers.
This transformer is designed to work with genotype data that has been preprocessed into a suitable format. It simulates missing data according to various strategies, allowing for the testing and evaluation of imputation methods. The simulated missing data can be controlled in terms of proportion and distribution across samples and loci.
- Parameters:
genotype_data (GenotypeData object) – GenotypeData instance.
prop_missing (float, optional) – Proportion of missing data desired in output. Must be in the interval [0, 1]. Defaults to 0.1
strategy (Literal["nonrandom", "nonrandom_weighted", "random_weighted", "random_weighted_inv", "random"]) – Strategy for simulating missing data. “random”: Uniformly masks genotypes at random among eligible entries until the target missing proportion is reached. “random_weighted”: Masks genotypes at random with probabilities proportional to their observed genotype frequencies in each column (more common genotypes are more likely to be masked). “random_weighted_inv”: Masks genotypes at random with probabilities inversely proportional to their observed genotype frequencies in each column (rarer genotypes are more likely to be masked). “nonrandom”: Uses the supplied genotype tree to place missing data on clades that are sampled uniformly from internal and/or tip nodes, producing phylogenetically clustered missingness. “nonrandom_weighted”: As in “nonrandom”, but clades are sampled with probabilities proportional to their branch lengths, concentrating missingness on longer branches (e.g., mimicking locus dropout tied to evolutionary divergence). Defaults to “random”.
missing_val (int, optional) – Value that represents missing data. Defaults to -9.
mask_missing (bool, optional) – True if you want to skip original missing values when simulating new missing data, False otherwise. Defaults to True.
verbose (bool, optional) – Verbosity level. Defaults to 0.
tol (float) – Tolerance to reach proportion specified in self.prop_missing. Defaults to 1/num_snps*num_inds
max_tries (int) – Maximum number of tries to reach targeted missing data proportion within specified tol. If None, num_inds will be used. Defaults to None.
- original_missing_mask_
Array with boolean mask for original missing locations.
- Type:
numpy.ndarray
- sim_missing_mask_
Array with boolean mask for simulated missing locations, excluding the original ones.
- Type:
numpy.ndarray
- all_missing_mask_
Array with boolean mask for all missing locations, including both simulated and original.
- Type:
numpy.ndarray
- __init__(genotype_data, *, tree_parser: TreeParser | None = None, prop_missing=0.1, strategy='random', missing_val=-9, mask_missing=True, verbose=0, tol=None, max_tries=None, seed: int | None = None, logger: Logger | None = None) None[source]
Initialize the SimMissingTransformer.
- Parameters:
genotype_data (GenotypeData object) – GenotypeData instance.
tree_parser (TreeParser | None) – TreeParser instance with a loaded tree. Required for “nonrandom” and “nonrandom_weighted” strategies.
prop_missing (float, optional) – Proportion of missing data desired in output. Must be in the interval [0, 1]. Defaults to 0.1
strategy (Literal["nonrandom", "nonrandom_weighted", "random_weighted", "random_weighted_inv", "random"]) – Strategy for simulating missing data. “random”: Uniformly masks genotypes at random among eligible entries until the target missing proportion is reached. “random_weighted”: Masks genotypes at random with probabilities proportional to their observed genotype frequencies in each column (more common genotypes are more likely to be masked). “random_weighted_inv”: Masks genotypes at random with probabilities inversely proportional to their observed genotype frequencies in each column (rarer genotypes are more likely to be masked). “nonrandom”: Uses the supplied genotype tree to place missing data on clades that are sampled uniformly from internal and/or tip nodes, producing phylogenetically clustered missingness. “nonrandom_weighted”: As in “nonrandom”, but clades are sampled with probabilities proportional to their branch lengths, concentrating missingness on longer branches (e.g., mimicking locus dropout tied to evolutionary divergence). Defaults to “random”.
missing_val (int, optional) – Value that represents missing data. Defaults to -9.
mask_missing (bool, optional) – True if you want to skip original missing values when simulating new missing data, False otherwise. Defaults to True.
verbose (bool, optional) – Verbosity level. Defaults to 0.
tol (float) – Tolerance to reach proportion specified in self.prop_missing. Defaults to 1/num_snps*num_inds
max_tries (int) – Maximum number of tries to reach targeted missing data proportion within specified tol. If None, num_inds will be used. Defaults to None.
seed (int | None) – RNG seed.
logger (logging.Logger | None) – Logger for messages.
- fit(X: ndarray, y=None) SimMissingTransformer[source]
Fit to input data X by simulating missing data.
Missing data will be simulated in varying ways depending on the
strategysetting.- Parameters:
X (np.ndarray) – Data with which to simulate missing data. It should have already been imputed with one of the non-machine learning simple imputers.
Xmay contain original missing values; simulation is applied to eligible entries depending on mask_missing.- Raises:
TypeError –
SimGenotypeDataTreeTransformer.treemust not be NoneType when using strategy=”nonrandom” or “nonrandom_weighted”.ValueError – Invalid
strategyparameter provided.
- transform(X: ndarray) ndarray[source]
Function to generate masked sites in a SimGenotypeData object
- Parameters:
X (np.ndarray) – Data to transform. No missing data should be present in X. It should have already been imputed with one of the non-machine learning simple imputers.
- Returns:
Transformed data with missing data added.
- Return type:
np.ndarray
- sqrt_transform(proportions: ndarray) ndarray[source]
Apply the square root transformation to an array of proportions.
- Parameters:
proportions (np.ndarray) – An array of proportions.
- Returns:
The transformed proportions.
- Return type:
np.ndarray
- random_weighted_missing_data(X: ndarray, transform_fn: Literal['sqrt', 'exp'] = 'sqrt', power: float = 0.5, inv: bool = False, rng: Generator | None = None, target_rate: float | None = None, *, mask_missing: bool = True) ndarray[source]
Simulate missing data proportional or inversely proportional to genotype frequencies.
This method simulates missing data in a genotype matrix based on genotype frequencies. It allows for different transformation functions to be applied to the base probabilities, and can optionally use inverse genotype frequencies.
- Parameters:
X (np.ndarray) – Input genotype matrix.
transform_fn (Literal["sqrt", "exp"]) – Transformation function to apply to base probabilities.
power (float) – Exponent to raise transformed probabilities.
inv (bool) – If True, use inverse genotype frequencies. If False, use direct frequencies to weight missingness.
rng (Optional[np.random.Generator]) – Optional NumPy Generator for reproducibility.
target_rate (float | None) – If provided, scales the probabilities to achieve this target missing rate.
- Returns:
Simulated missing mask.
- Return type:
np.ndarray
- write_mask(filename_prefix: str)[source]
Write mask to file.
- Parameters:
filename_prefix (str) – Prefix for the filenames to write to.
- read_mask(filename_prefix: str) Tuple[ndarray, ndarray, ndarray][source]
Read mask from file.
- Parameters:
filename_prefix (str) – Prefix for the filenames to read from.
- Returns:
The read masks. (mask, original_missing_mask, all_missing_mask).
- Return type:
Tuple[np.ndarray, np.ndarray, np.ndarray]
- property missing_count: int
Count of masked genotypes in SimGenotypeData.mask
- Returns:
Integer count of masked alleles.
- Return type:
int
- property prop_missing_real: float
Proportion of genotypes masked in SimGenotypeData.mask
- Returns:
Total number of masked alleles divided by SNP matrix size.
- Return type:
float
pgsui.data_processing.containers module
- class pgsui.data_processing.containers.ModelConfig(latent_init: Literal['random', 'pca'] = 'random', latent_dim: int = 2, dropout_rate: float = 0.2, num_hidden_layers: int = 2, activation: Literal['relu', 'elu', 'selu', 'leaky_relu'] = 'relu', layer_scaling_factor: float = 5.0, layer_schedule: Literal['pyramid', 'linear'] = 'pyramid')[source]
Model architecture configuration.
- latent_init
Method for initializing the latent space.
- Type:
Literal[“random”, “pca”]
- latent_dim
Dimensionality of the latent space.
- Type:
int
- dropout_rate
Dropout rate for regularization.
- Type:
float
- num_hidden_layers
Number of hidden layers in the neural network.
- Type:
int
- activation
Activation function.
- Type:
Literal[“relu”, “elu”, “selu”, “leaky_relu”]
- layer_scaling_factor
Scaling factor for the number of neurons in hidden layers.
- Type:
float
- layer_schedule
Schedule for scaling hidden layer sizes.
- Type:
Literal[“pyramid”, “linear”]
- latent_init: Literal['random', 'pca'] = 'random'
- latent_dim: int = 2
- dropout_rate: float = 0.2
- num_hidden_layers: int = 2
- activation: Literal['relu', 'elu', 'selu', 'leaky_relu'] = 'relu'
- layer_scaling_factor: float = 5.0
- layer_schedule: Literal['pyramid', 'linear'] = 'pyramid'
- class pgsui.data_processing.containers.TrainConfig(batch_size: int = 64, learning_rate: float = 0.001, l1_penalty: float = 0.0, early_stop_gen: int = 25, min_epochs: int = 100, max_epochs: int = 2000, validation_split: float = 0.2, device: Literal['gpu', 'cpu', 'mps'] = 'cpu', weights_max_ratio: float | None = None, weights_power: float = 1.0, weights_normalize: bool = True, weights_inverse: bool = False, gamma: float = 0.0, gamma_schedule: bool = False)[source]
Training procedure configuration.
- batch_size
Number of samples per training batch.
- Type:
int
- learning_rate
Learning rate for the optimizer.
- Type:
float
- l1_penalty
L1 regularization penalty.
- Type:
float
- early_stop_gen
Number of generations with no improvement to wait before early stopping.
- Type:
int
- min_epochs
Minimum number of epochs to train.
- Type:
int
- max_epochs
Maximum number of epochs to train.
- Type:
int
- validation_split
Proportion of data to use for validation.
- Type:
float
- weights_max_ratio
Maximum ratio for class weights to prevent extreme values.
- Type:
float | None
- gamma
Focusing parameter for focal loss.
- Type:
float
- device
Device to use for computation.
- Type:
Literal[“gpu”, “cpu”, “mps”]
- batch_size: int = 64
- learning_rate: float = 0.001
- l1_penalty: float = 0.0
- early_stop_gen: int = 25
- min_epochs: int = 100
- max_epochs: int = 2000
- validation_split: float = 0.2
- device: Literal['gpu', 'cpu', 'mps'] = 'cpu'
- weights_max_ratio: float | None = None
- weights_power: float = 1.0
- weights_normalize: bool = True
- weights_inverse: bool = False
- gamma: float = 0.0
- gamma_schedule: bool = False
- class pgsui.data_processing.containers.TuneConfig(enabled: bool = False, metrics: Literal['f1', 'accuracy', 'pr_macro', 'average_precision', 'roc_auc', 'precision', 'recall', 'mcc', 'jaccard'] | list[str] | tuple[str, ...] = 'f1', n_trials: int = 100, resume: bool = False, save_db: bool = False, epochs: int = 500, batch_size: int = 64, patience: int = 10)[source]
Hyperparameter tuning configuration.
- enabled
If True, enables hyperparameter tuning.
- Type:
bool
- metrics
Metric(s) to optimize during tuning. Multi-objective tuning is supported by providing a list or tuple of metric names.
- Type:
Literal[“f1”, “accuracy”, “pr_macro”, “average_precision”, “roc_auc”, “precision”, “recall”, “mcc”, “jaccard”] | list[str] | tuple[str, …]
- n_trials
Number of hyperparameter trials to run.
- Type:
int
- resume
If True, resumes tuning from a previous state.
- Type:
bool
- save_db
If True, saves the tuning results to a database.
- Type:
bool
- epochs
Number of epochs to train each trial.
- Type:
int
- batch_size
Batch size for training during tuning.
- Type:
int
- patience
Number of evaluations with no improvement before stopping early.
- Type:
int
- enabled: bool = False
- metrics: Literal['f1', 'accuracy', 'pr_macro', 'average_precision', 'roc_auc', 'precision', 'recall', 'mcc', 'jaccard'] | list[str] | tuple[str, ...] = 'f1'
- n_trials: int = 100
- resume: bool = False
- save_db: bool = False
- epochs: int = 500
- batch_size: int = 64
- patience: int = 10
- class pgsui.data_processing.containers.PlotConfig(fmt: Literal['pdf', 'png', 'jpg', 'jpeg', 'svg'] = 'pdf', dpi: int = 300, fontsize: int = 18, despine: bool = True, show: bool = True, multiqc: bool = True)[source]
Plotting configuration.
- fmt
Output file format.
- Type:
Literal[“pdf”, “png”, “jpg”, “jpeg”, “svg”]
- dpi
Dots per inch for the output figure.
- Type:
int
- fontsize
Font size for text in the plots.
- Type:
int
- despine
If True, removes the top and right spines from plots.
- Type:
bool
- show
If True, displays the plot interactively.
- Type:
bool
- multiqc
If True, generates MultiQC-compatible plots.
- Type:
bool
- fmt: Literal['pdf', 'png', 'jpg', 'jpeg', 'svg'] = 'pdf'
- dpi: int = 300
- fontsize: int = 18
- despine: bool = True
- show: bool = True
- multiqc: bool = True
- class pgsui.data_processing.containers.IOConfig(prefix: str = 'pgsui', ploidy: int = 2, verbose: bool = False, debug: bool = False, seed: int | None = None, n_jobs: int = 1, scoring_averaging: Literal['macro', 'weighted'] = 'macro')[source]
I/O configuration.
Dataclass that includes configuration settings for file naming, logging verbosity, random seed, and parallelism.
- prefix
Prefix for output files. Default is “pgsui”.
- Type:
str
- ploidy
Ploidy level of the organism. Default is 2.
- Type:
int
- verbose
If True, enables verbose logging. Default is False.
- Type:
bool
- debug
If True, enables debug mode. Default is False.
- Type:
bool
- seed
Random seed for reproducibility. Default is None.
- Type:
int | None
- n_jobs
Number of parallel jobs to run. Default is 1.
- Type:
int
- scoring_averaging
Averaging method.
- Type:
Literal[“macro”, “weighted”]
- prefix: str = 'pgsui'
- ploidy: int = 2
- verbose: bool = False
- debug: bool = False
- seed: int | None = None
- n_jobs: int = 1
- scoring_averaging: Literal['macro', 'weighted'] = 'macro'
- class pgsui.data_processing.containers.SimConfig(simulate_missing: bool = False, sim_strategy: Literal['random', 'random_weighted', 'random_weighted_inv', 'nonrandom', 'nonrandom_weighted'] = 'random', sim_prop: float = 0.2, sim_kwargs: dict | None = None)[source]
Top-level configuration for data simulation and imputation.
- simulate_missing
If True, simulates missing data.
- Type:
bool
- sim_strategy
Strategy for simulating missing data.
- Type:
Literal[“random”, …]
- sim_prop
Proportion of data to simulate as missing.
- Type:
float
- sim_kwargs
Additional keyword arguments for simulation.
- Type:
dict | None
- simulate_missing: bool = False
- sim_strategy: Literal['random', 'random_weighted', 'random_weighted_inv', 'nonrandom', 'nonrandom_weighted'] = 'random'
- sim_prop: float = 0.2
- sim_kwargs: dict | None = None
- class pgsui.data_processing.containers.AutoencoderConfig(io: IOConfig = <factory>, model: ModelConfig = <factory>, train: TrainConfig = <factory>, tune: TuneConfig = <factory>, plot: PlotConfig = <factory>, sim: SimConfig = <factory>)[source]
Top-level configuration for ImputeAutoencoder.
This configuration class encapsulates all settings required for the ImputeAutoencoder model, including I/O, model architecture, training, hyperparameter tuning, plotting, and simulated-missing configuration.
- io
I/O configuration.
- Type:
IOConfig
- model
Model architecture configuration.
- Type:
ModelConfig
- train
Training procedure configuration.
- Type:
TrainConfig
- tune
Hyperparameter tuning configuration.
- Type:
TuneConfig
- plot
Plotting configuration.
- Type:
PlotConfig
- sim
Simulated-missing configuration.
- Type:
SimConfig
- io: IOConfig
- model: ModelConfig
- train: TrainConfig
- tune: TuneConfig
- plot: PlotConfig
- sim: SimConfig
- classmethod from_preset(preset: Literal['fast', 'balanced', 'thorough'] = 'balanced') AutoencoderConfig[source]
Build a AutoencoderConfig from a named preset.
- Parameters:
preset (Literal["fast", "balanced", "thorough"]) – Preset name.
- Returns:
Configuration instance corresponding to the preset.
- Return type:
AutoencoderConfig
- apply_overrides(overrides: Dict[str, Any] | None) AutoencoderConfig[source]
Apply flat dot-key overrides.
- Parameters:
overrides (Dict[str, Any] | None) – Dictionary of overrides with dot-separated keys.
- Returns:
New configuration instance with overrides applied.
- Return type:
AutoencoderConfig
- to_dict() Dict[str, Any][source]
- class pgsui.data_processing.containers.VAEExtraConfig(kl_beta: 'float' = 1.0, kl_beta_schedule: 'bool' = False)[source]
- kl_beta: float = 1.0
- kl_beta_schedule: bool = False
- class pgsui.data_processing.containers.VAEConfig(io: IOConfig = <factory>, model: ModelConfig = <factory>, train: TrainConfig = <factory>, tune: TuneConfig = <factory>, plot: PlotConfig = <factory>, vae: VAEExtraConfig = <factory>, sim: SimConfig = <factory>)[source]
Top-level configuration for ImputeVAE (AE-parity + VAE extras).
Mirrors AutoencoderConfig sections and adds a
vaeblock with KL-beta controls for the VAE loss.- io
I/O configuration.
- Type:
IOConfig
- model
Model architecture configuration.
- Type:
ModelConfig
- train
Training procedure configuration.
- Type:
TrainConfig
- tune
Hyperparameter tuning configuration.
- Type:
TuneConfig
- plot
Plotting configuration.
- Type:
PlotConfig
- vae
VAE-specific configuration.
- Type:
VAEExtraConfig
- sim
Simulated-missing configuration.
- Type:
SimConfig
- io: IOConfig
- model: ModelConfig
- train: TrainConfig
- tune: TuneConfig
- plot: PlotConfig
- vae: VAEExtraConfig
- sim: SimConfig
- classmethod from_preset(preset: Literal['fast', 'balanced', 'thorough'] = 'balanced') VAEConfig[source]
Build a VAEConfig from a named preset.
- Parameters:
preset (Literal["fast", "balanced", "thorough"]) – Preset name.
- Returns:
Configuration instance corresponding to the preset.
- Return type:
VAEConfig
- apply_overrides(overrides: Dict[str, Any] | None) VAEConfig[source]
Apply flat dot-key overrides.
- to_dict() Dict[str, Any][source]
- class pgsui.data_processing.containers.NLPCAExtraConfig(projection_lr: 'float' = 0.05, projection_epochs: 'int' = 100)[source]
- projection_lr: float = 0.05
- projection_epochs: int = 100
- class pgsui.data_processing.containers.NLPCAConfig(io: IOConfig = <factory>, model: ModelConfig = <factory>, train: TrainConfig = <factory>, tune: TuneConfig = <factory>, plot: PlotConfig = <factory>, sim: SimConfig = <factory>, nlpca: NLPCAExtraConfig = <factory>)[source]
Top-level configuration for ImputeUBP.
This configuration class encapsulates all settings required for the ImputeUBP model, including I/O, model architecture, training, hyperparameter tuning, plotting, and simulated-missing configuration.
- io
I/O configuration.
- Type:
IOConfig
- model
Model architecture configuration.
- Type:
ModelConfig
- train
Training procedure configuration.
- Type:
TrainConfig
- tune
Hyperparameter tuning configuration.
- Type:
TuneConfig
- plot
Plotting configuration.
- Type:
PlotConfig
- sim
Simulated-missing configuration.
- Type:
SimConfig
- nlpca
NLPCA-specific configuration.
- Type:
NLPCAExtraConfig
- io: IOConfig
- model: ModelConfig
- train: TrainConfig
- tune: TuneConfig
- plot: PlotConfig
- sim: SimConfig
- nlpca: NLPCAExtraConfig
- classmethod from_preset(preset: Literal['fast', 'balanced', 'thorough'] = 'balanced') NLPCAConfig[source]
Build a NLPCAConfig from a named preset.
- Parameters:
preset (Literal["fast", "balanced", "thorough"]) – Preset name.
- Returns:
Configuration instance corresponding to the preset.
- Return type:
NLPCAConfig
- apply_overrides(overrides: Dict[str, Any] | None) NLPCAConfig[source]
Apply flat dot-key overrides.
- Parameters:
overrides (Dict[str, Any] | None) – Dictionary of overrides with dot-separated keys.
- Returns:
New configuration instance with overrides applied.
- Return type:
NLPCAConfig
- to_dict() Dict[str, Any][source]
- class pgsui.data_processing.containers.UBPExtraConfig(projection_lr: 'float' = 0.05, projection_epochs: 'int' = 100)[source]
- projection_lr: float = 0.05
- projection_epochs: int = 100
- class pgsui.data_processing.containers.UBPConfig(io: IOConfig = <factory>, model: ModelConfig = <factory>, train: TrainConfig = <factory>, tune: TuneConfig = <factory>, plot: PlotConfig = <factory>, sim: SimConfig = <factory>, ubp: UBPExtraConfig = <factory>)[source]
Top-level configuration for ImputeUBP.
This configuration class encapsulates all settings required for the ImputeUBP model, including I/O, model architecture, training, hyperparameter tuning, plotting, and simulated-missing configuration.
- io
I/O configuration.
- Type:
IOConfig
- model
Model architecture configuration.
- Type:
ModelConfig
- train
Training procedure configuration.
- Type:
TrainConfig
- tune
Hyperparameter tuning configuration.
- Type:
TuneConfig
- plot
Plotting configuration.
- Type:
PlotConfig
- sim
Simulated-missing configuration.
- Type:
SimConfig
- ubp
UBP-specific configuration.
- Type:
UBPExtraConfig
- io: IOConfig
- model: ModelConfig
- train: TrainConfig
- tune: TuneConfig
- plot: PlotConfig
- sim: SimConfig
- ubp: UBPExtraConfig
- classmethod from_preset(preset: Literal['fast', 'balanced', 'thorough'] = 'balanced') UBPConfig[source]
Build a UBPConfig from a named preset.
- Parameters:
preset (Literal["fast", "balanced", "thorough"]) – Preset name.
- Returns:
Configuration instance corresponding to the preset.
- Return type:
UBPConfig
- apply_overrides(overrides: Dict[str, Any] | None) UBPConfig[source]
Apply flat dot-key overrides.
- Parameters:
overrides (Dict[str, Any] | None) – Dictionary of overrides with dot-separated keys.
- Returns:
New configuration instance with overrides applied.
- Return type:
UBPConfig
- to_dict() Dict[str, Any][source]
- class pgsui.data_processing.containers.MostFrequentAlgoConfig(by_populations: bool = False, default: int = 0, missing: int = -1)[source]
Algorithmic knobs for ImputeMostFrequent.
- by_populations
Whether to compute per-population modes. Default is False.
- Type:
bool
- default
Fallback mode if no valid entries in a locus. Default is 0.
- Type:
int
- missing
Code for missing genotypes in 0/1/2. Default is -1.
- Type:
int
- by_populations: bool = False
- default: int = 0
- missing: int = -1
- class pgsui.data_processing.containers.DeterministicSplitConfig(test_size: float = 0.2, test_indices: Sequence[int] | None = None)[source]
Evaluation split configuration shared by deterministic imputers.
- test_size
Proportion of data to use as the test set. Default is 0.2.
- Type:
float
- test_indices
Specific indices to use as the test set. Default is None.
- Type:
Optional[Sequence[int]]
- test_size: float = 0.2
- test_indices: Sequence[int] | None = None
- class pgsui.data_processing.containers.MostFrequentConfig(io: IOConfig = <factory>, plot: PlotConfig = <factory>, split: DeterministicSplitConfig = <factory>, algo: MostFrequentAlgoConfig = <factory>, sim: SimConfig = <factory>, tune: TuneConfig = <factory>, train: TrainConfig = <factory>)[source]
Top-level configuration for ImputeMostFrequent.
Deterministic imputers primarily use
io,plot,split,algo, andsim. Thetrainandtunesections are retained for schema parity with NN models but are not currently used by ImputeMostFrequent.- io
I/O configuration.
- Type:
IOConfig
- plot
Plotting configuration.
- Type:
PlotConfig
- split
Data splitting configuration.
- Type:
DeterministicSplitConfig
- algo
Algorithmic configuration.
- Type:
MostFrequentAlgoConfig
- sim
Simulation configuration.
- Type:
SimConfig
- tune
Hyperparameter tuning configuration.
- Type:
TuneConfig
- train
Training configuration.
- Type:
TrainConfig
- io: IOConfig
- plot: PlotConfig
- split: DeterministicSplitConfig
- algo: MostFrequentAlgoConfig
- sim: SimConfig
- tune: TuneConfig
- train: TrainConfig
- classmethod from_preset(preset: Literal['fast', 'balanced', 'thorough'] = 'balanced') MostFrequentConfig[source]
Construct a preset configuration.
- Parameters:
preset (Literal["fast", "balanced", "thorough"]) – Preset name.
- Returns:
Configuration instance corresponding to the preset.
- Return type:
MostFrequentConfig
- apply_overrides(overrides: Dict[str, Any] | None) MostFrequentConfig[source]
Apply dot-key overrides.
- to_dict() Dict[str, Any][source]
- class pgsui.data_processing.containers.RefAlleleAlgoConfig(missing: int = -1)[source]
Algorithmic knobs for ImputeRefAllele.
- missing
Code for missing genotypes in 0/1/2.
- Type:
int
- missing: int = -1
- class pgsui.data_processing.containers.RefAlleleConfig(io: IOConfig = <factory>, plot: PlotConfig = <factory>, split: DeterministicSplitConfig = <factory>, algo: RefAlleleAlgoConfig = <factory>, sim: SimConfig = <factory>, tune: TuneConfig = <factory>, train: TrainConfig = <factory>)[source]
Top-level configuration for ImputeRefAllele.
Deterministic imputers primarily use
io,plot,split,algo, andsim. Thetrainandtunesections are retained for schema parity with NN models but are not currently used by ImputeRefAllele.- io
I/O configuration.
- Type:
IOConfig
- plot
Plotting configuration.
- Type:
PlotConfig
- split
Data splitting configuration.
- Type:
DeterministicSplitConfig
- algo
Algorithmic configuration.
- Type:
RefAlleleAlgoConfig
- sim
Simulation configuration.
- Type:
SimConfig
- tune
Hyperparameter tuning configuration.
- Type:
TuneConfig
- train
Training configuration.
- Type:
TrainConfig
- io: IOConfig
- plot: PlotConfig
- split: DeterministicSplitConfig
- algo: RefAlleleAlgoConfig
- sim: SimConfig
- tune: TuneConfig
- train: TrainConfig
- classmethod from_preset(preset: Literal['fast', 'balanced', 'thorough'] = 'balanced') RefAlleleConfig[source]
Presets mainly keep parity with logging/IO and split test_size.
- Parameters:
preset (Literal["fast", "balanced", "thorough"]) – Preset name.
- Returns:
Configuration instance corresponding to the preset.
- Return type:
RefAlleleConfig
- apply_overrides(overrides: Dict[str, Any] | None) RefAlleleConfig[source]
Apply dot-key overrides.
- to_dict() Dict[str, Any][source]
- class pgsui.data_processing.containers.IOConfigSupervised(prefix: str = 'pgsui', seed: int | None = None, n_jobs: int = 1, verbose: bool = False, debug: bool = False)[source]
I/O, logging, and run identity.
- prefix
Prefix for output files and logs.
- Type:
str
- seed
Random seed for reproducibility.
- Type:
Optional[int]
- n_jobs
Number of parallel jobs to use.
- Type:
int
- verbose
Whether to enable verbose logging.
- Type:
bool
- debug
Whether to enable debug mode.
- Type:
bool
- prefix: str = 'pgsui'
- seed: int | None = None
- n_jobs: int = 1
- verbose: bool = False
- debug: bool = False
- class pgsui.data_processing.containers.PlotConfigSupervised(fmt: Literal['pdf', 'png', 'jpg', 'jpeg'] = 'pdf', dpi: int = 300, fontsize: int = 18, despine: bool = True, show: bool = False)[source]
Plot/figure styling.
- fmt
File format.
- Type:
Literal[“pdf”, “png”, “jpg”, “jpeg”]
- dpi
Resolution in dots per inch.
- Type:
int
- fontsize
Base font size for plot text.
- Type:
int
- despine
Whether to remove top/right spines.
- Type:
bool
- show
Whether to display plots interactively.
- Type:
bool
- fmt: Literal['pdf', 'png', 'jpg', 'jpeg'] = 'pdf'
- dpi: int = 300
- fontsize: int = 18
- despine: bool = True
- show: bool = False
- class pgsui.data_processing.containers.TrainConfigSupervised(validation_split: float = 0.2)[source]
Training/evaluation split (by samples).
- validation_split
Proportion of data to use for validation.
- Type:
float
- validation_split: float = 0.2
- class pgsui.data_processing.containers.ImputerConfigSupervised(n_nearest_features: int | None = 10, max_iter: int = 10)[source]
IterativeImputer-like scaffolding used by current supervised wrappers.
- n_nearest_features
Number of nearest features to use.
- Type:
Optional[int]
- max_iter
Maximum number of imputation iterations to perform.
- Type:
int
- n_nearest_features: int | None = 10
- max_iter: int = 10
- class pgsui.data_processing.containers.SimConfigSupervised(prop_missing: float = 0.5, strategy: Literal['random', 'random_inv_genotype'] = 'random_inv_genotype', het_boost: float = 2.0, missing_val: int = -1)[source]
Simulation of missingness for evaluation.
- prop_missing
Proportion of features to set as missing.
- Type:
float
- strategy
Strategy.
- Type:
Literal[“random”, “random_inv_genotype”]
- het_boost
Boosting factor for heterogeneity.
- Type:
float
- missing_val
Internal code for missing genotypes.
- Type:
int
- prop_missing: float = 0.5
- strategy: Literal['random', 'random_inv_genotype'] = 'random_inv_genotype'
- het_boost: float = 2.0
- missing_val: int = -1
- class pgsui.data_processing.containers.TuningConfigSupervised(enabled: bool = True, n_trials: int = 100, metric: str = 'pr_macro', n_jobs: int = 8, fast: bool = True)[source]
Optuna tuning envelope.
- enabled: bool = True
- n_trials: int = 100
- metric: str = 'pr_macro'
- n_jobs: int = 8
- fast: bool = True
- class pgsui.data_processing.containers.RFModelConfig(n_estimators: int = 100, max_depth: int | None = None, min_samples_split: int = 2, min_samples_leaf: int = 1, max_features: Literal['sqrt', 'log2'] | float | None = 'sqrt', criterion: Literal['gini', 'entropy', 'log_loss'] = 'gini', class_weight: Literal['balanced', 'balanced_subsample', None] = 'balanced')[source]
Random Forest hyperparameters.
- n_estimators
Number of trees in the forest.
- Type:
int
- max_depth
Maximum depth of the trees.
- Type:
Optional[int]
- min_samples_split
Minimum number of samples required to split.
- Type:
int
- min_samples_leaf
Minimum number of samples required at a leaf.
- Type:
int
- max_features
Features to consider.
- Type:
Literal[“sqrt”, “log2”] | float | None
- criterion
Split quality metric.
- Type:
Literal[“gini”, “entropy”, “log_loss”]
- class_weight
Class weights.
- Type:
Literal[“balanced”, “balanced_subsample”, None]
- n_estimators: int = 100
- max_depth: int | None = None
- min_samples_split: int = 2
- min_samples_leaf: int = 1
- max_features: Literal['sqrt', 'log2'] | float | None = 'sqrt'
- criterion: Literal['gini', 'entropy', 'log_loss'] = 'gini'
- class_weight: Literal['balanced', 'balanced_subsample', None] = 'balanced'
- class pgsui.data_processing.containers.HGBModelConfig(n_estimators: int = 100, learning_rate: float = 0.1, max_depth: int | None = None, min_samples_leaf: int = 1, max_features: float | None = 1.0, n_iter_no_change: int = 10, tol: float = 1e-07)[source]
Histogram-based Gradient Boosting hyperparameters.
- n_estimators
Number of boosting iterations (max_iter).
- Type:
int
- learning_rate
Step size for each boosting iteration.
- Type:
float
- max_depth
Maximum depth of each tree.
- Type:
Optional[int]
- min_samples_leaf
Minimum number of samples required at a leaf.
- Type:
int
- max_features
Proportion of features to consider.
- Type:
float | None
- n_iter_no_change
Iterations to wait for early stopping.
- Type:
int
- tol
Minimum improvement in the loss.
- Type:
float
- n_estimators: int = 100
- learning_rate: float = 0.1
- max_depth: int | None = None
- min_samples_leaf: int = 1
- max_features: float | None = 1.0
- n_iter_no_change: int = 10
- tol: float = 1e-07
- class pgsui.data_processing.containers.RFConfig(io: IOConfigSupervised = <factory>, model: RFModelConfig = <factory>, train: TrainConfigSupervised = <factory>, imputer: ImputerConfigSupervised = <factory>, sim: SimConfigSupervised = <factory>, plot: PlotConfigSupervised = <factory>, tune: TuningConfigSupervised = <factory>)[source]
Configuration for ImputeRandomForest.
- io
Run identity, logging, and seeds.
- Type:
IOConfigSupervised
- model
RandomForest hyperparameters.
- Type:
RFModelConfig
- train
Sample split for validation.
- Type:
TrainConfigSupervised
- imputer
IterativeImputer scaffolding.
- Type:
ImputerConfigSupervised
- sim
Simulated missingness.
- Type:
SimConfigSupervised
- plot
Plot styling.
- Type:
PlotConfigSupervised
- tune
Optuna knobs.
- Type:
TuningConfigSupervised
- io: IOConfigSupervised
- model: RFModelConfig
- train: TrainConfigSupervised
- imputer: ImputerConfigSupervised
- sim: SimConfigSupervised
- plot: PlotConfigSupervised
- tune: TuningConfigSupervised
- classmethod from_preset(preset: str = 'balanced') RFConfig[source]
Build a config from a named preset.
- Parameters:
preset (str) – Preset name.
- Returns:
Configuration instance corresponding to the preset.
- Return type:
RFConfig
- classmethod from_yaml(path: str) RFConfig[source]
Load from YAML; honors optional top-level ‘preset’.
- apply_overrides(overrides: Dict[str, Any] | None) RFConfig[source]
Apply flat dot-key overrides.
- to_dict() Dict[str, Any][source]
- to_imputer_kwargs() Dict[str, Any][source]
- class pgsui.data_processing.containers.HGBConfig(io: IOConfigSupervised = <factory>, model: HGBModelConfig = <factory>, train: TrainConfigSupervised = <factory>, imputer: ImputerConfigSupervised = <factory>, sim: SimConfigSupervised = <factory>, plot: PlotConfigSupervised = <factory>, tune: TuningConfigSupervised = <factory>)[source]
Configuration for ImputeHistGradientBoosting.
- io
Run identity, logging, and seeds.
- Type:
IOConfigSupervised
- model
HistGradientBoosting hyperparameters.
- Type:
HGBModelConfig
- train
Sample split for validation.
- Type:
TrainConfigSupervised
- imputer
IterativeImputer scaffolding.
- Type:
ImputerConfigSupervised
- sim
Simulated missingness.
- Type:
SimConfigSupervised
- plot
Plot styling.
- Type:
PlotConfigSupervised
- tune
Optuna knobs.
- Type:
TuningConfigSupervised
- io: IOConfigSupervised
- model: HGBModelConfig
- train: TrainConfigSupervised
- imputer: ImputerConfigSupervised
- sim: SimConfigSupervised
- plot: PlotConfigSupervised
- tune: TuningConfigSupervised
- classmethod from_preset(preset: str = 'balanced') HGBConfig[source]
Build a config from a named preset.
- Parameters:
preset (str) – Preset name.
- Returns:
Configuration instance corresponding to the preset.
- Return type:
HGBConfig
- classmethod from_yaml(path: str) HGBConfig[source]
- apply_overrides(overrides: Dict[str, Any] | None) HGBConfig[source]
- to_dict() Dict[str, Any][source]
- to_imputer_kwargs() Dict[str, Any][source]
pgsui.data_processing.config module
- class pgsui.data_processing.config.T
Config utilities for PG-SUI.
We keep nested configs as dataclasses at all times.
Public API: - load_yaml_to_dataclass - apply_dot_overrides - dataclass_to_yaml - save_dataclass_yaml
alias of TypeVar(‘T’)
- pgsui.data_processing.config.dataclass_to_yaml(dc: Any) str[source]
Convert a dataclass instance to a YAML string.
This function uses the asdict function from the dataclasses module to convert the dataclass instance into a dictionary, which is then serialized to a YAML string using the yaml module.
- Parameters:
dc (t.Any) – A dataclass instance.
- Returns:
The YAML representation of the dataclass.
- Return type:
str
- Raises:
TypeError – If dc is not a dataclass instance.
- pgsui.data_processing.config.save_dataclass_yaml(dc: Any, path: str) None[source]
Save a dataclass instance as a YAML file.
This function uses the dataclass_to_yaml function to convert the dataclass instance into a YAML string, which is then written to a file.
- Parameters:
dc (T) – A dataclass instance.
path (str) – Path to save the YAML file.
- Raises:
TypeError – If dc is not a dataclass instance.
- pgsui.data_processing.config.load_yaml_to_dataclass(path: str, dc_type: Type[T], *, base: T | None = None, overlays: Dict[str, Any] | None = None, yaml_preset_behavior: Literal['ignore', 'error'] = 'ignore') T[source]
Load a YAML file and merge into a dataclass instance with strict precedence.
This function is designed for the new argument hierarchy: defaults < CLI preset (build base from it) < YAML file < CLI args/–set
Notes
preset is CLI-only. If the YAML contains preset, it will be ignored (default) or cause an error depending on yaml_preset_behavior.
Pass a base instance that is already constructed from the CLI-selected preset (e.g., VAEConfig.from_preset(args.preset)), and this function will overlay the YAML on top of it. Any additional overlays (a nested dict) are applied last.
- Parameters:
path (str) – Path to the YAML file.
dc_type (Type[T]) – Dataclass type to construct if base is not provided.
base (T | None) – A preconstructed dataclass instance to start from (typically built from the CLI preset). If provided, it takes precedence over any other starting point.
overlays (Dict[str, Any] | None) – A nested mapping to apply after the YAML (e.g., derived CLI flags). These win over YAML values.
yaml_preset_behavior (Literal["ignore","error"]) – What to do if the YAML contains a preset key. Default: “ignore”.
- Returns:
The merged dataclass instance.
- Return type:
Type[T]
- Raises:
TypeError – If base is not a dataclass, or YAML root isn’t a mapping, or overlays isn’t a mapping when provided.
ValueError – If yaml_preset_behavior=”error” and YAML contains preset.
KeyError – If any override path is invalid.
- pgsui.data_processing.config.apply_dot_overrides(dc: Any, overrides: dict[str, Any] | None, *, root_cls: type | None = None, create_missing: bool = False, registry: dict[str, type] | None = None) Any[source]
Apply overrides like {‘io.prefix’: ‘…’, ‘train.batch_size’: 64} to any Config dataclass.
This function updates the fields of a dataclass instance with values from a nested mapping (dict). It ensures that all keys in the mapping correspond to fields in the dataclass, and it handles nested dataclass fields as well.
- Parameters:
dc (t.Any) – A dataclass instance (or a dict that can be up-cast).
overrides (dict[str, t.Any] | None) – Mapping of dot-key paths to values.
root_cls (type | None) – Optional dataclass type to up-cast a root dict into (if dc is a dict).
create_missing (bool) – If True, instantiate missing intermediate dataclass nodes when the schema defines them.
registry (dict[str, type] | None) – Optional mapping from top-level segment → dataclass type to assist up-casting.
- Returns:
The updated dataclass instance (same object identity is not guaranteed; a deep copy is made).
- Return type:
t.Any
Notes
Dict payloads encountered at intermediate nodes are merged into the expected dataclass type using schema introspection.
Enforces unknown-key errors to keep configs honest.
- Raises:
TypeError – If dc is not a dataclass or dict (for up-cast).
KeyError – If any override path is invalid.