Unsupervised Imputers

Shared Arguments

Included here is the documentation shared among all unsupervised imputers that are not specific to a given model. This includes arguments pertaining to the number of epochs, batch size, number of components to reduce the input to, learning rate, etc., as well as parameters for grid searches or validation and various other settings.

class pgsui.impute.estimators.UnsupervisedImputer(genotype_data, clf, clf_type, *, prefix='imputer', gridparams=None, cv: int = 5, validation_split=0.2, column_subset=1.0, epochs=100, batch_size=32, n_components=3, early_stop_gen=25, num_hidden_layers=1, hidden_layer_sizes='midpoint', optimizer='adam', hidden_activation='elu', learning_rate=0.01, weights_initializer='glorot_normal', l1_penalty=1e-06, l2_penalty=1e-06, dropout_rate=0.2, kl_beta=1.0, sample_weights=None, gridsearch_method='gridsearch', grid_iter=80, scoring_metric='f1_weighted', population_size='auto', tournament_size=3, elitism=True, crossover_probability=0.2, mutation_probability=0.8, ga_algorithm='eaMuPlusLambda', sim_strategy='random_weighted', sim_prop_missing=0.2, disable_progressbar=False, n_jobs=-1, verbose=0, **kwargs)[source]

Bases: Impute

Parent class for unsupervised imputers. Contains all common arguments and code between unsupervised imputers.

Parameters:
  • genotype_data (GenotypeData object) – GenotypeData instance that was used to read in the sequence data.

  • prefix (str) – Prefix for output directory. Defaults to “imputer”.

  • gridparams (Dict[str, Any] or None, optional) – Dictionary with keys=keyword arguments for the specified estimator and values=lists of parameter values or distributions. If gridparams=None, a grid search is not performed, otherwise gridparams will be used to specify parameter ranges or distributions for the grid search. If using gridsearch_method="gridsearch", then the gridparams values can be lists or numpy arrays. If using gridsearch_method="randomized_gridsearch", distributions can be specified by using scipy.stats.uniform(low, high) (for a uniform distribution) or scipy.stats.loguniform(low, high) (useful if range of values spans orders of magnitude). If using the genetic algorithm grid search by setting gridsearch_method="genetic_algorithm", the parameters can be specified as sklearn_genetic.space objects. The grid search will determine the optimal parameters as those that maximize the scoring metrics. If it takes a long time, run it with a small subset of the data just to find the optimal parameters for the classifier, then run a full imputation using the optimal parameters. Defaults to None (no gridsearch).

  • cv (int, optional) – Number of cross-validation folds to use with grid search. Defaults to 5.

  • validation_split (float, optional) – Proportion of training dataset to set aside for loss validation during model training. Defaults to 0.2.

  • column_subset (int or float, optional) – If float is provided, gets the proportion of the dataset to randomly subset for the grid search or validation. Subsets int(n_features * column_subset) columns and Should be in the range [0, 1]. It can be small if the grid search or validation takes a long time. If int is provided, subset column_subset columns. Defaults to 1.0.

  • epochs (int, optional) – Number of epochs (cycles through the data) to run during training. Defaults to 100.

  • batch_size (int, optional) – Batch size to train the model with. Model training per epoch is performed over multiple subsets of samples (rows) of size batch_size. Defaults to 32.

  • n_components (int, optional) – Number of components (latent dimensions) to compress the input features to. Defaults to 3.

  • early_stop_gen (int, optional) – Only used with the genetic algorithm grid search option. Stop training early if the model sees early_stop_gen consecutive generations without improvement to the scoring metric. This can save training time by reducing the number of epochs and generations that are performed. Defaults to 25.

  • num_hidden_layers (int, optional) – Number of hidden layers to use in the model. Adjust if overfitting or underfitting occurs. Defaults to 1.

  • hidden_layer_sizes (str, List[int], List[str], or int, optional) – Number of neurons to use in the hidden layers. If string or a list of strings is passed, the strings must be either “midpoint”, “sqrt”, or “log2”. “midpoint” will calculate the midpoint as (n_features + n_components) / 2. If “sqrt” is supplied, the square root of the number of features will be used to calculate the output units. If “log2” is supplied, the units will be calculated as log2(n_features). hidden_layer_sizes will calculate and set the number of output units for each hidden layer. If multiple hidden layers are supplied, each subsequent layer’s dimensions are further reduced by the “midpoint”, “sqrt”, or “log2”. E.g., if using num_hidden_layers=3 and n_components=2, and there are 100 features (columns), the hidden layer sizes for midpoint will be: [51, 27, 14]. If a single string or integer is supplied, the model will use the same number of output units for each hidden layer. If a list of integers or strings is supplied, the model will use the values supplied in the list. The list length must be equal to the num_hidden_layers and all hidden layer sizes must be > n_components. Defaults to “midpoint”.

  • hidden_activation (str, optional) – The activation function to use for the hidden layers. See tf.keras.activations for more info. Supported activation functions include: [“elu”, “selu”, “leaky_relu”, “prelu”, “relu”]. Each activation function has some advantages and disadvantages and determines the curve and non-linearity of gradient descent. Some are also faster than others. See https://towardsdatascience.com/7-popular-activation-functions-you-should-know-in-deep-learning-and-how-to-use-them-with-keras-and-27b4d838dfe6 for more information. Note that using hidden_activation="selu" will force weights_initializer to be “lecun_normal”. Defaults to “elu”.

  • optimizer (str, optional) – The optimizer to use with gradient descent. Supported options are: “adam”, “sgd”, and “adagrad”. See tf.keras.optimizers for more info. Defaults to “adam”.

  • learning_rate (float, optional) – The learning rate for the optimizer. Adjust if the loss is learning too slowly or quickly. If you are getting overfitting, it is likely too high, and likewise underfitting can occur when the learning rate is too low. Defaults to 0.01.

  • lr_patience (int, optional) – Number of epochs without loss improvement to wait before reducing the learning rate. Defaults to 1.0.

  • weights_initializer (str, optional) – Initializer to use for the model weights. See tf.keras.initializers for more info. Defaults to “glorot_normal”.

  • l1_penalty (float, optional) – L1 regularization penalty to apply. Adjust if the model is over or underfitting. If this value is too high, underfitting can occur, and vice versa. Defaults to 1e-6.

  • l2_penalty (float, optional) –

  • dropout_rate (float, optional) – Neuron dropout rate during training. Dropout randomly disables dropout_rate proportion of neurons during training, which can reduce overfitting. E.g., if dropout_rate is set to 0.2, then 20% of the neurons are randomly dropped out per epoch. Adjust if the model is over or underfitting. Must be a float in the range [0, 1]. Defaults to 0.2.

  • sample_weights (str, Dict[int, float], or None, optional) – Weights for the ACTG-encoded classes during training. If None, then does not weight classes. If set to “auto”, then class weights are automatically calculated for each column to balance classes. If a dictionary is passed, it must contain “A”, “C”, “G”, and “T” as the keys and the class weights as the values. E.g., {“A”: 1.0, “C”: 1.0, “G”: 1.0, “T”: 1.0}. The dictionary is then used as the overall class weights. Defaults to None (no weighting).

  • gridsearch_method (str, optional) – Grid search method to use. Supported options include: {“gridsearch”, “randomized_gridsearch”, “genetic_algorithm”}. “gridsearch” uses GridSearchCV to test every possible parameter combination. “randomized_gridsearch” picks grid_iter random combinations of parameters to test. “genetic_algorithm” uses a genetic algorithm via the sklearn-genetic-opt GASearchCV module to do the grid search. If doing a grid search, “randomized_search” takes the least amount of time because it does not have to test all parameters. “genetic_algorithm” takes the longest. See the scikit-learn GridSearchCV and RandomizedSearchCV documentation for the “gridsearch” and “randomized_gridsearch” options, and the sklearn-genetic-opt GASearchCV documentation for the “genetic_algorithm” option. Defaults to “gridsearch”.

  • grid_iter (int, optional) – Number of iterations to use for randomized and genetic algorithm grid searches. For randomized grid search, grid_iter parameter combinations will be randomly sampled. For the genetic algorithm, this determines how many generations the genetic algorithm will run. Defaults to 80.

  • scoring_metric (str, optional) – Scoring metric to use for grid searches. The neural network imputers use a multimetric scorer and use different string values for the grid searches. Supported options include: {“accuracy”, “hamming”, “roc_auc_micro”, “roc_auc_macro”, “roc_auc_weighted”, “average_precision_micro”, “average_precision_macro”, “average_precision_weighted”, “f1_micro”, “f1_macro”, and “f1_weighted”}. All of the above metrics are calculated during the grid search, but the provided string just sets the metric that the grid search refits to (i.e., which one is used in the best estimator). See the scikit-learn documentation (https://scikit-learn.org/stable/modules/model_evaluation.html) for more information. Defaults to “f1_weighted”.

  • population_size (int or str, optional) – Only used for the genetic algorithm grid search. Size of the initial population to sample randomly generated individuals. If set to “auto”, then population_size is calculated as 15 * n_parameters. If set to an integer, then uses the integer value as population_size. If you need to speed up the genetic algorithm grid search, try decreasing this parameter. See GASearchCV in the sklearn-genetic-opt documentation (https://sklearn-genetic-opt.readthedocs.io) for more info. Defaults to “auto”.

  • tournament_size (int, optional) – For genetic algorithm grid search only. Number of individuals to perform tournament selection. See GASearchCV in the sklearn-genetic-opt documentation (https://sklearn-genetic-opt.readthedocs.io) for more info. Defaults to 3.

  • elitism (bool, optional) – For genetic algorithm grid search only. If set to True, takes the tournament_size best solution to the next generation. See GASearchCV in the sklearn-genetic-opt documentation (https://sklearn-genetic-opt.readthedocs.io) for more info. Defaults to True.

  • crossover_probability (float, optional) – For genetic algorithm grid search only. Probability of crossover operation between two individuals. See GASearchCV in the sklearn-genetic-opt documentation (https://sklearn-genetic-opt.readthedocs.io) for more info. Defaults to 0.2.

  • mutation_probability (float, optional) – For genetic algorithm grid search only. Probability of child mutation. See GASearchCV in the sklearn-genetic-opt documentation (https://sklearn-genetic-opt.readthedocs.io) for more info. Defaults to 0.8.

  • ga_algorithm (str, optional) – For genetic algorithm grid search only. Evolutionary algorithm to use. Supported options include: {“eaMuPlusLambda”, “eaMuCommaLambda”, “eaSimple”}. If you need to speed up the genetic algorithm grid search, try setting algorithm to “euSimple”, at the expense of evolutionary model robustness. See more details in the DEAP algorithms documentation (https://deap.readthedocs.io). Defaults to “eaMuPlusLambda”.

  • sim_strategy (str, optional) – Strategy to use for simulating missing data. Only used to validate the accuracy of the imputation. The final model will be trained with the non-simulated dataset. Supported options include: {“random”, “random_weighted”, “nonrandom”, “nonrandom_weighted”}. “random” randomly simulates missing data, while “random_weighted” also does this but balances selection of reference, heterozygous, and alternate alleles. When set to “nonrandom”, branches from GenotypeData.guidetree will be randomly sampled to generate missing data on descendant nodes. For “nonrandom_weighted”, missing data will be placed on nodes proportionally to their branch lengths (e.g., to generate data distributed as might be the case with mutation-disruption of RAD sites). If using the “nonrandom” or “nonrandom_weighted” options, a guide tree is required to have been initialized in the passed genotype_data object. Defaults to “random”.

  • sim_prop_missing (float, optional) – Proportion of missing data to use with missing data simulation. Defaults to 0.2.

  • disable_progressbar (bool, optional) – Whether to disable the tqdm progress bar. Useful if you are doing the imputation on e.g. a high-performance computing cluster, where sometimes tqdm does not work correctly when being written to a file. If False, uses tqdm progress bar. If True, does not use tqdm. Defaults to False.

  • n_jobs (int, optional) – Number of parallel jobs to use in the grid search if gridparams is not None. -1 means use all available processors. Defaults to -1 (all CPUs).

  • verbose (int, optional) – Verbosity flag. The higher, the more verbose. Possible values are 0, 1, or 2. 0 = silent, 1 = progress bar, 2 = one line per epoch. Note that the progress bar is not particularly useful when logged to a file, so verbose=0 or verbose=2 is recommended when not running interactively. Setting verbose higher than 0 is useful for initial runs and debugging, but can slow down training. Defaults to 0.

  • kwargs (Dict[str, Any], optional) – Possible options include: {“testing”: True/False}. If testing is True, a confusion matrix plot will be created showing model performance. Arrays of the true and predicted values will also be printed to STDOUT. testing defaults to False.

imputed

New GenotypeData instance with imputed data.

Type:

GenotypeData

best_params

Best found parameters from grid search.

Type:

Dict[str, Any]

Unsupervised Imputer Models (Neural Networks)

Non-linear Principal Component Analysis

class pgsui.impute.estimators.ImputeNLPCA(*args, **kwargs)[source]

Bases: ImputeUBP

Class to impute missing data using inverse non-linear principal component analysis (NLPCA) neural network models. For training, missing values are simulated and the model is trained on the simulated missing values. The real missing values are then predicted by the trained model. The strategy for simulating missing values can be set with the sim_strategy argument.

NLPCA [2] trains randomly generated, reduced-dimensionality input to predict the correct output. In the case of imputation, the model is trained only on known values, and the trained model is then used to predict the missing values.

Parameters:

genotype_data (GenotypeData object) – Input data initialized as GenotypeData object. Required positional argument.

Example

>>> data = GenotypeData(
>>>    filename="test.str",
>>>    filetype="auto",
>>>    guidetree="test.tre",
>>>    qmatrix_iqtree="test.iqtree"
>>> )
>>>
>>> nlpca = ImputeNLPCA(
>>>     genotype_data=data,
>>>     learning_rate=0.001,
>>>     epochs=200
>>> )
>>>
>>> nlpca_gtdata = nlpca.imputed

References

Unsupervised Backpropagation

class pgsui.impute.estimators.ImputeUBP(genotype_data, **kwargs)[source]

Bases: UnsupervisedImputer

Class to impute missing data using an unsupervised backpropagation (UBP) neural network model. For training, missing values are simulated and the model is trained on the simulated missing values. The real missing values are then predicted by the trained model. The strategy for simulating missing values can be set with the sim_strategy argument.

UBP [1] is an extension of NLPCA with the input being randomly generated and of reduced dimensionality that gets trained to predict the supplied output based on only known values. It then uses the trained model to predict missing values. However, in contrast to NLPCA, UBP trains the model over three phases. The first is a single layer perceptron used to refine the randomly generated input. The second phase is a multi-layer perceptron that uses the refined reduced-dimension data from the first phase as input. In the second phase, the model weights are refined but not the input. In the third phase, the model weights and the inputs are then refined.

Parameters:

genotype_data (GenotypeData object) – Input data initialized as GenotypeData object. Required positional argument.

Example

>>> data = GenotypeData(
>>>    filename="test.str",
>>>    filetype="auto",
>>>    guidetree="test.tre",
>>>    qmatrix_iqtree="test.iqtree"
>>> )
>>>
>>> ubp = ImputeUBP(
>>>     genotype_data=data,
>>>     learning_rate=0.001,
>>>     n_components=5
>>> )
>>>
>>> # Get the imputed data.
>>> ubp_gtdata = ubp.imputed

References

fit_predict(X: DataFrame) Tuple[DataFrame, Dict[str, Any]]

Fit and predict imputations with IterativeImputer(estimator).

Fits and predicts imputed 012-encoded genotypes using IterativeImputer with any of the supported estimator objects. If gridparams=None, then a grid search is not performed. If gridparams!=None, then a RandomizedSearchCV is performed on a subset of the data and a final imputation is done on the whole dataset using the best found parameters.

Parameters:

X (pandas.DataFrame) – DataFrame with 012-encoded genotypes.

Returns:

GenotypeData object with missing genotypes imputed. Dict[str, Any]: Best parameters found during grid search.

Return type:

GenotypeData

Standard AutoEncoder

class pgsui.impute.estimators.ImputeStandardAutoEncoder(genotype_data, **kwargs)[source]

Bases: UnsupervisedImputer

Class to impute missing data using a standard Autoencoder (SAE) neural network model. For training, missing values are simulated and the model is trained on the simulated missing values. The real missing values are then predicted by the trained model. The strategy for simulating missing values can be set with the sim_strategy argument.

Parameters:

genotype_data (GenotypeData object) – Input data initialized as GenotypeData object. Required positional argument.

Example

>>> data = GenotypeData(
>>>    filename="test.str",
>>>    filetype="auto",
>>>    guidetree="test.tre",
>>>    qmatrix_iqtree="test.iqtree"
>>> )
>>>
>>> sae = ImputeStandardAutoEncoder(
>>>     genotype_data=data,
>>>     learning_rate=0.001,
>>>     n_components=5,
>>>     epochs=200,
>>> )
>>>
>>> # Get the imputed data.
>>> sae_gtdata = sae.imputed
fit_predict(X: DataFrame) Tuple[DataFrame, Dict[str, Any]]

Fit and predict imputations with IterativeImputer(estimator).

Fits and predicts imputed 012-encoded genotypes using IterativeImputer with any of the supported estimator objects. If gridparams=None, then a grid search is not performed. If gridparams!=None, then a RandomizedSearchCV is performed on a subset of the data and a final imputation is done on the whole dataset using the best found parameters.

Parameters:

X (pandas.DataFrame) – DataFrame with 012-encoded genotypes.

Returns:

GenotypeData object with missing genotypes imputed. Dict[str, Any]: Best parameters found during grid search.

Return type:

GenotypeData

Variational AutoEncoder

class pgsui.impute.estimators.ImputeVAE(genotype_data, kl_beta=1.0, **kwargs)[source]

Bases: UnsupervisedImputer

Class to impute missing data using a Variational Autoencoder neural network model. For training, missing values are simulated and the model is trained on the simulated missing values. The real missing values are then predicted by the trained model. The strategy for simulating missing values can be set with the sim_strategy argument.

Parameters:
  • genotype_data (GenotypeData object) – Input data initialized as GenotypeData object. Required positional argument.

  • kl_beta (float, optional) – Weight to apply to Kullback-Liebler divergence loss. If the latent distribution is not learned well, this weight can be adjusted to adjust how much KL divergence affects the total loss. Should be in the range [0, 1]. If set to 1.0, the KL loss is unweighted. If set to 0.0, the KL loss is negated entirely and does not affect the total loss. Defaults to 1.0.

Example

>>> data = GenotypeData(
>>>    filename="test.str",
>>>    filetype="auto",
>>>    guidetree="test.tre",
>>>    qmatrix_iqtree="test.iqtree"
>>> )
>>>
>>> vae = ImputeVAE(
>>>     genotype_data=data,
>>>     learning_rate=0.001,
>>>     epochs=200,
>>> )
>>>
>>> vae_gtdata = vae.imputed
fit_predict(X: DataFrame) Tuple[DataFrame, Dict[str, Any]]

Fit and predict imputations with IterativeImputer(estimator).

Fits and predicts imputed 012-encoded genotypes using IterativeImputer with any of the supported estimator objects. If gridparams=None, then a grid search is not performed. If gridparams!=None, then a RandomizedSearchCV is performed on a subset of the data and a final imputation is done on the whole dataset using the best found parameters.

Parameters:

X (pandas.DataFrame) – DataFrame with 012-encoded genotypes.

Returns:

GenotypeData object with missing genotypes imputed. Dict[str, Any]: Best parameters found during grid search.

Return type:

GenotypeData