Supervised Imputers

Shared Arguments

Included here is the documentation shared among all supervised imputers that are not specific to a given model. This includes arguments pertaining to the number of nearest neighbors to use for the IterativeImputer, as well as parameters for grid searches or validation and various other settings such as the number of CPUs to use.

class pgsui.impute.estimators.SupervisedImputer(genotype_data, clf, clf_type, *, prefix: str = 'imputer', gridparams: Dict[str, Any] | None = None, do_validation: bool = False, column_subset: int | float = 0.1, cv: int = 5, max_iter: int = 10, tol: float = 0.001, n_nearest_features: int | None = 10, initial_strategy: str = 'most_frequent', str_encodings: Dict[str, int] = {'A': 1, 'C': 2, 'G': 3, 'N': -9, 'T': 4}, imputation_order: str = 'ascending', skip_complete: bool = False, random_state: int | None = None, gridsearch_method: str = 'gridsearch', grid_iter: int = 80, population_size: int | str = 'auto', tournament_size: int = 3, elitism: bool = True, crossover_probability: float = 0.2, mutation_probability: float = 0.8, ga_algorithm: str = 'eaMuPlusLambda', early_stop_gen: int = 5, scoring_metric: str = 'f1_weighted', chunk_size: int | float = 1.0, disable_progressbar: bool = False, progress_update_percent: int | None = None, n_jobs: int = -1, verbose: int = 0, **kwargs)[source]

Bases: Impute

Parent class for the supervised imputers. Contains all common arguments and code between supervised imputers.

Parameters:
  • genotype_data (GenotypeData object) – GenotypeData instance that was used to read in the sequence data.

  • prefix (str) – Prefix for imputed data’s output directory.

  • gridparams (Dict[str, Any] or None, optional) – Dictionary with keys=keyword arguments for the specified estimator and values=lists of parameter values or distributions. If gridparams=None, a grid search is not performed, otherwise gridparams will be used to specify parameter ranges or distributions for the grid search. If using gridsearch_method="gridsearch", then the gridparams values can be lists or numpy arrays. If using gridsearch_method="randomized_gridsearch", distributions can be specified by using scipy.stats.uniform(low, high) (for a uniform distribution) or scipy.stats.loguniform(low, high) (useful if range of values spans orders of magnitude). If using the genetic algorithm grid search by setting gridsearch_method="genetic_algorithm", the parameters can be specified as sklearn_genetic.space objects. The grid search will determine the optimal parameters as those that maximize the scoring_methods. NOTE: Takes a long time, so you can run it with a small subset of the data using the column_subset argument just to find the optimal parameters for the classifier, then it will automatically run a full imputation using the optimal parameters. Defaults to None (no gridsearch).

  • do_validation (bool, optional) – Whether to validate the imputation if not doing a grid search. This validation method randomly replaces between 15% and 50% of the known, non-missing genotypes in n_features * column_subset of the features. It then imputes the newly missing genotypes for which we know the true values and calculates validation scores. This procedure is replicated cv times and a mean, median, minimum, maximum, lower 95% confidence interval (CI) of the mean, and the upper 95% CI are calculated and saved to a CSV file. gridparams must be set to None for do_validation to work. Calculating a validation score can be turned off altogether by setting do_validation to False. Defaults to False.

  • column_subset (int or float, optional) – If float, proportion of the dataset to randomly subset for the grid search or validation. Should be between 0 and 1, and should also be small, because the grid search or validation takes a long time. If int, subset column_subset columns. If float, subset int(n_features * column_subset) columns. Defaults to 0.1.

  • cv (int, optional) – Number of folds for cross-validation during grid search. Defaults to 5.

  • max_iter (int, optional) – Maximum number of imputation rounds to perform before returning the imputations computed during the final round. A round is a single imputation of each feature with missing values. Defaults to 10.

  • tol (float, optional) – Tolerance of the stopping condition for the iterations. Defaults to 1e-3.

  • n_nearest_features (int, optional) – Number of other features to use to estimate the missing values of eacah feature column. If None, then all features will be used, but this can consume an intractable amount of computing resources. Nearness between features is measured using the absolute correlation coefficient between each feature pair (after initial imputation). To ensure coverage of features throughout the imputation process, the neighbor features are not necessarily nearest, but are drawn with probability proportional to correlation for each imputed target feature. Reducing this can provide significant speed-up when the number of features is large. Defaults to 10.

  • initial_strategy (str, optional) – Which strategy to use for initializing the missing values in the training data (neighbor columns). IterativeImputer must initially impute the training data (neighbor columns) using a simple, quick imputation in order to predict the missing values for each target column. The initial_strategy argument specifies which method to use for this initial imputation. Valid options include: “most_frequent”, “populations”, “phylogeny”, or “mf”. “most_frequent” uses the overall mode of each column. “populations” uses the mode per population/ per column via a population map file and the ImputeAlleleFreq class. “phylogeny” uses an input phylogenetic tree and a rate matrix with the ImputePhylo class. “mf” performs the imputaton via matrix factorization with the ImputeMF class. Note that the “mean” and “median” options from the original IterativeImputer are not supported because they are not sensible settings for the type of input data used here. Defaults to “populations”.

  • (dict(str (str_encodings) – int), optional): Integer encodings for nucleotides if input file was in STRUCTURE format. Only used if initial_strategy="phylogeny". Defaults to {“A”: 1, “C”: 2, “G”: 3, “T”: 4, “N”: -9}.

  • imputation_order (str, optional) – The order in which the features will be imputed. Possible values: “ascending” (from features with fewest missing values to most), “descending” (from features with most missing values to fewest), “roman” (left to right), “arabic” (right to left), “random” (a random order for each round). Defaults to “ascending”.

  • skip_complete (bool, optional) – If True, then features with missing values during transform that did not have any missing values during fit will be imputed with the initial imputation method only. Set to True if you have many features with no missing values at both fit and transform time to save compute time. Defaults to False.

  • random_state (int or None, optional) – The seed of the pseudo random number generator to use for the iterative imputer. Randomizes selection of etimator features if n_nearest_features is not None or the imputation_order is “random”. Use an integer for determinism. If None, then uses a different random seed each time. Defaults to None.

  • gridsearch_method (str, optional) – Grid search method to use. Supported options include: {“gridsearch”, “randomized_gridsearch”, and “genetic_algorithm”}. “gridsearch” uses GridSearchCV to test every possible parameter combination. “randomized_gridsearch” picks grid_iter random combinations of parameters to test. “genetic_algorithm” uses a genetic algorithm via sklearn-genetic-opt GASearchCV to do the grid search. If doing a grid search, “randomized_search” takes the least amount of time because it does not have to test all parameters. “genetic_algorithm” takes the longest. See the scikit-learn GridSearchCV and RandomizedSearchCV documentation for the “gridsearch” and “randomized_gridsearch” options, and the sklearn-genetic-opt GASearchCV documentation (https://sklearn-genetic-opt.readthedocs.io) for the “genetic_algorithm” option. Defaults to “gridsearch”.

  • grid_iter (int, optional) – Number of iterations for randomized and genetic algorithm grid searches. Defaults to 80.

  • population_size (int or str, optional) – For genetic algorithm grid search: Size of the initial population to sample randomly generated individuals. If set to “auto”, then population_size is calculated as 15 * n_parameters. If set to an integer, then uses the integer value as population_size. If you need to speed up the genetic algorithm grid search, try decreasing this parameter. See GASearchCV in the sklearn-genetic-opt documentation (https://sklearn-genetic-opt.readthedocs.io). Defaults to “auto”.

  • tournament_size (int, optional) – For genetic algorithm grid search: Number of individuals to perform tournament selection. See GASearchCV documentation. Defaults to 3.

  • elitism (bool, optional) – For genetic algorithm grid search: If True takes the tournament_size best solution to the next generation. See GASearchCV documentation. Defaults to True.

  • crossover_probability (float, optional) – For genetic algorithm grid search: Probability of crossover operation between two individuals. See GASearchCV documentation. Defaults to 0.2.

  • mutation_probability (float, optional) – For genetic algorithm grid search: Probability of child mutation. See GASearchCV documentation. Defaults to 0.8.

  • ga_algorithm (str, optional) – For genetic algorithm grid search: Evolutionary algorithm to use. Supported options include: {“eaMuPlusLambda”, “eaMuCommaLambda”, “eaSimple”}. If you need to speed up the genetic algorithm grid search, try setting algorithm to “euSimple”, at the expense of evolutionary model robustness. See more details in the DEAP algorithms documentation (https://deap.readthedocs.io). Defaults to “eaMuPlusLambda”.

  • early_stop_gen (int, optional) – If the genetic algorithm sees early_stop_gen consecutive generations without improvement in the scoring metric, an early stopping callback is implemented. This saves time by reducing the number of generations the genetic algorithm has to perform. Defaults to 5.

  • scoring_metric (str, optional) – Scoring metric to use for grid searches. See the classification metrics in the scikit-learn documentation (https://scikit-learn.org/stable/modules/model_evaluation.html) for supported options. Defaults to “f1_weighted”.

  • chunk_size (int or float, optional) – Number of loci for which to perform IterativeImputer at one time. Useful for reducing the memory usage if you are running out of RAM. If integer is specified, selects chunk_size loci at a time. If a float is specified, selects math.ceil(total_loci * chunk_size) loci at a time]. Defaults to 1.0 (all features).

  • disable_progressbar (bool, optional) – Whether or not to disable the tqdm progress bar when doing the imputation. If True, progress bar is disabled, which is useful when running the imputation on e.g. an HPC cluster. If the bar is disabled, a status update will be printed to standard output for each iteration and feature instead. If False, the tqdm progress bar will be used. Defaults to False.

  • progress_update_percent (int or None, optional) – Print status updates for features every progress_update_percent%. IterativeImputer iterations will always be printed, but progress_update_percent involves iteration progress through the features of each IterativeImputer iteration. If None, then does not print progress through features. Defaults to None.

  • n_jobs (int, optional) – Number of parallel jobs to use. If gridparams is not None, n_jobs is used for the grid search. Otherwise it is used for the classifier. -1 means using all available processors. Defaults to -1 (all CPUs).

  • verbose (int, optional) – Verbosity flag, controls the debug messages that are issues as functions are evaluated. The higher, the more verbose. Possible values are 0, 1, or 2. Defaults to 0.

imputed

New GenotypeData instance with imputed data.

Type:

GenotypeData

best_params

Best found parameters from grid search.

Type:

Dict[str, Any]

Supervised Imputer Models

K Nearest Neighbors

class pgsui.impute.estimators.ImputeKNN(genotype_data: Any, *, n_neighbors: int = 5, weights: str = 'distance', algorithm: str = 'auto', leaf_size: int = 30, p: int = 2, metric: str = 'minkowski', **kwargs)[source]

Bases: SupervisedImputer

Does K-Nearest Neighbors Iterative Imputation of missing data. Iterative imputation uses the n_nearest_features to inform the imputation at each feature (i.e., SNP site), using the N most correlated features per site. The N most correlated features are drawn with probability proportional to correlation for each imputed target feature to ensure coverage of features throughout the imputation process.

Parameters:
  • genotype_data (GenotypeData object) – GenotypeData instance that was used to read in the sequence data.

  • n_neighbors (int, optional) – Number of neighbors to use for K-Nearest Neighbors queries. Defaults to 5.

  • weights (str, optional) – Weight function used in prediction. Possible values: ‘Uniform’: Uniform weights with all points in each neighborhood weighted equally; ‘distance’: Weight points by the inverse of their distance, in this case closer neighbors of a query point will have a greater influence than neighbors that are further away; ‘callable’: A user-defined function that accepts an array of distances and returns an array of the same shape containing the weights. Defaults to “distance”.

  • algorithm (str, optional) – Algorithm used to compute the nearest neighbors. Possible values: ‘ball_tree’, ‘kd_tree’, ‘brute’, ‘auto’. Defaults to “auto”.

  • leaf_size (int, optional) – Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem. Defaults to 30.

  • p (int, optional) – Power parameter for the Minkowski metric. When p=1, this is equivalent to using manhattan_distance (l1), and if p=2 it is equivalent to using euclidean distance (l2). For arbitrary p, minkowski_distance (l_p) is used. Defaults to 2.

  • metric (str, optional) – The distance metric to use for the tree. The default metric is minkowski, and with p=2 this is equivalent to the standard Euclidean metric. See the documentation of sklearn.DistanceMetric for a list of available metrics. If metric is ‘precomputed’, X is assumed to be a distance matrix and must be square during fit. Defaults to “minkowski”.

Example

>>> data = GenotypeData(
>>>     filename="test.str",
>>>     filetype="auto",
>>>     guidetree="test.tre",
>>>     qmatrix_iqtree="test.iqtree"
>>> )
>>>
>>> # Genetic Algorithm grid_params
>>> grid_params = {
>>>     "n_neighbors": Integer(3, 10),
>>>     "leaf_size": Integer(10, 50),
>>> }
>>>
>>> knn = ImputeKNN(
>>>     genotype_data=data,
>>>     gridparams=grid_params,
>>>     cv=5,
>>>     gridsearch_method="genetic_algorithm",
>>>     n_nearest_features=10,
>>>     n_estimators=100,
>>>     initial_strategy="phylogeny",
>>> )
>>>
>>> knn_gtdata = knn.imputed
fit_predict(X: DataFrame) Tuple[DataFrame, Dict[str, Any]]

Fit and predict imputations with IterativeImputer(estimator).

Fits and predicts imputed 012-encoded genotypes using IterativeImputer with any of the supported estimator objects. If gridparams=None, then a grid search is not performed. If gridparams!=None, then a RandomizedSearchCV is performed on a subset of the data and a final imputation is done on the whole dataset using the best found parameters.

Parameters:

X (pandas.DataFrame) – DataFrame with 012-encoded genotypes.

Returns:

GenotypeData object with missing genotypes imputed. Dict[str, Any]: Best parameters found during grid search.

Return type:

GenotypeData

Random Forest (and Extra Trees)

class pgsui.impute.estimators.ImputeRandomForest(genotype_data: Any, *, extratrees: bool = True, n_estimators: int = 100, criterion: str = 'gini', max_depth: int | None = None, min_samples_split: int | float = 2, min_samples_leaf: int | float = 1, min_weight_fraction_leaf: float = 0.0, max_features: str | int | float | None = 'sqrt', max_leaf_nodes: int | None = None, min_impurity_decrease: float = 0.0, bootstrap: bool = False, oob_score: bool = False, max_samples: int | float | None = None, **kwargs)[source]

Bases: SupervisedImputer

Does Random Forest or Extra Trees Iterative imputation of missing data. Iterative imputation uses the n_nearest_features to inform the imputation at each feature (i.e., SNP site), using the N most correlated features per site. The N most correlated features are drawn with probability proportional to correlation for each imputed target feature to ensure coverage of features throughout the imputation process.

Parameters:
  • genotype_data (GenotypeData object) – GenotypeData instance that was used to read in the sequence data.

  • extra_trees (bool, optional) – Whether to use ExtraTreesClassifier (If True) instead of RandomForestClassifier (If False). ExtraTreesClassifier is faster, but is not supported by the scikit-learn-intelex patch, whereas RandomForestClassifier is. If using an Intel CPU, the optimizations provided by the scikit-learn-intelex patch might make setting extratrees=False worthwhile. If you are not using an Intel CPU, the scikit-learn-intelex library is not supported and ExtraTreesClassifier will be faster with similar performance. NOTE: If using scikit-learn-intelex, criterion must be set to “gini” and oob_score to False, as those parameters are not currently supported herein. Defaults to True.

  • n_estimators (int, optional) – The number of trees in the forest. Increasing this value can improve the fit, but at the cost of compute time and resources. Defaults to 100.

  • criterion (str, optional) – The function to measure the quality of a split. Supported values are “gini” for the Gini impurity and “entropy” for the information gain. Defaults to “gini”.

  • max_depth (int, optional) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. Defaults to None.

  • min_samples_split (int or float, optional) – The minimum number of samples required to split an internal node. If value is an integer, then considers min_samples_split as the minimum number. If value is a floating point, then min_samples_split is a fraction and (min_samples_split * n_samples), rounded up to the nearest integer, are the minimum number of samples for each split. Defaults to 2.

  • min_samples_leaf (int or float, optional) – The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. If value is an integer, then min_samples_leaf is the minimum number. If value is floating point, then min_samples_leaf is a fraction and int(min_samples_leaf * n_samples) is the minimum number of samples for each node. Defaults to 1.

  • min_weight_fraction_leaf (float, optional) – The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided. Defaults to 0.0.

  • max_features (str, int, float, or None, optional) – The number of features to consider when looking for the best split. If int, then consider “max_features” features at each split. If float, then “max_features” is a fraction and int(max_features * n_samples) features are considered at each split. If “sqrt”, then max_features=sqrt(n_features). If “log2”, then max_features=log2(n_features). If None, then max_features=n_features. Defaults to “sqrt”.

  • max_leaf_nodes (int or None, optional) – Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes. Defaults to None.

  • min_impurity_decrease (float, optional) – A node will be split if this split induces a decrease of the impurity greater than or equal to this value. See sklearn.ensemble.ExtraTreesClassifier documentation for more information. Defaults to 0.0.

  • bootstrap (bool, optional) – Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree. Defaults to False.

  • oob_score (bool, optional) – Whether to use out-of-bag samples to estimate the generalization score. Only available if bootstrap=True. Defaults to False.

  • max_samples (int or float, optional) – If bootstrap is True, the number of samples to draw from X to train each base estimator. If None (default), then draws X.shape[0] samples. if int, then draws max_samples samples. If float, then draws int(max_samples * X.shape[0] samples) with max_samples in the interval (0, 1). Defaults to None.

Example

>>> data = GenotypeData(
>>>     filename="test.str",
>>>     filetype="auto",
>>>     guidetree="test.tre",
>>>     qmatrix_iqtree="test.iqtree"
>>> )
>>>
>>> # Genetic Algorithm grid_params
>>> grid_params = {
>>>     "min_samples_leaf": Integer(1, 10),
>>>     "max_depth": Integer(2, 110),
>>> }
>>>
>>> rf = ImputeRandomForest(
>>>     genotype_data=data,
>>>     gridparams=grid_params,
>>>     cv=5,
>>>     gridsearch_method="genetic_algorithm",
>>>     n_nearest_features=10,
>>>     n_estimators=100,
>>>     initial_strategy="phylogeny",
>>> )
>>>
>>> rf_gtdata = rf.imputed
fit_predict(X: DataFrame) Tuple[DataFrame, Dict[str, Any]]

Fit and predict imputations with IterativeImputer(estimator).

Fits and predicts imputed 012-encoded genotypes using IterativeImputer with any of the supported estimator objects. If gridparams=None, then a grid search is not performed. If gridparams!=None, then a RandomizedSearchCV is performed on a subset of the data and a final imputation is done on the whole dataset using the best found parameters.

Parameters:

X (pandas.DataFrame) – DataFrame with 012-encoded genotypes.

Returns:

GenotypeData object with missing genotypes imputed. Dict[str, Any]: Best parameters found during grid search.

Return type:

GenotypeData

XGBoost

class pgsui.impute.estimators.ImputeXGBoost(genotype_data: Any, *, n_estimators: int = 100, max_depth: int = 3, learning_rate: float = 0.1, booster: str = 'gbtree', gamma: float = 0.0, min_child_weight: float = 1.0, max_delta_step: float = 0.0, subsample: float = 1.0, colsample_bytree: float = 1.0, reg_lambda: float = 1.0, reg_alpha: float = 0.0, **kwargs)[source]

Bases: SupervisedImputer

Does XGBoost (Extreme Gradient Boosting) Iterative imputation of missing data. Iterative imputation uses the n_nearest_features to inform the imputation at each feature (i.e., SNP site), using the N most correlated features per site. The N most correlated features are drawn with probability proportional to correlation for each imputed target feature to ensure coverage of features throughout the imputation process.

Parameters:
  • genotype_data (GenotypeData object) – GenotypeData instance that was used to read in the sequence data.

  • n_estimators (int, optional) – The number of boosting rounds. Increasing this value can improve the fit, but at the cost of compute time and RAM usage. Defaults to 100.

  • max_depth (int, optional) – Maximum tree depth for base learners. Defaults to 3.

  • learning_rate (float, optional) – Boosting learning rate (eta). Basically, it serves as a weighting factor for correcting new trees when they are added to the model. Typical values are between 0.1 and 0.3. Lower learning rates generally find the best optimum at the cost of requiring far more compute time and resources. Defaults to 0.1.

  • booster (str, optional) – Specify which booster to use. Possible values include “gbtree”, “gblinear”, and “dart”. Defaults to “gbtree”.

  • gamma (float, optional) – Minimum loss reduction required to make a further partition on a leaf node of the tree. Defaults to 0.0.

  • min_child_weight (float, optional) – Minimum sum of instance weight(hessian) needed in a child. Defaults to 1.0.

  • max_delta_step (float, optional) – Maximum delta step we allow each tree’s weight estimation to be. Defaults to 0.0.

  • subsample (float, optional) – Subsample ratio of the training instance. Defaults to 1.0.

  • colsample_bytree (float, optional) – Subsample ratio of columns when constructing each tree. Defaults to 1.0.

  • reg_lambda (float, optional) – L2 regularization term on weights (xgb’s lambda parameter). Defaults to 1.0.

  • reg_alpha (float, optional) – L1 regularization term on weights (xgb’s alpha parameter). Defaults to 1.0.

Example

>>> data = GenotypeData(
>>>     filename="test.str",
>>>     filetype="auto",
>>>     guidetree="test.tre",
>>>     qmatrix_iqtree="test.iqtree"
>>> )
>>>
>>> # Genetic Algorithm grid_params
>>> grid_params = {
>>>     "learning_rate": Continuous(lower=0.01, upper=0.1),
>>>     "max_depth": Integer(2, 110),
>>> }
>>>
>>> xgb = ImputeXGBoost(
>>>     genotype_data=data,
>>>     gridparams=grid_params,
>>>     cv=5,
>>>     gridsearch_method="genetic_algorithm",
>>>     n_nearest_features=10,
>>>     n_estimators=100,
>>>     initial_strategy="phylogeny",
>>> )
>>>
>>> xgb_gtdata = xgb.imputed
fit_predict(X: DataFrame) Tuple[DataFrame, Dict[str, Any]]

Fit and predict imputations with IterativeImputer(estimator).

Fits and predicts imputed 012-encoded genotypes using IterativeImputer with any of the supported estimator objects. If gridparams=None, then a grid search is not performed. If gridparams!=None, then a RandomizedSearchCV is performed on a subset of the data and a final imputation is done on the whole dataset using the best found parameters.

Parameters:

X (pandas.DataFrame) – DataFrame with 012-encoded genotypes.

Returns:

GenotypeData object with missing genotypes imputed. Dict[str, Any]: Best parameters found during grid search.

Return type:

GenotypeData