Supervised Imputers
Supervised Imputer Models
K Nearest Neighbors
- class pgsui.impute.estimators.ImputeKNN(genotype_data: Any, *, n_neighbors: int = 5, weights: str = 'distance', algorithm: str = 'auto', leaf_size: int = 30, p: int = 2, metric: str = 'minkowski', **kwargs)[source]
Bases:
SupervisedImputer
Does K-Nearest Neighbors Iterative Imputation of missing data. Iterative imputation uses the n_nearest_features to inform the imputation at each feature (i.e., SNP site), using the N most correlated features per site. The N most correlated features are drawn with probability proportional to correlation for each imputed target feature to ensure coverage of features throughout the imputation process.
- Parameters:
genotype_data (GenotypeData object) – GenotypeData instance that was used to read in the sequence data.
n_neighbors (int, optional) – Number of neighbors to use for K-Nearest Neighbors queries. Defaults to 5.
weights (str, optional) – Weight function used in prediction. Possible values: ‘Uniform’: Uniform weights with all points in each neighborhood weighted equally; ‘distance’: Weight points by the inverse of their distance, in this case closer neighbors of a query point will have a greater influence than neighbors that are further away; ‘callable’: A user-defined function that accepts an array of distances and returns an array of the same shape containing the weights. Defaults to “distance”.
algorithm (str, optional) – Algorithm used to compute the nearest neighbors. Possible values: ‘ball_tree’, ‘kd_tree’, ‘brute’, ‘auto’. Defaults to “auto”.
leaf_size (int, optional) – Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem. Defaults to 30.
p (int, optional) – Power parameter for the Minkowski metric. When p=1, this is equivalent to using manhattan_distance (l1), and if p=2 it is equivalent to using euclidean distance (l2). For arbitrary p, minkowski_distance (l_p) is used. Defaults to 2.
metric (str, optional) – The distance metric to use for the tree. The default metric is minkowski, and with p=2 this is equivalent to the standard Euclidean metric. See the documentation of sklearn.DistanceMetric for a list of available metrics. If metric is ‘precomputed’, X is assumed to be a distance matrix and must be square during fit. Defaults to “minkowski”.
Example
>>> data = GenotypeData( >>> filename="test.str", >>> filetype="auto", >>> guidetree="test.tre", >>> qmatrix_iqtree="test.iqtree" >>> ) >>> >>> # Genetic Algorithm grid_params >>> grid_params = { >>> "n_neighbors": Integer(3, 10), >>> "leaf_size": Integer(10, 50), >>> } >>> >>> knn = ImputeKNN( >>> genotype_data=data, >>> gridparams=grid_params, >>> cv=5, >>> gridsearch_method="genetic_algorithm", >>> n_nearest_features=10, >>> n_estimators=100, >>> initial_strategy="phylogeny", >>> ) >>> >>> knn_gtdata = knn.imputed
- fit_predict(X: DataFrame) Tuple[DataFrame, Dict[str, Any]]
Fit and predict imputations with IterativeImputer(estimator).
Fits and predicts imputed 012-encoded genotypes using IterativeImputer with any of the supported estimator objects. If
gridparams=None
, then a grid search is not performed. Ifgridparams!=None
, then a RandomizedSearchCV is performed on a subset of the data and a final imputation is done on the whole dataset using the best found parameters.- Parameters:
X (pandas.DataFrame) – DataFrame with 012-encoded genotypes.
- Returns:
GenotypeData object with missing genotypes imputed. Dict[str, Any]: Best parameters found during grid search.
- Return type:
GenotypeData
Random Forest (and Extra Trees)
- class pgsui.impute.estimators.ImputeRandomForest(genotype_data: Any, *, extratrees: bool = True, n_estimators: int = 100, criterion: str = 'gini', max_depth: int | None = None, min_samples_split: int | float = 2, min_samples_leaf: int | float = 1, min_weight_fraction_leaf: float = 0.0, max_features: str | int | float | None = 'sqrt', max_leaf_nodes: int | None = None, min_impurity_decrease: float = 0.0, bootstrap: bool = False, oob_score: bool = False, max_samples: int | float | None = None, **kwargs)[source]
Bases:
SupervisedImputer
Does Random Forest or Extra Trees Iterative imputation of missing data. Iterative imputation uses the n_nearest_features to inform the imputation at each feature (i.e., SNP site), using the N most correlated features per site. The N most correlated features are drawn with probability proportional to correlation for each imputed target feature to ensure coverage of features throughout the imputation process.
- Parameters:
genotype_data (GenotypeData object) – GenotypeData instance that was used to read in the sequence data.
extra_trees (bool, optional) – Whether to use ExtraTreesClassifier (If True) instead of RandomForestClassifier (If False). ExtraTreesClassifier is faster, but is not supported by the scikit-learn-intelex patch, whereas RandomForestClassifier is. If using an Intel CPU, the optimizations provided by the scikit-learn-intelex patch might make setting
extratrees=False
worthwhile. If you are not using an Intel CPU, the scikit-learn-intelex library is not supported and ExtraTreesClassifier will be faster with similar performance. NOTE: If using scikit-learn-intelex,criterion
must be set to “gini” andoob_score
to False, as those parameters are not currently supported herein. Defaults to True.n_estimators (int, optional) – The number of trees in the forest. Increasing this value can improve the fit, but at the cost of compute time and resources. Defaults to 100.
criterion (str, optional) – The function to measure the quality of a split. Supported values are “gini” for the Gini impurity and “entropy” for the information gain. Defaults to “gini”.
max_depth (int, optional) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. Defaults to None.
min_samples_split (int or float, optional) – The minimum number of samples required to split an internal node. If value is an integer, then considers min_samples_split as the minimum number. If value is a floating point, then min_samples_split is a fraction and (min_samples_split * n_samples), rounded up to the nearest integer, are the minimum number of samples for each split. Defaults to 2.
min_samples_leaf (int or float, optional) – The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least
min_samples_leaf
training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. If value is an integer, thenmin_samples_leaf
is the minimum number. If value is floating point, thenmin_samples_leaf
is a fraction andint(min_samples_leaf * n_samples)
is the minimum number of samples for each node. Defaults to 1.min_weight_fraction_leaf (float, optional) – The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided. Defaults to 0.0.
max_features (str, int, float, or None, optional) – The number of features to consider when looking for the best split. If int, then consider “max_features” features at each split. If float, then “max_features” is a fraction and
int(max_features * n_samples)
features are considered at each split. If “sqrt”, thenmax_features=sqrt(n_features)
. If “log2”, thenmax_features=log2(n_features)
. If None, thenmax_features=n_features
. Defaults to “sqrt”.max_leaf_nodes (int or None, optional) – Grow trees with
max_leaf_nodes
in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes. Defaults to None.min_impurity_decrease (float, optional) – A node will be split if this split induces a decrease of the impurity greater than or equal to this value. See
sklearn.ensemble.ExtraTreesClassifier
documentation for more information. Defaults to 0.0.bootstrap (bool, optional) – Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree. Defaults to False.
oob_score (bool, optional) – Whether to use out-of-bag samples to estimate the generalization score. Only available if
bootstrap=True
. Defaults to False.max_samples (int or float, optional) – If bootstrap is True, the number of samples to draw from X to train each base estimator. If None (default), then draws
X.shape[0] samples
. if int, then drawsmax_samples
samples. If float, then drawsint(max_samples * X.shape[0] samples)
withmax_samples
in the interval (0, 1). Defaults to None.
Example
>>> data = GenotypeData( >>> filename="test.str", >>> filetype="auto", >>> guidetree="test.tre", >>> qmatrix_iqtree="test.iqtree" >>> ) >>> >>> # Genetic Algorithm grid_params >>> grid_params = { >>> "min_samples_leaf": Integer(1, 10), >>> "max_depth": Integer(2, 110), >>> } >>> >>> rf = ImputeRandomForest( >>> genotype_data=data, >>> gridparams=grid_params, >>> cv=5, >>> gridsearch_method="genetic_algorithm", >>> n_nearest_features=10, >>> n_estimators=100, >>> initial_strategy="phylogeny", >>> ) >>> >>> rf_gtdata = rf.imputed
- fit_predict(X: DataFrame) Tuple[DataFrame, Dict[str, Any]]
Fit and predict imputations with IterativeImputer(estimator).
Fits and predicts imputed 012-encoded genotypes using IterativeImputer with any of the supported estimator objects. If
gridparams=None
, then a grid search is not performed. Ifgridparams!=None
, then a RandomizedSearchCV is performed on a subset of the data and a final imputation is done on the whole dataset using the best found parameters.- Parameters:
X (pandas.DataFrame) – DataFrame with 012-encoded genotypes.
- Returns:
GenotypeData object with missing genotypes imputed. Dict[str, Any]: Best parameters found during grid search.
- Return type:
GenotypeData
XGBoost
- class pgsui.impute.estimators.ImputeXGBoost(genotype_data: Any, *, n_estimators: int = 100, max_depth: int = 3, learning_rate: float = 0.1, booster: str = 'gbtree', gamma: float = 0.0, min_child_weight: float = 1.0, max_delta_step: float = 0.0, subsample: float = 1.0, colsample_bytree: float = 1.0, reg_lambda: float = 1.0, reg_alpha: float = 0.0, **kwargs)[source]
Bases:
SupervisedImputer
Does XGBoost (Extreme Gradient Boosting) Iterative imputation of missing data. Iterative imputation uses the n_nearest_features to inform the imputation at each feature (i.e., SNP site), using the N most correlated features per site. The N most correlated features are drawn with probability proportional to correlation for each imputed target feature to ensure coverage of features throughout the imputation process.
- Parameters:
genotype_data (GenotypeData object) – GenotypeData instance that was used to read in the sequence data.
n_estimators (int, optional) – The number of boosting rounds. Increasing this value can improve the fit, but at the cost of compute time and RAM usage. Defaults to 100.
max_depth (int, optional) – Maximum tree depth for base learners. Defaults to 3.
learning_rate (float, optional) – Boosting learning rate (eta). Basically, it serves as a weighting factor for correcting new trees when they are added to the model. Typical values are between 0.1 and 0.3. Lower learning rates generally find the best optimum at the cost of requiring far more compute time and resources. Defaults to 0.1.
booster (str, optional) – Specify which booster to use. Possible values include “gbtree”, “gblinear”, and “dart”. Defaults to “gbtree”.
gamma (float, optional) – Minimum loss reduction required to make a further partition on a leaf node of the tree. Defaults to 0.0.
min_child_weight (float, optional) – Minimum sum of instance weight(hessian) needed in a child. Defaults to 1.0.
max_delta_step (float, optional) – Maximum delta step we allow each tree’s weight estimation to be. Defaults to 0.0.
subsample (float, optional) – Subsample ratio of the training instance. Defaults to 1.0.
colsample_bytree (float, optional) – Subsample ratio of columns when constructing each tree. Defaults to 1.0.
reg_lambda (float, optional) – L2 regularization term on weights (xgb’s lambda parameter). Defaults to 1.0.
reg_alpha (float, optional) – L1 regularization term on weights (xgb’s alpha parameter). Defaults to 1.0.
Example
>>> data = GenotypeData( >>> filename="test.str", >>> filetype="auto", >>> guidetree="test.tre", >>> qmatrix_iqtree="test.iqtree" >>> ) >>> >>> # Genetic Algorithm grid_params >>> grid_params = { >>> "learning_rate": Continuous(lower=0.01, upper=0.1), >>> "max_depth": Integer(2, 110), >>> } >>> >>> xgb = ImputeXGBoost( >>> genotype_data=data, >>> gridparams=grid_params, >>> cv=5, >>> gridsearch_method="genetic_algorithm", >>> n_nearest_features=10, >>> n_estimators=100, >>> initial_strategy="phylogeny", >>> ) >>> >>> xgb_gtdata = xgb.imputed
- fit_predict(X: DataFrame) Tuple[DataFrame, Dict[str, Any]]
Fit and predict imputations with IterativeImputer(estimator).
Fits and predicts imputed 012-encoded genotypes using IterativeImputer with any of the supported estimator objects. If
gridparams=None
, then a grid search is not performed. Ifgridparams!=None
, then a RandomizedSearchCV is performed on a subset of the data and a final imputation is done on the whole dataset using the best found parameters.- Parameters:
X (pandas.DataFrame) – DataFrame with 012-encoded genotypes.
- Returns:
GenotypeData object with missing genotypes imputed. Dict[str, Any]: Best parameters found during grid search.
- Return type:
GenotypeData