Non-Machine Learning Imputers

ImputePhylo, ImputeAlleleFreq, ImputeRefAllele

class pgsui.impute.simple_imputers.ImputePhylo(genotype_data: Any | None, minbr: float | None = 1e-10, *, str_encodings: Dict[str, int] = {'A': 1, 'C': 2, 'G': 3, 'N': -9, 'T': 4}, prefix: str = 'imputer', save_plots: bool = False, disable_progressbar: bool = False, **kwargs: Dict[str, Any] | None)[source]

Impute missing data using a phylogenetic tree to inform the imputation.

Parameters:
  • genotype_data (GenotypeData instance) – GenotypeData instance. Must have the q, tree, and optionally site_rates attributes defined.

  • minbr (float or None, optional) – Minimum branch length. Defaults to 0.0000000001

  • str_encodings (Dict[str, int], optional) – Integer encodings used in STRUCTURE-formatted file. Should be a dictionary with keys=nucleotides and values=integer encodings. The missing data encoding should also be included. Argument is ignored if using a PHYLIP-formatted file. Defaults to {“A”: 1, “C”: 2, “G”: 3, “T”: 4, “N”: -9}

  • prefix (str, optional) – Prefix to use with output files. Defaults to “imputer”.

  • save_plots (bool, optional) – Whether to save PDF files with genotype imputations for each site to disk. It makes one PDF file per locus, so if you have a lot of loci it will make a lot of PDF files. Defaults to False.

  • disable_progressbar (bool, optional) – Whether to disable the progress bar during the imputation. Defaults to False.

  • kwargs (Dict[str, Any] or None, optional) – Additional keyword arguments intended for internal purposes only. Possible arguments: {“column_subset”: List[int] or numpy.ndarray[int]}; Subset SNPs by a list of indices for IterativeImputer. Defauls to None.

imputed

New GenotypeData instance with imputed data.

Type:

GenotypeData

Example

>>>data = GenotypeData( >>> filename=”test.str”, >>> filetype=”structure”, >>> popmapfile=”test.popmap”, >>> guidetree=”test.tre”, >>> qmatrix_iqtree=”test.iqtree”, >>> siterates_iqtree=”test.rates”, >>>) >>> >>>phylo = ImputePhylo( >>> genotype_data=data, >>> save_plots=True, >>>) >>> # Get GenotypeData object. >>>gd_phylo = phylo.imputed

property genotypes_012
property snp_data
property alignment
impute_phylo(tree: ToyTree, genotypes: Dict[str, List[str | int]], Q: DataFrame, site_rates=None, minbr=1e-10) DataFrame[source]

Imputes genotype values with a guide tree.

Imputes genotype values by using a provided guide tree to inform the imputation, assuming maximum parsimony.

Process Outline:

For each SNP: 1) if site_rates, get site-transformated Q matrix.

2) Postorder traversal of tree to compute ancestral state likelihoods for internal nodes (tips -> root). If exclude_N==True, then ignore N tips for this step.

3) Preorder traversal of tree to populate missing genotypes with the maximum likelihood state (root -> tips).

Parameters:
  • tree (toytree.tree object) – Input tree.

  • genotypes (Dict[str, List[Union[str, int]]]) – Dictionary with key=sampleids, value=sequences.

  • Q (pandas.DataFrame) – Rate Matrix Q from .iqtree or separate file.

  • site_rates (List) – Site-specific substitution rates (used to weight per-site Q)

  • minbr (float) – Minimum branch length (those below this value will be treated as == minbr)

Returns:

Imputed genotypes.

Return type:

pandas.DataFrame

Raises:
  • IndexError – If index does not exist when trying to read genotypes.

  • AssertionError – Sites must have same lengths.

  • AssertionError – Missing data still found after imputation.

nbiallelic() int[source]

Get the number of remaining bi-allelic sites after imputation.

Returns:

Number of bi-allelic sites remaining after imputation.

Return type:

int

class pgsui.impute.simple_imputers.ImputeAlleleFreq(genotype_data: GenotypeData, *, by_populations: bool = False, diploid: bool = True, default: int = 0, missing: int = -9, verbose: bool = True, prefix='imputer', **kwargs: Dict[str, Any])[source]

Impute missing data by global allele frequency. Population IDs can be sepcified with the pops argument. if pops is None, then imputation is by global allele frequency. If pops is not None, then imputation is by population-wise allele frequency. A list of population IDs in the appropriate format can be obtained from the GenotypeData object as GenotypeData.populations.

Parameters:
  • genotype_data (GenotypeData object) – GenotypeData instance.

  • by_populations (bool, optional) – Whether or not to impute by-population or globally. Defaults to False (global allele frequency).

  • diploid (bool, optional) – When diploid=True, function assumes 0=homozygous ref; 1=heterozygous; 2=homozygous alt. 0-1-2 genotypes are decomposed to compute p (=frequency of ref) and q (=frequency of alt). In this case, p and q alleles are sampled to generate either 0 (hom-p), 1 (het), or 2 (hom-q) genotypes. When diploid=FALSE, 0-1-2 are sampled according to their observed frequency. Defaults to True.

  • default (int, optional) – Value to set if no alleles sampled at a locus. Defaults to 0.

  • missing (int, optional) – Missing data value. Defaults to -9.

  • verbose (bool, optional) – Whether to print status updates. Set to False for no status updates. Defaults to True.

  • kwargs (Dict[str, Any]) – Additional keyword arguments to supply. Primarily for internal purposes. Options include: {“iterative_mode”: bool, validation_mode: bool, gt: List[List[int]]}. “iterative_mode” determines whether ImputeAlleleFreq is being used as the initial imputer in IterativeImputer. gt is used internally for the simple imputers during grid searches and validation. If genotype_data is None then gt cannot also be None, and vice versa. Only one of gt or genotype_data can be set.

Raises:

TypeError – genotype_data cannot be NoneType.

imputed

New GenotypeData instance with imputed data.

Type:

GenotypeData

Example

>>>data = GenotypeData( >>> filename=”test.str”, >>> filetype=”structure2rowPopID”, >>> popmapfile=”test.popmap”, >>>) >>> >>>afpop = ImputeAlleleFreq( >>> genotype_data=data, >>> by_populations=True, >>>) >>> >>>gd_afpop = afpop.imputed

property genotypes_012
property snp_data
property alignment
fit_predict(X: List[List[int]]) Tuple[DataFrame | ndarray | List[List[int | float]], List[int]][source]

Impute missing genotypes using allele frequencies.

Impute using global or by_population allele frequencies. Missing alleles are primarily coded as negative; usually -9.

Parameters:

X (List[List[int]], numpy.ndarray, or pandas.DataFrame) – 012-encoded genotypes obtained from the GenotypeData object.

Returns:

Imputed genotypes of same shape as data.

List[int]: Column indexes that were retained.

Return type:

pandas.DataFrame, numpy.ndarray, or List[List[Union[int, float]]]

Raises:

TypeError – X must be either list, np.ndarray, or pd.DataFrame.

write2file(X: DataFrame | ndarray | List[List[int | float]]) None[source]

Write imputed data to file on disk.

Parameters:

X (pandas.DataFrame, numpy.ndarray, List[List[Union[int, float]]]) – Imputed data to write to file.

Raises:

TypeError – If X is of unsupported type.

class pgsui.impute.simple_imputers.ImputeMF(genotype_data, *, latent_features: int = 2, max_iter: int = 100, learning_rate: float = 0.0002, regularization_param: float = 0.02, tol: float = 0.1, n_fail: int = 20, missing: int = -9, prefix: str = 'imputer', verbose: bool = True, **kwargs: Dict[str, Any])[source]

Impute missing data using matrix factorization. If by_populations=False then imputation is by global allele frequency. If by_populations=True then imputation is by population-wise allele frequency.

Parameters:
  • genotype_data (GenotypeData object or None, optional) – GenotypeData instance.

  • latent_features (float, optional) – The number of latent variables used to reduce dimensionality of the data. Defaults to 2.

  • learning_rate (float, optional) – The learning rate for the optimizers. Adjust if the loss is learning too slowly. Defaults to 0.1.

  • tol (float, optional) – Tolerance of the stopping condition. Defaults to 1e-3.

  • missing (int, optional) – Missing data value. Defaults to -9.

  • prefix (str, optional) – Prefix for writing output files. Defaults to “output”.

  • verbose (bool, optional) – Whether to print status updates. Set to False for no status updates. Defaults to True.

  • **kwargs (Dict[str, Any]) – Additional keyword arguments to supply. Primarily for internal purposes. Options include: {“iterative_mode”: bool}. “iterative_mode” determines whether ImputeAlleleFreq is being used as the initial imputer in IterativeImputer.

imputed

New GenotypeData instance with imputed data.

Type:

GenotypeData

Example

>>>data = GenotypeData( >>> filename=”test.str”, >>> filetype=”structure”, >>> popmapfile=”test.popmap”, >>>) >>> >>>nmf = ImputeMF( >>> genotype_data=data, >>> by_populations=True, >>>) >>> >>> # Get GenotypeData instance. >>>gd_nmf = nmf.imputed

Raises:

TypeError – genotype_data cannot be NoneType.

property genotypes_012
property snp_data
property alignment
fit_predict(X)[source]
transform(original, predicted)[source]
accuracy(expected, predicted)[source]
write2file(X: DataFrame | ndarray | List[List[int | float]]) None[source]

Write imputed data to file on disk.

Parameters:

X (pandas.DataFrame, numpy.ndarray, List[List[Union[int, float]]]) – Imputed data to write to file.

Raises:

TypeError – If X is of unsupported type.

class pgsui.impute.simple_imputers.ImputeRefAllele(genotype_data: GenotypeData, *, missing: int = -9, prefix='imputer', verbose: bool = True, **kwargs: Dict[str, Any])[source]

Impute missing data by reference allele.

Parameters:
  • genotype_data (GenotypeData object) – GenotypeData instance.

  • missing (int, optional) – Missing data value. Defaults to -9.

  • verbose (bool, optional) – Whether to print status updates. Set to False for no status updates. Defaults to True.

  • kwargs (Dict[str, Any]) – Additional keyword arguments to supply. Primarily for internal purposes. Options include: {“iterative_mode”: bool, validation_mode: bool, gt: List[List[int]]}. “iterative_mode” determines whether ImputeRefAllele is being used as the initial imputer in IterativeImputer. gt is used internally for the simple imputers during grid searches and validation. If genotype_data is None then gt cannot also be None, and vice versa. Only one of gt or genotype_data can be set.

Raises:

TypeError – genotype_data cannot be NoneType.

imputed

New GenotypeData instance with imputed data.

Type:

GenotypeData

Example

>>>data = GenotypeData( >>> filename=”test.str”, >>> filetype=”structure2rowPopID”, >>> popmapfile=”test.popmap”, >>>) >>> >>>refallele = ImputeRefAllele( >>> genotype_data=data >>>) >>> >>>gd_refallele = refallele.imputed

property genotypes_012
property snp_data
property alignment
fit_predict(X: List[List[str | int]]) DataFrame | ndarray | List[List[str | int]][source]

Impute missing genotypes using reference alleles.

Impute using reference alleles. Missing alleles are primarily coded as negative; usually -9.

Parameters:

X (List[List[Union[int, str]]], numpy.ndarray, or pandas.DataFrame) – Genotypes obtained from the GenotypeData object.

Returns:

Imputed genotypes of same shape as data.

Return type:

pandas.DataFrame, numpy.ndarray, or List[List[Union[int, str]]]

Raises:
  • TypeError – X must be of type list(list(int or str)), numpy.ndarray,

  • or pandas.DataFrame, but got {type(X)}

write2file(X: DataFrame | ndarray | List[List[int | float]]) None[source]

Write imputed data to file on disk.

Parameters:

X (pandas.DataFrame, numpy.ndarray, List[List[Union[int, float]]]) – Imputed data to write to file.

Raises:

TypeError – If X is of unsupported type.