About PG-SUI

PG-SUI: Population Genomic Supervised and Unsupervised Imputation

PG-SUI Philosophy

PG-SUI is a Python 3 API that uses machine learning to impute missing values from population genomic SNP data. There are several supervised and unsupervised machine learning algorithms available to impute missing data, as well as some non-machine learning imputers that are useful.

Supervised Imputation Methods

Supervised methods utilze the scikit-learn’s IterativeImputer, which is based on the MICE (Multivariate Imputation by Chained Equations) algorithm [1], and iterates over each SNP site (i.e., feature) while uses the N nearest neighbor features to inform the imputation. The number of nearest features can be adjusted by users. IterativeImputer currently works with any of the following scikit-learn classifiers:

  • K-Nearest Neighbors

  • Random Forest

  • Extra Trees

  • XGBoost

See the scikit-learn documentation for more information on IterativeImputer and each of the classifiers.

Unsupervised Imputation Methods

Unsupervised imputers include three custom neural network models:

  • Variational Autoencoder (VAE) [2]

  • Standard Autoencoder (SAE) [3]

  • Non-linear Principal Component Analysis (NLPCA) [4]

  • Unsupervised Backpropagation (UBP) [5]

To use the unsupervised neural networks for imputation, the real missing values are masked and missing values are simulated for training. The model gets trained to reconstruct only on known values. Once the model is trained, it is then used to predict the real missing values.

SAE models encode the input features (i.e., loci) into a reduced-dimensional layer (typically 2 or 3 dimensions). This reduced-dimensional layer is then input into the decoder and the model trains itself to reconstruct the input (i.e., the genotypes).

VAE models also have an encoder and a decoder and train themselves to reconstruct their input (i.e., the genotypes), but the reduced-dimensional layer in VAE (latent dimension) represent a sampling distribution with a mean and a variance (latent variables). This distribution gets sampled from during training, and the samples get input into the decoder.

NLPCA initializes random, reduced-dimensional input, then trains itself by using the known values (i.e., genotypes) as targets and refining the random input until it accurately predicts the genotype output. Essentially, NLPCA resembles the second half of a standard autoencoder.

UBP is an extension of NLPCA that runs over three phases. Phase 1 refines the randomly generated, reduced-dimensional input in a single layer perceptron neural network to obtain good initial input values. Phase 2 uses the refined reduced-dimensional input from phase 1 as input into a multi-layer perceptron (MLP), but in Phase 2 only the neural network weights are refined. Phase three uses an MLP to refine both the weights and the reduced-dimensional input.

Non-Machine Learning Methods

We also include several non-machine learning options for imputing missing data, including:

  • Per-population mode per SNP site

  • Global mode per SNP site

  • Ref allele per SNP site

  • Using a phylogeny as input to inform the imputation

  • Matrix Factorization

These four non-machine learning imputation methods can be used as standalone imputers, as the initial imputation strategy for IterativeImputer (at least one method is required to be chosen), and to validate the accuracy of both IterativeImputer and the neural network models.

References