pgsui.impute.unsupervised.models package
Submodules
pgsui.impute.unsupervised.models.autoencoder_model module
- class pgsui.impute.unsupervised.models.autoencoder_model.Encoder(n_features: int, num_classes: int, latent_dim: int, hidden_layer_sizes: List[int], dropout_rate: float, activation: Module)[source]
The Encoder module of a standard Autoencoder.
This module defines the encoder network, which takes high-dimensional input data and maps it to a deterministic, low-dimensional latent representation. The architecture consists of a series of fully-connected hidden layers that progressively compress the flattened input data into a single latent vector, z.
- __init__(n_features: int, num_classes: int, latent_dim: int, hidden_layer_sizes: List[int], dropout_rate: float, activation: Module)[source]
Initializes the Encoder module.
This class defines the encoder network, which takes high-dimensional input data and maps it to a deterministic, low-dimensional latent representation. The architecture consists of a series of fully-connected hidden layers that progressively compress the flattened input data into a single latent vector, z.
- Parameters:
n_features (int) – The number of features in the input data (e.g., SNPs).
num_classes (int) – Number of genotype states per locus (2 for haploid, 3 for diploid in practice).
latent_dim (int) – The dimensionality of the output latent space.
hidden_layer_sizes (List[int]) – A list of integers specifying the size of each hidden layer.
dropout_rate (float) – The dropout rate for regularization in the hidden layers.
activation (torch.nn.Module) – An instantiated activation function module (e.g., nn.ReLU()) for the hidden layers.
- class pgsui.impute.unsupervised.models.autoencoder_model.Decoder(n_features: int, num_classes: int, latent_dim: int, hidden_layer_sizes: List[int], dropout_rate: float, activation: Module)[source]
The Decoder module of a standard Autoencoder.
This module defines the decoder network, which takes a deterministic latent vector and maps it back to the high-dimensional data space, aiming to reconstruct the original input. The architecture typically mirrors the encoder, consisting of a series of fully-connected hidden layers that progressively expand the representation, followed by a final linear layer to produce the reconstructed data.
- __init__(n_features: int, num_classes: int, latent_dim: int, hidden_layer_sizes: List[int], dropout_rate: float, activation: Module) None[source]
Initializes the Decoder module.
- Parameters:
n_features (int) – The number of features in the output data (e.g., SNPs).
num_classes (int) – Number of genotype states per locus (2 or 3 in practice).
latent_dim (int) – The dimensionality of the input latent space.
hidden_layer_sizes (List[int]) – A list of integers specifying the size of each hidden layer (typically the reverse of the encoder’s).
dropout_rate (float) – The dropout rate for regularization in the hidden layers.
activation (torch.nn.Module) – An instantiated activation function module (e.g., nn.ReLU()) for the hidden layers.
- class pgsui.impute.unsupervised.models.autoencoder_model.AutoencoderModel(n_features: int, prefix: str, *, num_classes: int = 4, hidden_layer_sizes: List[int] | ndarray = [128, 64], latent_dim: int = 2, dropout_rate: float = 0.2, activation: Literal['relu', 'elu', 'selu', 'leaky_relu'] = 'relu', gamma: Tensor = tensor(2.), device: Literal['cpu', 'gpu', 'mps'] = 'cpu', verbose: bool = False, debug: bool = False)[source]
A standard Autoencoder (AE) model for imputation.
This class combines an Encoder and a Decoder to form a standard autoencoder. The model is trained to learn a compressed, low-dimensional representation of the input data and then reconstruct it as accurately as possible. It is particularly useful for unsupervised dimensionality reduction and data imputation.
Model Architecture and Objective:
- The autoencoder consists of two parts: an encoder, $f_{theta}$, and a decoder, $g_{phi}$.
The encoder maps the input data $x$ to a latent representation $z$: $$ z = f_{ heta}(x) $$
The decoder reconstructs the data $hat{x}$ from the latent representation: $$ hat{x} = g_{phi}(z) $$
The model is trained by minimizing a reconstruction loss, $L(x, hat{x})$, which measures the dissimilarity between the original input and the reconstructed output. This implementation uses a
FocalCELossto handle missing values and class imbalance effectively.- __init__(n_features: int, prefix: str, *, num_classes: int = 4, hidden_layer_sizes: List[int] | ndarray = [128, 64], latent_dim: int = 2, dropout_rate: float = 0.2, activation: Literal['relu', 'elu', 'selu', 'leaky_relu'] = 'relu', gamma: Tensor = tensor(2.), device: Literal['cpu', 'gpu', 'mps'] = 'cpu', verbose: bool = False, debug: bool = False)[source]
Initializes the AutoencoderModel.
- Parameters:
n_features (int) – The number of features in the input data (e.g., SNPs).
prefix (str) – A prefix used for logging.
num_classes (int) – Number of genotype states per locus. Defaults to 4 for backward compatibility, but the genotype imputers pass 2 (haploid) or 3 (diploid).
hidden_layer_sizes (List[int] | np.ndarray) – A list of integers specifying the size of each hidden layer in the encoder. The decoder will use the reverse of this structure. Defaults to [128, 64].
latent_dim (int) – The dimensionality of the latent space (bottleneck). Defaults to 2.
dropout_rate (float) – The dropout rate for regularization in hidden layers. Defaults to 0.2.
activation (Literal["relu", "elu", "selu", "leaky_relu"]) – The name of the activation function for hidden layers. Defaults to “relu”.
gamma (float) – The focusing parameter for the focal loss function. Defaults to 2.0.
device (Literal["cpu", "gpu", "mps"]) – The device to run the model on.
verbose (bool) – If True, enables detailed logging.
debug (bool) – If True, enables debug mode.
pgsui.impute.unsupervised.models.nlpca_model module
- class pgsui.impute.unsupervised.models.nlpca_model.NLPCAModel(num_embeddings: int, n_features: int, prefix: str, *, embedding_init: Tensor, num_classes: int = 3, hidden_layer_sizes: List[int] | ndarray = [64, 128], latent_dim: int = 2, dropout_rate: float = 0.2, activation: Literal['relu', 'elu', 'selu', 'leaky_relu'] = 'relu', device: device | str = 'cpu', verbose: bool = False, debug: bool = False)[source]
Non-linear PCA (NLPCA) model implemented as UBP Phase-3-only.
- This model learns:
V: per-sample latent embeddings (nn.Embedding)
W: decoder network weights (MLP) jointly via backpropagation (i.e., the “non-linear refinement” phase of UBP).
Forward maps embeddings -> logits over genotype classes for each locus.
- __init__(num_embeddings: int, n_features: int, prefix: str, *, embedding_init: Tensor, num_classes: int = 3, hidden_layer_sizes: List[int] | ndarray = [64, 128], latent_dim: int = 2, dropout_rate: float = 0.2, activation: Literal['relu', 'elu', 'selu', 'leaky_relu'] = 'relu', device: device | str = 'cpu', verbose: bool = False, debug: bool = False) None[source]
Initialize NLPCAModel.
- Parameters:
num_embeddings (int) – Total number of samples (rows).
n_features (int) – Number of loci/features (columns).
prefix (str) – Logging prefix.
embedding_init (torch.Tensor) – Tensor of shape (num_embeddings, latent_dim) used to initialize V (PCA warm-start).
num_classes (int) – Number of genotype classes (3 diploid, 2 haploid).
hidden_layer_sizes (List[int] | np.ndarray) – Hidden layer widths for the decoder MLP.
latent_dim (int) – Latent embedding dimension.
dropout_rate (float) – Dropout probability within the decoder.
activation (Literal["relu", "elu", "selu", "leaky_relu"]) – Activation function.
device (torch.device | str) – Torch device or device string.
verbose (bool) – Verbose logging.
debug (bool) – Debug logging.
- forward(indices: Tensor | None = None, override_embeddings: Tensor | None = None) Tensor[source]
Forward pass mapping latent embeddings -> genotype logits.
- Parameters:
indices (Optional[torch.Tensor]) – Tensor of sample indices, shape (B,).
override_embeddings (Optional[torch.Tensor]) – Direct embeddings, shape (B, latent_dim).
- Returns:
Logits tensor of shape (B, n_features, num_classes).
pgsui.impute.unsupervised.models.ubp_model module
- class pgsui.impute.unsupervised.models.ubp_model.UBPModel(num_embeddings: int, n_features: int, prefix: str, *, embedding_init: Tensor, num_classes: int = 3, hidden_layer_sizes: List[int] | ndarray = [64, 128], latent_dim: int = 2, dropout_rate: float = 0.2, activation: Literal['relu', 'elu', 'selu', 'leaky_relu'] = 'relu', device: Literal['cpu', 'gpu', 'mps'] = 'cpu', verbose: bool = False, debug: bool = False)[source]
Unsupervised Backpropagation (UBP) Model.
Unlike a standard Autoencoder, UBP does not have an Encoder network. Instead, it learns a latent vector (embedding) for every sample index directly via backpropagation. The ‘forward’ pass acts as a Decoder, mapping indices (via embeddings) to reconstructed outputs.
- __init__(num_embeddings: int, n_features: int, prefix: str, *, embedding_init: Tensor, num_classes: int = 3, hidden_layer_sizes: List[int] | ndarray = [64, 128], latent_dim: int = 2, dropout_rate: float = 0.2, activation: Literal['relu', 'elu', 'selu', 'leaky_relu'] = 'relu', device: Literal['cpu', 'gpu', 'mps'] = 'cpu', verbose: bool = False, debug: bool = False)[source]
Initialize UBP Model.
- Parameters:
num_embeddings – Total number of samples (rows) in the dataset (n).
n_features – Number of features/SNPs (d).
num_classes – Genotype states (3 for diploid, 2 for haploid).
hidden_layer_sizes – Sizes of hidden layers for the MLP (W).
latent_dim – Size of the intrinsic/latent vector (t).
- forward(indices: Tensor | None = None, override_embeddings: Tensor | None = None) Tensor[source]
Forward pass mapping latent V -> output logits.
- Parameters:
indices – Sample indices (shape: [B]) used to lookup embeddings. Must be integer dtype.
override_embeddings – Direct latent embeddings (shape: [B, latent_dim]) used instead of lookup.
- Returns:
Logits of shape (B, n_features, num_classes).
- Return type:
torch.Tensor
- Raises:
ValueError – If neither indices nor override_embeddings are provided, or shapes mismatch.
TypeError – If indices is provided with non-integer dtype.
pgsui.impute.unsupervised.models.vae_model module
- class pgsui.impute.unsupervised.models.vae_model.Sampling(*args: Any, **kwargs: Any)[source]
A layer that samples from a latent distribution using the reparameterization trick.
- forward(z_mean: Tensor, z_log_var: Tensor) Tensor[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class pgsui.impute.unsupervised.models.vae_model.Encoder(n_features: int, num_classes: int, latent_dim: int, hidden_layer_sizes: List[int], dropout_rate: float, activation: Module)[source]
The Encoder module of a Variational Autoencoder (VAE).
- forward(x: Tensor) Tuple[Tensor, Tensor, Tensor][source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class pgsui.impute.unsupervised.models.vae_model.Decoder(n_features: int, num_classes: int, latent_dim: int, hidden_layer_sizes: List[int], dropout_rate: float, activation: Module)[source]
The Decoder module of a Variational Autoencoder (VAE).
- forward(x: Tensor) Tensor[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class pgsui.impute.unsupervised.models.vae_model.VAEModel(n_features: int, prefix: str, *, num_classes: int = 4, hidden_layer_sizes: List[int] | ndarray = [128, 64], latent_dim: int = 2, dropout_rate: float = 0.2, activation: Literal['relu', 'elu', 'selu', 'leaky_relu'] = 'relu', kl_beta: float = 1.0, device: Literal['cpu', 'gpu', 'mps'] = 'cpu', verbose: bool = False, debug: bool = False)[source]
- __init__(n_features: int, prefix: str, *, num_classes: int = 4, hidden_layer_sizes: List[int] | ndarray = [128, 64], latent_dim: int = 2, dropout_rate: float = 0.2, activation: Literal['relu', 'elu', 'selu', 'leaky_relu'] = 'relu', kl_beta: float = 1.0, device: Literal['cpu', 'gpu', 'mps'] = 'cpu', verbose: bool = False, debug: bool = False)[source]
Variational Autoencoder (VAE) model for unsupervised imputation.
- forward(x: Tensor) Tuple[Tensor, Tensor, Tensor][source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.