pgsui.utils package
Submodules
pgsui.utils.misc module
- class pgsui.utils.misc.OptunaParamSpec(keys: FrozenSet[str])[source]
Specification and validation for Optuna objective parameter keys.
- keys
Canonical keys used in the Optuna objective params dict.
- Type:
FrozenSet[str]
- keys: FrozenSet[str]
- validate(params: Mapping[str, Any], *, allow_extra: bool = False) None[source]
Validate that a params mapping matches this spec’s keys.
- Parameters:
params – Mapping of parameter names -> values (typically the objective params dict).
allow_extra – If True, extra keys are allowed; missing keys still error.
- Raises:
TypeError – If params is not a Mapping.
KeyError – If required keys are missing (or extras exist when allow_extra=False).
- pgsui.utils.misc.validate_input_type(X: DataFrame | ndarray | list | Tensor, return_type: Literal['array', 'df', 'list', 'tensor'] = 'array') DataFrame | ndarray | list | Tensor[source]
Validate input type and return as numpy array.
This method validates the input type and returns the input data as a numpy array, pandas DataFrame, 2D list, or torch.Tensor.
- Parameters:
X (pandas.DataFrame | numpy.ndarray | list | torch.Tensor) – Input data. Supported types include: pandas.DataFrame, numpy.ndarray, list, and torch.Tensor.
return_type (Literal["array", "df", "list", "tensor"]) – Type of returned object. Supported options include: “df”, “array”, “list”, and “tensor”. “df” corresponds to a pandas DataFrame. “array” corresponds to a numpy array. “list” corresponds to a 2D list. “tensor” corresponds to a torch.Tensor. Defaults to “array”.
- Returns:
Input data as the desired return_type.
- Return type:
pandas.DataFrame | numpy.ndarray | list | torch.Tensor
- Raises:
TypeError – X must be of type pandas.DataFrame, numpy.ndarray, list, or torch.Tensor.
ValueError – Unsupported return_type provided. Supported types are “df”, “array”, “list”, and “tensor”.
- pgsui.utils.misc.detect_computing_device(*, force_cpu: bool = False, verbose: bool = False) device[source]
Detects and returns the best available PyTorch compute device.
Prioritizes CUDA (NVIDIA) > MPS (Apple Silicon) > CPU.
- Parameters:
force_cpu (bool) – If True, forces the device to CPU regardless of available hardware. Defaults to False.
verbose (bool) – If True, prints the selected device to stdout. Defaults to False.
- Returns:
The selected computing device.
- Return type:
torch.device
- pgsui.utils.misc.get_missing_mask(X: DataFrame | Series | ndarray | list | Tensor) DataFrame | Series | ndarray | Tensor[source]
Returns a boolean mask indicating missing values (NaN, None).
Notes: Lists are converted to numpy arrays to compute the mask.
- Parameters:
X – Input data.
- Returns:
Boolean mask of the same shape as X (returned as DF, Array, or Tensor).
- Return type:
pd.DataFrame | pd.Series | np.ndarray | torch.Tensor
- Raises:
TypeError – If input type is not supported.
- pgsui.utils.misc.ensure_2d(X: DataFrame | Series | ndarray | list | Tensor) DataFrame | ndarray | list | Tensor[source]
Ensures the input is at least 2-dimensional.
If input is 1D (e.g., shape (N,)), it is reshaped to (N, 1). Already 2D+ inputs are returned unchanged.
- Parameters:
X (pd.DataFrame | pd.Series | np.ndarray | list | torch.Tensor) – Input data.
- Returns:
Input data transformed to be at least 2D.
- Return type:
pd.DataFrame | np.ndarray | list | torch.Tensor
- Raises:
TypeError – If input type is not supported.
- pgsui.utils.misc.flatten_1d(y: DataFrame | Series | ndarray | list | Tensor) Series | ndarray | list | Tensor[source]
Flattens input to a 1D structure.
- Parameters:
y (pd.DataFrame | pd.Series | np.ndarray | list | torch.Tensor) – Input data.
- Returns:
1D representation of the input.
- Return type:
pd.Series | np.ndarray | list | torch.Tensor
Notes
Inputs with multiple columns (e.g., DataFrame with >1 column) are flattened into a single 1D structure.
- Raises:
TypeError – If input type is not supported.
- pgsui.utils.misc.safe_shape(X: DataFrame | Series | ndarray | list | Tensor) tuple[int, ...][source]
Returns the shape of the input container as a tuple.
- Parameters:
X (pd.DataFrame | pd.Series | np.ndarray | list | torch.Tensor) – Input data.
- Returns:
Dimensions of the data (rows, cols, etc.).
- Return type:
tuple[int, …]
pgsui.utils.plotting module
- class pgsui.utils.plotting.Plotting(model_name: str, *, prefix: str = 'pgsui', plot_format: Literal['pdf', 'png', 'jpeg', 'jpg', 'svg'] = 'pdf', plot_fontsize: int = 18, plot_dpi: int = 300, title_fontsize: int = 20, despine: bool = True, show_plots: bool = False, verbose: int = 0, debug: bool = False, multiqc: bool = False, multiqc_section: str | None = None)[source]
Class for plotting imputer scoring and results.
This class is used to plot the performance metrics of imputation models. It can plot ROC and Precision-Recall curves, model history, and the distribution of genotypes in the dataset.
Example
>>> from pgsui import Plotting >>> plotter = Plotting(model_name="ImputeVAE", prefix="pgsui_test", plot_format="png") >>> plotter.plot_metrics(metrics, num_classes) >>> plotter.plot_history(history) >>> plotter.plot_confusion_matrix(y_true_1d, y_pred_1d) >>> plotter.plot_tuning(study, model_name, optimize_dir, target_name="Objective Value") >>> plotter.plot_gt_distribution(df)
- model_name
Name of the model.
- Type:
str
- prefix
Prefix for the output directory.
- Type:
str
- plot_format
Format for the plots (‘pdf’, ‘png’, ‘jpeg’, ‘jpg’, ‘svg’).
- Type:
Literal[“pdf”, “png”, “jpeg”, “jpg”, “svg”]
- plot_fontsize
Font size for the plots.
- Type:
int
- plot_dpi
Dots per inch for the plots.
- Type:
int
- title_fontsize
Font size for the plot titles.
- Type:
int
- show_plots
Whether to display the plots inline or during execution.
- Type:
bool
- output_dir
Directory where plots will be saved.
- Type:
Path
- logger
Logger instance for logging messages.
- Type:
logging.Logger
- __init__(model_name: str, *, prefix: str = 'pgsui', plot_format: Literal['pdf', 'png', 'jpeg', 'jpg', 'svg'] = 'pdf', plot_fontsize: int = 18, plot_dpi: int = 300, title_fontsize: int = 20, despine: bool = True, show_plots: bool = False, verbose: int = 0, debug: bool = False, multiqc: bool = False, multiqc_section: str | None = None) None[source]
Initialize the Plotting object.
This class is used to plot the performance metrics of imputation models. It can plot ROC and Precision-Recall curves, model history, and the distribution of genotypes in the dataset.
- Parameters:
model_name (str) – Name of the model.
prefix (str, optional) – Prefix for the output directory. Defaults to ‘pgsui’.
plot_format (Literal["pdf", "png", "jpeg", "jpg"]) – Format for the plots (‘pdf’, ‘png’, ‘jpeg’, ‘jpg’). Defaults to ‘pdf’.
plot_fontsize (int) – Font size for the plots. Defaults to 18.
plot_dpi (int) – Dots per inch for the plots. Defaults to 300.
title_fontsize (int) – Font size for the plot titles. Defaults to 20.
despine (bool) – Whether to remove the top and right spines from the plots. Defaults to True.
show_plots (bool) – Whether to display the plots. Defaults to False.
verbose (int) – Verbosity level for logging. Defaults to 0.
debug (bool) – Whether to enable debug mode. Defaults to False.
multiqc (bool) – Whether to queue plots for a MultiQC HTML report. Defaults to False.
multiqc_section (Optional[str]) – Section name to use in MultiQC. Defaults to ‘PG-SUI (<model_name>)’.
- plot_tuning(study: Study, model_name: str, optimize_dir: Path, target_name: str = 'Objective Value') None[source]
Plot the optimization history of a study.
This method plots the optimization history of a study. The plot is saved to disk as a
<plot_format>file.- Parameters:
study (optuna.study.Study) – Optuna study object.
model_name (str) – Name of the model.
optimize_dir (Path) – Directory to save the optimization plots.
target_name (str) – Name of the target value. Defaults to ‘Objective Value’.
- plot_metrics(y_true: ndarray, y_pred_proba: ndarray, metrics: Dict[str, float], label_names: Sequence[str] | None = None, prefix: str = '') None[source]
Plot multi-class ROC-AUC and Precision-Recall curves.
This method plots the multi-class ROC-AUC and Precision-Recall curves. The plot is saved to disk as a
<plot_format>file.- Parameters:
y_true (np.ndarray) – 1D array of true integer labels in [0, n_classes-1].
y_pred_proba (np.ndarray) – (n_samples, n_classes) array of predicted probabilities.
metrics (Dict[str, float]) – Dict of summary metrics to annotate the figure.
label_names (Optional[Sequence[str]]) – Optional sequence of class names (length must equal n_classes). If provided, legends will use these names instead of ‘Class i’.
prefix (str) – Optional prefix for the output filename.
- Raises:
ValueError – If model_name is not recognized (legacy guard).
- plot_history(history: dict[str, list[float]] | dict[str, dict[str, list[float]]]) None[source]
Plot model history traces. Will be saved to file.
This method plots the deep learning model history traces. The plot is saved to disk as a
<plot_format>file.- Parameters:
history (dict[str, list[float]] | dict[str, dict[str, list[float]]]) – Dictionary with lists of history objects. Keys should be “Train” and “Validation”.
- Raises:
ValueError – self.model_name must be either ‘ImputeAutoencoder’ or ‘ImputeVAE’.
ValueError – history object passed to ‘plot_history’ is empty.
TypeError – history must be a dict containing {‘Train’, ‘Val’} or {‘Phase2’, ‘Phase3’}.
ValueError – For ImputeUBP, history must contain ‘Phase2’ and ‘Phase3’ keys.
- plot_confusion_matrix(y_true_1d: ndarray | DataFrame | List[str | int] | Tensor, y_pred_1d: ndarray | DataFrame | List[str | int] | Tensor, label_names: Sequence[str] | Dict[str, int] | None = None, prefix: str = '') None[source]
Plot a confusion matrix with optional class labels.
This method plots a confusion matrix using true and predicted labels. The plot is saved to disk as a
<plot_format>file.- Parameters:
y_true_1d (np.ndarray | pd.DataFrame | list | torch.Tensor) – 1D array of true integer labels in [0, n_classes-1].
y_pred_1d (np.ndarray | pd.DataFrame | list | torch.Tensor) – 1D array of predicted integer labels in [0, n_classes-1].
label_names (Sequence[str] | None) – Optional sequence of class names (length must equal n_classes). If provided, both the internal label order and displayed tick labels will respect this order (assumed to be 0..n-1).
prefix (str) – Optional prefix for the output filename.
Notes
If label_names is None, the display labels default to the numeric class indices inferred from y_true_1d ∪ y_pred_1d.
- plot_gt_distribution(X: ndarray | DataFrame | list | Tensor, X_compare: ndarray | DataFrame | list | Tensor | None = None, is_imputed: bool = False) None[source]
Plot genotype distribution, optionally comparing two datasets.
- Plots counts for all genotypes. If X_compare is provided, it plots side-by-side bars and calculates the Jensen-Shannon distance between
the distributions.
- Parameters:
X (np.ndarray | pd.DataFrame | list | torch.Tensor) – Primary genotype matrix (usually the imputed/final one).
X_compare (np.ndarray | pd.DataFrame | list | torch.Tensor | None) – Optional baseline genotype matrix to compare against (e.g., the original dataset with missing values).
is_imputed (bool) – Labeling flag. If True, X is labeled “Imputed”.
pgsui.utils.scorers module
- class pgsui.utils.scorers.Scorer(prefix: str, average: Literal['macro', 'weighted'] = 'macro', verbose: bool = False, debug: bool = False)[source]
Class for evaluating the performance of a model using various metrics.
This class is used to evaluate the performance of a model using various metrics, such as accuracy, F1 score, precision, recall, average precision, and ROC AUC. The class can be used to evaluate the performance of a model on a dataset with ground truth labels. The class can also be used to evaluate the performance of a model in objective mode for hyperparameter tuning.
- __init__(prefix: str, average: Literal['macro', 'weighted'] = 'macro', verbose: bool = False, debug: bool = False) None[source]
Initialize a Scorer object.
This class is used to evaluate the performance of a model using various metrics, such as accuracy, F1 score, precision, recall, average precision, and ROC AUC. The class can be used to evaluate the performance of a model on a dataset with ground truth labels. The class can also be used to evaluate the performance of a model in objective mode for hyperparameter tuning.
- Parameters:
prefix (str) – Prefix for logging messages.
average (Literal["macro", "weighted"]) – Average method for metrics. Must be one of ‘macro’ or ‘weighted’.
verbose (bool) – Verbosity level for logging messages. Default is False.
debug (bool) – Debug mode for logging messages. Default is False.
- Raises:
ValueError – If the average parameter is invalid. Must be one of ‘macro’ or ‘weighted’.
- accuracy(y_true: ndarray, y_pred: ndarray) float[source]
Calculate the accuracy of the model.
This method calculates the accuracy of the model by comparing the ground truth labels with the predicted labels.
- Parameters:
y_true (np.ndarray) – Ground truth labels.
y_pred (np.ndarray) – Predicted labels.
- Returns:
Accuracy score.
- Return type:
float
- f1(y_true: ndarray, y_pred: ndarray) float[source]
Calculate the F1 score of the model.
This method calculates the F1 score of the model by comparing the ground truth labels with the predicted labels.
- Parameters:
y_true (np.ndarray) – Ground truth labels.
y_pred (np.ndarray) – Predicted labels.
- Returns:
F1 score.
- Return type:
float
- precision(y_true: ndarray, y_pred: ndarray) float[source]
Calculate the precision of the model.
This method calculates the precision of the model by comparing the ground truth labels with the predicted labels.
- Parameters:
y_true (np.ndarray) – Ground truth labels.
y_pred (np.ndarray) – Predicted labels.
- Returns:
Precision score.
- Return type:
float
- recall(y_true: ndarray, y_pred: ndarray) float[source]
Calculate the recall of the model.
This method calculates the recall of the model by comparing the ground truth labels with the predicted labels.
- Parameters:
y_true (np.ndarray) – Ground truth labels.
y_pred (np.ndarray) – Predicted labels.
- Returns:
Recall score.
- Return type:
float
- roc_auc(y_true: ndarray, y_pred_proba: ndarray) float[source]
Multiclass ROC-AUC with label targets.
This method calculates the ROC-AUC score for multiclass classification problems. It handles both 1D integer labels and 2D one-hot/indicator matrices for the ground truth labels.
- Parameters:
y_true – 1D integer labels (shape: [n]). If a one-hot/indicator matrix is supplied, we convert to labels.
y_pred_proba – 2D probabilities (shape: [n, n_classes]).
- evaluate(y_true: ndarray | Tensor | list, y_pred: ndarray | Tensor | list, y_true_ohe: ndarray | Tensor | list, y_pred_proba: ndarray | Tensor | list, objective_mode: bool = False, tune_metric: Literal['pr_macro', 'roc_auc', 'average_precision', 'accuracy', 'f1', 'precision', 'recall', 'mcc', 'jaccard'] = 'pr_macro') Dict[str, float] | None[source]
Evaluate the model using various metrics.
This method evaluates the performance of a model using various metrics, such as accuracy, F1 score, precision, recall, average precision, and ROC AUC. The method can be used to evaluate the performance of a model on a dataset with ground truth labels. The method can also be used to evaluate the performance of a model in objective mode for hyperparameter tuning.
- Parameters:
y_true (np.ndarray | torch.Tensor) – Ground truth labels.
y_pred (np.ndarray | torch.Tensor) – Predicted labels.
y_true_ohe (np.ndarray | torch.Tensor) – One-hot encoded ground truth labels.
y_pred_proba (np.ndarray | torch.Tensor) – Predicted probabilities.
objective_mode (bool) – Whether to use objective mode for evaluation. Default is False.
tune_metric (Literal["pr_macro", "roc_auc", "average_precision", "accuracy", "f1", "precision", "recall"]) – Metric to use for tuning. Ignored if objective_mode is False. Default is ‘pr_macro’.
- Returns:
Dictionary of evaluation metrics. Keys are ‘accuracy’, ‘f1’, ‘precision’, ‘recall’, ‘roc_auc’, ‘average_precision’, and ‘pr_macro’.
- Return type:
Dict[str, float]
- Raises:
ValueError – If the input data is invalid.
ValueError – If an invalid tune_metric is provided.
- jaccard(y_true: ndarray, y_pred: ndarray) float[source]
Compute the Jaccard similarity coefficient.
The Jaccard similarity coefficient, also known as Intersection over Union (IoU), measures the similarity between two sets. It is defined as the size of the intersection divided by the size of the union of the sample sets.
- Parameters:
y_true (np.ndarray) – Ground truth (correct) target values.
y_pred (np.ndarray) – Predicted target values.
- Returns:
Jaccard similarity coefficient.
- Return type:
float
- mcc(y_true: ndarray, y_pred: ndarray) float[source]
Compute the Matthews correlation coefficient (MCC).
MCC is a balanced measure that can be used even if the classes are of very different sizes. It returns a value between -1 and +1, where +1 indicates a perfect prediction, 0 indicates no better than random prediction, and -1 indicates total disagreement between prediction and observation.
- Parameters:
y_true (np.ndarray) – Ground truth (correct) target values.
y_pred (np.ndarray) – Predicted target values.
- Returns:
Matthews correlation coefficient.
- Return type:
float
- average_precision(y_true: ndarray, y_pred_proba: ndarray) float[source]
Average precision with safe multiclass handling.
If y_true is 1D of class indices, it is binarized against the number of columns in y_pred_proba. If y_true is already one-hot or indicator, it is used as-is.
- Parameters:
y_true (np.ndarray) – Ground truth labels (1D class indices or 2D one-hot/indicator).
y_pred_proba (np.ndarray) – Predicted probabilities (2D array).
- Returns:
Average precision score.
- Return type:
float
- pr_macro(y_true_ohe: ndarray, y_pred_proba: ndarray) float[source]
Macro-averaged average precision (precision-recall AUC) across classes.
- Parameters:
y_true_ohe (np.ndarray) – One-hot encoded ground truth labels (2D array).
y_pred_proba (np.ndarray) – Predicted probabilities (2D array).
- Returns:
Macro-averaged average precision score.
- Return type:
float