cellmaps_vnn package

Submodules

cellmaps_vnn.annotate module

class cellmaps_vnn.annotate.VNNAnnotate(outdir, model_predictions, disease=None, hierarchy=None, parent_network=None, ndexserver='ndexbio.org', ndexuser=None, ndexpassword='-', visibility=False, slurm=False, slurm_partition=None, slurm_account=None)[source]

Bases: object

Constructor. Sets up the hierarchy path either directly from the arguments or by looking for a hierarchy.cx2 file in the first RO-Crate directory provided. If neither is found, raises an error.

Raises:: CellmapsvnnError – If no hierarchy path is specified or found.

COMMAND = 'annotate'

DEFAULT_NDEX_SERVER = 'ndexbio.org'

DEFAULT_PASSWORD = '-'

static add_subparser(subparsers)[source]: Adds a subparser for the ‘annotate’ command.

register_outputs(outdir, description, keywords, provenance_utils)[source]

Registers the output files of the annotation process with the FAIRSCAPE service for data provenance. This includes the annotated hierarchy and the RLIPP output files.

Parameters:

outdir (str) – The output directory where the files are stored.
description (str) – A description of the files for provenance registration.
keywords (list) – A list of keywords associated with the files.
provenance_utils (ProvenanceUtility) – The utility class for provenance registration.

Returns:

A list of dataset IDs assigned to the registered files.

Return type:

list

run()[source]: The logic for annotating hierarchy with prediction results from cellmaps_vnn. It aggregates prediction scores from models, optionally filters them for a specific disease, and annotates the hierarchy with these scores.

cellmaps_vnn.ccc_loss module

class cellmaps_vnn.ccc_loss.CCCLoss(eps=1e-06)[source]

Bases: Module

A PyTorch module for calculating the Concordance Correlation Coefficient (CCC) Loss.

The CCC Loss is a measure used in regression tasks to evaluate the agreement between two variables.

Initializes the CCCLoss module.

Parameters:: eps (float) – A small epsilon value for numerical stability. Default is 1e-6.

forward(y_true, y_hat)[source]

Computes the CCC loss given true and predicted values.

Parameters:

y_true (Tensor) – The true values.
y_hat (Tensor) – The predicted values.

Return ccc:

The calculated CCC loss.

Rtype ccc:

Tensor

cellmaps_vnn.cellmaps_vnncmd module

cellmaps_vnn.cellmaps_vnncmd.main(args)[source]

Main entry point for program

Parameters:: args (list) – arguments passed to command line usually sys.argv[1:]()
Returns:: return value of cellmaps_vnn.runner.CellmapsvnnRunner.run() or 2 if an exception is raised
Return type:: int

cellmaps_vnn.cellmaps_vnncmd.set_arguments_from_config_and_defaults(theargs, config)[source]: Sets default values for arguments if not already set.

cellmaps_vnn.constants module

Contains constants used by cellmaps vnn

cellmaps_vnn.constants.CRHO_SCORE = 'C_rho': C rho score

cellmaps_vnn.constants.C_PVAL_SCORE = 'C_pval': C pval score

cellmaps_vnn.constants.DEFAULT_CUDA = 0: Set of constants for VNNTrain and VNNPredict

cellmaps_vnn.constants.EDGE_IMPORTANCE_SCORE = 'edge_importance_score': Name of the edge importance score attribute

cellmaps_vnn.constants.GENE_IMPORTANCE_SCORE = 'importance_score': Gene importance scores

cellmaps_vnn.constants.GENE_RHO_FILE = 'gene_rho.out': Output file for gene Rho from rlipp algorithm

cellmaps_vnn.constants.GENE_SET_COLUMN_NAME = 'CD_MemberList': Name of the node attribute of the hierarchy with list of genes/ proteins of this node.

cellmaps_vnn.constants.GENE_SET_WITH_DATA = 'VNN_gene_set_with_data': Hierarchy node attribute that contain genes with available data (eg. mutation, deletion, amplification) for vnn model

cellmaps_vnn.constants.HIERARCHY_FILENAME = 'hierarchy.cx2': Hierarchy filename.

cellmaps_vnn.constants.IMPORTANCE_SCORE = 'importance_score': Importance score (set to P_rho currently)

cellmaps_vnn.constants.ORIGINAL_HIERARCHY_FILENAME = 'original_hierarchy.cx2': Original hierarchy filename.

cellmaps_vnn.constants.PARENT_NETWORK_NAME = 'hierarchy_parent.cx2': Parent network of hierarchy filename.

cellmaps_vnn.constants.PRHO_SCORE = 'P_rho': P rho score

cellmaps_vnn.constants.P_PVAL_SCORE = 'P_pval': P pval score

cellmaps_vnn.constants.RLIPP_OUTPUT_FILE = 'rlipp.out': Output file from rlipp algorithm

cellmaps_vnn.constants.RLIPP_SCORE = 'RLIPP': RLIPP score

cellmaps_vnn.constants.SCORE_FILE_NAME_SUFFIX = '_gene_scores.out': Suffix for gene score file

cellmaps_vnn.constants.SYSTEM_INTERACTOME_FILE_SUFFIX = '_interactome.cx2': Suffix for system’s interactome file name

cellmaps_vnn.data_wrapper module

class cellmaps_vnn.data_wrapper.TrainingDataWrapper(outdir, inputdir, gene_attribute_name, training_data, cell2id, gene2id, mutations, cn_deletions, cn_amplifications, modelfile, genotype_hiddens, lr, wd, alpha, epoch, batchsize, cuda, zscore_method, stdfile, patience, delta, min_dropout_layer, dropout_fraction, hierarchy=None)[source]

Bases: object

Initializes the TrainingDataWrapper object with configuration and training data parameters.

cellmaps_vnn.exceptions module

exception cellmaps_vnn.exceptions.CellmapsvnnError[source]

Bases: Exception

Base exception for cellmaps_vnn

cellmaps_vnn.predict module

class cellmaps_vnn.predict.VNNPredict(outdir, inputdir, config_file=None, predict_data=None, gene2id=None, cell2id=None, mutations=None, cn_deletions=None, cn_amplifications=None, batchsize=64, zscore_method='auc', cpu_count=1, drug_count=0, genotype_hiddens=4, cuda=0, std=None, slurm=False, use_gpu=False, slurm_partition=None, slurm_account=None)[source]

Bases: object

Constructor for predicting with a trained model.

COMMAND = 'predict'

DEFAULT_CPU_COUNT = 1

DEFAULT_DRUG_COUNT = 0

static add_subparser(subparsers)[source]: Adds a subparser for the ‘predict’ command.

predict(predict_data, model_file, hidden_folder, batch_size, cell_features=None)[source]

Perform prediction using the trained model.

Parameters:

predict_data – Tuple of features and labels for prediction.
model_file – Path to the trained model file.
hidden_folder – Directory to store hidden layer outputs.
batch_size – Size of each batch for prediction.
cell_features – Additional cell features for prediction.

register_outputs(outdir, description, keywords, provenance_utils)[source]

Registers all output files (predictions, feature gradients, and hidden files) with the FAIRSCAPE service for data provenance.

Parameters:

outdir – The directory where the output files are stored.
description – Description for the output files.
keywords – List of keywords associated with the files.
provenance_utils – The utility class for provenance registration.

Returns:

A list of dataset IDs for the registered files.

run()[source]

The logic for running predictions with the model. It executes the prediction process using the trained model and input data.

Raises:: CellmapsvnnError – If an error occurs during the prediction process.

cellmaps_vnn.rlipp_calculator module

class cellmaps_vnn.rlipp_calculator.RLIPPCalculator(outdir, hierarchy, test_data, predicted_data, gene2idfile, cell2idfile, hidden_dir, cpu_count, num_hiddens_genotype, drug_count, excluded_terms=[])[source]

Bases: ImportanceScoreCalculator

A calculator for Relative Importance of Predictor Performance (RLIPP) scores.

Parameters: outdir (str): Output directory for the RLIPP scores and gene correlations. hierarchy (CX2Network): A hierarchy in HCX format. test_data (str): predicted_data (str): Path to the file containing predicted values. gene2idfile (str): Path to the file mapping genes to IDs. cell2idfile (str): Path to the file mapping cells to IDs. hidden_dir (str): Directory containing hidden layer outputs. rlipp_file (str): Path of the output file where results of rlipp algorithm will be saved gene_rho_file (str): Path of the output file where gene rho scores will be saved cpu_count (int): No of available cores num_hiddens_genotype (int): Mapping for the number of neurons in each term in genotype parts drug_count (int): No of top performing drugs

Constructor

calc_gene_rho(gene_features, position_map, gene, drug)[source]

Calculates Spearman correlation between gene embeddings and predicted AUC.

Parameters:

gene_features (numpy.ndarray) – The features for the gene.
position_map (list) – A list of positions for which correlation is to be calculated.
gene (str) – The gene for which correlation is calculated.
drug (str) – The drug for which correlation is calculated.

Returns:

A formatted string containing the gene, Spearman correlation, and p-value.

Return type:

str

calc_scores()[source]

Calculates RLIPP scores for top n drugs (n = drug_count), and prints the result in “Drug Term P_rho C_rho RLIPP” format.

This method runs the calculation in parallel for efficiency.

calc_term_rlipp(term_features, term_child_features, position_map, term, drug)[source]

Calculates the RLIPP score for a given term and drug.

Parameters:

term_features (numpy.ndarray) – The features for the parent term.
term_child_features (list) – The features for the children of the term.
position_map (list) – A list of positions for which RLIPP is to be calculated.
term (str) – The term for which RLIPP is calculated.
drug (str) – The drug for which RLIPP is calculated.

Returns:

A formatted string containing the term, Spearman correlations, p-values, and RLIPP score.

Return type:

str

create_child_feature_map(feature_map, term)[source]

Creates a map of child features for a given term.

Parameters:

feature_map (dict) – A dictionary mapping terms/genes to their features.
term (str) – The term for which child features are to be created.

Returns:

A list of child features for the given term.

Return type:

list

create_drug_corr_map_sorted(drug_pos_map)[source]

Creates a sorted mapping of drugs to their Spearman correlation values.

Parameters:: drug_pos_map (dict) – A dictionary mapping drugs to their positions in the test data.
Returns:: A dictionary of drugs sorted by their Spearman correlation values in descending order.
Return type:: dict

create_drug_pos_map()[source]

Creates a mapping from drugs to their positions in the test data file.

Returns:: A dictionary where keys are drugs and values are lists of positions in the test data.
Return type:: dict

exec_lm(X, y)[source]

Executes 5-fold cross-validated Ridge regression for a given hidden features matrix and returns the Spearman correlation value of the predicted output.

Parameters:

X (numpy.ndarray) – The input matrix for regression.
y (numpy.ndarray) – The target variable.

Returns:

A tuple containing the Spearman correlation coefficient and p-value.

Return type:

(float, float)

static get_child_features(term_child_features, position_map)[source]

Gets a matrix of hidden features for a given term’s children.

Parameters:

term_child_features (list) – A list of features for the children of a term.
position_map (list) – A list of positions for which features are to be extracted.

Returns:

A matrix of hidden features for the children of the given term.

Return type:

numpy.ndarray

load_all_features()[source]

Loads hidden features for all terms and genes.

Returns:: A tuple containing two dictionaries, one mapping terms/genes to their features and the other mapping terms to their child features.
Return type:: (dict, dict)

load_feature(element, size)[source]

Loads hidden features for a given element.

Parameters:

element (str) – The element (term or gene) whose features are to be loaded.
size (int) – The number of columns (features) to load.

Returns:

A numpy array of the hidden features for the given element.

Return type:

numpy.ndarray

load_gene_features(gene)[source]

Loads hidden features for a given gene.

Parameters:: gene (str) – The gene whose features are to be loaded.
Returns:: A numpy array of the hidden features for the given gene.
Return type:: numpy.ndarray

load_term_features(term)[source]

Loads hidden features for a given term.

Parameters:: term (str) – The term whose features are to be loaded.
Returns:: A numpy array of the hidden features for the given term.
Return type:: numpy.ndarray

cellmaps_vnn.runner module

class cellmaps_vnn.runner.CellmapsvnnRunner(outdir=None, command=None, inputdir=None, name=None, organization_name=None, project_name=None, exitcode=None, skip_logging=True, input_data_dict=None, provenance_utils=<cellmaps_utils.provenance.ProvenanceUtil object>)[source]

Bases: VnnRunner

Class to run algorithm

Constructor

Parameters:

outdir (str) – Directory to create and put results in
skip_logging (bool) – If True skip logging, if None or False do NOT skip logging
exitcode – value to return via CellmapsvnnRunner.run() method
input_data_dict (dict) – Command line arguments used to invoke this
provenance_utils (ProvenanceUtil) – Wrapper for fairscape-cli which is used for RO-Crate creation and population

run()[source]

Runs cellmaps_vnn

Returns:

class cellmaps_vnn.runner.SLURMCellmapsvnnRunner(outdir=None, command=None, inputdir=None, gene_attribute_name='CD_MemberList', gene2id=None, cell2id=None, mutations=None, cn_deletions=None, cn_amplifications=None, training_data=None, batchsize=64, cuda=0, zscore_method='auc', epoch=50, lr=0.001, wd=0.001, alpha=0.3, genotype_hiddens=4, optimize=0, n_trials=3, patience=30, delta=0.001, min_dropout_layer=2, dropout_fraction=0.3, skip_parent_copy=False, cpu_count=1, drug_count=0, predict_data=None, std=None, model_predictions=None, disease=None, hierarchy=None, parent_network=None, ndexserver=None, ndexuser=None, ndexpassword=None, visibility=False, gpu=False, slurm_partition=None, slurm_account=None, input_data_dict=None)[source]

Bases: VnnRunner

run()[source]

Runs CM4AI Pipeline

Returns:

class cellmaps_vnn.runner.VnnRunner(outdir)[source]

Bases: object

run()[source]: Runs VNN :raises NotImplementedError: Always raised cause subclasses need to implement

cellmaps_vnn.train module

class cellmaps_vnn.train.VNNTrain(outdir, inputdir, gene_attribute_name='CD_MemberList', config_file=None, training_data=None, gene2id=None, cell2id=None, mutations=None, cn_deletions=None, cn_amplifications=None, batchsize=64, zscore_method='auc', epoch=50, lr=0.001, wd=0.001, alpha=0.3, genotype_hiddens=4, patience=30, delta=0.001, min_dropout_layer=2, dropout_fraction=0.3, optimize=0, n_trials=3, cuda=0, skip_parent_copy=False, slurm=False, use_gpu=False, slurm_partition=None, slurm_account=None, hierarchy=None, parent_network=None)[source]

Bases: object

Constructor for training a Visual Neural Network.

Parameters:

outdir (str) – Directory to write results to.
inputdir (str) – Path to directory or RO-Crate with hierarchy.cx2 file.
gene_attribute_name (str) – Name of the node attribute with genes/proteins.
config_file (str, optional) – Path to configuration file for populating arguments.
training_data (str, optional) – Training data file path.
gene2id (str, optional) – File mapping genes to IDs.
cell2id (str, optional) – File mapping cells to IDs.
mutations (str, optional) – File with mutation information for cell lines.
cn_deletions (str, optional) – File with copy number deletions for cell lines.
cn_amplifications (str, optional) – File with copy number amplifications for cell lines.
batchsize (int) – Batch size for training. Default is 64.
zscore_method (str) – Z-score method. Default is ‘auc’.
epoch (int) – Number of epochs for training. Default is 50.
lr (float or list or tuple) – Learning rate. Default is 0.001.
wd (float) – Weight decay. Default is 0.001.
alpha (float) – Loss parameter alpha. Default is 0.3.
genotype_hiddens (int) – Number of neurons in genotype parts. Default is 4.
patience (int) – Early stopping epoch limit. Default is 30.
delta (float) – Minimum loss improvement for early stopping. Default is 0.001.
min_dropout_layer (int) – Layer number to start applying dropout. Default is 2.
dropout_fraction (float) – Dropout fraction. Default is 0.3.
optimize (int) – Hyperparameter optimization flag. Default is 0.
cuda (int) – GPU index. Default is 0.
skip_parent_copy (bool) – If True, do not copy hierarchy parent. Default is False.
slurm (bool) – If True, generate SLURM script for training. Default is False.
use_gpu (bool) – If True, adjust SLURM script to run on GPU. Default is False.
slurm_partition (str, optional) – SLURM partition to use. Default is ‘nrnb-gpu’ if use_gpu is True.
slurm_account (str, optional) – SLURM account name.

COMMAND = 'train'

DEFAULT_ALPHA = 0.3

DEFAULT_DELTA = 0.001

DEFAULT_DROPOUT_FRACTION = 0.3

DEFAULT_EPOCH = 50

DEFAULT_LR = 0.001

DEFAULT_MIN_DROPOUT_LAYER = 2

DEFAULT_N_TRIALS = 3

DEFAULT_OPTIMIZE = 0

DEFAULT_PATIENCE = 30

DEFAULT_STD = 'std.txt'

DEFAULT_WD = 0.001

static add_subparser(subparsers)[source]: Adds a subparser for the ‘train’ command.

register_outputs(outdir, description, keywords, provenance_utils)[source]

Registers the model and standard deviation files with the FAIRSCAPE service for data provenance. It generates dataset IDs for each registered file.

Parameters:

outdir – The directory where the output files are stored.
description – Description for the output files.
keywords – List of keywords associated with the files.
provenance_utils – The utility class for provenance registration.

Returns:

A list of dataset IDs for the registered model and standard deviation files.

run()[source]: The logic for training the Visual Neural Network.

cellmaps_vnn.util module

cellmaps_vnn.util.build_input_vector(input_data, cell_features)[source]

Builds an input vector for model training using cell features.

Parameters:

input_data (Tensor) – Input data containing cell indices.
cell_features (numpy.ndarray) – Cell features array.

Returns feature:

Input feature tensor for the model.

Rtype feature:

Tensor

cellmaps_vnn.util.calc_std_vals(df, zscore_method)[source]

Calculates standard deviation values for a given DataFrame based on the specified z-score method (‘zscore’ and ‘robustz’).

Parameters:

df (pandas.DataFrame) – the data to be standardized.
zscore_method (str) – Method to use for standardization (‘zscore’ or ‘robustz’).

Returns std_df:

DataFrame with standard deviation values for each dataset.

Rtype std_df:

pandas.DataFrame

cellmaps_vnn.util.copy_and_register_gene2id_file(genet2id_in_file, outdir, description, keywords, provenance_utils)[source]

cellmaps_vnn.util.create_term_mask(term_direct_gene_map, gene_dim, cuda_id=None)[source]

Creates a term mask map for gene sets. This function generates a mask for each term where the mask is a matrix with rows equal to the number of relevant gene set and columns equal to the total number of genes. Each element is set to 1 if the corresponding gene is one of the relevant genes.

Parameters:

term_direct_gene_map (dict) – Mapping of terms to their respective gene sets.
gene_dim (int) – Total number of genes.
cuda_id (int) – CUDA ID for tensor operations.

Returns term_mask_map:

Dictionary of term masks.

Rtype term_mask_map:

dict

cellmaps_vnn.util.get_grad_norm(model_params, norm_type)[source]

Computes the gradient norm of model parameters.

The norm is computed over all gradients together, as if they were concatenated into a single vector. Gradients are modified in-place.

Parameters:

model_params (Iterable[Tensor] or Tensor) – Iterable of model parameters or a single Tensor that will have gradients normalized.
norm_type (float or int) – Type of the p-norm to use (can be ‘inf’ for infinity norm).

Returns:

Total norm of the model parameters (viewed as a single vector).

Return type:

Tensor

cellmaps_vnn.util.load_cell_features(mutations, cn_deletions, cn_amplifications)[source]

Loads and combines cell/drug features from given mutation, CN deletion, and CN amplification files.

Each feature set is loaded as a NumPy array and then combined into a single array.

Parameters:

mutations (str) – Path to the mutations data file.
cn_deletions (str) – Path to the CN deletions data file.
cn_amplifications (str) – Path to the CN amplifications data file.

Returns:

Combined cell features.

Return type:

numpy.ndarray

cellmaps_vnn.util.load_mapping(mapping_file, mapping_type)[source]

Loads a mapping from a file and returns it as a dictionary.

Parameters:

mapping_file (str) – Path to the mapping file.
mapping_type (str) – Description of the mapping (e.g., ‘gene to ID’).

Returns mapping:

Dictionary containing the mapping from the file.

Rtype mapping:

dict

Raises:

CellmapsvnnError – If the mapping file is not found.

cellmaps_vnn.util.load_numpy_data(file_path)[source]

Reads a file at the specified path and attempts to convert it into a NumPy array. If the file is not found or any other error occurs, an exception is raised.

Parameters:: file_path (str) – Path to the file to be loaded.
Returns:: Data loaded from the file.
Return type:: numpy.ndarray
Raises:: CellmapsvnnError – If the file is not found or an error occurs during loading.

cellmaps_vnn.util.pearson_corr(x, y)[source]

Computes the Pearson correlation coefficient between two tensors.

Parameters:

x (Tensor) – First variable tensor.
y (Tensor) – Second variable tensor.

Returns:

Pearson correlation coefficient.

Return type:

Tensor

cellmaps_vnn.util.standardize_data(df, std_df)[source]

Standardizes the data based on provided standard deviation values. This function applies z-score standardization to the ‘auc’ column of the DataFrame, using the standard deviation values provided.

Parameters:

df (pandas.DataFrame) – the data to be standardized.
std_df (pandas.DataFrame) – the standard deviation values.

Returns merged:

DataFrame with the standardized ‘z’ values.

Rtype merged:

pandas.DataFrame

cellmaps_vnn.vnn module

class cellmaps_vnn.vnn.VNN(data_wrapper: TrainingDataWrapper)[source]

Bases: Module

Initializes the VNN model with the provided data wrapper.

This constructor sets up components of the VNN model, including term maps, gene mappings, dropout parameters, and initializes neural network layers based on the given data structure. It also calculates the dimensions for each term and constructs the direct gene layers and the neural network graph.

Parameters:: data_wrapper (TrainingDataWrapper) – The necessary data and configurations for initializing the VNN model.
Raises:: CellmapsvnnError – If an error occurs during the initialization of the neural network.

cal_term_dim(term_size_map)[source]

Calculates the dimensionality of each term based on the term sizes.

This method updates the term_dim_map attribute, which maps each term to its dimensionality. The dimensionality for each term is set to the number of hidden genotype variables.

Parameters:: term_size_map (dict) – A mapping of terms to their sizes.

construct_direct_gene_layer()[source]

Constructs layers for genes directly annotated with each term.

This method iterates through each gene and term to create specific layers in the neural network. For each gene, it adds a feature layer and a batch normalization layer. For each term, if there are genes directly annotated with it, it adds a linear layer that takes all genes as input and outputs only those genes directly annotated with the term. If a term has no directly associated genes, the method will raise exception.

construct_nn_graph(digraph)[source]

Constructs a neural network graph based on given hierarchy.

This method builds the neural network by starting from the bottom (leaves) of the given directed graph (digraph) and iteratively adding modules for each term in the hierarchy. The method stores the built neural network layers in term_layer_list and maintains a map (term_neighbor_map) of each term to its children.

For each term, the method calculates the input size, which is the sum of the dimensions of its children and the number of genes directly annotated by the term. It then adds a series of layers (dropout, linear, batch normalization, and auxiliary linear layers) for each term.

The process continues until all nodes (terms) in the digraph have been processed and added to the network.

Parameters:: digraph (networkx.DiGraph) – A directed graph representing the ontology, where nodes are terms and edges indicate term relationships.

forward(x)[source]

Defines the forward function of the VNN model.

This method processes the input through the neural network constructed in the VNN class. It applies a series of transformations to the input data, including feature layer operations, batch normalization, and tanh activations. The method aggregates outputs from different terms in the network and finally produces two dictionaries: one for hidden embeddings and one for auxiliary outputs.

Parameters:: x (torch.Tensor) – Input tensor representing gene data. Each row corresponds to a gene, and columns are features.
Returns:: A tuple containing two dictionaries: - hidden_embeddings_map: A mapping from terms to their hidden embeddings. - aux_out_map: A mapping from terms to their auxiliary output.
Return type:: (dict, dict)

cellmaps_vnn.vnn_trainer module

class cellmaps_vnn.vnn_trainer.VNNTrainer(data_wrapper)[source]

Bases: object

Initialize the VNN Trainer.

Parameters:: data_wrapper (TrainingDataWrapper) – data wrapper containing data necessary for training

TRAINING_PROGRESS_FILE = 'training_progress.tsv'

train_model()[source]

Trains the VNN model.

Returns min_loss:: The minimum validation loss achieved during training.
Rtype min_loss:: float

Module contents

Top-level package for cellmaps_vnn.