cellmaps_vnn package
Submodules
cellmaps_vnn.annotate module
- class cellmaps_vnn.annotate.VNNAnnotate(outdir, model_predictions, disease=None, hierarchy=None, parent_network=None, ndexserver='ndexbio.org', ndexuser=None, ndexpassword='-', visibility=False, slurm=False, slurm_partition=None, slurm_account=None)[source]
Bases:
objectConstructor. Sets up the hierarchy path either directly from the arguments or by looking for a hierarchy.cx2 file in the first RO-Crate directory provided. If neither is found, raises an error.
- Raises:
CellmapsvnnError – If no hierarchy path is specified or found.
- COMMAND = 'annotate'
- DEFAULT_NDEX_SERVER = 'ndexbio.org'
- DEFAULT_PASSWORD = '-'
- register_outputs(outdir, description, keywords, provenance_utils)[source]
Registers the output files of the annotation process with the FAIRSCAPE service for data provenance. This includes the annotated hierarchy and the RLIPP output files.
- Parameters:
- Returns:
A list of dataset IDs assigned to the registered files.
- Return type:
cellmaps_vnn.ccc_loss module
- class cellmaps_vnn.ccc_loss.CCCLoss(eps=1e-06)[source]
Bases:
ModuleA PyTorch module for calculating the Concordance Correlation Coefficient (CCC) Loss.
The CCC Loss is a measure used in regression tasks to evaluate the agreement between two variables.
Initializes the CCCLoss module.
- Parameters:
eps (float) – A small epsilon value for numerical stability. Default is 1e-6.
cellmaps_vnn.cellmaps_vnncmd module
- cellmaps_vnn.cellmaps_vnncmd.main(args)[source]
Main entry point for program
- Parameters:
args (list) – arguments passed to command line usually
sys.argv[1:]()- Returns:
return value of
cellmaps_vnn.runner.CellmapsvnnRunner.run()or2if an exception is raised- Return type:
cellmaps_vnn.constants module
Contains constants used by cellmaps vnn
- cellmaps_vnn.constants.CRHO_SCORE = 'C_rho'
C rho score
- cellmaps_vnn.constants.C_PVAL_SCORE = 'C_pval'
C pval score
- cellmaps_vnn.constants.DEFAULT_CUDA = 0
Set of constants for VNNTrain and VNNPredict
- cellmaps_vnn.constants.EDGE_IMPORTANCE_SCORE = 'edge_importance_score'
Name of the edge importance score attribute
- cellmaps_vnn.constants.GENE_IMPORTANCE_SCORE = 'importance_score'
Gene importance scores
- cellmaps_vnn.constants.GENE_RHO_FILE = 'gene_rho.out'
Output file for gene Rho from rlipp algorithm
- cellmaps_vnn.constants.GENE_SET_COLUMN_NAME = 'CD_MemberList'
Name of the node attribute of the hierarchy with list of genes/ proteins of this node.
- cellmaps_vnn.constants.GENE_SET_WITH_DATA = 'VNN_gene_set_with_data'
Hierarchy node attribute that contain genes with available data (eg. mutation, deletion, amplification) for vnn model
- cellmaps_vnn.constants.HIERARCHY_FILENAME = 'hierarchy.cx2'
Hierarchy filename.
- cellmaps_vnn.constants.IMPORTANCE_SCORE = 'importance_score'
Importance score (set to P_rho currently)
- cellmaps_vnn.constants.ORIGINAL_HIERARCHY_FILENAME = 'original_hierarchy.cx2'
Original hierarchy filename.
- cellmaps_vnn.constants.PARENT_NETWORK_NAME = 'hierarchy_parent.cx2'
Parent network of hierarchy filename.
- cellmaps_vnn.constants.PRHO_SCORE = 'P_rho'
P rho score
- cellmaps_vnn.constants.P_PVAL_SCORE = 'P_pval'
P pval score
- cellmaps_vnn.constants.RLIPP_OUTPUT_FILE = 'rlipp.out'
Output file from rlipp algorithm
- cellmaps_vnn.constants.RLIPP_SCORE = 'RLIPP'
RLIPP score
- cellmaps_vnn.constants.SCORE_FILE_NAME_SUFFIX = '_gene_scores.out'
Suffix for gene score file
- cellmaps_vnn.constants.SYSTEM_INTERACTOME_FILE_SUFFIX = '_interactome.cx2'
Suffix for system’s interactome file name
cellmaps_vnn.data_wrapper module
- class cellmaps_vnn.data_wrapper.TrainingDataWrapper(outdir, inputdir, gene_attribute_name, training_data, cell2id, gene2id, mutations, cn_deletions, cn_amplifications, modelfile, genotype_hiddens, lr, wd, alpha, epoch, batchsize, cuda, zscore_method, stdfile, patience, delta, min_dropout_layer, dropout_fraction, hierarchy=None)[source]
Bases:
objectInitializes the TrainingDataWrapper object with configuration and training data parameters.
cellmaps_vnn.exceptions module
cellmaps_vnn.predict module
- class cellmaps_vnn.predict.VNNPredict(outdir, inputdir, config_file=None, predict_data=None, gene2id=None, cell2id=None, mutations=None, cn_deletions=None, cn_amplifications=None, batchsize=64, zscore_method='auc', cpu_count=1, drug_count=0, genotype_hiddens=4, cuda=0, std=None, slurm=False, use_gpu=False, slurm_partition=None, slurm_account=None)[source]
Bases:
objectConstructor for predicting with a trained model.
- COMMAND = 'predict'
- DEFAULT_CPU_COUNT = 1
- DEFAULT_DRUG_COUNT = 0
- predict(predict_data, model_file, hidden_folder, batch_size, cell_features=None)[source]
Perform prediction using the trained model.
- Parameters:
predict_data – Tuple of features and labels for prediction.
model_file – Path to the trained model file.
hidden_folder – Directory to store hidden layer outputs.
batch_size – Size of each batch for prediction.
cell_features – Additional cell features for prediction.
- register_outputs(outdir, description, keywords, provenance_utils)[source]
Registers all output files (predictions, feature gradients, and hidden files) with the FAIRSCAPE service for data provenance.
- Parameters:
outdir – The directory where the output files are stored.
description – Description for the output files.
keywords – List of keywords associated with the files.
provenance_utils – The utility class for provenance registration.
- Returns:
A list of dataset IDs for the registered files.
- run()[source]
The logic for running predictions with the model. It executes the prediction process using the trained model and input data.
- Raises:
CellmapsvnnError – If an error occurs during the prediction process.
cellmaps_vnn.rlipp_calculator module
- class cellmaps_vnn.rlipp_calculator.RLIPPCalculator(outdir, hierarchy, test_data, predicted_data, gene2idfile, cell2idfile, hidden_dir, cpu_count, num_hiddens_genotype, drug_count, excluded_terms=[])[source]
Bases:
ImportanceScoreCalculatorA calculator for Relative Importance of Predictor Performance (RLIPP) scores.
Parameters: outdir (str): Output directory for the RLIPP scores and gene correlations. hierarchy (CX2Network): A hierarchy in HCX format. test_data (str): predicted_data (str): Path to the file containing predicted values. gene2idfile (str): Path to the file mapping genes to IDs. cell2idfile (str): Path to the file mapping cells to IDs. hidden_dir (str): Directory containing hidden layer outputs. rlipp_file (str): Path of the output file where results of rlipp algorithm will be saved gene_rho_file (str): Path of the output file where gene rho scores will be saved cpu_count (int): No of available cores num_hiddens_genotype (int): Mapping for the number of neurons in each term in genotype parts drug_count (int): No of top performing drugs
Constructor
- calc_gene_rho(gene_features, position_map, gene, drug)[source]
Calculates Spearman correlation between gene embeddings and predicted AUC.
- Parameters:
gene_features (numpy.ndarray) – The features for the gene.
position_map (list) – A list of positions for which correlation is to be calculated.
gene (str) – The gene for which correlation is calculated.
drug (str) – The drug for which correlation is calculated.
- Returns:
A formatted string containing the gene, Spearman correlation, and p-value.
- Return type:
- calc_scores()[source]
Calculates RLIPP scores for top n drugs (n = drug_count), and prints the result in “Drug Term P_rho C_rho RLIPP” format.
This method runs the calculation in parallel for efficiency.
- calc_term_rlipp(term_features, term_child_features, position_map, term, drug)[source]
Calculates the RLIPP score for a given term and drug.
- Parameters:
term_features (numpy.ndarray) – The features for the parent term.
term_child_features (list) – The features for the children of the term.
position_map (list) – A list of positions for which RLIPP is to be calculated.
term (str) – The term for which RLIPP is calculated.
drug (str) – The drug for which RLIPP is calculated.
- Returns:
A formatted string containing the term, Spearman correlations, p-values, and RLIPP score.
- Return type:
- create_child_feature_map(feature_map, term)[source]
Creates a map of child features for a given term.
- create_drug_corr_map_sorted(drug_pos_map)[source]
Creates a sorted mapping of drugs to their Spearman correlation values.
- create_drug_pos_map()[source]
Creates a mapping from drugs to their positions in the test data file.
- Returns:
A dictionary where keys are drugs and values are lists of positions in the test data.
- Return type:
- exec_lm(X, y)[source]
Executes 5-fold cross-validated Ridge regression for a given hidden features matrix and returns the Spearman correlation value of the predicted output.
- Parameters:
X (numpy.ndarray) – The input matrix for regression.
y (numpy.ndarray) – The target variable.
- Returns:
A tuple containing the Spearman correlation coefficient and p-value.
- Return type:
- static get_child_features(term_child_features, position_map)[source]
Gets a matrix of hidden features for a given term’s children.
- Parameters:
- Returns:
A matrix of hidden features for the children of the given term.
- Return type:
- load_feature(element, size)[source]
Loads hidden features for a given element.
- Parameters:
- Returns:
A numpy array of the hidden features for the given element.
- Return type:
- load_gene_features(gene)[source]
Loads hidden features for a given gene.
- Parameters:
gene (str) – The gene whose features are to be loaded.
- Returns:
A numpy array of the hidden features for the given gene.
- Return type:
cellmaps_vnn.runner module
- class cellmaps_vnn.runner.CellmapsvnnRunner(outdir=None, command=None, inputdir=None, name=None, organization_name=None, project_name=None, exitcode=None, skip_logging=True, input_data_dict=None, provenance_utils=<cellmaps_utils.provenance.ProvenanceUtil object>)[source]
Bases:
VnnRunnerClass to run algorithm
Constructor
- Parameters:
outdir (str) – Directory to create and put results in
skip_logging (bool) – If
Trueskip logging, ifNoneorFalsedo NOT skip loggingexitcode – value to return via
CellmapsvnnRunner.run()methodinput_data_dict (dict) – Command line arguments used to invoke this
provenance_utils (
ProvenanceUtil) – Wrapper for fairscape-cli which is used for RO-Crate creation and population
- class cellmaps_vnn.runner.SLURMCellmapsvnnRunner(outdir=None, command=None, inputdir=None, gene_attribute_name='CD_MemberList', gene2id=None, cell2id=None, mutations=None, cn_deletions=None, cn_amplifications=None, training_data=None, batchsize=64, cuda=0, zscore_method='auc', epoch=50, lr=0.001, wd=0.001, alpha=0.3, genotype_hiddens=4, optimize=0, n_trials=3, patience=30, delta=0.001, min_dropout_layer=2, dropout_fraction=0.3, skip_parent_copy=False, cpu_count=1, drug_count=0, predict_data=None, std=None, model_predictions=None, disease=None, hierarchy=None, parent_network=None, ndexserver=None, ndexuser=None, ndexpassword=None, visibility=False, gpu=False, slurm_partition=None, slurm_account=None, input_data_dict=None)[source]
Bases:
VnnRunner
cellmaps_vnn.train module
- class cellmaps_vnn.train.VNNTrain(outdir, inputdir, gene_attribute_name='CD_MemberList', config_file=None, training_data=None, gene2id=None, cell2id=None, mutations=None, cn_deletions=None, cn_amplifications=None, batchsize=64, zscore_method='auc', epoch=50, lr=0.001, wd=0.001, alpha=0.3, genotype_hiddens=4, patience=30, delta=0.001, min_dropout_layer=2, dropout_fraction=0.3, optimize=0, n_trials=3, cuda=0, skip_parent_copy=False, slurm=False, use_gpu=False, slurm_partition=None, slurm_account=None, hierarchy=None, parent_network=None)[source]
Bases:
objectConstructor for training a Visual Neural Network.
- Parameters:
outdir (str) – Directory to write results to.
inputdir (str) – Path to directory or RO-Crate with hierarchy.cx2 file.
gene_attribute_name (str) – Name of the node attribute with genes/proteins.
config_file (str, optional) – Path to configuration file for populating arguments.
training_data (str, optional) – Training data file path.
gene2id (str, optional) – File mapping genes to IDs.
cell2id (str, optional) – File mapping cells to IDs.
mutations (str, optional) – File with mutation information for cell lines.
cn_deletions (str, optional) – File with copy number deletions for cell lines.
cn_amplifications (str, optional) – File with copy number amplifications for cell lines.
batchsize (int) – Batch size for training. Default is 64.
zscore_method (str) – Z-score method. Default is ‘auc’.
epoch (int) – Number of epochs for training. Default is 50.
lr (float or list or tuple) – Learning rate. Default is 0.001.
wd (float) – Weight decay. Default is 0.001.
alpha (float) – Loss parameter alpha. Default is 0.3.
genotype_hiddens (int) – Number of neurons in genotype parts. Default is 4.
patience (int) – Early stopping epoch limit. Default is 30.
delta (float) – Minimum loss improvement for early stopping. Default is 0.001.
min_dropout_layer (int) – Layer number to start applying dropout. Default is 2.
dropout_fraction (float) – Dropout fraction. Default is 0.3.
optimize (int) – Hyperparameter optimization flag. Default is 0.
cuda (int) – GPU index. Default is 0.
skip_parent_copy (bool) – If True, do not copy hierarchy parent. Default is False.
slurm (bool) – If True, generate SLURM script for training. Default is False.
use_gpu (bool) – If True, adjust SLURM script to run on GPU. Default is False.
slurm_partition (str, optional) – SLURM partition to use. Default is ‘nrnb-gpu’ if use_gpu is True.
slurm_account (str, optional) – SLURM account name.
- COMMAND = 'train'
- DEFAULT_ALPHA = 0.3
- DEFAULT_DELTA = 0.001
- DEFAULT_DROPOUT_FRACTION = 0.3
- DEFAULT_EPOCH = 50
- DEFAULT_LR = 0.001
- DEFAULT_MIN_DROPOUT_LAYER = 2
- DEFAULT_N_TRIALS = 3
- DEFAULT_OPTIMIZE = 0
- DEFAULT_PATIENCE = 30
- DEFAULT_STD = 'std.txt'
- DEFAULT_WD = 0.001
- register_outputs(outdir, description, keywords, provenance_utils)[source]
Registers the model and standard deviation files with the FAIRSCAPE service for data provenance. It generates dataset IDs for each registered file.
- Parameters:
outdir – The directory where the output files are stored.
description – Description for the output files.
keywords – List of keywords associated with the files.
provenance_utils – The utility class for provenance registration.
- Returns:
A list of dataset IDs for the registered model and standard deviation files.
cellmaps_vnn.util module
- cellmaps_vnn.util.build_input_vector(input_data, cell_features)[source]
Builds an input vector for model training using cell features.
- Parameters:
input_data (Tensor) – Input data containing cell indices.
cell_features (numpy.ndarray) – Cell features array.
- Returns feature:
Input feature tensor for the model.
- Rtype feature:
Tensor
- cellmaps_vnn.util.calc_std_vals(df, zscore_method)[source]
Calculates standard deviation values for a given DataFrame based on the specified z-score method (‘zscore’ and ‘robustz’).
- Parameters:
df (pandas.DataFrame) – the data to be standardized.
zscore_method (str) – Method to use for standardization (‘zscore’ or ‘robustz’).
- Returns std_df:
DataFrame with standard deviation values for each dataset.
- Rtype std_df:
pandas.DataFrame
- cellmaps_vnn.util.copy_and_register_gene2id_file(genet2id_in_file, outdir, description, keywords, provenance_utils)[source]
- cellmaps_vnn.util.create_term_mask(term_direct_gene_map, gene_dim, cuda_id=None)[source]
Creates a term mask map for gene sets. This function generates a mask for each term where the mask is a matrix with rows equal to the number of relevant gene set and columns equal to the total number of genes. Each element is set to 1 if the corresponding gene is one of the relevant genes.
- cellmaps_vnn.util.get_grad_norm(model_params, norm_type)[source]
Computes the gradient norm of model parameters.
The norm is computed over all gradients together, as if they were concatenated into a single vector. Gradients are modified in-place.
- Parameters:
- Returns:
Total norm of the model parameters (viewed as a single vector).
- Return type:
Tensor
- cellmaps_vnn.util.load_cell_features(mutations, cn_deletions, cn_amplifications)[source]
Loads and combines cell/drug features from given mutation, CN deletion, and CN amplification files.
Each feature set is loaded as a NumPy array and then combined into a single array.
- Parameters:
- Returns:
Combined cell features.
- Return type:
- cellmaps_vnn.util.load_mapping(mapping_file, mapping_type)[source]
Loads a mapping from a file and returns it as a dictionary.
- Parameters:
- Returns mapping:
Dictionary containing the mapping from the file.
- Rtype mapping:
dict
- Raises:
CellmapsvnnError – If the mapping file is not found.
- cellmaps_vnn.util.load_numpy_data(file_path)[source]
Reads a file at the specified path and attempts to convert it into a NumPy array. If the file is not found or any other error occurs, an exception is raised.
- Parameters:
file_path (str) – Path to the file to be loaded.
- Returns:
Data loaded from the file.
- Return type:
- Raises:
CellmapsvnnError – If the file is not found or an error occurs during loading.
- cellmaps_vnn.util.pearson_corr(x, y)[source]
Computes the Pearson correlation coefficient between two tensors.
- Parameters:
x (Tensor) – First variable tensor.
y (Tensor) – Second variable tensor.
- Returns:
Pearson correlation coefficient.
- Return type:
Tensor
- cellmaps_vnn.util.standardize_data(df, std_df)[source]
Standardizes the data based on provided standard deviation values. This function applies z-score standardization to the ‘auc’ column of the DataFrame, using the standard deviation values provided.
- Parameters:
df (pandas.DataFrame) – the data to be standardized.
std_df (pandas.DataFrame) – the standard deviation values.
- Returns merged:
DataFrame with the standardized ‘z’ values.
- Rtype merged:
pandas.DataFrame
cellmaps_vnn.vnn module
- class cellmaps_vnn.vnn.VNN(data_wrapper: TrainingDataWrapper)[source]
Bases:
ModuleInitializes the VNN model with the provided data wrapper.
This constructor sets up components of the VNN model, including term maps, gene mappings, dropout parameters, and initializes neural network layers based on the given data structure. It also calculates the dimensions for each term and constructs the direct gene layers and the neural network graph.
- Parameters:
data_wrapper (TrainingDataWrapper) – The necessary data and configurations for initializing the VNN model.
- Raises:
CellmapsvnnError – If an error occurs during the initialization of the neural network.
- cal_term_dim(term_size_map)[source]
Calculates the dimensionality of each term based on the term sizes.
This method updates the term_dim_map attribute, which maps each term to its dimensionality. The dimensionality for each term is set to the number of hidden genotype variables.
- Parameters:
term_size_map (dict) – A mapping of terms to their sizes.
- construct_direct_gene_layer()[source]
Constructs layers for genes directly annotated with each term.
This method iterates through each gene and term to create specific layers in the neural network. For each gene, it adds a feature layer and a batch normalization layer. For each term, if there are genes directly annotated with it, it adds a linear layer that takes all genes as input and outputs only those genes directly annotated with the term. If a term has no directly associated genes, the method will raise exception.
- construct_nn_graph(digraph)[source]
Constructs a neural network graph based on given hierarchy.
This method builds the neural network by starting from the bottom (leaves) of the given directed graph (digraph) and iteratively adding modules for each term in the hierarchy. The method stores the built neural network layers in term_layer_list and maintains a map (term_neighbor_map) of each term to its children.
For each term, the method calculates the input size, which is the sum of the dimensions of its children and the number of genes directly annotated by the term. It then adds a series of layers (dropout, linear, batch normalization, and auxiliary linear layers) for each term.
The process continues until all nodes (terms) in the digraph have been processed and added to the network.
- Parameters:
digraph (networkx.DiGraph) – A directed graph representing the ontology, where nodes are terms and edges indicate term relationships.
- forward(x)[source]
Defines the forward function of the VNN model.
This method processes the input through the neural network constructed in the VNN class. It applies a series of transformations to the input data, including feature layer operations, batch normalization, and tanh activations. The method aggregates outputs from different terms in the network and finally produces two dictionaries: one for hidden embeddings and one for auxiliary outputs.
- Parameters:
x (torch.Tensor) – Input tensor representing gene data. Each row corresponds to a gene, and columns are features.
- Returns:
A tuple containing two dictionaries: - hidden_embeddings_map: A mapping from terms to their hidden embeddings. - aux_out_map: A mapping from terms to their auxiliary output.
- Return type:
cellmaps_vnn.vnn_trainer module
- class cellmaps_vnn.vnn_trainer.VNNTrainer(data_wrapper)[source]
Bases:
objectInitialize the VNN Trainer.
- Parameters:
data_wrapper (TrainingDataWrapper) – data wrapper containing data necessary for training
- TRAINING_PROGRESS_FILE = 'training_progress.tsv'
Module contents
Top-level package for cellmaps_vnn.