Statistical frameworks

All tools implements robust statistical frameworks. In respect to statistical methodology employed, available web-tools can be divided into two types. The first group (ProfCom_GO, GeneSet2MiRNA, CCancer, ProfCom_PROT_MOTIF, ProfCom_UPSTREAM_MOTIF) employs advanced enrichment schema referred to as "ProfCom". The second group (KEGG spider, R spider, PPI spider, CCancer spider) implements statistical methodology for the network based interpretation of a gene list referred to as "Global network".

ProfCom

In this case the reference knowledge represent split of genes into groups, like, split of genes into functional classes (Gene Ontology) or split of genes into groups based on whether or not they are regulated by the same microRNA. Let us denote each class (i.e. GO term, microRNA, amino acid triplet) as f and the set of all classes as F. In a standard enrichment schema a query list of genes (referred to as list A) and a reference list (referred to as list B, usually all genes from the genome) are compared. For each class f from the set F the number of genes that have been annotated with f in the list A and in the list B is counted and compared. In the next step, the null hypothesis H0 (genes that belong to the set A are independent of having attribute f) is tested. Hypergeometric, binomial or Xi2-tests are usually employed to find over/under represented attributes.

ProfCom extends enrichment schema by construction "complex classes", which are Boolean combination of available classes F. ProfCom uses Boolean operations: intersection, union and difference. For example, intersection of two categories f1 and f2 (AND operator) is the set of genes that belong to both categories f1 and f2. The difference between two categories f1 and f2 (NOT operator) is genes from f1 which are not f2.

Consideration of all possible "complex classes" can lead to combinatorial complexity. For this reason, a search algorithm should be used. ProfCom employs the algorithm based on greedy heuristics. Greedy heuristics does not guarantee to find the optimal solution in every case but significantly reduce the computational complexity. To adjust P-values for multiple testing ProfCom uses both Bonferoni correction and Monte-Carlo simulation approach.

Global Network

In this case the reference knowledge represents gene pairwise associations of any biological essence. Two genes are connected by edge if they affect the state of each other. The model inference procedure is based on natural assumptions:

  • We believe that most genes from the input list are related and most of the genes that are not from the input list are unrelated
These assumptions can be reformulated as standard MINIMAX optimization principle:
  • We want to connect into non-interrupted subnetwork component as many input genes as possible using minimal number of missing genes (genes that are not from the input list)

To implement the above formulated optimal criteria we propose a simple network inference algorithm. We want to connect only input genes with each other even if they are not connected by edge according to the global reference network. For this reason, we introduce a parameter m which is a maximal number of missing genes between any two input nodes (according to the global reference network) to be connected. The model inferred in 3 steps by fixing m to be 0,1,2. At each step we connect by edge any two input genes which have less then or equal to m genes in between in accordance with global reference network. At each step (m = 0, 1, 2) a connected component with maximal number of input genes is inferred an referred to as model D1, D2, D3.

It is clear that for any gene list and a reference network, one can always infer some model, which will cover all genes from the list by relaxing the number of possible intermediate genes. Statistical treatment plays an important role for the quality control of inferred models. All spider tools implement robust statistical framework to estimate p-value of the inferred models. Let us give more strict description of the employed procedure to etimate p-values.

Let us assume that we have N genes from the input list to be mapped to the reference network. Next we refer to the value N as the size of the input list. We infer the network models D1, D2, D3. Let us denote S1, S2, S3 to be the number of input nodes in the inferred network models. We also refer to S1, S2, S3 as the sizes of the respective models D1, D2, D3. Given the number of mapped genes from the input list (N), we consider the sizes of the models (values S1, S2, S3) as statistics. We have to estimate the probability to get models of the same or larger sizes from a randomly generated gene list which has N genes mapped to the reference network.

To generate the background distributions BD1, BD2, BD3 we repeat the following simulation procedure k times (1000 times), where k specifies the upper significance level. A random gene list Lj of size N (equal to the size of the input list) is generated by sampling genes from global gene network. Index j = 1 . . . k specifies each of the k random simulations. The network inference procedure described above is applied to the random list Lj and the network models D1, D2, D3 are inferred. Let us denote the size (the number of input genes) of the inferred models D1, D2, D3 for the random list Lj as R1j, R2j, R3j. Thus, after repeating the simulation procedure k times, we get the background distribution R1j (j = 1 ... k) for models D1, the background distribution R2j (j = 1 ... k) for models D2, and the background distribution R3j (j = 1 . . . k) for models D3. To estimate significance of the inferred network model D1 for the input gene list, the value S1 is compared with the distribution R1j. Let n be the number of values from the distribution R1j that are equal or greater than S1. The estimate of P (p-value) of the inferred network model D1 is computed as P = (n + 1) / k. In the same way the pvalues for the model D2 and D3 are estimated.

There is a very simple test for any similar tool: the tool must be able to recognize a random gene list and return on average insignificant p-values for the random case. In 20 submissions of different randomly generated gene lists on average only 1 case is expected to be significant at the level of 0.05 (1/20). The estimate of the p-value provided by the Monte Carlo procedure corresponds exactly to the definition of p-value: the probability to get a model of the same quality for a random gene list.