Description

The output of Genome-wide association studies (GWAS) of various natures can be commonly reduced to multiple genomic regions (loci) associated with a particular phenotype or disease. The bioinformatics challenge is to prioritize genes residing in the reported loci and to provide a reasonable gene model which can explain the role of prioritized genes in the studied phenomena. Locus Spider is a tool for functional analyses of a list of genomic regions (loci). Locus Spider implements a network based optimization principle to prioritize genes from submitted loci and to figure out molecular mechanisms (gene subnetwork) implicated by the loci.

As input, Locus Spider required a list of loci and a reference gene network. A reference gene network can represent any type of gene-gene relations (regulatory, physical protein interaction and so on). At the moment Locus Spider is using by default an integral gene network constructed based on 3 public databases (IntAct, Reactome and KEGG). The network based optimization principle in Locus Spider can be verbally expressed as following: Locus Spider search for a gene subnetwork (using the reference gene network) which has minimal length (the subnetwork length accounts both for the number of edges and distances from implicated genes to the corresponding SNPs) and cover as many as possible input loci.

As output, Locus Spider provides a set of optimal gene network models. For each number from 2 to k the best model is provided (where k is the maximal number of loci which could be connected with regard to the reference gene network, in most cases it is not possible to connect all input loci). Significance (p-value) of inferred models is estimated by Monte Carlo simulation. Random loci are sampled 100 times. The random loci are sampled to have the same (as input loci) distribution across chromosomes and the same (similar) distribution for the number of genes residing nearby. Each time (100 random simulation) Locus Spider is applied to infer the set of optimal gene network models. Each time for each number from 2 to k (k is the size of the largest model) the minimal length of the best model is recorded. Therefore, Locus Spider derives a distribution (of size 100) of optimal model lengths for each possible number (from 2 to n, where n is the number of input loci) of covered loci for a random loci. The minimal model length for each number of covered input loci is compared to the distribution and the p-value is derived which indicate the probability to get the same or less model length for a random loci.