DataNaut
     Company | Approach | Services | Careers | Contact | Sitemap | Home     
Services
Articles & Whitepapers
The best way to understand what we do is to learn what we’ve done for other businesses and how we did it.
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19  contents | back | next



TIGR MEV Algorithms - Putting Math to Work

TIGR MEV uses clustering algorithms to assist researchers with intepreting experiment data. These clustering algorithms are driven by gene expression levels and powered by complex mathematical formulas. Many of the clustering algorithms used by TIGR MEV are also used in other industry sectors such as the military, financial and even computer games. Without uses clustering algorithms researchers would have to sift through hundreds of megabytes of data performing millions of comparisons so thankfully we have mathematics to lend a hand.

DataNaut adapted or implemented six algorithms in Java and C++ and performed extensive research on how to parallelize each algorithm. The following is a short list of algorithms that are supported by TIGR MEV.

Hierarchical Clustering (Eisen et al. 1998) - Hierarchical cluster analysis is a statistical method for finding relatively homogeneous clusters of cases based on measured characteristics. In TIGR MEV this starts with each case in a separate cluster and then combines the clusters sequentially, reducing the number of clusters at each step until only one cluster is left. When there are N cases, this involves N-1 clustering steps, or fusions. This hierarchical clustering process can be represented as a tree, or dendrogram, where a join of the tree illustrates each step in the clustering process.

K-Means Clustering (Soukas et al. 2000) - K-means clustering is useful when the user has an apriori hypothesis about the number of clusters that the genes should group into. The result of the algorithm is K clusters of genes. In k-means clustering, objects are partitioned into a fixed number (k) of clusters, such that the clusters are internally similar but externally dissimilar; no Dendrograms are produced. The process involved in k-means clustering is conceptually simple, but can be computationally intensive.

Self Organized Maps (Tamayo et al. 1999) - A self-organizing map (SOM) is a Neural Network-based divisive clustering approach where the result is a set of clusters. A SOM assigns genes to a series of partitions on the basis of the similarity of their expression vectors to reference vectors that are defined for each partition. It is the process of defining these reference vectors that distinguishes SOMs from k-means clustering. Before initiating the analysis, the user defines a geometric configuration for the partitions, typically a two-dimensional rectangular or hexagonal grid. Random vectors are generated for each partition, but before genes can be assigned to partitions, the vectors are first ‘trained’ using an iterative process that continues until convergence so that the data are most effectively separated.

Relevance Networks (Atul J. Butte , Pablo Tamayo, Donna Slonim , Todd R. Golub , and Isaac S. Kohane) - This algorithm builds a Relevance Network in which nodes correspond to genes and chains correspond to the degree of similarity between them. The statistically important chains are remained. Pearson distance (correlation coefficient) used as similarity measure between genes. So, the algorithm calculates all pairwise correlation coefficients. If its value is large enough the appropriate genes are connected, otherwise they are disconnected. After the network built, graph layout algorithm is used to rearrange nodes on the chart. For more detailed information about the Relevance Networks algorithm see the A Closer Look at the Relevance Network Algorithm.

Principal Component Analysis (Raychaudhuri et al. 2000) - An analysis of microarray data is a search for genes that have similar, correlated patterns of expression. This indicates that some of the data might contain redundant information. For example, if a group of experiments were more closely related than we had expected, we could ignore some of the redundant experiments, or use some average of the information without loss of information. PCA (also called singular value decomposition) is a mathematical technique that exploits these factors to pick out patterns in the data, while reducing the effective dimensionality of gene-expression space without significant loss of information. This technique can be applied to both genes and experiments as a means of classification.

Support Vector Machines (Brown et al. 2000) - Support Vector Machines (SVM) is considered a supervised computer learning method because they exploit prior knowledge of gene function to identify unknown genes of similar function from expression data. SVMs avoid several problems associated with unsupervised clustering methods, such as hierarchical clustering and self-organizing maps. SVMs have many mathematical features that make them attractive for gene expression analysis, including their flexibility in choosing a similarity function, sparseness of solution when dealing with large data sets, the ability to handle large feature spaces, and the ability to identify outliers.

Page 11 of 19 contents | back | next



TIGR MEV is an open source bioinformatics system used for computational microarray analysis. Portions of this software were developed by DataNaut Inc.; however, all rights and title in and to this software are owned and retained by The Institute for Genomic Research. If you are interested in obtaining the software visit the TIGR web site.

DataNaut provides software development consulting services with extensive expertise with microarray technologies. Organizations that are interested in using DataNaut consulting services or having TIGR MEV customized for specific research applications can send email to info@datanaut.com.

     Company | Approach | Services | Careers | Contact | Sitemap | Home   © 2012 Datanaut, Inc. All Rights Reserved.