
The best way to understand what we do is to learn what we’ve done for
other businesses and how we did it. |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19
contents |
back |
next
|
|
TIGR
MEV Algorithms - Putting Math to Work
TIGR
MEV uses clustering algorithms to assist researchers with intepreting experiment data. These clustering
algorithms are driven by gene expression levels and powered by complex mathematical formulas. Many of the
clustering algorithms used by TIGR MEV are also used in other industry sectors such as the military, financial and
even computer games. Without uses clustering algorithms researchers would have to sift through hundreds of
megabytes of data performing millions of comparisons so thankfully we have mathematics to lend a hand.
DataNaut adapted or implemented six algorithms in Java and C++ and performed extensive research on how to
parallelize each algorithm. The following is a short list of algorithms that are supported by TIGR MEV.
Hierarchical Clustering (Eisen et al. 1998) -
Hierarchical cluster analysis is a statistical method for finding relatively homogeneous clusters of cases based on
measured characteristics. In TIGR MEV this starts with each case in a separate cluster and then combines the clusters
sequentially, reducing the number of clusters at each step until only one cluster is left. When there are N cases,
this involves N-1 clustering steps, or fusions. This hierarchical clustering process can be represented as a tree, or
dendrogram, where a join of the tree illustrates each step in the clustering process.
K-Means Clustering (Soukas et al. 2000) -
K-means clustering is useful when the user has an apriori hypothesis about the number of clusters that the genes
should group into. The result of the algorithm is K clusters of genes. In k-means clustering, objects are partitioned
into a fixed number (k) of clusters, such that the clusters are internally similar but externally dissimilar; no
Dendrograms are produced. The process involved in k-means clustering is conceptually simple, but can
be computationally intensive.
Self Organized Maps (Tamayo et al. 1999) -
A self-organizing map (SOM) is a Neural Network-based divisive clustering approach where the result is a
set of clusters. A SOM assigns genes to a series of partitions on the basis of the similarity of their
expression vectors to reference vectors that are defined for each partition. It is the process of defining these
reference vectors that distinguishes SOMs from k-means clustering. Before initiating the analysis, the user defines
a geometric configuration for the partitions, typically a two-dimensional rectangular or hexagonal grid.
Random vectors are generated for each partition, but before genes can be assigned to partitions, the vectors are
first ‘trained’ using an iterative process that continues until convergence so that the data are most effectively separated.
Relevance Networks (Atul J. Butte , Pablo Tamayo, Donna Slonim , Todd R. Golub , and Isaac S. Kohane) -
This algorithm builds a Relevance Network in which nodes correspond to genes and chains correspond to the
degree of similarity between them. The statistically important chains are remained. Pearson distance
(correlation coefficient) used as similarity measure between genes. So, the algorithm calculates all pairwise
correlation coefficients. If its value is large enough the appropriate genes are connected, otherwise they are
disconnected. After the network built, graph layout algorithm is used to rearrange nodes on the chart.
For more detailed information about the Relevance Networks algorithm see the
A Closer Look at the Relevance Network Algorithm.
Principal Component Analysis (Raychaudhuri et al. 2000) -
An analysis of microarray data is a search for genes that have similar, correlated patterns of expression.
This indicates that some of the data might contain redundant information.
For example, if a group of experiments were more closely related than we had expected,
we could ignore some of the redundant experiments, or use some average of the information without loss of information. PCA (also called singular value decomposition) is a mathematical technique that exploits these factors to pick out patterns in the data, while reducing the effective dimensionality of gene-expression space without significant loss of information. This technique can be applied to both genes and experiments as a means of classification.
Support Vector Machines (Brown et al. 2000) -
Support Vector Machines (SVM) is considered a supervised computer learning method because they exploit prior knowledge
of gene function to identify unknown genes of similar function from expression data.
SVMs avoid several problems associated with unsupervised clustering methods, such as hierarchical clustering and
self-organizing maps. SVMs have many mathematical features that make them attractive for gene expression analysis,
including their flexibility in choosing a similarity function, sparseness of solution when dealing with
large data sets, the ability to handle large feature spaces, and the ability to identify outliers.
|
Page 11 of 19
contents |
back |
next
TIGR MEV is an open source bioinformatics system used for computational microarray analysis. Portions of
this software were developed by DataNaut Inc.; however, all rights and title in and to this software
are owned and retained by The Institute for Genomic Research. If you are interested in obtaining the
software visit the TIGR web site.
DataNaut provides software development consulting services with extensive expertise with microarray
technologies. Organizations that are interested in using DataNaut consulting services or having
TIGR MEV customized for specific research applications can send email to info@datanaut.com.
|
|
|