
The best way to understand what we do is to learn what we’ve done for
other businesses and how we did it. |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19
contents |
back |
next
|
A Closer Look at the Relevance Network Algorithm
One new
exciting algorithm that TIGR MEV supports is Relevance Networks. This algorithm runs on the
TIGR MEV grid and locally. Developing this algorithm required an
understanding of biology and mathematics but the most difficult aspect of development was making
the algorithm operate in parallel mode. Below is a closer look at how this algorithm is
implemented and how it was modified to run on the grid.
Computation Aspects
First an overview of the mathematics behind the algorithm. Let's suppose that we have a set of vectors , where , n - the dimension of the space. The similarity between vectors in X can be estimated using Pearson metric
(correlation coefficient):
where and
is the i-th element of vector
and
respectively,
and,
A Relevance Network is an "unoriented" graph
where vertexes
and edges ,
is Cartesian of .
,
were T is threshold value ( ).
Simply stated a Relevance Network is a network in which nodes correspond to vectors and links corresponds
to similarity measures (weight) between nodes and all links in a Relevance Network have a weight greater or equal
to a threshold value. Therefore in order to build a Relevance Network on X we need to calculate all pairwise
distance between vectors in X, compare them with the given threshold value and keep those values
greater or equal to the supplied threshold.
The following is the pseudocode for the Relevance Network algorithm:
TempSet=0
For each x belong to X
TempSet=
TempSet U x For
each y belong to (X- TempSet) If
d(x,y) > =T then <generate link connecting x and y>
End for
End for
This is rather simple and consists of two nested loops only. This algorithm requires
calculations of function d. The problem is that if we have 32768 vectors in 100-dimensional
space the calculations will require a lot of processing time.
A Pentium III 600Mhz PC takes more than 2 hours to perform this calculation with only 100mb of data. In order
to reduce computational time a parallel implementation of Relevance Networks was created with the fundamental
idea to divide all the jobs into sub-jobs then distribute these sub-jobs to machines on the grid.
The first step is to find the formula to define a sub-job with size s. Let's suppose that X is an array of
vectors and X[i] - i-th vector in the array, Size(X) is the size of the array X. Therefore we can rewrite
the above algorithm:
For i=1 to Size(X)
For j=i-1 to 0
If d(X[i],X[j]) >= T then <generate link connecting x and y>
End For
End For
However in the general case we need to state the algorithm for any sub-job:
Sub RelNetSubjob( iStartOuter as integer,
iStartInner asinteger, iSubJobSize as
integer) iCounter= 0
For j =
iStartInner to 0 If d(X[iStartOuter], X[j])
> =T
then <generatelink connecting x
and y> iCounter = iCounter+1 if (iCounter> iSubJobSize) return
End For
For i= iStartInner+1
to Size(X)
For j=
i-1to 0 If
d(X[i],X[j]) > =T then <generate link connecting x and y>
iCounter= iCounter+1
if (iCounter> iSubJobSize) return
End For
End For
End Sub
Using this function we define iStartOuter, iStartInner and iSubJobSize parameters to calculate the desired sub-job.
Then we can define the algorithm to find these parameters for any sub-job. We can start from the
current <iStartOuter1, iStartInner1> and move forward on iOpCount1 values for which new cortege <iStartOuter2, iStartInner2, iOpCount2>
corresponds. After repeating this step we can get the sequence of < iStartOuter1, iStartInner1, iOpCount1>,
< iStartOuter2, iStartInner2, iOpCount2>,...,< iStartOuterk, iStartInnerk, iOpCountk> corteges which defining sub-jobs
covering the total job. This sequence defines decomposition of a Job into a set of sub-jobs. Each sub-job can be
assigned to a single host computer that uses the algorithm to handle the sub-job defined by the appropriate cortege.
The master computer defines the decomposition of job and assigns sub-jobs to the workers.
Lets state that the algorithm can jump from previous <iStartOuteri, iStartInneri, iOpCounti> to the
new <iStartOuteri+1, iStartInneri+1,iOpCounti+1>. Therefore iStartOuteri, iStartInneri, iOpCounti are given. iStartOuteri+1
and iStartInneri+1 needed to be found:
1.
2.
if s > 0 solve the equation:

Here is the solution:

3. Tune iStartInner i+1
Using this strategy we can jump from the current position in calculations to the new one. This algorithm is used
by the Master computer to distribute sub-jobs between worker computers. The iOpCounti value is defined using a
worker host calculation speed. The primary goal of the scheduling algorithm is a proportional workload across all workers.
Communication Aspects of the Algorithm
TIGR
MEV uses PVM as the distributed calculation environment where calculation hosts are connected via the LAN. The data
transfer speed is limited by LAN capacity and is why iOpCounti values should not be too small otherwise most of the
time will be wasted moving data between the master and workers on the grid. We can estimate the time needed for
sub-jobs execution using the following formula:
, where
and
are times respective of sending and receiving results and fltOpsPerSec is the time in seconds for performing one elementary
operation (For example an operation is the Relevance Network distance function calculation). Using this formula we can calculate iOpCounti
for the appropriate worker with fltOpsPerSec computation speed estimation. We also should take into account the fact that
. In
other words the computation time should be much greater than
communications time. It is critical to reduce the amount of
communication overhead because workers will be moving data
instead of crunching it.
|
Page 18 of 19
contents |
back |
next
TIGR MEV is an open source bioinformatics system used for computational microarray analysis. Portions of
this software were developed by DataNaut Inc.; however, all rights and title in and to this software
are owned and retained by The Institute for Genomic Research. If you are interested in obtaining the
software visit the TIGR web site.
DataNaut provides software development consulting services with extensive expertise with microarray
technologies. Organizations that are interested in using DataNaut consulting services or having
TIGR MEV customized for specific research applications can send email to info@datanaut.com.
|
|
|