DataNaut
     Company | Approach | Services | Careers | Contact | Sitemap | Home     
Services
Articles & Whitepapers
The best way to understand what we do is to learn what we’ve done for other businesses and how we did it.
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19  contents | back | next


A Closer Look at the Relevance Network Algorithm

One new exciting algorithm that TIGR MEV supports is Relevance Networks. This algorithm runs on the TIGR MEV grid and locally. Developing this algorithm required an understanding of biology and mathematics but the most difficult aspect of development was making the algorithm operate in parallel mode. Below is a closer look at how this algorithm is implemented and how it was modified to run on the grid.

Computation Aspects

First an overview of the mathematics behind the algorithm. Let's suppose that we have a set of vectors , where , n - the dimension of the space. The similarity between vectors in X can be estimated using Pearson metric
(correlation coefficient):

where and is the i-th element of vector and respectively,
and,

A Relevance Network is an "unoriented" graph where vertexes and edges , is Cartesian of . , were T is threshold value ().

Simply stated a Relevance Network is a network in which nodes correspond to vectors and links corresponds to similarity measures (weight) between nodes and all links in a Relevance Network have a weight greater or equal to a threshold value. Therefore in order to build a Relevance Network on X we need to calculate all pairwise distance between vectors in X, compare them with the given threshold value and keep those values greater or equal to the supplied threshold.

The following is the pseudocode for the Relevance Network algorithm:

TempSet=0
For each x belong to X

TempSet=

    TempSet  U x For
	each y belong to (X- TempSet) If
		d(x,y) > =T then <generate link connecting x and y>
	End for
End for



This is rather simple and consists of two nested loops only. This algorithm requires calculations of function d. The problem is that if we have 32768 vectors in 100-dimensional space the calculations will require a lot of processing time.
A Pentium III 600Mhz PC takes more than 2 hours to perform this calculation with only 100mb of data. In order to reduce computational time a parallel implementation of Relevance Networks was created with the fundamental idea to divide all the jobs into sub-jobs then distribute these sub-jobs to machines on the grid.
The first step is to find the formula to define a sub-job with size s. Let's suppose that X is an array of vectors and X[i] - i-th vector in the array, Size(X) is the size of the array X. Therefore we can rewrite the above algorithm:

For i=1 to Size(X)
	For j=i-1 to 0
		If d(X[i],X[j]) >= T then  <generate link connecting x and y>
	End For
End For

However in the general case we need to state the algorithm for any sub-job:

Sub RelNetSubjob( iStartOuter as integer,
iStartInner asinteger, iSubJobSize as
	integer) iCounter=  0
		For j =





     iStartInner to 0 If d(X[iStartOuter], X[j])

     >  =T
     then <generatelink connecting x
          and y> iCounter =    iCounter+1 if (iCounter> iSubJobSize) return
          End For
          For i=  iStartInner+1
     to Size(X)

     For j=
            i-1to 0 If
                d(X[i],X[j]) > =T then <generate link connecting x and y>
                iCounter= iCounter+1
                if (iCounter> iSubJobSize) return
           End For
     End For
End Sub

Using this function we define iStartOuter, iStartInner and iSubJobSize parameters to calculate the desired sub-job. Then we can define the algorithm to find these parameters for any sub-job. We can start from the current <iStartOuter1, iStartInner1> and move forward on iOpCount1 values for which new cortege <iStartOuter2, iStartInner2, iOpCount2> corresponds. After repeating this step we can get the sequence of < iStartOuter1, iStartInner1, iOpCount1>, < iStartOuter2, iStartInner2, iOpCount2>,...,< iStartOuterk, iStartInnerk, iOpCountk> corteges which defining sub-jobs covering the total job. This sequence defines decomposition of a Job into a set of sub-jobs. Each sub-job can be assigned to a single host computer that uses the algorithm to handle the sub-job defined by the appropriate cortege. The master computer defines the decomposition of job and assigns sub-jobs to the workers.

Lets state that the algorithm can jump from previous <iStartOuteri, iStartInneri, iOpCounti> to the new <iStartOuteri+1, iStartInneri+1,iOpCounti+1>. Therefore iStartOuteri, iStartInneri, iOpCounti are given. iStartOuteri+1 and iStartInneri+1 needed to be found:

1.
2. if s > 0 solve the equation:

Here is the solution:

3. Tune iStartInner i+1

Using this strategy we can jump from the current position in calculations to the new one. This algorithm is used by the Master computer to distribute sub-jobs between worker computers. The iOpCounti value is defined using a worker host calculation speed. The primary goal of the scheduling algorithm is a proportional workload across all workers.

Communication Aspects of the Algorithm

TIGR MEV uses PVM as the distributed calculation environment where calculation hosts are connected via the LAN. The data transfer speed is limited by LAN capacity and is why iOpCounti values should not be too small otherwise most of the time will be wasted moving data between the master and workers on the grid. We can estimate the time needed for sub-jobs execution using the following formula:

, where and are times respective of sending and receiving results and fltOpsPerSec is the time in seconds for performing one elementary operation (For example an operation is the Relevance Network distance function calculation). Using this formula we can calculate iOpCounti for the appropriate worker with fltOpsPerSec computation speed estimation. We also should take into account the fact that . In other words the computation time should be much greater than communications time. It is critical to reduce the amount of communication overhead because workers will be moving data instead of crunching it.

Page 18 of 19 contents | back | next



TIGR MEV is an open source bioinformatics system used for computational microarray analysis. Portions of this software were developed by DataNaut Inc.; however, all rights and title in and to this software are owned and retained by The Institute for Genomic Research. If you are interested in obtaining the software visit the TIGR web site.

DataNaut provides software development consulting services with extensive expertise with microarray technologies. Organizations that are interested in using DataNaut consulting services or having TIGR MEV customized for specific research applications can send email to info@datanaut.com.

     Company | Approach | Services | Careers | Contact | Sitemap | Home   © 2012 Datanaut, Inc. All Rights Reserved.