As with any distributed system a critical component is the underlying communications infrastructure used to
shuttle data between the client and back-end services. A researcher can send hundreds of megabytes of data to
the grid for processing so this was a concern from the start. The answer was a barebones protocol consisting
of a transport layer, job control semantics and calculation data semantics. In addition to the protocol a proxy
was needed to serve as a gateway to the grid.
Microarray Gene Expression Markup Language (MAGE-ML) was adapted to govern
the exchange microarray data between TIGR MEV and the grid. MAGE is an emerging standard used to describe and communicate information about microarray-based experiments.
MAGE is based on XML and can describe microarray designs, microarray manufacturing information, microarray
experiment setup and execution information, gene expression data and data analysis results. MAGE does
an excellent job describing microarray data however the grid does not need most of the information MAGE defines. The
TIGR MEV grid works with data in terms of math, floating point numbers, not biology treating data sets as vectors
and matrixes. Extracting and converting large amounts of data from a MAGE data set required too much time and
resources on the grid therefore the TIGR MEV client application uses a subset of MAGE reducing conversion time.
The job control layer of the protocol provides the necessary semantics to initiate a request, stopping a request
in progress and provide notifications about execution process. This was implemented using XML. MAGE packets are
Base64 encoded and inserted into a job-control packet.
The transport layer of the protocol provides the mechanism for transferring job control packets up to 100Mb in size
between client and server. HTTP was selected as the underlying transport because of its flexibility and industry support.
In
conjunction with the protocol a communications gateway was developed to serve as a proxy between the TIGR MEV client and
grid. All clients send data through the gateway thus never directly communicate with the grid. The gateway speaks the
TIGR MEV protocol and knows how to handle each
job preparing it for execution and shuttling the results and status back to the client. Isolating the
grid in this manner has its advantages but it can be argued that a using a gateway is bottleneck. This was a
concern but testing indicates that the gateway will support the workload of the TIGR research team.
The
TIGR MEV Communications Gateway is implemented as a servlet running on Apache Tomcat 3.2 Servlet Engine. The gateway
leverages the HTTP session mechanism provided by the servlet engine to implement asynchronous HTTP communications
between the gateway and client. The client sends a job to the gateway then polls for a result via HTTP. After
receiving a job the gateway spawns a PVM process (analysis algorithm) and redirects the job to it. The PVM process
parses the request, performs calculations and sends the result back to the gateway. The client receives the result
from the gateway during the next polling request.
TIGR MEV is an open source bioinformatics system used for computational microarray analysis. Portions of
this software were developed by DataNaut Inc.; however, all rights and title in and to this software
are owned and retained by The Institute for Genomic Research. If you are interested in obtaining the
software visit the TIGR web site.
DataNaut provides software development consulting services with extensive expertise with microarray
technologies. Organizations that are interested in using DataNaut consulting services or having
TIGR MEV customized for specific research applications can send email to info@datanaut.com.