Using XML to Mine Message Boards
This article discusses using Extensible Stylesheet Language (XSL) and several other XML technologies to extract valuable
data from message boards. This article is intended for developers that have experience with XML.
Many online communities such as Yahoo,
IMDB and Raging Bull
offer message boards as a forum for online discussion. Locked in these message boards is goldmine of information that covers
any imaginable topic containing messages from people all over the world.
Many message boards are organized by topic such as movies, a company or person. Each topic has a dedicated message board where
messages related to the topic are posted. For example, on the
Yahoo Finance Red Hat message board
there is a list of message hyperlinks in descending order by post date. This message board has over 80,000 posts and some boards have
over 500,000 posts. Messages are usually organized into threads where the first post initiates a thread and
subsequent messages (replies) refer to the first message. Messages and threads have some sort of identifier used to
organize the posts in the correct order.
Drilling down to a
message on Yahoo you see the message body along with subject, author's nickname, post date and other
message identifiers. The message body can contain links, special characters and text. Yahoo also provides the
discussion thread as part of the message body however not all boards provide this handy list.
The problem is that to mine messages boards for interesting information one must manually browse 1000s of messages. Instead of doing
this manually it is possible to automate part of this process by using a crawler to collect and process messages. A message board crawler
needs to traverse all message list pages extracting each message link then follows each link collecting the content for each message.
After collection the data from the message has to be extracted from the HTML and stored for subsequent analysis.
Listing 1 and Listing 2
illustrate data structures used to store the data from a message list page and a message page. Following that is Listing 3 that
illustrates the navigation strategy that a crawler can use to get to each message.
Listing 1. Message List Page Data Structure:
<list>
<next url=”www.aaa.com?start=11122222”/>
<prev url=”www.aaa.com?start=11122233”/>
<messages>
<message date="1999-05-31T13:20:00-05:00" url="http://www.aaa.com?msgid=55"/>
<message date="1999-05-31T13:20:00-05:00" url="http://www.aaa.com?msgid=56"/>
</messages>
</list>
|
Note: If a page has does not have a next or prev page, <next> tag(s) is not present.
Listing 2. Message Page Data Structure
<message id="Id" date="1999-05-31T13:20:00-05:00">
<parent id="Id"/>
<author id="nick">
<subject>string</subject>
<body>string</body>
</message>
|
Listing 3. Crawling Algorithm for a Message Board
current_page := Get_First_Board_Page;
while( true ) {
visit_each_message_page( current_page );
if( !current_page.has_next )
break;
current_page := current_page.get_next;
}
|
Developing a crawler is out of the scope of this article but if you are interested in finding crawling code check
SourceForge and search for crawler or spider. Instead the focus
is on the process used to extract the data from the HTML that defines the message board pages. The process uses XHTML,
XSL, XSLT and XPath to get the job done and while this all seems like acronym soup these technologies all work very nicely together and
offer an excellent way to extract data from web pages.
XHTML is the Extensible Hypertext Markup Language that is a reformulation of
HTML 4 as an XML 1.0 application. Converting HTML to XHTML allows you to use the Document Object Model (DOM) to
reference specific elements in a document. The problem with converting HTML to XHTML is that most of the HTML on the
Internet today follows poor coding standards. Missing HTML tags and incorrect usage, browser specific hacks and other
oddities make conversion to XHTML difficult.
Rather then converting to XHTML manually
HTML Tidy can be used. HTML TIDY is a
very helpful tool that is typically used to validate HTML code checking for missing tags or other HTML coding errors.
HTML TIDY also provides configuration options to generate XHTML where TIDY converts HTML to XHTML. The process can
be controlled via code so the conversion to XHTML can be completely automated.
XSL is used for expressing stylesheets that consist of a language for transforming XML
documents and an XML vocabulary
for specifying formatting semantics. Extensible Stylesheet Language or XSLT is used as
the language to implement the stylesheet. XSL specifies the styling of an XML document by using XSLT to describe how the
document is transformed into another XML document that used the formatting vocabulary.
A XSL stylesheet is used to extract and format the data in the source HTML using XSLT. In the
stylesheet XPath locations define where to look for specific HTML elements.
The problem using XPath is that when HTML changes the location does not point to the correct place in the HTML code. Therefore
any changes in the source HTML could require making changes to the XSLT. However maintaining
a stylesheet is easier then constantly tweaking parsing code and maintenance can be performed by people with basic programming skills.
Creating a stylesheet can be time consuming but using an XML authoring
tool can expedite development. A popular XML authoring tool is XML Spy from Altova. XML Spy is very
powerful and using the tool makes it easier to develop, test and maintain XSL and other XML assets.
In conjunction with using an XML authoring tool a process needs to be used to create stylesheets. Listing 4 illustrates
this process. The process starts with downloading a HTML page from a message board. This page is then used to
define the XSL template. The developer processes the HTML with HTML TIDY to create an XHTML document. The XSL template is then
applied to the XHTML where the results of the transformation are checked against a target schema and date formatting rules. Once a
template is tested it can be used to extract data from any message on the board. See Listing 5 for a sample
stylesheet produced with XML Spy that transforms a Yahoo message board page.
Each stylesheet needs to be integrated with a crawler so there is still a lot left to do. In order to use
the stylesheet the crawler needs to automate the process of transforming a web page and storing the result
somewhere. There are many XML libraries that can be used to handle this task and are available for most popular programming
languages. In Java, Xerces and Xalan can be used to
handle the XML chores and are available for free.
In conclusion XSL and supporting XML technologies offer an alternative approach to extracting data from message
board pages. While might take several extra steps to build a stylesheet in the end you will
have to structured and standardized approach to collect online discussion. Hopefully one day all online communities will
take Google's approach and offer an API to query for messages and have the
results returned in XML. Unfortunately we are not there yet but by using XML you will be ready.