Document Similarity w/ Hadoop February 3, 2009Posted by Andre Vellino in CISTI, Information retrieval, Open Source, Search.
For a while now, I have been wanting to calculate an exact “pairwise document similarity” measure for a large (~8M item) corpus of full-text scientific articles. I tried a variety of obvious sequential methods and discovered that even with good caching strategies, this just wasn’t feasible for such a large collection.
Since this problem is clearly parallelizable, I tried to do it with Hadoop – Apache’s Java implementation of Google’s MapReduce. As a starting point, Glen pointed me to a nice short paper on how to compute pairwise document similarity using MapReduce (by Elsayed, Lin and Oard). Here’s what I learned from the experience.
It is pretty easy to see just how useful Hadoop can be just by installing it in standalone mode and executing the distributed “WordCount” example from the QuickStart guide. But doing anything more than that means not only how to translate your sequential algorithm into the MapReduce paradigm (often not a trival excercise) but pretty much all the particulars of the Hadoop distributed file system (DFS) and MapReduce API as well.
It’s a complex system, but the net value is so high (fast, fault-tollerant distributed computing for large datasets – although Hadoop is not, it should be noted, “highly available” since there is a single point of failure) that diving into the deep end is quite easy to justify. Nevertheless, for my first Hadoop application, I found it really helpful to have some extra documentation around. The knowledge you get from reading the project’s documentation and Wiki isn’t (yet) quite enough.
Fortunately, one of the commiters tothe Hadoop project, Tom White, is in the middle of writing an O’Reilly book Hadoop: the Definitive Guide, which fills in many blanks. I predict that the startup companies like the one he’s with, Cloudera, will have no shortage of customers.
After a lot of fiddling I got a functioning version of the implementation described by Elsayed, Lin & Oard paper. I haven’t deployed MapReduce on multiple nodes yet (from reading the installation docs, I’ve got the feeling this is non-trivial), but I did notice that you’d better have lots of disk space lying around – if running this in “pseudo distributed mode” is any indication, anyway.
First I tested that the hadooped algorithm generated the same results as the sequential one on a small collection of 25,000 articles using a similarity measure based on the “keywords” field for the article (as opposed to the somewhat larger “abstract” or “full-text” fields.) The Hadoop input file for this experiment was about 8Mb and contains about 23,000 terms. That went pretty quickly (on the order of a few minutes on a commodity 2007 Dell server blade.)
Then I tried it on the same collection of 25,000 articles but used the terms in the “abstract” field as the term-vectors to use for comparing similarity. This yeilds an input file to Hadoop of 92,000 terms and an input file of 112 Mb.
No doubt because I didn’t do anything specific to change default parameters for the mapreduce processes, the “intermediate” files that are generated by the “Map” part took up more than 100x (yes, 120Gb!) the orginal input file!
In a nutshell: Hadoop a pretty cool piece of software – albeit still “young”. I can only hope for its future that support for it from Yahoo doesn’t bite the dust because of the recession.