jump to navigation

Document Similarity w/ Hadoop February 3, 2009

Posted by Andre Vellino in CISTI, Information retrieval, Open Source, Search.
trackback

hadoop-logoFor a while now, I have been wanting to calculate an exact “pairwise document similarity” measure for a large (~8M item) corpus of full-text scientific articles.  I tried a variety of obvious sequential methods and discovered that even with good caching strategies, this just wasn’t feasible for such a large collection.

Since this problem is clearly parallelizable, I tried to do it with Hadoop – Apache’s Java implementation of Google’s MapReduce.  As a starting point, Glen pointed me to a nice short paper on how to compute pairwise document similarity using MapReduce (by Elsayed, Lin and Oard). Here’s what I learned from the experience.

It is pretty easy to see just how useful Hadoop can be just by installing it in standalone mode and executing the distributed “WordCount” example from the QuickStart guide. But doing anything more than that means not only how to translate your sequential algorithm into the MapReduce paradigm (often not a trival excercise) but pretty much all the particulars of the Hadoop distributed file system (DFS) and MapReduce API as well.

It’s a complex system, but the net value is so high (fast, fault-tollerant distributed computing for large datasets – although Hadoop is not, it should be noted, “highly available” since there is a single point of failure) that diving into the deep end is quite easy to justify. Nevertheless, for my first Hadoop application, I found it really helpful to have some extra documentation around. The knowledge you get from reading the project’s documentation and Wiki isn’t (yet) quite enough.

Fortunately, one of the commiters tothe Hadoop project, Tom White, is in the middle of writing an O’Reilly book Hadoop: the Definitive Guide, which fills in many blanks. I predict that the startup companies like the one he’s with, Cloudera, will have no shortage of customers.

After a lot of fiddling I got a functioning version of the implementation described by Elsayed, Lin & Oard paper.  I haven’t deployed MapReduce on multiple nodes yet (from reading the installation docs, I’ve got the feeling this is non-trivial), but I did notice that you’d better have lots of disk space lying around – if running this in “pseudo distributed mode” is any indication, anyway.

First I tested that the hadooped algorithm generated the same results as the sequential one on a small collection of 25,000 articles using a similarity measure based on the “keywords” field for the article (as opposed to the somewhat larger “abstract” or “full-text” fields.) The Hadoop input file for this experiment was about 8Mb and contains about 23,000 terms. That went pretty quickly (on the order of a few minutes on a commodity 2007 Dell server blade.)

Then I tried it on the same collection of 25,000 articles but used the terms in the “abstract” field as the term-vectors to use for comparing similarity.  This yeilds an input file to Hadoop of 92,000 terms and an input file of 112 Mb.

No doubt because I didn’t do anything specific to change default parameters for the mapreduce processes, the “intermediate” files that are generated by the “Map” part took up more than 100x (yes, 120Gb!) the orginal input file!

In a nutshell: Hadoop a pretty cool piece of software – albeit still “young”.  I can only hope for its future that support for it from Yahoo doesn’t bite the dust because of the recession.

Comments»

1. Brett the James - February 25, 2009

So I’m curious if Hadoop provided useful data in terms of document similarity. Did you get that far with your test?

2. Andre Vellino - February 25, 2009

Thanks for the question Brett. I should have said something about that in t he original post.

The answer is yes – the doc-sim in Hadoop produces the right answers. I compared the Haddoop results with the equivalent sequential algorithm and they produced the same results on the test sets that I used. So it behaves correctly.

I didn’t go as far as using Haddoop on a large distributed cluster and for the 8.5M documents. That will be for another time.

3. Brett the James - February 26, 2009

I have a project where I need to search for duplicate documents among thousands of PDFs. I don’t know if Hadoop is the way to go for me, but it sounds like it has promise. Did you have to do a lot of coding to do the above test? If so, would you be amenable to sharing the code?

The bigger question would be: as someone starting from not even having Hadoop installed, how big a project was it to get where you are? Days? Weeks?

I sincerely appreciate any answers, as you are already saving me a lot of work.

- Brett

4. Andre Vellino - February 26, 2009

Sure I’d be happy to help. Contact me at andre.vellino@nrc.ca.

5. Cloud Computing, MapReduce y Hadoop » Blog Archive » Sistemas de recomendación con Hadoop - July 22, 2009

[...] paper “Pairwise Document Similarity in Large Collections with MapReduce“. Encontré una entrada en un blog detallando el uso de este algoritmo, y un tutorial que muestra cómo implementarlo usando Elastic [...]