jump to navigation

Document Similarity w/ Hadoop February 3, 2009

Posted by Andre Vellino in CISTI, Information retrieval, Open Source, Search.

hadoop-logoFor a while now, I have been wanting to calculate an exact “pairwise document similarity” measure for a large (~8M item) corpus of full-text scientific articles.  I tried a variety of obvious sequential methods and discovered that even with good caching strategies, this just wasn’t feasible for such a large collection.

Since this problem is clearly parallelizable, I tried to do it with Hadoop – Apache’s Java implementation of Google’s MapReduce.  As a starting point, Glen pointed me to a nice short paper on how to compute pairwise document similarity using MapReduce (by Elsayed, Lin and Oard). Here’s what I learned from the experience.

It is pretty easy to see just how useful Hadoop can be just by installing it in standalone mode and executing the distributed “WordCount” example from the QuickStart guide. But doing anything more than that means not only how to translate your sequential algorithm into the MapReduce paradigm (often not a trival excercise) but pretty much all the particulars of the Hadoop distributed file system (DFS) and MapReduce API as well.

It’s a complex system, but the net value is so high (fast, fault-tollerant distributed computing for large datasets – although Hadoop is not, it should be noted, “highly available” since there is a single point of failure) that diving into the deep end is quite easy to justify. Nevertheless, for my first Hadoop application, I found it really helpful to have some extra documentation around. The knowledge you get from reading the project’s documentation and Wiki isn’t (yet) quite enough.

Fortunately, one of the commiters tothe Hadoop project, Tom White, is in the middle of writing an O’Reilly book Hadoop: the Definitive Guide, which fills in many blanks. I predict that the startup companies like the one he’s with, Cloudera, will have no shortage of customers.

After a lot of fiddling I got a functioning version of the implementation described by Elsayed, Lin & Oard paper.  I haven’t deployed MapReduce on multiple nodes yet (from reading the installation docs, I’ve got the feeling this is non-trivial), but I did notice that you’d better have lots of disk space lying around – if running this in “pseudo distributed mode” is any indication, anyway.

First I tested that the hadooped algorithm generated the same results as the sequential one on a small collection of 25,000 articles using a similarity measure based on the “keywords” field for the article (as opposed to the somewhat larger “abstract” or “full-text” fields.) The Hadoop input file for this experiment was about 8Mb and contains about 23,000 terms. That went pretty quickly (on the order of a few minutes on a commodity 2007 Dell server blade.)

Then I tried it on the same collection of 25,000 articles but used the terms in the “abstract” field as the term-vectors to use for comparing similarity.  This yeilds an input file to Hadoop of 92,000 terms and an input file of 112 Mb.

No doubt because I didn’t do anything specific to change default parameters for the mapreduce processes, the “intermediate” files that are generated by the “Map” part took up more than 100x (yes, 120Gb!) the orginal input file!

In a nutshell: Hadoop a pretty cool piece of software – albeit still “young”.  I can only hope for its future that support for it from Yahoo doesn’t bite the dust because of the recession.


1. Brett the James - February 25, 2009

So I’m curious if Hadoop provided useful data in terms of document similarity. Did you get that far with your test?

2. Andre Vellino - February 25, 2009

Thanks for the question Brett. I should have said something about that in t he original post.

The answer is yes – the doc-sim in Hadoop produces the right answers. I compared the Haddoop results with the equivalent sequential algorithm and they produced the same results on the test sets that I used. So it behaves correctly.

I didn’t go as far as using Haddoop on a large distributed cluster and for the 8.5M documents. That will be for another time.

3. Brett the James - February 26, 2009

I have a project where I need to search for duplicate documents among thousands of PDFs. I don’t know if Hadoop is the way to go for me, but it sounds like it has promise. Did you have to do a lot of coding to do the above test? If so, would you be amenable to sharing the code?

The bigger question would be: as someone starting from not even having Hadoop installed, how big a project was it to get where you are? Days? Weeks?

I sincerely appreciate any answers, as you are already saving me a lot of work.

– Brett

4. Andre Vellino - February 26, 2009

Sure I’d be happy to help. Contact me at andre.vellino@nrc.ca.

5. Cloud Computing, MapReduce y Hadoop » Blog Archive » Sistemas de recomendación con Hadoop - July 22, 2009

[…] paper “Pairwise Document Similarity in Large Collections with MapReduce“. Encontré una entrada en un blog detallando el uso de este algoritmo, y un tutorial que muestra cómo implementarlo usando Elastic […]

6. Andreas - March 15, 2010

This seems to be very good work.
Can you please send me the code you used?

Thank you.

7. dg - April 23, 2010

I did try to implement the *doc-to-doc similarity measurement* (in my case, it was a text mining job where I had to measure similarities between categories (each category has >1000 documents and so on) . It runs pretty fine for a smaller corpus. The problem starts occurring when I started experiments with a bigger (typically > 100k categories) corpus. The *intermediate* files literally filled up the server place! I surmise the mappers output these files (before the sorting step perhaps?) but I don’t have any clues how to overcome this. As Andre commented – yes, the intermediate files take more than 100 times the input-corpus. Anyone has quick ideas/pointers/solutions?

8. Andre Vellino - April 23, 2010

Wish I could help with that, but I can’t. I ended up giving up on the Hadoop approach for the problem and looked to alternatives. You might consider “Semantic Vectors”:

9. Christian - August 12, 2010

Hadoop does allow for compressing its map output file. You can also compress the input files themselves, requiring less space and making things run a bit more quickly on clusters since some of the time is spent on transferring the file blocks to the appropriate nodes.

10. sudheer1313 - October 23, 2013

thanks for this valuble information and itis useful for us .Hadoop online trainings also provides the best Hadoop online training classes in India.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: