jump to navigation

Review: “Mahout in Action” December 22, 2011

Posted by Andre Vellino in Book Review, Collaborative filtering, Data Mining, Java, Open Source, Recommender service.
trackback

In early September 2010 (I’m embarassed to count many months ago that was!) I received an Early Access (PDF) copy of “Mahout in Action” (MIA) from Manning Publications and asked to write a review. There have been 4 major updates to the book (now no longer “early access”!) since then and although it is too late to fulfill their purpose in giving me an early access to review (no doubt a supportive quote for the dust jacket or web site), I thought I’d nevertheless post my belated notes.

Mahout is an Apache project that develops scalable machine learning libraries for recommendation, clustering and classification. Like many other such software-documentation “in Action” books for Apache projects (Lucene / Hadoop / Hibernate / Ajax, etc.), the primary purpose of MIA is to complement the existing software documentation with both an explanatory guide for how to use these libraries and some practical examples of how they would be deployed.

First I want to ask: “how does one go about reviewing such a book”? Is it possible to dissassociate one’s opinion about the book itself from one’s opinion of the software? If the software is missing an important algorithm, does this impugn the book in any way?

The answers to these questions are, I think, “yes” and “no” respectively. Hence, the following comments assess the book on its own merits and in relation to the software that it documents, not in relation to the machine learning literature at large. Indeed, the fact that this book is not a textbook on or an authoritative source for machine learning is made quite explicit at the beginning of the book and the authors make no claim at being experts in the field of Machine Learning.

It’s important to understand that Mahout came about in part as a refactoring excercise in the Apache Lucene project, since several modules in Lucene use information retrieval techniques such as vector based models for document semantics (see the survey paper by Peter Turney and Patrick Pantel “From Frequency to Meaning: Vector Space Models of Semantics“). The amalgamation of those modules with the open source collaborative filtering system (formerly called Taste) by co-author Sean Owen yielded the foundation for Mahout.

Thus, if  there are gaps in Mahout software it is an accident of history more than a design flaw.  Like most software – especially open-source software – Mahout is still “under construction”, as evidenced by its current version number (“0.5”). Even though many element are quite mature there are also several missing elements and whatever lacunae there are should be considered as an opportunity to contribute and improve this library rather than to criticize it.

One obvious source for comparison is Weka – also an open-source machine learning library in Java. The book associated with this library – Data Mining: Practical Machine Learning Tools and Techniques (Second Edition) by Ian H. Witten, Eibe Frank – was published in 2005 and has a much more pedagogical purpose than Mahout in Action. In contrast with MIA, “Data Mining” is much more of an academic book, published by academic researchers, whose purpose is to teach readers about Machine Learning.  In that way, these two books are complimentary, particularly as there are no algorithms devoted to recommendations in Weka and many more varieties of classification and clustering algorithms in Weka than in Mahout.

The Mahout algorithms that are discussed in MIA include the following.

  • Collaborative Filtering
  • User and Item based recommenders
  • K-Means, Fuzzy K-Means clustering
  • Mean Shift clustering
  • Dirichlet process clustering
  • Latent Dirichlet Allocation
  • Singular value decomposition
  • Parallel Frequent Pattern mining
  • Complementary Naive Bayes classifier
  • Random forest decision tree based classifier

The integration of Mahout with Apache’s implementation of MapReduce – Hadoop – is no doubt the unique characteristic of this software. If you want to use a distributed computing platform to implement these kinds of algorithms, Mahout and MAI is the place to start.

On its own terms, then, how does the book fare? It is fair to say – for the quotable extract – that Mahout in Action is an indispensible guide to Mahout! I wish I had had this book 5 years ago when I was getting to grips with open source collaborative filtering recommenders!

P.S. This book fits clearly in the business model for open source Apache software – write great and useful software for free, but make the users pay for the documentation!  Which is only fair, I think, since $20 or so is not much at all for such a wealth of well-written software! The same can be said for Weka, whose 303 pages of software documentation still requires the book to be useful.

Comments»

1. Gareth Palidwor - January 3, 2012

I hadn’t realized that Mahout is so closely related to Lucene. I’ve tried unsuccessfully to adjust the vector based similarity models in the Lucene internals. I’ll see if it’s possible in Mahout.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: