Review: “Mahout in Action” December 22, 2011Posted by Andre Vellino in Book Review, Collaborative filtering, Data Mining, Java, Open Source, Recommender service.
1 comment so far
In early September 2010 (I’m embarassed to count many months ago that was!) I received an Early Access (PDF) copy of “Mahout in Action” (MIA) from Manning Publications and asked to write a review. There have been 4 major updates to the book (now no longer “early access”!) since then and although it is too late to fulfill their purpose in giving me an early access to review (no doubt a supportive quote for the dust jacket or web site), I thought I’d nevertheless post my belated notes.
Mahout is an Apache project that develops scalable machine learning libraries for recommendation, clustering and classification. Like many other such software-documentation “in Action” books for Apache projects (Lucene / Hadoop / Hibernate / Ajax, etc.), the primary purpose of MIA is to complement the existing software documentation with both an explanatory guide for how to use these libraries and some practical examples of how they would be deployed.
First I want to ask: “how does one go about reviewing such a book”? Is it possible to dissassociate one’s opinion about the book itself from one’s opinion of the software? If the software is missing an important algorithm, does this impugn the book in any way?
The answers to these questions are, I think, “yes” and “no” respectively. Hence, the following comments assess the book on its own merits and in relation to the software that it documents, not in relation to the machine learning literature at large. Indeed, the fact that this book is not a textbook on or an authoritative source for machine learning is made quite explicit at the beginning of the book and the authors make no claim at being experts in the field of Machine Learning.
It’s important to understand that Mahout came about in part as a refactoring excercise in the Apache Lucene project, since several modules in Lucene use information retrieval techniques such as vector based models for document semantics (see the survey paper by Peter Turney and Patrick Pantel “From Frequency to Meaning: Vector Space Models of Semantics“). The amalgamation of those modules with the open source collaborative filtering system (formerly called Taste) by co-author Sean Owen yielded the foundation for Mahout.
Thus, if there are gaps in Mahout software it is an accident of history more than a design flaw. Like most software – especially open-source software – Mahout is still “under construction”, as evidenced by its current version number (“0.5″). Even though many element are quite mature there are also several missing elements and whatever lacunae there are should be considered as an opportunity to contribute and improve this library rather than to criticize it.
One obvious source for comparison is Weka – also an open-source machine learning library in Java. The book associated with this library – Data Mining: Practical Machine Learning Tools and Techniques (Second Edition) by Ian H. Witten, Eibe Frank – was published in 2005 and has a much more pedagogical purpose than Mahout in Action. In contrast with MIA, “Data Mining” is much more of an academic book, published by academic researchers, whose purpose is to teach readers about Machine Learning. In that way, these two books are complimentary, particularly as there are no algorithms devoted to recommendations in Weka and many more varieties of classification and clustering algorithms in Weka than in Mahout.
The Mahout algorithms that are discussed in MIA include the following.
- Collaborative Filtering
- User and Item based recommenders
- K-Means, Fuzzy K-Means clustering
- Mean Shift clustering
- Dirichlet process clustering
- Latent Dirichlet Allocation
- Singular value decomposition
- Parallel Frequent Pattern mining
- Complementary Naive Bayes classifier
- Random forest decision tree based classifier
The integration of Mahout with Apache’s implementation of MapReduce – Hadoop – is no doubt the unique characteristic of this software. If you want to use a distributed computing platform to implement these kinds of algorithms, Mahout and MAI is the place to start.
On its own terms, then, how does the book fare? It is fair to say – for the quotable extract – that Mahout in Action is an indispensible guide to Mahout! I wish I had had this book 5 years ago when I was getting to grips with open source collaborative filtering recommenders!
P.S. This book fits clearly in the business model for open source Apache software – write great and useful software for free, but make the users pay for the documentation! Which is only fair, I think, since $20 or so is not much at all for such a wealth of well-written software! The same can be said for Weka, whose 303 pages of software documentation still requires the book to be useful.
Mendeley Data vs. Netflix Data November 2, 2010Posted by Andre Vellino in Citation, Collaborative filtering, Data, Data Mining, Digital library, Recommender, Recommender service.
Mendeley, the on-line reference management software and social networking site for science researchers has generously offered up a reference dataset with which developers and researchers can conduct experiments on recommender systems. This release of data is their reply to the DataTel Challenge put forth at the 2010 ACM Recommender System Conference in Barcelona.
The paper published by computer scientists at Mendeley, which accompanies the dataset (bibliographic reference and full PDF), describes the dataset as containing boolean ratings (read / unread or starred / unstarred) for about 50,000 (anonymized) users and references to about 4.8M articles (also anonymized), 3.6M of which are unique.
I was gratified to note that this is almost exactly the user-item ratio (1:100) that I indicated in my poster at ASIS&T2010 was typically the cause of the data sparsity problem for recommenders in digital libraries. If we measure the sparseness of a dataset by the number of edges in the bipartite user-item graph divided by the total number of possible edges, Mendeley gives 2.66E-05. Compared with the sparsity of Neflix – 1.18E-02 – that’s a difference of 3 orders of magnitude!
But raw sparsity is not all that matters. The number of users per movie is much more evenly distributed in Netflix than the number of readers per article in Mendeley, i.e. the user-item graph in Netflix is more connected (in the sense that the probability of creating a disconnected graph by deleting a random edge is much lower).
In the Mendeley data, out of the 3,652286 unique articles, 3,055546 (83.6%) were referenced by only 1 user and 378,114 were referenced by only 2 users. Less than 6% of the articles referenced were referenced by 3 or more users. [The most frequently referenced article was referenced 19,450 times!]
Compared with the Netflix dataset (which contains over ~100M ratings from ~480K users on ~17k titles) over 89% of the movies in the Netflix data had been rated by 20 or more users. (See this blog post for more aggregate statistics on Netflix data.)
I think that user or item similarity measures aren’t going to work well with the kind of distribution we find in Mendeley data. Some additional information such as article citation data or some content attribute such as the categories to which the articles belong is going to be needed to get any kind of reasonable accuracy from a recommender system.
Or, it could be that some method like the heat-dissipation technique introduced by physicists in the paper “Solving the apparent diversity-accuracydilemma of recommender systems” published in the Proceedings of the National Academy of Sciences (PNAS) could work on such a sparse and loosely connected dataset. The authors claim that this approach works especially well for sparse bipartite graphs (with no ratings information). We’ll have to try and see.
Are User-Based Recommenders Biased by Search Engine Ranking? September 28, 2010Posted by Andre Vellino in Collaborative filtering, Recommender, Recommender service, Search, Semantics.
I have a hypothesis (first emitted here) that I would like to test with data from query logs: user-based recommenders – such as the ‘bX’ recommender for journal articles – are biased by search-engine language models and ranking algorithms.
Let’s say you are looking for “multiple sclerosis” and you enter those terms as a search query. Some of the articles that were presented to you from the search results will likely be relevant and you download a few of the articles during your session. This may be followed by another, semantically germane query that yeilds more article downloads. As a consequence, the usage-log (e.g. the SFX log used by ‘bX’) is going to register these articles as having been “co-downloaded”. Which is natural enough.
But if this happens a lot, then a collaborative filtering recommender is going to generate recommendations that are biased by the ranking algorithm and language model that produced the search-result ranking: even by PageRank, if you’re using Google.
In contrast, a citation-based (i.e. author-centric) recommender (such as Sarkanto) will likely yield more semantically diverse recommendations because co-citations will have (we hope!) originated from deeper semantic relations (i.e. non-obvious but meaningful connections between the items cited in the bibliography).
Sarkanto Scientific Search September 13, 2010Posted by Andre Vellino in Collaborative filtering, Digital library, Information retrieval, Recommender, Recommender service, Search.
add a comment
A few weeks ago I finished deploying a version of a collaborative recommender system that uses only article citations as a basis for recommending journal articles. This tool allows you to search ~ 7 million STM (Scientific Technical and Medical) articles up to Dec. 2009 and to compare citation-base recommendations (using the Synthese recommender) with recommendations generated by ‘bX’ (a user-based collaborative recommender from Ex Libris). You can try the Sarkanto demo and read more about how ‘bX’ and Sarkanto compare.
Note that I’m also using this implementation to experiment with Google Translate API and the Microsoft Translator to do both query expansion into the other Canadian Official Language and to translate various bibliographic fields upon returning search results.
Visualizing Netflix Rental Patterns January 10, 2010Posted by Andre Vellino in Recommender service, User Interface, Visualization.
1 comment so far
The recent NY Times mashup of Netflix rental data with geographical data based on postal-codes illustrates just how informative such visualizations can be.
Take for instance the distribution of rentals in Washington DC of the movie Milk – based on the true story of Harvey Milk, the American gay activist who fought for gay rights and became California’s first openly gay elected official…
… and compare that with the distribution of rentals for The Proposal – a (straight) romantic comedy.
I think you could be forgiven for concluding that residents in the downtown core of Washington DC are more socially liberal than in its residential suburbs (or, of course, that downtown residents prefer serious historical dramas to fictional comedies – or both).
Imagine if you could do the same thing with labeled Bayesian or LSA models that characterize classes or intersections of classes of Netflix users (e.g. class types that might be labeled something like “highly-educated-and-well-paid-government-employee” vs. “unemployed-manufacturing-blue-collar-worker”). That could form the basis of a nice explanation interface to a movie recommender system.
Nothing is “Miscellaneous” October 17, 2009Posted by Andre Vellino in Classification, Collaborative filtering, Information retrieval, Recommender service.
I think I now understand why David Weinberger’s book “Everything is Miscellaneous” is so provocative and sometimes enraging. It often sounds like he’s claiming that there is no point at all in classifing / categorizing information. No matter what you do, you’re going to get the category “wrong” because there is no such thing as a “right” category. Ergo, don’t even try – everything belongs in the category “Misc”.
I think Weinberger’s emperor has no clothes – in fact, he is asserting that nothing is “Miscellanous”. Everything belongs to some category for someone, it’s just that it may not be the same category for everyone. A banana is likely to be a fruit for most people, but also a weapon for John Cleese. The point is: a banana is always a kind of something in every context.
So isn’t there is a middle ground between banishing the Dewey decimal system (or indeed any other library classification system) and dumping every digital object into an undifferentiated pile. Indeed, there’s a lot to be said for a thoroughly well-understood standard, albeit a dated and even a bad, system of classification: at the very least, it is predictable. If you know how the meta-data was generated (e.g. call-number, subject category, keywords), for a given item, you’ll be better able to retrieve it.
Furthermore, I expect there are some unforseen problems with the democratization of knowledge generated by social tagging and recommender systems. Who’s doing the tagging? Who’s doing the bookmarking? High school students?
This is of particular concern to me in the context of scholarly articles. Are the numbers of co-downloads in a digital library primarily due to professors’ undergraduate course syllabi? Would professors’ syllabi be influenced by scholarly recommender systems? I expect that the recommender-effect studied in Daniel Fleder’s “Blockbuster Culture’s Next Rise or Fall: The Impact of Recommender Systems on Sales Diversity” and which shows that recommenders decrease aggregate diversity would be an especially accute problem when sources of co-download behaviour are (relatively) few (e.g. professors’ course syllabi).
Conclusion? I think it matters what population you are drawing from for your metadata – be it social tagging or collaborative filtering recommendations. There is a point in relying on experts and big thinkers. They are more knowledgeable and credible than even the collective intelligence of the masses.