jump to navigation

Review: “Mahout in Action” December 22, 2011

Posted by Andre Vellino in Book Review, Collaborative filtering, Data Mining, Java, Open Source, Recommender service.
1 comment so far

In early September 2010 (I’m embarassed to count many months ago that was!) I received an Early Access (PDF) copy of “Mahout in Action” (MIA) from Manning Publications and asked to write a review. There have been 4 major updates to the book (now no longer “early access”!) since then and although it is too late to fulfill their purpose in giving me an early access to review (no doubt a supportive quote for the dust jacket or web site), I thought I’d nevertheless post my belated notes.

Mahout is an Apache project that develops scalable machine learning libraries for recommendation, clustering and classification. Like many other such software-documentation ”in Action” books for Apache projects (Lucene / Hadoop / Hibernate / Ajax, etc.), the primary purpose of MIA is to complement the existing software documentation with both an explanatory guide for how to use these libraries and some practical examples of how they would be deployed.

First I want to ask: “how does one go about reviewing such a book”? Is it possible to dissassociate one’s opinion about the book itself from one’s opinion of the software? If the software is missing an important algorithm, does this impugn the book in any way?

The answers to these questions are, I think, “yes” and “no” respectively. Hence, the following comments assess the book on its own merits and in relation to the software that it documents, not in relation to the machine learning literature at large. Indeed, the fact that this book is not a textbook on or an authoritative source for machine learning is made quite explicit at the beginning of the book and the authors make no claim at being experts in the field of Machine Learning.

It’s important to understand that Mahout came about in part as a refactoring excercise in the Apache Lucene project, since several modules in Lucene use information retrieval techniques such as vector based models for document semantics (see the survey paper by Peter Turney and Patrick Pantel “From Frequency to Meaning: Vector Space Models of Semantics“). The amalgamation of those modules with the open source collaborative filtering system (formerly called Taste) by co-author Sean Owen yielded the foundation for Mahout.

Thus, if  there are gaps in Mahout software it is an accident of history more than a design flaw.  Like most software – especially open-source software – Mahout is still “under construction”, as evidenced by its current version number (“0.5″). Even though many element are quite mature there are also several missing elements and whatever lacunae there are should be considered as an opportunity to contribute and improve this library rather than to criticize it.

One obvious source for comparison is Weka – also an open-source machine learning library in Java. The book associated with this library – Data Mining: Practical Machine Learning Tools and Techniques (Second Edition) by Ian H. Witten, Eibe Frank – was published in 2005 and has a much more pedagogical purpose than Mahout in Action. In contrast with MIA, “Data Mining” is much more of an academic book, published by academic researchers, whose purpose is to teach readers about Machine Learning.  In that way, these two books are complimentary, particularly as there are no algorithms devoted to recommendations in Weka and many more varieties of classification and clustering algorithms in Weka than in Mahout.

The Mahout algorithms that are discussed in MIA include the following.

  • Collaborative Filtering
  • User and Item based recommenders
  • K-Means, Fuzzy K-Means clustering
  • Mean Shift clustering
  • Dirichlet process clustering
  • Latent Dirichlet Allocation
  • Singular value decomposition
  • Parallel Frequent Pattern mining
  • Complementary Naive Bayes classifier
  • Random forest decision tree based classifier

The integration of Mahout with Apache’s implementation of MapReduce – Hadoop - is no doubt the unique characteristic of this software. If you want to use a distributed computing platform to implement these kinds of algorithms, Mahout and MAI is the place to start.

On its own terms, then, how does the book fare? It is fair to say – for the quotable extract – that Mahout in Action is an indispensible guide to Mahout! I wish I had had this book 5 years ago when I was getting to grips with open source collaborative filtering recommenders!

P.S. This book fits clearly in the business model for open source Apache software – write great and useful software for free, but make the users pay for the documentation!  Which is only fair, I think, since $20 or so is not much at all for such a wealth of well-written software! The same can be said for Weka, whose 303 pages of software documentation still requires the book to be useful.

CISTI Sciverse Gadget App December 13, 2011

Posted by Andre Vellino in CISTI, Digital library, General, Information retrieval, Open Access.
add a comment

Betwixt the jigs and the reels, and with the help of several people at CISTI and Elsevier, I developed a (beta) Sciverse gadget that gives searchers and researchers a window on CISTI’s electonic collection by taking the search term entered in Elsevier Hub and providing them with CISTI’s search results from a database of over 20 million journal articles.

Next year, I plan follow up with another Sciverse gadget for my citation-based recommender that uses the full power of Elsevier’s API into its collection content.

I want to commend all and sundry at Sciverse Applications for this initiative.  Opening up bibligraphic data and providing developers with a developer platform (a customized version of Google’s OpenSocial platform) is exactly the right kind of thing to do both to benefit third parties (they get access to anotherwise closed and proprietary data) and to enhance their own search and discover environment.

There are, already, several advanced and interesting applications on Sciverse. My favourites are: Altmetric (winner of the Science Challenge prize – see YouTube demo video below) NextBio’s Prolific Authors and Elsevier’s Table Download.

And there will be more to come. An open marketplace like this where the principles of variation and natural selection can operate will, I predict, make for a richer diversity of useful search and discovery tools than any single organization can develop on its own.

What is ‘Data’? June 14, 2011

Posted by Andre Vellino in Data, Data Mining, Information retrieval.
3 comments

“What does ‘data’ mean to you?” I asked innocently to various participants at JCDL 2011 today.  I had just come out of a very interesting panel discussion entitled “Big Data, Big Deal?” at which most of the discussion was about large amounts of proprietary text at http://www.hathitrust.org/ (some of of the discussion was also about large amounts of music in the SALAMI project at McGill).

Now I am very interested in text, text retrieval (and music IR too) and I found the panel discussion most rewarding.  But it wasn’t aboutwhat I had been expecting it to be about (from the title) and I was perplexed by this use of the term “data” in this context. After all, the subtitle of the JCDL 2011 conference is “Bringing Together Scholars, Scholarship and Research Data”.  So the context for “data” was (for me) “research data” in the sense of the term that is pretty much the same the first 3 sentences of the Wikipedia entry for Data:

The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data (plural of “datum”) are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which information and then knowledge are derived.

So I was somewhat taken aback by the argument that ensued. Everyone, it seems (except me), was quite happy to speak of “Big data” and “large amounts of text” as synonymous.  As though the streams of bytes that are common to readings from an NMR spectrometer, digital music and electronic journal articles were in all significant respects indistinguishable.

Of course, large volumes of byte-sequences share some kinds of problems like storage, preservation and search. But “text data” is a different kind of beast, isn’t it? For one thing, text typically has meaning – cognitive content that is different from, say, music or images or spreadsheets of temperature variations in Glasgow over the past 500 years. It has more structure too, as evidenced by how efficiently it compresses and how (relatively) easy it is to search.

I’m happy to speak of data about text that is inferred by the act of mining text.  Word frequencies, ngrams, term clusters, sentiment categories etc. fit the definition of “data” above. Even the textual “meta-data” about text is data of a certain kind. But the text itself just doesn’t seem to be that kind of thing (qualitative or quantitative attributes of a variable).

Learning from Watson February 19, 2011

Posted by Andre Vellino in Artificial Intelligence, Information retrieval, Search, Semantics, Statistical Semantics.
2 comments

WatsonNow that Watson has convincingly demonstrated that machines can perform some natural language tasks more effectively than humans can (see a rerun of part of Day 1 of the Jeopardy contest), what is the proper conclusion to be drawn from it?

Should we join hands with “confederates” like Brian Christian and rally against the invasion of smart machines? (See his recent piece in the Atlantic and listen to his recent radio interview on CBC)?

Or do we conclude that machines are now (or soon will be) sentient and deserve to be spoken to with respect for their moral standing (see Peter Singer’s article “Rights for Robots“)? Or should we, like NSERC Gold Medal Award winner Geoffrey Hinton,  be scared about the social consequences (in the long term) of intelligent robots designed replace soldiers (listen to his interview on the future of AI machines on CBC’s Quirk and Quarks).

Before coming to any definite conclusion about how “like” us machines can be, I think we should consider how these machines do what they do.  The survey paper in AI Magazine about the design of “DeepQA” by the Watson team gives some indications of the general approach:

DeepQA is a massively parallel, probabilistic evidence-based architecture. For the Jeopardy Challenge, we use more than 100 different techniques for analyzing natural language, identifying sources, finding and generating hypotheses, finding and scoring evidence, and merging and ranking hypotheses….

The overarching principles in DeepQA are massive parallelism, many experts, pervasive confi-dence estimation, and integration of shallow and deep knowledge.

Is this the right model for creating artificial cognition? Probably not. As Maarten van Emden and I argue in a recent paper on the chinese room argument and the “Human Window”, the question of whether a computer is simulating cognition cannot be decided by how effectively a computer solves a chess puzzle (for instance) but rather by the mechanism that it uses to achieve the end.

In this instance DeepQA uses and combines a number of different techniques from NLP, machine learning, distributed processing and decision theory – which is not likely to be an accurate representation of what humans actually do but it is undeniably successful at that task (see this talk on YouTube about how IBM addressed the Jeopardy problem).

Geoff Hinton (in the radio interview mentioned above) speculates that Watson is a feat of special-purpose engineering but that the general-purpose solution – a large neural network that simulates the learning abilities of the brain – is what the project of AI is really about.

What we suggest in our Human Window paper is that one criterion we can use to determine whether machines are performing adequate simulations of what humans do is whether or not humans are able to follow the steps that machine is undertaking. On that criterion, I think it’s safe to say that Watson – although very impressive – isn’t quite there yet.

P.S. If you have the patience, I recommend watching a BBC debate from 1973 between Sir James Lighthill, John McCarthy and Donald Michie about whether AI is possible. The context of this video is the “Lighthill Affair” in 1972, recently chronicled on van Emden’s blog (note that the audio on this thumbnail video is rather out of synch!).

It’s amazing how spectacularly wrong an amateur in artificial intelligence (Prof. Lighthill was an applied mathematician specializing in fluid dynamics) can be about the possibiliy of machines simulating intelligent behaviour. It is real tragedy that Sir Lighthill’s ideological biases had such disastrous consequences for AI research funding in the UK. The attitude of Sir Lighthill reminds me of Samuel Wilberforce‘s objections  to Darwin’s theory of evolution. I find it astonishing that this BBC debate was so civilized in its demeanour.

Mendeley Data vs. Netflix Data November 2, 2010

Posted by Andre Vellino in Citation, Collaborative filtering, Data, Data Mining, Digital library, Recommender, Recommender service.
8 comments

Mendeley, the on-line reference management software and social networking site for science researchers has generously offered up a reference dataset with which developers and researchers can conduct experiments on recommender systems. This release of data is their reply to the DataTel Challenge put forth at the 2010 ACM Recommender System Conference in Barcelona.

The paper published by computer scientists at Mendeley, which accompanies the dataset (bibliographic reference and full PDF), describes the dataset as containing boolean ratings (read / unread or starred / unstarred) for about 50,000 (anonymized) users and references to about 4.8M articles (also anonymized), 3.6M of which are unique.

I was gratified to note that this is almost exactly the user-item ratio (1:100) that I indicated in my poster at ASIS&T2010 was typically the cause of the data sparsity problem for recommenders in digital libraries. If we measure the sparseness of a dataset by the number of edges in the bipartite user-item graph divided by the total number of possible edges, Mendeley gives 2.66E-05.  Compared with the sparsity of Neflix – 1.18E-02 – that’s a difference of 3 orders of magnitude!

But raw sparsity is not all that matters. The number of users per movie is much more evenly distributed in Netflix than the number of readers per article in Mendeley, i.e.  the user-item graph in Netflix is more connected (in the sense that the probability of creating a disconnected graph by deleting a random edge is much lower).

In the Mendeley data, out of the 3,652286 unique articles, 3,055546 (83.6%) were referenced by only 1 user and 378,114 were referenced by only 2 users. Less than 6% of the articles referenced were referenced by 3 or more users. [The most frequently referenced article was referenced 19,450 times!]

Compared with the Netflix dataset (which contains over ~100M ratings from ~480K users on ~17k titles) over 89% of the movies in the Netflix data had been rated by 20 or more users. (See this blog post for more aggregate statistics on Netflix data.)

I think that user or item similarity measures aren’t going to work well with the kind of distribution we find in Mendeley data. Some additional information such as article citation data or some content attribute such as the categories to which the articles belong is going to be needed to get any kind of reasonable accuracy from a recommender system.

Or, it could be that some method like the heat-dissipation technique introduced by physicists in the paper “Solving the apparent diversity-accuracydilemma of recommender systems” published in the Proceedings of the National Academy of Sciences (PNAS) could work on such a sparse and loosely connected dataset. The authors claim that this approach works especially well for sparse bipartite graphs (with no ratings information). We’ll have to try and see.

Ex Libris ‘bX’ Recommender Promo Video October 5, 2010

Posted by Andre Vellino in Collaborative filtering, Recommender.
2 comments

I stumbled across this Ex Libris promo video for its ‘bX’ recommender yesterday. Having done quite a few of these use-case demo scenarios to “show the value”, I appreciate how hard it is to pitch a relatively complex idea in straight-forward terms. I think it does a pretty good job too, notwithstanding the slightly over-the-top-happiness tenor of the whole thing.

At the risk of repeating myself, though, there’s one thing that the video glosses over.  SFX logs are, effectively, click-logs and clicks have two sources: search engine results and ‘bX’ recommendations themselves.  Hence ‘bX’ recommendations are more likely to be “semantically homogenous” (although less so than pure search results) because the data they derive from is biased by search-engine ranking.  The proportion of SFX trafic that is generated by the recommender itself further narrows the semantic diversity of recommendations.

Follow

Get every new post delivered to your Inbox.