jump to navigation

Is Clippy the Future? February 8, 2013

Posted by Andre Vellino in Artificial Intelligence, Collaborative filtering, Data Mining.
add a comment

iwblogoThe student-led Information without Borders conference that I attended at Dalhousie yesterday was truly excellent – as much for its organization (all by students!) as for its diverse topics: the future of libraries, cloud computing, recommender systems, sciverse apps and the foundations for innovation.

At the panel discussion in which I participated, I suggested that to predict the future one need only look at the past. To predict the iPad one needed only look at the Apple Newton (which died in 1998). What was the analog, I wondered, for an information retrieval tool, now dead and buried, that might still evolve into something we all want in the field of information management?

I proposed that the future of information retrieval might be something like an evolved Office Assistant, (affectionately coined “Clippy”) – the infamous, now deceased Microsoft Paperclip that assisted you in understanding and navigating Microsoft products.

My vision for a next generation Clippy was clearly not well articulated since it prompted the following tweet from Stephen Abram:


I think that Siri, (about which I posted a few years ago) belongs to the old Clippy style of annoying and in-the-way-of-what-I-want-to-do applications. I am surprised it has survived so long and was promoted by Apple so strongly. I predict it will join Clippy, Google Wave and Google Glasses on the growing heap of unwanted technologies that were not ready for prime-time.

Watson (who is now going to medical school, and about which I also posted a couple of years ago) is, however, just the sort of Natural Language Understanding component technology that I have in mind for for an interactive, personal information assistant. When a computer that now costs three million dollars with15 terrabytes of RAM can fit in your pocket and cost $500, a Watson-like system that understands natural language queries will be an important component of Clippy++.

What neither Watson nor Siri have – and this is what I foresee in my crystal ball is the most significant attribute about “Clippy++” – is personalization and autonomy. What will make true personalization possible with “Clippy++” is our collective willingness to accept the intrusion of a mechanical supervisor that learns from our behaviour about what we want, need and expect.

This culture-shift is happening right now – we gladly and willingly disclose our information consumption habits to supervisory software and data-analytics engines in exchange for entertainment and social networking. It won’t be long before we’re willing to do that for serious, personalized information management purposes as well.

The key, though, is going to be the interaction – the dialog that we have with Clippy++ – and it will have to have explanations for its actions and recommendations. That’s going to be the hallmark of its evolution to Machina Sapiens.

Marissa Mayer Wants to Read Your Mind August 14, 2012

Posted by Andre Vellino in Collaborative filtering, Digital Identity, Personal identity.
add a comment

At about minute 3 of Charlie Rose’s Green Room interview with Marissa Mayer, the newly minted CEO of Yahoo offers a vision of the mobile future and asks “How do we create a search without search? Can we figure out the information you need before you even have to ask?” And, she says excitedly, “that’s really like mind reading technology!”

The inference? Be prepared for Yahoo to read your mind!

I have been a proponent of personalization since 2000, when I worked on developing “Personal Identity Management” services at Nortel. The idea at the time was (for a telecom company) to enable IP devices (routers / gateways) to track / manage / control your on-line identity and provide identity services (single sign-on, personalization of news services, etc.) to the user.

This was conceived at about the time that Microsoft Hailstorm was being launched. The only fundamental difference was – which service provider – “network access” vs. “operating system” vs. “third party service” – would be the trusted source for managing your identity.

From a public relations point of view Hailstorm and its successors Microsoft Passport, and Wallet, were a disaster. Invasion of privacy, identity theft, all the usual public anxiety buttons were pressed and Microsoft dropped a lot of these products – or at least gave them a makeover.

Yet, a few internet generations later, these ideas persist.  Google didn’t make a big PR campaign of it, but everything at Google is about personalization and localization as illustrated most graphically by the (dystopic?) Google Glasses video.

But – fortunately, I might add – I am noticing a (small) swing of the pendulum away from machine-learning, Netflix-style personalization towards a “how do you want it?” style of personalization.

For instance, Google News used to be fully and automatically biased towards your location. Since the summer of 2011, Google has given the end-user a great deal more control.

Marissa Mayer may want to read your mind, but I know that most people don’t want to have their minds read by machines. I think the trend towards great user-control will eventually spread to more personalization and recommender services. I hope so anyway.

Review: “Mahout in Action” December 22, 2011

Posted by Andre Vellino in Book Review, Collaborative filtering, Data Mining, Java, Open Source, Recommender service.
1 comment so far

In early September 2010 (I’m embarassed to count many months ago that was!) I received an Early Access (PDF) copy of “Mahout in Action” (MIA) from Manning Publications and asked to write a review. There have been 4 major updates to the book (now no longer “early access”!) since then and although it is too late to fulfill their purpose in giving me an early access to review (no doubt a supportive quote for the dust jacket or web site), I thought I’d nevertheless post my belated notes.

Mahout is an Apache project that develops scalable machine learning libraries for recommendation, clustering and classification. Like many other such software-documentation “in Action” books for Apache projects (Lucene / Hadoop / Hibernate / Ajax, etc.), the primary purpose of MIA is to complement the existing software documentation with both an explanatory guide for how to use these libraries and some practical examples of how they would be deployed.

First I want to ask: “how does one go about reviewing such a book”? Is it possible to dissassociate one’s opinion about the book itself from one’s opinion of the software? If the software is missing an important algorithm, does this impugn the book in any way?

The answers to these questions are, I think, “yes” and “no” respectively. Hence, the following comments assess the book on its own merits and in relation to the software that it documents, not in relation to the machine learning literature at large. Indeed, the fact that this book is not a textbook on or an authoritative source for machine learning is made quite explicit at the beginning of the book and the authors make no claim at being experts in the field of Machine Learning.

It’s important to understand that Mahout came about in part as a refactoring excercise in the Apache Lucene project, since several modules in Lucene use information retrieval techniques such as vector based models for document semantics (see the survey paper by Peter Turney and Patrick Pantel “From Frequency to Meaning: Vector Space Models of Semantics“). The amalgamation of those modules with the open source collaborative filtering system (formerly called Taste) by co-author Sean Owen yielded the foundation for Mahout.

Thus, if  there are gaps in Mahout software it is an accident of history more than a design flaw.  Like most software – especially open-source software – Mahout is still “under construction”, as evidenced by its current version number (“0.5”). Even though many element are quite mature there are also several missing elements and whatever lacunae there are should be considered as an opportunity to contribute and improve this library rather than to criticize it.

One obvious source for comparison is Weka – also an open-source machine learning library in Java. The book associated with this library – Data Mining: Practical Machine Learning Tools and Techniques (Second Edition) by Ian H. Witten, Eibe Frank – was published in 2005 and has a much more pedagogical purpose than Mahout in Action. In contrast with MIA, “Data Mining” is much more of an academic book, published by academic researchers, whose purpose is to teach readers about Machine Learning.  In that way, these two books are complimentary, particularly as there are no algorithms devoted to recommendations in Weka and many more varieties of classification and clustering algorithms in Weka than in Mahout.

The Mahout algorithms that are discussed in MIA include the following.

  • Collaborative Filtering
  • User and Item based recommenders
  • K-Means, Fuzzy K-Means clustering
  • Mean Shift clustering
  • Dirichlet process clustering
  • Latent Dirichlet Allocation
  • Singular value decomposition
  • Parallel Frequent Pattern mining
  • Complementary Naive Bayes classifier
  • Random forest decision tree based classifier

The integration of Mahout with Apache’s implementation of MapReduce – Hadoop – is no doubt the unique characteristic of this software. If you want to use a distributed computing platform to implement these kinds of algorithms, Mahout and MAI is the place to start.

On its own terms, then, how does the book fare? It is fair to say – for the quotable extract – that Mahout in Action is an indispensible guide to Mahout! I wish I had had this book 5 years ago when I was getting to grips with open source collaborative filtering recommenders!

P.S. This book fits clearly in the business model for open source Apache software – write great and useful software for free, but make the users pay for the documentation!  Which is only fair, I think, since $20 or so is not much at all for such a wealth of well-written software! The same can be said for Weka, whose 303 pages of software documentation still requires the book to be useful.

Mendeley Data vs. Netflix Data November 2, 2010

Posted by Andre Vellino in Citation, Collaborative filtering, Data, Data Mining, Digital library, Recommender, Recommender service.

Mendeley, the on-line reference management software and social networking site for science researchers has generously offered up a reference dataset with which developers and researchers can conduct experiments on recommender systems. This release of data is their reply to the DataTel Challenge put forth at the 2010 ACM Recommender System Conference in Barcelona.

The paper published by computer scientists at Mendeley, which accompanies the dataset (bibliographic reference and full PDF), describes the dataset as containing boolean ratings (read / unread or starred / unstarred) for about 50,000 (anonymized) users and references to about 4.8M articles (also anonymized), 3.6M of which are unique.

I was gratified to note that this is almost exactly the user-item ratio (1:100) that I indicated in my poster at ASIS&T2010 was typically the cause of the data sparsity problem for recommenders in digital libraries. If we measure the sparseness of a dataset by the number of edges in the bipartite user-item graph divided by the total number of possible edges, Mendeley gives 2.66E-05.  Compared with the sparsity of Neflix – 1.18E-02 – that’s a difference of 3 orders of magnitude!

But raw sparsity is not all that matters. The number of users per movie is much more evenly distributed in Netflix than the number of readers per article in Mendeley, i.e.  the user-item graph in Netflix is more connected (in the sense that the probability of creating a disconnected graph by deleting a random edge is much lower).

In the Mendeley data, out of the 3,652286 unique articles, 3,055546 (83.6%) were referenced by only 1 user and 378,114 were referenced by only 2 users. Less than 6% of the articles referenced were referenced by 3 or more users. [The most frequently referenced article was referenced 19,450 times!]

Compared with the Netflix dataset (which contains over ~100M ratings from ~480K users on ~17k titles) over 89% of the movies in the Netflix data had been rated by 20 or more users. (See this blog post for more aggregate statistics on Netflix data.)

I think that user or item similarity measures aren’t going to work well with the kind of distribution we find in Mendeley data. Some additional information such as article citation data or some content attribute such as the categories to which the articles belong is going to be needed to get any kind of reasonable accuracy from a recommender system.

Or, it could be that some method like the heat-dissipation technique introduced by physicists in the paper “Solving the apparent diversity-accuracydilemma of recommender systems” published in the Proceedings of the National Academy of Sciences (PNAS) could work on such a sparse and loosely connected dataset. The authors claim that this approach works especially well for sparse bipartite graphs (with no ratings information). We’ll have to try and see.

Ex Libris ‘bX’ Recommender Promo Video October 5, 2010

Posted by Andre Vellino in Collaborative filtering, Recommender.

I stumbled across this Ex Libris promo video for its ‘bX’ recommender yesterday. Having done quite a few of these use-case demo scenarios to “show the value”, I appreciate how hard it is to pitch a relatively complex idea in straight-forward terms. I think it does a pretty good job too, notwithstanding the slightly over-the-top-happiness tenor of the whole thing.

At the risk of repeating myself, though, there’s one thing that the video glosses over.  SFX logs are, effectively, click-logs and clicks have two sources: search engine results and ‘bX’ recommendations themselves.  Hence ‘bX’ recommendations are more likely to be “semantically homogenous” (although less so than pure search results) because the data they derive from is biased by search-engine ranking.  The proportion of SFX trafic that is generated by the recommender itself further narrows the semantic diversity of recommendations.

Are User-Based Recommenders Biased by Search Engine Ranking? September 28, 2010

Posted by Andre Vellino in Collaborative filtering, Recommender, Recommender service, Search, Semantics.

I have a hypothesis (first emitted here) that I would like to test with data from query logs: user-based recommenders – such as the ‘bX’ recommender for journal articles – are biased by search-engine language models and ranking algorithms.

Let’s say you are looking for “multiple sclerosis” and you enter those terms as a search query. Some of the articles that were presented to you from the search results will likely be relevant and you download a few of the articles during your session. This may be followed by another, semantically germane query that yeilds more article downloads. As a consequence, the usage-log (e.g. the SFX log used by ‘bX’) is going to register these articles as having been “co-downloaded”.  Which is natural enough.

But if this happens a lot, then a collaborative filtering recommender is going to generate recommendations that are biased by the ranking algorithm and language model that produced the search-result ranking: even by PageRank, if you’re using Google.

In contrast, a citation-based (i.e. author-centric) recommender (such as Sarkanto) will likely yield more semantically diverse recommendations because co-citations will have (we hope!) originated from deeper semantic relations (i.e. non-obvious but meaningful connections between the items cited in the bibliography).