Is Clippy the Future? February 8, 2013
Posted by Andre Vellino in Artificial Intelligence, Collaborative filtering, Data Mining.add a comment
The student-led Information without Borders conference that I attended at Dalhousie yesterday was truly excellent – as much for its organization (all by students!) as for its diverse topics: the future of libraries, cloud computing, recommender systems, sciverse apps and the foundations for innovation.
At the panel discussion in which I participated, I suggested that to predict the future one need only look at the past. To predict the iPad one needed only look at the Apple Newton (which died in 1998). What was the analog, I wondered, for an information retrieval tool, now dead and buried, that might still evolve into something we all want in the field of information management?
I proposed that the future of information retrieval might be something like an evolved Office Assistant, (affectionately coined “Clippy”) – the infamous, now deceased Microsoft Paperclip that assisted you in understanding and navigating Microsoft products.
My vision for a next generation Clippy was clearly not well articulated since it prompted the following tweet from Stephen Abram:
I think that Siri, (about which I posted a few years ago) belongs to the old Clippy style of annoying and in-the-way-of-what-I-want-to-do applications. I am surprised it has survived so long and was promoted by Apple so strongly. I predict it will join Clippy, Google Wave and Google Glasses on the growing heap of unwanted technologies that were not ready for prime-time.
Watson (who is now going to medical school, and about which I also posted a couple of years ago) is, however, just the sort of Natural Language Understanding component technology that I have in mind for for an interactive, personal information assistant. When a computer that now costs three million dollars with15 terrabytes of RAM can fit in your pocket and cost $500, a Watson-like system that understands natural language queries will be an important component of Clippy++.
What neither Watson nor Siri have – and this is what I foresee in my crystal ball is the most significant attribute about “Clippy++” – is personalization and autonomy. What will make true personalization possible with “Clippy++” is our collective willingness to accept the intrusion of a mechanical supervisor that learns from our behaviour about what we want, need and expect.
This culture-shift his happening right now – we gladly and willingly disclose our information consumption habits to supervisory software and data-analytics engines in exchange for entertainment and social networking. It won’t be long before we’re willing to do that for serious, personalized information management purposes as well.
The key, though, is going to be the interaction – the dialog that we have with Clippy++ – and it will have to have explanations for its actions and recommendations. That’s going to be the hallmark of its evolution to Machina Sapiens.
Review: “Mahout in Action” December 22, 2011
Posted by Andre Vellino in Book Review, Collaborative filtering, Data Mining, Java, Open Source, Recommender service.1 comment so far
In early September 2010 (I’m embarassed to count many months ago that was!) I received an Early Access (PDF) copy of “Mahout in Action” (MIA) from Manning Publications and asked to write a review. There have been 4 major updates to the book (now no longer “early access”!) since then and although it is too late to fulfill their purpose in giving me an early access to review (no doubt a supportive quote for the dust jacket or web site), I thought I’d nevertheless post my belated notes.
Mahout is an Apache project that develops scalable machine learning libraries for recommendation, clustering and classification. Like many other such software-documentation ”in Action” books for Apache projects (Lucene / Hadoop / Hibernate / Ajax, etc.), the primary purpose of MIA is to complement the existing software documentation with both an explanatory guide for how to use these libraries and some practical examples of how they would be deployed.
First I want to ask: “how does one go about reviewing such a book”? Is it possible to dissassociate one’s opinion about the book itself from one’s opinion of the software? If the software is missing an important algorithm, does this impugn the book in any way?
The answers to these questions are, I think, “yes” and “no” respectively. Hence, the following comments assess the book on its own merits and in relation to the software that it documents, not in relation to the machine learning literature at large. Indeed, the fact that this book is not a textbook on or an authoritative source for machine learning is made quite explicit at the beginning of the book and the authors make no claim at being experts in the field of Machine Learning.
It’s important to understand that Mahout came about in part as a refactoring excercise in the Apache Lucene project, since several modules in Lucene use information retrieval techniques such as vector based models for document semantics (see the survey paper by Peter Turney and Patrick Pantel “From Frequency to Meaning: Vector Space Models of Semantics“). The amalgamation of those modules with the open source collaborative filtering system (formerly called Taste) by co-author Sean Owen yielded the foundation for Mahout.
Thus, if there are gaps in Mahout software it is an accident of history more than a design flaw. Like most software – especially open-source software – Mahout is still “under construction”, as evidenced by its current version number (“0.5″). Even though many element are quite mature there are also several missing elements and whatever lacunae there are should be considered as an opportunity to contribute and improve this library rather than to criticize it.
One obvious source for comparison is Weka – also an open-source machine learning library in Java. The book associated with this library – Data Mining: Practical Machine Learning Tools and Techniques (Second Edition) by Ian H. Witten, Eibe Frank – was published in 2005 and has a much more pedagogical purpose than Mahout in Action. In contrast with MIA, “Data Mining” is much more of an academic book, published by academic researchers, whose purpose is to teach readers about Machine Learning. In that way, these two books are complimentary, particularly as there are no algorithms devoted to recommendations in Weka and many more varieties of classification and clustering algorithms in Weka than in Mahout.
The Mahout algorithms that are discussed in MIA include the following.
- Collaborative Filtering
- User and Item based recommenders
- K-Means, Fuzzy K-Means clustering
- Mean Shift clustering
- Dirichlet process clustering
- Latent Dirichlet Allocation
- Singular value decomposition
- Parallel Frequent Pattern mining
- Complementary Naive Bayes classifier
- Random forest decision tree based classifier
The integration of Mahout with Apache’s implementation of MapReduce – Hadoop - is no doubt the unique characteristic of this software. If you want to use a distributed computing platform to implement these kinds of algorithms, Mahout and MAI is the place to start.
On its own terms, then, how does the book fare? It is fair to say – for the quotable extract – that Mahout in Action is an indispensible guide to Mahout! I wish I had had this book 5 years ago when I was getting to grips with open source collaborative filtering recommenders!
P.S. This book fits clearly in the business model for open source Apache software – write great and useful software for free, but make the users pay for the documentation! Which is only fair, I think, since $20 or so is not much at all for such a wealth of well-written software! The same can be said for Weka, whose 303 pages of software documentation still requires the book to be useful.
What is ‘Data’? June 14, 2011
Posted by Andre Vellino in Data, Data Mining, Information retrieval.3 comments
“What does ‘data’ mean to you?” I asked innocently to various participants at JCDL 2011 today. I had just come out of a very interesting panel discussion entitled “Big Data, Big Deal?” at which most of the discussion was about large amounts of proprietary text at http://www.hathitrust.org/ (some of of the discussion was also about large amounts of music in the SALAMI project at McGill).
Now I am very interested in text, text retrieval (and music IR too) and I found the panel discussion most rewarding. But it wasn’t aboutwhat I had been expecting it to be about (from the title) and I was perplexed by this use of the term “data” in this context. After all, the subtitle of the JCDL 2011 conference is “Bringing Together Scholars, Scholarship and Research Data”. So the context for “data” was (for me) “research data” in the sense of the term that is pretty much the same the first 3 sentences of the Wikipedia entry for Data:
The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data (plural of “datum”) are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which information and then knowledge are derived.
So I was somewhat taken aback by the argument that ensued. Everyone, it seems (except me), was quite happy to speak of “Big data” and “large amounts of text” as synonymous. As though the streams of bytes that are common to readings from an NMR spectrometer, digital music and electronic journal articles were in all significant respects indistinguishable.
Of course, large volumes of byte-sequences share some kinds of problems like storage, preservation and search. But “text data” is a different kind of beast, isn’t it? For one thing, text typically has meaning – cognitive content that is different from, say, music or images or spreadsheets of temperature variations in Glasgow over the past 500 years. It has more structure too, as evidenced by how efficiently it compresses and how (relatively) easy it is to search.
I’m happy to speak of data about text that is inferred by the act of mining text. Word frequencies, ngrams, term clusters, sentiment categories etc. fit the definition of “data” above. Even the textual “meta-data” about text is data of a certain kind. But the text itself just doesn’t seem to be that kind of thing (qualitative or quantitative attributes of a variable).
Mendeley Data vs. Netflix Data November 2, 2010
Posted by Andre Vellino in Citation, Collaborative filtering, Data, Data Mining, Digital library, Recommender, Recommender service.9 comments
Mendeley, the on-line reference management software and social networking site for science researchers has generously offered up a reference dataset with which developers and researchers can conduct experiments on recommender systems. This release of data is their reply to the DataTel Challenge put forth at the 2010 ACM Recommender System Conference in Barcelona.
The paper published by computer scientists at Mendeley, which accompanies the dataset (bibliographic reference and full PDF), describes the dataset as containing boolean ratings (read / unread or starred / unstarred) for about 50,000 (anonymized) users and references to about 4.8M articles (also anonymized), 3.6M of which are unique.
I was gratified to note that this is almost exactly the user-item ratio (1:100) that I indicated in my poster at ASIS&T2010 was typically the cause of the data sparsity problem for recommenders in digital libraries. If we measure the sparseness of a dataset by the number of edges in the bipartite user-item graph divided by the total number of possible edges, Mendeley gives 2.66E-05. Compared with the sparsity of Neflix – 1.18E-02 – that’s a difference of 3 orders of magnitude!
But raw sparsity is not all that matters. The number of users per movie is much more evenly distributed in Netflix than the number of readers per article in Mendeley, i.e. the user-item graph in Netflix is more connected (in the sense that the probability of creating a disconnected graph by deleting a random edge is much lower).
In the Mendeley data, out of the 3,652286 unique articles, 3,055546 (83.6%) were referenced by only 1 user and 378,114 were referenced by only 2 users. Less than 6% of the articles referenced were referenced by 3 or more users. [The most frequently referenced article was referenced 19,450 times!]
Compared with the Netflix dataset (which contains over ~100M ratings from ~480K users on ~17k titles) over 89% of the movies in the Netflix data had been rated by 20 or more users. (See this blog post for more aggregate statistics on Netflix data.)
I think that user or item similarity measures aren’t going to work well with the kind of distribution we find in Mendeley data. Some additional information such as article citation data or some content attribute such as the categories to which the articles belong is going to be needed to get any kind of reasonable accuracy from a recommender system.
Or, it could be that some method like the heat-dissipation technique introduced by physicists in the paper “Solving the apparent diversity-accuracydilemma of recommender systems” published in the Proceedings of the National Academy of Sciences (PNAS) could work on such a sparse and loosely connected dataset. The authors claim that this approach works especially well for sparse bipartite graphs (with no ratings information). We’ll have to try and see.
Scientific Data is Interpreted September 26, 2010
Posted by Andre Vellino in Data, Data Mining, Epistemology.add a comment
It must be a truism by now that there is no such thing as theory-free observation. Scientific data is necessarily tied up with the theories that are required to interpret them and which led to their discovery.
By analogy I would argue that scientific data sets are useless unless they are interpreted. There is no such thing as a useful “raw” data set.
Consider for instance the data on leap seconds from the National Research Council. It’s a simple enough table: there are three columns (Date, UTC Leap Seconds, and MJD) and only a few dozen rows. Here are two such rows:
DATE UTC Leap Seconds MJD
2006-01-01 - 2009-01-01 33 53 736 - 54 832
1999-01-01 - 2006-01-01 32 51 179 - 53 736
The first question for the uninitiated in the measurement of time: what is a “UTC Leap Second”? It’s easy enough to look up and learn that UTC is
is a time standard based on International Atomic Time (TAI) with leap seconds added at irregular intervals to compensate for the Earth’s slowing rotation.
Ah, so this was news to me: the earth’s rotation is slowing down! (“the solar day becomes 1.7 ms longer every century due mainly to tidal friction (2.3 ms/cy, reduced by 0.6 ms/cy due to glacial rebound”).
The (implicit) frame of reference for (exact) time with respect to which the earth is slowing down is the atomic (cesium) clock, which requires an understanding of the highly theoretical processes of quantum mechanics to interpret correctly.
So now we have an inkling of what the data means. They give us the variance between two time measurements – those from atomic clocks and those from the earth’s rotation. A first attempt at interpreting the first row in the table is: it took 3 years between 2006-01-01 and 2009-01-01 to add one leap-second to the calendar date.
A little “Binging” (I’ve all but abandoned “Googling” since Google became “instant” – not because it can’t be turned off, but to make a statement to Google) yields “Modified Julian Day” for MJD. So the third column is primarily a conversion of the first column into a standard, though not without its own theoretical reasons for being the preferred measure.
All this to say – repositories of datasets without (substantial) amounts of textual metadata, not to mention software and tools designed for its interpretation and navigation are going to be (at best) not very useful.
The Future of Data Information Retrieval April 29, 2010
Posted by Andre Vellino in Data Mining, Information retrieval, Knowledge Representation.10 comments
It seems like the “Open Data” movement is at last getting some traction and the floodgates are opening. [Thanks to Daniel Lemire, Richard Akerman, Peter Turney and Paul Gilbert for all the pointers and helpful suggestions!]
For instance, just the other day the World Bank opened up its data vaults to add to the already voluminous quantities of “social science” data. It probably won’t take long for Google to add some of those data sets to its growing collection of public data that it can display with Google Motion Charts API. This is in parallel to “Google Fusion Tables” (read “Google Docs for Data”), for which visualizations are also be available.
I am not sure about Google’s commercial motives – they probably don’t know either – “build it and the money will come” seems to work for Google, somehow. But Amazon’s motive for providing 30 or so significant data sets (between 20 and 300Gb each) is more transparent: to sell their cloud-computing services for those who want to data-mine this information. What a great honey-pot for data-miners who need to chew up CPU cycles! My favourites are
- U.S. Census Data (1980, 1990, 2000)
- Daily Weather (1929-2009) currated from National Climatic Data Center Data Sets
- Sloan Digital Sky Survey (DR6)
- GenBank (the NIH Gene DataBase)
- PubChem (another database from NIH)
- Ensembl (Human and other animal Genomes)
There are stacks of other datasets as well, lovingly cared for by people for whom this data really matters.
- Astronomy Data Set
- BioMedical Informatics Network
- NRCAN Geo Science Data Repository
- NIST Data Gateway
The list is much too large to provide a comprehensive catalogue, although CISTI has begun developing such a list of Canadian scientific datasets at the Gateway to Scientific Data.
In sum, we have lots of scientists contributing lots of data to databases they care about. Now what?
Here’s the analogy with the web that Tim Berners-Lee makes in this TED video. In ~ 1990 we already had lots and lots of electronic documents on PC hard-disks, not to mention mainframes and even file-servers. Then came his wonderful, awesome idea – do what MacIntosh HyperCard did to “Cards” to files on the internet. Brilliant! But hypertext alone wasn’t enough. First requirement: a “universal locator scheme” for linking documents to one another. Second requirement: a harvesting / indexing method to make the content accessible via a search engine.
The corresponding ideas are now migrating to the data world. For instance Berners-Lee is spearheading the Linked Data movement. The idea of Data URIs (addressable via HTTP) is an essential first step but the corresponding second step – a data-harvesting / indexing method – hasn’t (yet) been taken. Not by conventional internet search engines anyway.
We need to be able to search for and find data, but unlike text, data itself can’t be indexed! What can we do about that? One suggestion – for scientific datasets anyway – is to make the link between scientific documents (i.e. published articles) and the datasets that they depend on and use the text of the scientific articles as “metadata” for the datasets.
Fortunately, thanks to DataCite, scientists can now obtain Digital Object Identifiers (DOIs) for data and cite datasets in their publications. They may even get credit for developing and publishing datasets in academic peer-review. In the not too distant future scientists in Canada will be going to CISTI to obtain those DOIs.
What might a future in which data is properly indexed and discoverable look like? A little like WolframAlpha, I expect. This “computational knowledge engine” – note the absence of “search” in this descriptor – already relies on lots of “databases” to return “search results” and compute [possibly relevant] related facts. It’s also already got a pretty cool iPad app for navigating and visualizing results. This is where the puck is going and we ought to be skating there.
Currently the state of the art for cataloging datasets is to annotate the metadata with some bibliographic standard like Dublin Core. And to “crawl” the metadata for datasets, we could use Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) as a starting point.
But we are going to need something more than just “metadata” (which is, in fact “meta-metadata”, since the “metadata” for a dataset should really be the Data Schema or XMLSchema that it conforms to). Remember David Weinberger’s advice in Everything is Miscellaneous: “The solution to overabundance of information is more information”.

So where can we find more information about the data, above and beyond how it is linked to other data and how it is referenced in the published literature? How about the putting Data-Schemas and XML Schemas to work and mining them for meaning? The idea is still a bit vague in my mind, but something like an automated way of extracting or reverse-engineering the entities and relationships from the data schema and using them to index the data elements. This could enable even more links between data elements in one dataset and elements in another.
This is perhaps where the social sciences have a leg-up on the physical sciences – there is quite a rich metadata standard for social sciences: Data Documentation Initiative (DDI) (which is a misnomer, really – it should be called “Social Sciences Data Documentation Initiative”). The value of this standard has been a framework for developing tools navigating datasets and statistical tools for analysis.
There are some such toolsets in some scientific fields: NetCDF (network Common Data Form), for instance, is a set of software libraries and machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data.
Perhaps if each scientific discipline can establish a similar set of standards the future of scientific dataset discovery will look as compelling as the WWW does today.



