jump to navigation

Mendeley Data vs. Netflix Data November 2, 2010

Posted by Andre Vellino in Citation, Collaborative filtering, Data, Data Mining, Digital library, Recommender, Recommender service.

Mendeley, the on-line reference management software and social networking site for science researchers has generously offered up a reference dataset with which developers and researchers can conduct experiments on recommender systems. This release of data is their reply to the DataTel Challenge put forth at the 2010 ACM Recommender System Conference in Barcelona.

The paper published by computer scientists at Mendeley, which accompanies the dataset (bibliographic reference and full PDF), describes the dataset as containing boolean ratings (read / unread or starred / unstarred) for about 50,000 (anonymized) users and references to about 4.8M articles (also anonymized), 3.6M of which are unique.

I was gratified to note that this is almost exactly the user-item ratio (1:100) that I indicated in my poster at ASIS&T2010 was typically the cause of the data sparsity problem for recommenders in digital libraries. If we measure the sparseness of a dataset by the number of edges in the bipartite user-item graph divided by the total number of possible edges, Mendeley gives 2.66E-05.  Compared with the sparsity of Neflix – 1.18E-02 – that’s a difference of 3 orders of magnitude!

But raw sparsity is not all that matters. The number of users per movie is much more evenly distributed in Netflix than the number of readers per article in Mendeley, i.e.  the user-item graph in Netflix is more connected (in the sense that the probability of creating a disconnected graph by deleting a random edge is much lower).

In the Mendeley data, out of the 3,652286 unique articles, 3,055546 (83.6%) were referenced by only 1 user and 378,114 were referenced by only 2 users. Less than 6% of the articles referenced were referenced by 3 or more users. [The most frequently referenced article was referenced 19,450 times!]

Compared with the Netflix dataset (which contains over ~100M ratings from ~480K users on ~17k titles) over 89% of the movies in the Netflix data had been rated by 20 or more users. (See this blog post for more aggregate statistics on Netflix data.)

I think that user or item similarity measures aren’t going to work well with the kind of distribution we find in Mendeley data. Some additional information such as article citation data or some content attribute such as the categories to which the articles belong is going to be needed to get any kind of reasonable accuracy from a recommender system.

Or, it could be that some method like the heat-dissipation technique introduced by physicists in the paper “Solving the apparent diversity-accuracydilemma of recommender systems” published in the Proceedings of the National Academy of Sciences (PNAS) could work on such a sparse and loosely connected dataset. The authors claim that this approach works especially well for sparse bipartite graphs (with no ratings information). We’ll have to try and see.


1. Daniel Lemire - November 2, 2010

Any chance to turn this into an user-author problem?

I would be interested in a tool recommending authors rather than papers, and it might help the sparsity.

If you want to chat about it, let’s email.

2. Rod Page - November 3, 2010

My initial reaction to this is that the Mendeley dataset lacks the most obvious source of useful information, namely citation data, which is perhaps a more explicit form of user data (I liked this paper so much I cited it). But then you could argue that users’ libraries will mainly comprise papers they have cited when writing their own papers.

Would be interesting to compare properties of Mendeley data with typical citation networks. If most papers aren’t cited, but some are massively cited, then perhaps sparseness of Mendeley graph is only to be expected.

3. Andre Vellino - November 3, 2010

Thanks for your comments.
@Daniel. Yes, I agree that author recommendations could be quite interesting, perhaps more so than paper recommendations.
@Rod. I have some citation data. I’ll try to compare them in the same way and see if something interesting emerges

4. Daniel Lemire - November 3, 2010

I couldn’t help but come back and plug an old post of mine:

Toward author-centric science

In everything else, music, blogs, books… people are author-centric. In science, we are, effectively, also author-centric. You read such a paper because it was written by D. Knuth.

But somehow, the system got organized around publication venues… which is very unnatural. Who actually reads one by one the papers in a given journal or conference? The journal and conference is useful to jump from author to author, but the true alignment remains the authors…

What I want to learn about are new authors I should care about who do research similar to mine.

I was astounded this year to learn of two other canadian researchers who do work very closely related to my interests. They have been around for years… but they publish in slightly different venues… I found them by accident… in one case because the fellow cited a paper of mine, and in the other case, because I was asked to review a grant application.

This means that I am quite bad at being aware of related researchers… I’m sure I’m not alone.

Yes, yes, we all know about the super stars…. this is easy… but there is a long tail at work here… most researchers are not highly prolific… yet, they can be very important nonetheless… especially because there are many of them…

5. Dinesh Vadhia - November 18, 2010

The Mendelay data set doesn’t include links to the articles. How would you test a recommender?

6. Andre Vellino - November 18, 2010

You could test *some* aspect of a recommender even without the references to the articles themselves: Precision at Rank N, for example. You would select a random user, remove a reference from that user at random and see whether the recommender produces the removed reference as a recommendation as Top-1 (first recommendation), Top-5 (within the top 5), Top-10 (within the top 10) etc.

Whether that measure anything meaningful about an article recommender is another story….

7. Dinesh Vadhia - November 18, 2010

Thanks Andre. Maybe I’m grousing unfavorably but to me the data set has almost zero utility for building a demonstrable recommender. Seems as if the data was released without any thought about the people building recommenders. Pity!

8. Yicong Liang - January 18, 2011

I think this dataset should also provide the citation network among papers rather than only provide “user-collect-papers”.

I am also interested in paper recommendation problem (or related but still around “paper”). However, I am still frustrated because few existing datasets are suitable for conducting experiment.

9. Jimme - February 14, 2012

Rod – I have been fascinated reading your blog posts. Well done and thank you!

Just as a point of interest: in Qiqqa, with just over 2 million documents, we are seeing almost exactly the same trend in “shared readership”.

Readers Perc
1 87.39%
2 7.75%
3 3.08%
4 1.47%
5 0.11%
6 0.11%
7 0.05%
8 0.04%
9 0.00%
10 0.00%

Looking at the poor performance of the recommendation results of these online tools suggests the failure of the “if we share docs in common, then you will like my other docs” or simplistic collaborative filtering algorithms. I agree with Daniel that some higher-level context needs to be involved in the recommendation aggregation layer – author being a great one. In Qiqqa we are heading down the route of recommending around language models inside the papers.

I will certainly be coming back to your blog to see what thoughts you have on taming the scientific literature.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: