Paper vs. Bytes February 24, 2009
Posted by Andre Vellino in CISTI, Digital library, General.1 comment so far
Until just last week, water-cooler conversations in our library sometimes went to the question of whether a paper collection has value in the 21st century. The universal consensus seems to be that books and paper journals are out and that the future is digital. After all, paper is expensive to produce, transport and store. It also takes up space and can’t be searched or retrieved without meta-data and catalogues. In short, paper collections are less preferable in every way to digital ones. So went film cameras and paper photography, after all.
Ever the contrarian, I sometimes argue the case for paper, at the behest of my bookish spouse (as you might expect from a professor of English literature.)
Here then are some arguments for paper.
- Once produced paper requires no further technology to access – no electicity, no computers, no software.
- The fact that paper takes up space and is expensive to store obliges the “stewards of content” (aka librarians) to be selective about what they accept and keep in their collections. The high cost of publishing is only justifiable if the quality is high, hence an expensive storage vehicle increases the likelihood that what is preserved in libraries is high quality.
- Print has prestige (perhaps because of [ii]). [See the January '09 CBC Spark Podcast on how newspapers are making a comeback]
- At 167 ppi (the current resolution for the Amazon Kindle Book Reader), reading paper is easier on human eyes for which content is (still now, mostly) intended.
- Computers contribute to our individual and collective distraction. Should we really be enhancing our tendency to juggle so many things at a time?
Of course, each of these arguments has counter-arguments too. Paper rots easily, requires computer technology to access it and buildings to protect it. Also, perhaps librarians shouldn’t be the arbiters of what is collected – it contradicts the (now popular) idea underlying Everything is Miscellaneous” wherein pretty much everything has “value”, depending on who you are and what your perspective is.
Still, I do think there’s a place for paper. I don’t think the phrase “digital library” will join the ranks of oxymorons like “paperless office” but I do think (hope) that not all libraries of the future will be (entirely) digital. Each dimension of the library has it’s niche, I think.
The Mechanical Librarian February 19, 2009
Posted by Andre Vellino in CISTI, Collaborative filtering, General, Recommender, Recommender service.add a comment
I was scheduled to give a CISTI Seminar yesterday, entitled “The Mechanical Librarian” but it was pre-empted by an address from NRC’s president.
I was primed to give the talk, though, because recommenders for scholarly digital libraries are coming of age and there’s lots to say about them.
The presentation (which you can view here) covers a lot of ground at quite a high level, including a brief screen-shot demo of the Synthese Recommender: it was intended for a general audience, mostly of librarians and information specialists.
One chart lists some of the digital library recommenders that have been either developed or studied in the last 7 years:
Techlens (University of Minnesota) (2002)
- Uses ACM DL, full text Mixed Hybrid (CF – CBF)
BibTip (University of Karlsruhe) (2003)
- Uses OPAC (Library Catalog) usage data for collaborative filtering
IngentaConnect (2007)
- Uses Baynote (SaaS) customer tracking
DSpace (2008)
- Content-based recommender based on user-bookmarks
CiteULike (academic experiment 2008)
- Collaborative filtering on user bookmarks from CiteULike
“bX” system from Ex Libris (2009)
- Uses SFX resolver logs
NextBio (to be announced in March 2009)
- Life sciences search engine that uses collaborative filtering + ontologies to suggest new content (trials / abstracts / data)
Let me know if I’ve missed something.
NextBio Recommender February 15, 2009
Posted by Andre Vellino in CISTI, Information retrieval, Recommender, Recommender service, Search.3 comments
The biosciences search portal NextBio is interesting for several reasons. According to this interview, the VP of Engineering Satnam Alag (also the author of Collective Intelligence in Action) says NextBio will shortly be introducing an article recommender
The key point about this particular recommendation engine is its strong use of an ontology, similar in concept to tags, to develop a common vocabulary for items and users. The system then makes use of profile information and user interactions, both short- and long-term, to provide recommendations. The system leverages both item- and user-based approaches.
I am a little too jaded to (completely) believe the enthusiastic assertion that article recommenders will be the next killer-app, but I do hope this prediction comes true. Recommenders are basically just a feature in portal and they depend on a lot of other things in it – user tracking, content, collections, ratings. They are killer apps for Amazon and Netflix but only because everything else they do is also done well. It will take a perfect storm to get everything right for a scientific article recommender.
In addition to a recommender it appears that NextBio also has a feature that Glen Newton came up with for query refinement: “drill clouds“
The major difference is that, in Glen’s drill clouds, clicking on a term in adds the term to the conjnction of terms in the query and narrows the search to the subset of documents that contains that conjunction. In NextBio the tag-cloud changes the class of things that the original search term applies to – i.e. it narrows the context for the query rather than adding terms to the query. Which is a little bit counter-intuitive once you’ve tried Glen’s method (you can experiement with it on CISTI Lab).
I think NextBio – which also includes scientific datasets, clinical trials and news – is a science portal to keep tabs on.
Document Similarity w/ Hadoop February 3, 2009
Posted by Andre Vellino in CISTI, Information retrieval, Open Source, Search.5 comments
For a while now, I have been wanting to calculate an exact “pairwise document similarity” measure for a large (~8M item) corpus of full-text scientific articles. I tried a variety of obvious sequential methods and discovered that even with good caching strategies, this just wasn’t feasible for such a large collection.
Since this problem is clearly parallelizable, I tried to do it with Hadoop – Apache’s Java implementation of Google’s MapReduce. As a starting point, Glen pointed me to a nice short paper on how to compute pairwise document similarity using MapReduce (by Elsayed, Lin and Oard). Here’s what I learned from the experience.