jump to navigation

2007 Top 10 December 20, 2007

Posted by Andre Vellino in Java, Search, User Interface, Visualization.
1 comment so far

Here’s a list of the top 10 neat things (I would have said “cool” but I’m from a different generation :-) ) that I stumbled across this year:

  1. Google’s Charting API – an http REST API for generating charts
  2. Intellij IDEA – a Java IDE to rival Eclipse
  3. Hibernate - an object-relational DB mapping framework for Java.
  4. Freebase - a database commons
  5. Firebug addon for Firefox – lets you view HTML / CCS / JavaScript / DOM of web pages
  6. Toyota’s robot – that plays the violin
  7. Holographically Reduced Representations – as a way of modeling memory
  8. Quintura - most promising UI for search refinement
  9. PowerSet - interesting NLP-based search engine
  10. KeyPass – useful cross-platform password generation / keyring

Eclipse & Intelij IDEA December 16, 2007

Posted by Andre Vellino in Java, Linux.
add a comment

If there was a 12 step program for IDE addicts I might begin my weekly session with “Hello, my name is Andre and I’m addicted to Eclipse“. The truth is, I depend on Eclipse’s wizard dialogs, code-completion popups, and quick-fiixes for boken code. And yet I envy people who can use Emacs and “java” from the command-line. I even envy those who are addicted to other IDEs such as Intelij’s IDEA from JetBrains.

So I thought I’d try IDEA for a few days. There are some really outstanding things about this IDE – its code analyser from Inspection Gadgets is one. Built-in support for Hibernate and the Google Widget Toolkit is another. It even has a built-in IM client for chatting with other IDEA users. But overall, it is the clarity and simplicity of the UI in combination with a rich and coherent set of features that makes it shine. Everything is right there where you expect it and it behaves consistently.

Comparing IDEA and Eclipse feature for feature would probably not be a fruitful excercise. I don’t doubt for a second that you can make Eclipse do anything that IDEA does and visa versa – I bet there’s even an Eclipse IM client plugin somewhere.

But the major strength of Eclipse (one reason I’m an addict) is also it’s greatest weakness – flexibility and modularity. It’s such a ubiquitous development environment that there are plugins for just about everything you may care for: J2ME, J2EE with Hibernate, Profiling & Code Analysis, Testing Platforms, Swing UI … you name it. Eclipse has become like Linux – everyone picks it up and configures it with different plugins / defaults / splash screens – it has become everything to everyone. And in the process, like Linux I think, it has lost (some of) it’s identity.

But, despite a compelling alternative, I’m still stuck on Eclipse.  Perhaps it’s because I enjoy being mollycoddled.

Attention Profiling – APML December 10, 2007

Posted by Andre Vellino in Citation, Collaborative filtering, Digital library, Recommender, Recommender service.
add a comment

Paul Lamere’s blog post on APML (Attention Profiling Markup Language) points us to a Beginners Guide, which in turn offers a basic survey. The main APML web site offers this definition:

APML allows users to share their own personal Attention Profile in much the same way that OPML allows the exchange of reading lists between News Readers.

For a science library user, this kind of profile definition may be possible, but likely only useful in combination with a Library social bookmarking site such as CiteULike or Elsevier’s 2Collab (see also Richard Ackerman’s recent preview of new features in 2Collab.)

PowerSet December 5, 2007

Posted by Andre Vellino in Information retrieval, Search.
3 comments

I like what I’m seeing in PowerSet. I obtained a login ID the other day and did a few informal experiments on the PowerLabs site. I know PowerSet is getting a lot of pre-launch buzz – and I am as allergic to hype as the next person, but I’m optimistic about PowerSet.

Both Haika – whose marketing assertions I expressed some doubt about in a previous post – and PowerSet are mentioned in the recent roundup of “Semantic Apps to Watch” on Read/Write web.

But consider the differences – even just at the level of marketing blurbs from their respective web sites. Haika claims that:

OntoSem offers an advanced methodology and technology for natural language processing, the only one of its kind, so far, to access the full meaning of the text it handles.

PowerSet, on the other hand makes a somewhat more modest claim:

Our unique innovations in search are rooted in breakthrough technologies that take advantage of the structure and nuances of natural language. Using these advanced techniques, Powerset is building a large-scale search engine that breaks the confines of keyword search.

Haika claims to “access the full meaning of the text” and PowerSet merely to “break the confines of keyword search”. One could argue that Google does that too, of course, with PageRank, if nothing else.

From my unscientific survey of sample queries, I’d say PowerSet will live up to their claims when they go live. The demos that I tried are partitioned into structured queries such as ” ‘What did’ X ’say about’ Y?” where X and Y are your favourite noun phrases. Other demos on Sports and Art and Business are structured in the same way. However, the index is limited to the text content in Wikipedia – which, as Daniel Lemire pointed out the other day, you might as well restrict your Google queries to anyway since ~ 27% of top search results come from Wikipedia.

Consider. for example, the question “What did someone say about the Canadian Dollar?”. Powerset’s top result is:

He has been the Premier of Manitoba since 1999, leading a New Democratic Party government…. Doer encouraged the Bank of Canada to lower its rates in late 2003, saying that the rising strength of the Canadian dollar in relation to the American dollar was causing increased unemployment.

Compare that with the following query on “The other guys” web site:

site:http://en.wikipedia.org/ what did someone say about the Canadian Dollar

Google’s 1st result is:

I can’t work out where the black box comes from – did someone change CSS? ….. the shortest version, which according to the Canadian dollar page is “C$”.

These demo query templates are rigged against Google, naturally. Even some surface NLP on the query, which Google doesn’t seem to do, will give you better results. But the PowerSet index does some (maybe quite a bit of) NLP on the content as well – named entity extraction for instance and possibly some anaphora resolution.

I’m giving an encouraging review of PowerSet not just because I worry about a search-engine monoculture. (It’s true that I worry about Google’s dominance, but it’s for the same reason that I worry about the monopolies of Microsoft, Intel and Chiquita Bananas – species diversity is good for the eco-system.) For instance, I find I often want answers to questions about things, which requires the ability to differentiate between “sense” and “reference”. For instance I often want to read reviews of {books, digital equipment, etc.} rather than have references to the items themselves and I have to twist into pretzels to formulate a Google-query with quotes (for the item) and synonyms for “review” / “opinion” etc. which are likely to occur in the “about” items I’m looking for.

I think PowerSet might find its niche with users who want a particular kind of question-answering engine. But I don’t think the relative business failure of Ask.com should deter them from seeking that niche.

Quintura Search December 2, 2007

Posted by Andre Vellino in Collaborative filtering.
4 comments

Richard pointed me (again) to something of interest. Yet-another-search-engine Quintura tries to help the user visualize the search space and refine the search results. Quintura appears to combine unsupervised categorization + keyphrase extraction + the tag-cloud paradigm to suggest possible query refinements. Previews of result-sets appear when you mouse-over the tags, you can save your search states and share them with others.

It’s all a bit unusual, but that’s what makes it interesting, if flawed. The company background is unusual too. The keyphrases are: small startup, Russian, neural networks, psychology Ph.D. students in Moscow.

We’re not quite there yet, IMO, but it’s good to see that not all hope has been lost on breaking the now all too familiar mold of the search-box + linear list of results.

Baynote CF for IngentaConnect December 1, 2007

Posted by Andre Vellino in Collaborative filtering, Digital library, Recommender, Recommender service.
2 comments

It’s validating to see that there are commercial offerings for recommender systems in digital libraries. My colleague Richard Akerman, blogger extraordinare for all things digital library, pointed me to this announcement by IngentaConnect declaring that they are using Baynote to provide collaborative filtering recommendations for scholarly publications. The announcement reads:

… a new partnership with content guidance pioneer Baynote will see IngentaConnect providing “more like this” article recommendations based on both the current context but also, more unusually in scholarly publishing, the user’s previous behaviour. Articles reviewed or acquired by users with similar interests and behaviour will be recommended for consideration and potential purchase.

The “more about how it works” section reads:

… context and behaviour are combined to determine the user’s intent, which is then analysed for relevance to that of the site’s other users; patterns that emerge from this analysis are used to recommend additional content which is more likely to be of interest and relevance to the user than regular, contextual recommendations. Sophisticated behavioural analysis monitors not simply clicks and page views, but also the length of time that a user spends on the page and the type of activities that they carry out there.

I’m not quite sure what “regular, contextual recommendations” means exactly – probably TF*IDF-based content-similarity “more like this” – but I think the overall claim from BayNote is that clickstream data holds the secret to harvesting implicit user-ratings for collaborative filtering recommendations.

The marketing blurb on the Baynote web site says:

By silently observing more than twenty different user actions on your site, Baynote identifies virtual communities of like-minded visitors who have similar intent.

….

Baynote identifies emerging patterns among the visitors which represents the collective wisdom of the crowd. These emerging patterns represents the true intent of the visitors.

Embed a couple of Baynote JavaScript tags (one to monitor users and one to display BayNote’s recommendations) on your web site and, voila, you have yourself a recommender service.

Sounds too simple to be true, doesn’t it? However good your implict rating scheme is, there has to be a lot more going on behind the scenes. If Baynote does the CF recommendations for the customer (library), then it must at least have a catalogue of the customer’s offerings (to make the item recommendations) as well as user-browsing data to do the collaborative filtering. Unless, of course, the items that are being recommended are advertisers items, in which case the catalogue isn’t the library’s but the advertizers’.

This kind of approach will no doubt be better than plain content analysis for advertisers. But my hunch is that this isn’t likely to work that well for end-users of a digital library.

For one thing, there’s the privacy issue with sparse, anonymized data-sets. Nobody seems to mind Google Analytics on e-commerce web sites and blogs, but what the end-user is searching and browsing in a digital library could be highly confidential. Imagine a forensic pathologist investigating the death of Alexander Litvinenko and searching for scientific data on the toxicity of Polonium 210. The browsing behaviour of such a session might not be something that should be used to provide recommendations for other users.

Also, if BayNote are following the DL recommender research from the GroupLens team the recommender service needs more than just the items in the catalogue, it needs citation meta data as well to seed its ratings matrices – the way TechLens does. Yet unless the collection’s catalogue is highly homogeneous or has a large number of well-referenced entries, this may not be a feasible strategy because of the sparsity of references that have entries in the collection’s catalogue.

At any rate, this is an interesting development and I’m looking forward to finding out more about how this approach works.