jump to navigation

The Identity of Objects March 14, 2008

Posted by Andre Vellino in Digital Identity, Epistemology, Semantics.
trackback

I was listening to my colleague Richard Ackerman give a preview of his upcoming keynote address at the National Information Standards Organization (NISO) forum when Brian Cantwell Smith’s book On The Origin of Objects popped into mind (I wrote a short review of that book many moons ago and I’m a big fan of the book.) Brian is now Dean of the Faculty of Information Studies at the University of Toronto and those of us who have enjoyed The Origin have been patiently waiting for the publication of “The Age of Significance“, a 7-volume series that fleshes out some details.

Brian’s book came to mind because of the point Richard makes in his presentation that computers love unique identifiers for objects – books, articles, authors – and that we don’t really have good standards for identifying things. Even if you take into account efforts like Digital Object Identifiers (DOI) the task providing unique references to persistent digital objects presents significant hurdles, such as dealing with versions.

What is the proper identity of a digital object? Is it “the instance” or “the work”? Richard’s example is “The Philosopher’s Stone” and “The Sorcerer’s Stone”, books whose full-text are almost (but not quite) identical yet have the “same author” and are, in some sense “the same work”. But even if the plot is the same and most of the text is the same, might it not be useful, in some contexts, to describe one of them as “the British book” and the other as “the U.S. book?”

Suppose I wanted to search for all the speeches that the present Queen of Canada had given, I wouldn’t want to search for all of Queen Elizabeth the second’s speeches or even all the speeches by “the present Queen of England”. Does this sound familiar? (see Bertrand Russell’s “On Denoting“)

I think the fundamental problem with the identity of digital objects isn’t our lack of standards or a lack of willingness to define them: it is intrinsic to the problem of naming and reference. My view (I expect Brian Smith might agree) is that it is futile to seek identity in objects. Names change, references change, objects change and what constitutes an “immutable”, identifiable object depend on the context and on the point of view. To that degree, I subscribe to David Weinberger’s thesis in Everything is Miscellaneous. A file on a file-system can be an atomic element – from the user’s point of view – or, from the operating system’s point of view, it could be the i-node. The atom of identity could be the physical book in a bricks and mortar library or it could be “the work” – that which is protected by intellectual property rights, for example.

So I go back to Peter Turney’s early blog post on Attributes and Relations in which he argues that relations are primary and attributes are secondary. Perhaps what is important for managing / searching / finding digital objects isn’t so much a way of providing definite descriptors or names for objects, but some flexible way of expressing relations between them (versions / variants / semantic relations with other objects.)

It might be right to provide a unique identifier (e.g. a DOI) to an on-line article today, but it could be that a few years from now, the appropriate unit of reference is the paragraph or the sentence. Who remembers the speech in which JFK said “ask not what your country can do for you–ask what you can do for your country” – it is the sentence that persists.

Comments»

1. Peter Turney - March 14, 2008

Version control systems often store an original file and differences from the original, instead of full copies of the later versions. A difference is a relation between the original and the revised version. The “diffs” are what distinguish versions. (The “diff” file comparison utility has been a standard part of Unix for about 30 years.)

A key part of a good technical paper is the discussion of related work, which situates the paper in the context of past work. It is the “diffs” from past work that define the contribution of the paper to the body of knowledge on the subject. The “diffs” are the relation between the paper and the and the preceding state of the art in the field.

When introducing my son to some novels, albums, and movies that I thought were important, I discovered that he often couldn’t see why I liked them, because their lessons were absorbed into the mainstream. I remember when they were fresh, and the “diff” between them and what came before was (it seemed to me) large. He sees them now when they are no longer fresh, and the “diff” between them and what came after is (it seems to him) small.

It’s all about relations.

2. Daniel Lemire - March 14, 2008

Unique identifier work for a limited domain over a limited time period. They are most certainly useful. It is only when you try to extend the domain that things fall apart.

I wrote on my blog about the fact that we cannot even define the equality between two numbers:

http://www.daniel-lemire.com/blog/archives/2007/12/05/formal-definitions-are-less-useful-than-you-think/

That’s not to say that you can’t get some mileage out of comparing numbers in computer science!

As for versions, people sometimes think of a linear sequence. You have version 1, then 2, then 3… but of course, it is more complicated than that in real life. The versions of a piece of work will form a graph in general. You can have several different versions coexisting and the merging and then splitting again. If you live in a distributed world, you may not even be able to enumerate all possible versions!

3. Gareth - March 20, 2008

I’m currently struggling with an aspect of this problem dealing with a set of proteins trying to find the relationship between an identifier I have (SwissTrembl/UniParc) and one I want (GI). The list is not even two years old but many of the identifiers are deprecated, having been deleted or have been folded into other identifiers. As Daniel says above, the relationship of things over time isn’t a linear sequence but a graph, constantly evolving.
You can solve some of the identifier problems for data by using convenient identifiers based on hashing functions of the data itself say, or choosing some element of the thing to use as an identifier (see DNA barcodes for taxonomic identifiers). Unfortunately those identifiers don’t always capture the “thingness” of the the thing, the element that makes it a distinct, identifiable entity. What happens if another protein is sequenced that differs by one residue? The hashing functions changes completely, but is it a different protein? Depends what you mean by different… What happens if your original protein is just a fragment of another one, or a splice variant? What if it’s an identical sequence, but sequenced in a different organism? Or same organism but we get a different sequence? What if it was an error.
Taxonomists have worked long and hard on this sort of problem and have a rich vocabulary for the evolution and relationship of identifiers (see Holotypes; http://en.wikipedia.org/wiki/Holotype). A nice example of such evolution is the Anomalocaris fossil type identifier relationships:
http://en.wikipedia.org/wiki/Anomalocaris
Classical taxonomy required quite a bit of manual curation. As high throughput data creation mechanisms become pervasive, that’s not going to be possible. I’m solving my problem by doing some sequence similarity searches (ie searching by similarity of the data, ignoring identifiers). I thing that’s going to be more and more common in many fields.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: