Scientific Data is Interpreted September 26, 2010Posted by Andre Vellino in Data, Data Mining, Epistemology.
It must be a truism by now that there is no such thing as theory-free observation. Scientific data is necessarily tied up with the theories that are required to interpret them and which led to their discovery.
By analogy I would argue that scientific data sets are useless unless they are interpreted. There is no such thing as a useful “raw” data set.
Consider for instance the data on leap seconds from the National Research Council. It’s a simple enough table: there are three columns (Date, UTC Leap Seconds, and MJD) and only a few dozen rows. Here are two such rows:
DATE UTC Leap Seconds MJD 2006-01-01 - 2009-01-01 33 53 736 - 54 832 1999-01-01 - 2006-01-01 32 51 179 - 53 736
The first question for the uninitiated in the measurement of time: what is a “UTC Leap Second”? It’s easy enough to look up and learn that UTC is
is a time standard based on International Atomic Time (TAI) with leap seconds added at irregular intervals to compensate for the Earth’s slowing rotation.
Ah, so this was news to me: the earth’s rotation is slowing down! (“the solar day becomes 1.7 ms longer every century due mainly to tidal friction (2.3 ms/cy, reduced by 0.6 ms/cy due to glacial rebound”).
The (implicit) frame of reference for (exact) time with respect to which the earth is slowing down is the atomic (cesium) clock, which requires an understanding of the highly theoretical processes of quantum mechanics to interpret correctly.
So now we have an inkling of what the data means. They give us the variance between two time measurements – those from atomic clocks and those from the earth’s rotation. A first attempt at interpreting the first row in the table is: it took 3 years between 2006-01-01 and 2009-01-01 to add one leap-second to the calendar date.
A little “Binging” (I’ve all but abandoned “Googling” since Google became “instant” – not because it can’t be turned off, but to make a statement to Google) yields “Modified Julian Day” for MJD. So the third column is primarily a conversion of the first column into a standard, though not without its own theoretical reasons for being the preferred measure.
All this to say – repositories of datasets without (substantial) amounts of textual metadata, not to mention software and tools designed for its interpretation and navigation are going to be (at best) not very useful.