2009-04-16

Identifier interoperability

This is of course a month too late, but I thought I'd put down some thoughts about identifier interoperability.

Digital Identifiers Out There exist in a variety of schemes—(HTTP) URI, DOI, Handle, PURL, XRI. ARK, if only it was actually implemented more widely. Plus the large assortment of national bibliographic schemes, only some of which are caged in at Info-URI. ISBN, which is an identifier websites know how to do things with digitally. And so forth.

Confronted with a variety of schemes, users would rather one unified scheme. Or failing that, interoperability between schemes. Now, this makes intuitive sense when we're talking about services like search, with well defined interfaces and messages. The problem is that an identifier is not a service (despite the conflation of identifier and service in HTTP): it is a linguistic sign. In essence (as we have argued in the PILIN project), it is just a string, associated with some thing. You work out, from the string, what the thing is, through a service like resolution (though that is not the only possible service associated with an identifier). You get from the string to the thing through a service like retrieval (which is *not* necessarily the same as resolution—although URLs historically conflated the two.) But the identifier is the argument for the resolution or retrieval service; it's not the service itself.

And in a trivial way, if we ignore resolution and just concentrate on identifying things, pure strings are plenty interoperable. I can use an ISBN string like 978-1413304541 anywhere I want, whether on a napkin, or Wikipedia's Book Sources service, or LookUpByISBN.com, or an Access database. So what's the problem? That ASCII string can get used in multiple services, therefore it's interoperable.

That's the trivial way, of identifier string interoperability. (In PILIN, we referred to "labels" as more generic than strings.) And of course, that's not really what people mean by interoperable identifiers. What they mean is identifier service interoperability after all: some mechanism of resolution, which can deal with more than one identifier scheme. So http:// deals with resolving HTTP URIs and PURLs, and http://hdl.handle.net deals with resolving Handles, and a Name Mapping Authority like http://ark.cdlib.org deals with resolving ARKs. What people would like is a single resolver, which takes an identifier and a name for an identifier scheme, and gives you the resolution (or retrieval) for that identifier.

There's a couple of reasons why a universal resolver is harder than it looks. For one, different schemes have different associated metadata, and services to access that metadata: that is part of the reason they are different. So ARK has its ? and ?? operators; Handle has its association of an identifier with arbitrary metadata fields; XRI has its resource Descriptor; HTTP has its HTTP 303 vs HTTP 100 status code, differentiating (belatedly) between resolution and retrieval (getting the resource vs. getting the description of the resource). A single universal resolver would have to come up with some sort of superschema to represent access to all these various kinds of metadata, or else forego accessing them. If it did give up on accessing all of them—the ARK ?? , the Handle Description, the XRI Resource Descriptor—then you're only left with one kind of resolution: get the resource itself. So you'd have a universal retriever (download a document given any identifier scheme), but not the more abstract notion of a universal resolver (get the various kinds of available metadata, given any identifier scheme).

The second reason, related to the first, is that different identifier schemes can allow different services to be associated with their identifiers. In fact those different services depend on the different kinds of metadata that the schemes expose. But if the service is idiosyncratic to an identifier scheme, then getting it to interoperate with a different identifier scheme will require lowest common denominator interchange of data that may get clunky, and will end up discarding much of the idiosyncracy. A persistence guarantee service from ARK may not make sense applied to Handles. A checksum or a linkrot service applied across identifiers would end up falling back on the lowest common denominator service—that is, the universal retriever, which only knows about downloading resources.

On the other hand, the default universal retriever does already exist. The internet now has a universal protocol in HTTP, and a universal way of dereferencing HTTP references. As we argued in Using URIs as Persistent Identifiers, if an identifier scheme is to get any traction now on the internet, it has to be exposed through HTTP: that is, it has to be accessed as an HTTP URI. That makes HTTP URI resolvers the universal retriever: http://hdl.handle.net/ prefixed to Handles, http://ark.cdlib.org/ prefixed to ARKs, http://xri.net/ prefixed to XRIs. In the W3C's way of thinking, this means that HTTP URIs are the universal identifier, and there's no point in having anything else; to the extent that other identifier schemes exist, they are merely subsets of HTTP URIs (as XRI ended up going with, to address W3C's nix).

Despite the Semantic Web's intent of universality, I don't think that any URI has supplanted my name or my passport number: identifiers (and more to the point, linguistic signs) exist and are maintained independently, and are exposed through services and mechanisms of the system's choosing, whether they are exposed as URIs or not. A Handle can be maintained in the Handle system as a Handle, independently of how it is exposed as an HTTP URI; and exposing it as an HTTP URI does not preclude exposing it in different protocols (like UDP). But there are excellent reasons for any identifier used in the context of the web to be resolvable through the web—that is, dereferenced through HTTP. That's why the identifier schemes all end up inside HTTP URIs. What you end up with as a result of HTTP GET on that URI may be a resolution or a retrieval. The HTTP protocol distinguishes the two through status codes, but most people ignore the distinction, and they treat the splash page they get from http://arxiv.org/abs/cmp-lg/9609008 as Just Another Representation of Mark Lauer's thesis, rather than as a resolution distinct from retrieving the thesis. So HTTP GET is the Universal Retriever.

But again, retrieval is not all you can do with identifiers. You can just identify things with identifiers. And you can reason about what you have identified: in particular, whether two identifiers are identifying the same thing, and if not, how those two things are related. When the Identifier Interoperability stream of the UKOLN respository workshop sat down to work out what we could do about identifier interoperability, we did not pursue cross-scheme resolvers or universal metadata schemas: if we thought about that at all, we thought it would be too large an undertaking for a year's horizon, and probably too late, given the realities in repository land.

Instead, all we committed to was a service for informing users about whether two identifiers, which could be from different schemes, identified the same file. And for that, you don't need identifier service interoperability: you don't need to actually resolve the identifier live to work it out. Like all metadata, this assertion of equivalence is a claim that a particular authority is making. And like any claim, you can merely represent that assertion in something like RDF, with the identifier strings as arguments. So all you need for the claim "Handle 102.100.272/T9G74WJQH is equivalent to URI https://www.pilin.net.au/Project_Documents/PILIN_Ontology/PILIN_Ontology_Summary.htm" is identifier string interoperability—the fact you can insert identifiers from two different schemes in the same assertion. The same holds if you go further, and start modelling different kinds of relations between identifier referents, such as are covered in FRBR. And because any authority can make claims about anything, we opened up the prospect of not just a central equivalence service, but a decentralised network of hubs of authorities: each making their own assertions about identifiers to match their own purposes, and each available to be consumed by the outside world—subject to how much those authorities are trusted.

Defaulting from identifier service interoperability—i.e. interoperability as we know it—back to identifier string interoperability may seem retrograde. Saying things about strings certainly doesn't seem very interoperablish, when you don't seem to actually be doing anything with those strings. Put differently, if the identifier isn't being dereferenced, there does not seem to be an identifier operation at all, so there doesn't seem to be anything to interoperate with. But such thinking is falling back into the trap of conflating the identifier with clicking the identifier. Identifiers aren't just network locations, and they aren't just resolution requests—something everyone now agrees with, including the W3C. They exist as names for things, in addition to any dereferencing to get to those things. And because they exist as names for things, reasoning about how such names relate to each other is part of their core functionality, and is not tied up with live dereferencing of the names. (RDF would not work if they did.)

So this is less than interoperability as we know it; but in a way, it is more interoperable than any service. You don't even need a deployed resolver service in place, to get useful equivalence assertions about identifiers. Nothing prevents you making assertions about URNs, after all...

No comments: