Interoppo Research: April 2009

2009-04-16

Identifier interoperability

This is of course a month too late, but I thought I'd put down some thoughts about identifier interoperability.

Digital Identifiers Out There exist in a variety of schemes—(HTTP) URI, DOI, Handle, PURL, XRI. ARK, if only it was actually implemented more widely. Plus the large assortment of national bibliographic schemes, only some of which are caged in at Info-URI. ISBN, which is an identifier websites know how to do things with digitally. And so forth.

Confronted with a variety of schemes, users would rather one unified scheme. Or failing that, interoperability between schemes. Now, this makes intuitive sense when we're talking about services like search, with well defined interfaces and messages. The problem is that an identifier is not a service (despite the conflation of identifier and service in HTTP): it is a linguistic sign. In essence (as we have argued in the PILIN project), it is just a string, associated with some thing. You work out, from the string, what the thing is, through a service like resolution (though that is not the only possible service associated with an identifier). You get from the string to the thing through a service like retrieval (which is *not* necessarily the same as resolution—although URLs historically conflated the two.) But the identifier is the argument for the resolution or retrieval service; it's not the service itself.

And in a trivial way, if we ignore resolution and just concentrate on identifying things, pure strings are plenty interoperable. I can use an ISBN string like 978-1413304541 anywhere I want, whether on a napkin, or Wikipedia's Book Sources service, or LookUpByISBN.com, or an Access database. So what's the problem? That ASCII string can get used in multiple services, therefore it's interoperable.

That's the trivial way, of identifier string interoperability. (In PILIN, we referred to "labels" as more generic than strings.) And of course, that's not really what people mean by interoperable identifiers. What they mean is identifier service interoperability after all: some mechanism of resolution, which can deal with more than one identifier scheme. So http:// deals with resolving HTTP URIs and PURLs, and http://hdl.handle.net deals with resolving Handles, and a Name Mapping Authority like http://ark.cdlib.org deals with resolving ARKs. What people would like is a single resolver, which takes an identifier and a name for an identifier scheme, and gives you the resolution (or retrieval) for that identifier.

There's a couple of reasons why a universal resolver is harder than it looks. For one, different schemes have different associated metadata, and services to access that metadata: that is part of the reason they are different. So ARK has its ? and ?? operators; Handle has its association of an identifier with arbitrary metadata fields; XRI has its resource Descriptor; HTTP has its HTTP 303 vs HTTP 100 status code, differentiating (belatedly) between resolution and retrieval (getting the resource vs. getting the description of the resource). A single universal resolver would have to come up with some sort of superschema to represent access to all these various kinds of metadata, or else forego accessing them. If it did give up on accessing all of them—the ARK ?? , the Handle Description, the XRI Resource Descriptor—then you're only left with one kind of resolution: get the resource itself. So you'd have a universal retriever (download a document given any identifier scheme), but not the more abstract notion of a universal resolver (get the various kinds of available metadata, given any identifier scheme).

The second reason, related to the first, is that different identifier schemes can allow different services to be associated with their identifiers. In fact those different services depend on the different kinds of metadata that the schemes expose. But if the service is idiosyncratic to an identifier scheme, then getting it to interoperate with a different identifier scheme will require lowest common denominator interchange of data that may get clunky, and will end up discarding much of the idiosyncracy. A persistence guarantee service from ARK may not make sense applied to Handles. A checksum or a linkrot service applied across identifiers would end up falling back on the lowest common denominator service—that is, the universal retriever, which only knows about downloading resources.

On the other hand, the default universal retriever does already exist. The internet now has a universal protocol in HTTP, and a universal way of dereferencing HTTP references. As we argued in Using URIs as Persistent Identifiers, if an identifier scheme is to get any traction now on the internet, it has to be exposed through HTTP: that is, it has to be accessed as an HTTP URI. That makes HTTP URI resolvers the universal retriever: http://hdl.handle.net/ prefixed to Handles, http://ark.cdlib.org/ prefixed to ARKs, http://xri.net/ prefixed to XRIs. In the W3C's way of thinking, this means that HTTP URIs are the universal identifier, and there's no point in having anything else; to the extent that other identifier schemes exist, they are merely subsets of HTTP URIs (as XRI ended up going with, to address W3C's nix).

Despite the Semantic Web's intent of universality, I don't think that any URI has supplanted my name or my passport number: identifiers (and more to the point, linguistic signs) exist and are maintained independently, and are exposed through services and mechanisms of the system's choosing, whether they are exposed as URIs or not. A Handle can be maintained in the Handle system as a Handle, independently of how it is exposed as an HTTP URI; and exposing it as an HTTP URI does not preclude exposing it in different protocols (like UDP). But there are excellent reasons for any identifier used in the context of the web to be resolvable through the web—that is, dereferenced through HTTP. That's why the identifier schemes all end up inside HTTP URIs. What you end up with as a result of HTTP GET on that URI may be a resolution or a retrieval. The HTTP protocol distinguishes the two through status codes, but most people ignore the distinction, and they treat the splash page they get from http://arxiv.org/abs/cmp-lg/9609008 as Just Another Representation of Mark Lauer's thesis, rather than as a resolution distinct from retrieving the thesis. So HTTP GET is the Universal Retriever.

But again, retrieval is not all you can do with identifiers. You can just identify things with identifiers. And you can reason about what you have identified: in particular, whether two identifiers are identifying the same thing, and if not, how those two things are related. When the Identifier Interoperability stream of the UKOLN respository workshop sat down to work out what we could do about identifier interoperability, we did not pursue cross-scheme resolvers or universal metadata schemas: if we thought about that at all, we thought it would be too large an undertaking for a year's horizon, and probably too late, given the realities in repository land.

Instead, all we committed to was a service for informing users about whether two identifiers, which could be from different schemes, identified the same file. And for that, you don't need identifier service interoperability: you don't need to actually resolve the identifier live to work it out. Like all metadata, this assertion of equivalence is a claim that a particular authority is making. And like any claim, you can merely represent that assertion in something like RDF, with the identifier strings as arguments. So all you need for the claim "Handle 102.100.272/T9G74WJQH is equivalent to URI https://www.pilin.net.au/Project_Documents/PILIN_Ontology/PILIN_Ontology_Summary.htm" is identifier string interoperability—the fact you can insert identifiers from two different schemes in the same assertion. The same holds if you go further, and start modelling different kinds of relations between identifier referents, such as are covered in FRBR. And because any authority can make claims about anything, we opened up the prospect of not just a central equivalence service, but a decentralised network of hubs of authorities: each making their own assertions about identifiers to match their own purposes, and each available to be consumed by the outside world—subject to how much those authorities are trusted.

Defaulting from identifier service interoperability—i.e. interoperability as we know it—back to identifier string interoperability may seem retrograde. Saying things about strings certainly doesn't seem very interoperablish, when you don't seem to actually be doing anything with those strings. Put differently, if the identifier isn't being dereferenced, there does not seem to be an identifier operation at all, so there doesn't seem to be anything to interoperate with. But such thinking is falling back into the trap of conflating the identifier with clicking the identifier. Identifiers aren't just network locations, and they aren't just resolution requests—something everyone now agrees with, including the W3C. They exist as names for things, in addition to any dereferencing to get to those things. And because they exist as names for things, reasoning about how such names relate to each other is part of their core functionality, and is not tied up with live dereferencing of the names. (RDF would not work if they did.)

So this is less than interoperability as we know it; but in a way, it is more interoperable than any service. You don't even need a deployed resolver service in place, to get useful equivalence assertions about identifiers. Nothing prevents you making assertions about URNs, after all...

2009-04-01

Visit to European Schoolnet

Somewhat belatedly (because some work came up when I returned to Australia), this is the writeup of my visit to European Schoolnet, Brussels, on the 18th of March.

As background: European Schoolnet are a partnership of European ministries of education, who are developing common e-learning infrastructure for use in schools throughout Europe. EUNet are involved in the ASPECT project, constructing an e-learning repository network for use in schools in multiple countries in Europe, in partnership with commercial content developers. (See summary.) The network involves adding resource descriptions and collection descriptions to central registries. The network being constructed is currently a closed version of the LRE (Learning Resource Exchange), which is under development.

Link Affiliates are following the progress of the ASPECT project, to see how its learnings can apply to the Digital Education Revolution initiative in Australia.

Link Affiliates (for DEEWR) are also participating with European Schoolnet on the IMS LODE (Learning Object Discovery and Exchange) Project Group, which is formulating common specifications for registering and exchanging e-learning objects between repositories. Link Affiliates is doing some software development to test out the specifications being developed at LODE, and was looking for more elaboration on the requirements that ASPECT in particular would like met.

Identifiers

EUNet are interested in exploring identifier issues for resources further. EUNet are dealing with 24 content providers (including 16 Ministries of Education), with each one identifying resources however it sees fit, and no preexisting coordination in how they identify resources through identifiers. EUNet never know, when they get a resource from a provider, whether they already have it registered or it is new.

EUNet are working on a comparator to guess whether resources deposited with them are identical, based on both attributes and content of the resource. People change the identifiers for objects within institutions; if that did not happen, a comparator would not be needed. Some contributors manage referatories, so they will have both different metadata and different identifiers for the same resource. The comparator service is becoming cleverer. ASPECT plans to promote Handle and persistent identifiers. If they are used correctly, they will not eliminate all problems; but they will deal with some resources better than what is happening now.

Metadata transformation & translation

ASPECT is setting up registries of application profiles and vocabulary banks. They aim to automatically transform metadata for learning resources between vocabularies and profiles. Vocabularies are the major challenge. ASPECT have promised to deliver 200 vocabularies, but that includes language translations: at a minimum ASPECT needs to support the 22 languages of the EU, and 10 or 12 LOM vocabularies in their application profile. The content providers are prepared to adopt the LRE vocabularies and application profile; the content providers transform their metadata vocabularies into the LRE European norm from any national vocabularies, as a compliance requirement. EUN use Systran for translating free text, but that is restricted to titles, descriptions and keywords. The vocabulary bank is used to translate controlled vocabulary entries.

Transformations between metadata schemas, such as DC to LOM, or LRE to and from MARC, will happen much later. The Swiss are making attempts in that direction; but the mappings are very complicated. EUN avoid the problem by sticking to the LRE application profile in-house; they would eventually want LRE to be able to acquire resources from cultural heritage institutions, which will require crosswalking MARC or DC to LOM.

The vocabulary bank will eventually map between distinct vocabularies; e.g. a national vocabulary and mapping will be uploaded centrally, to enable transformation to the LRE norm. One can do metadata transformation by mapping to a common spine, as is done in the UK (e.g. 2002 discussion paper). But the current agreed way is by allowing different degrees of equivalence in translation, and by allowing a single term to map to a Boolean conjunction of terms. Because LOM cannot have boolean conjunctions for its values, this approach cannot be used in static transformations, or in harvest; but federated search can expand out the Boolean conjunctions into multiple search terms. Harvested transformations can still fall back on notions of degrees of equivalence. The different possible mappings are described in:

F. Van Assche, S. Hartinger, A. Harvey, D. Massart, K. Synytsya, A. Wanniart, & M. Willem. 2005. Harmonisation of vocabularies for elearning, CEN Workshop Agreement (CWA 15453). November.

ASPECT work with LODE

IMS LODE is working on ILOX (Information for Learning Object Exchange), as an information model. ILOX includes a notion of abstract classes of resources, akin to FRBR's manifestations, expressions, and works. ASPECT is currently working on a new version of the LRE metadata application profile of ILOX + LOM, v.4: this corrects errors, adds new vocabularies, and does some tweaks including tweaks to identifier formatting. The profile also includes an information model akin to FRBR, as profiled for LRE under LODE/ILOX.

The ILOX schema that has already been made available to Link Affiliates for development work is stable: it will not change as a result of the current editing of the application profile. The application profile should be ready by the end of March. ASPECT will then ask content providers to format their metadata according to the new application profile, with the new binding based on ILOX. By the end of May ASPECT want to have infrastructure in place, to disseminate metadata following the profile.

The content in the LRE is currently restricted to what can be rendered in a browser, i.e. online resources. After May, ASPECT will add SCORM, Common Cartridge and other such packaged content to their scope: they will seek to describe them also with ILOX, and to see whether packaging information can be reused in searches, in order to select the right format for content delivery. This would capitalise on the added value of ILOX metadata, to deal with content in multiple formats.

Transformation services will be put in place to transform content. Most packaged content will be available in several (FRBR) manifestations. The first tests of this infrastructure will be by the end of September 2009; by February 2010 ASPECT aim to have sufficient experience to have the infrastructure running smoothly, and supporting pilot projects. EUN does not know yet if it will adopt this packaging infrastructure for the whole of LRE, or just ASPECT: this depends on the results of the pilots. There will be a mix of schemas in content delivery: content in ASPECT will use ILOX, while content in LRE will continue to use LOM. This should not present a major problem; ASPECT will provide XSL transforms from LRE metadata to ILOX on the first release of their metadata transformation service.

Within ASPECT, EUN have been working with KU Leuven on creating a tool to extract metadata straight out of a SCORM or Common Cartridge package, and generating ILOX metadata directly. KU Leuven have indicated that this should already be working, but they are now waiting for the application profile for testing. When the LRE is opened up to the outside world, it will offer both metadata formats, LRE LOM and LRE ILOX, so they can engage with other LODE partners who have indicated interest—particularly Canada and Australia.

The binding of the Registry information model in LODE is proceeding, using IMS tools. ASPECT want an IMS-compatible binding. The registry work will proceed based on that binding. The registry work is intended for use not just in ASPECT, but as an open source project for wider community feedback and contribution. The Canadians involved in LODE will contribute resources, as will Australia. The registry project is intended to start work in the coming weeks. Development in ASPECT will mostly be undertaken by EUN and KU Leuven. Two instances of the registry will be set up as running and talking to each other for testing. There may be different instances of registries run internationally to register content, and possibly a peer to peer network of registries to exchange information about learning resources. For example a K-12 resources registry in Australia run by education.au, could now talk to EUNet's registry.

There has not yet been a decision on what kind of open source license the registry project will use. They are currently inclined to the GNU lesser public license, as it allows both open source and commercial development. Suggestions are welcome.

The LRE architecture is presented at Slideshare , with a more complete description underway .

Abstract hierarchies of resources

ASPECT is using abstract hierarchies of learning resources as being modelled in ILOX, and derived from the abstract hierarchies of FRBR. ASPECT would like to display information on the (FRBR) Expression when a user does discovery, and then to automatically select the (FRBR) Manifestation of the object to deliver. Link Affiliates had proposed testing facetted search returning the different available expressions or manifestations of search items. ASPECT were not going to go all the way to facet-based discovery, and are not intending to expose manifestations directly to users: they prefer to have the search interface navigate through abstractions intelligently to end up at the most appropriate manifestation. Still, they are curious to see what facet-based discovery of resources might look like. Several parties are developing portals to LRE, and creating unexpected interfaces and uses of the LRE that they are interested in seeing.

The current test search interface is available online.

ASPECT would like to reuse the ILOX FRBR-ised schema for its collection descriptions. The ILOX schema takes different chunks of metadata, and groups them together according to what level of abstraction they apply to. (Some fields, such as "title" apply to all resources belonging to the same Work; some fields, such as "technical format" would be shared only by all resources belonging to the same Manifestation.) A collection description can also be broken down in this way, since different elements of the content description correspond to different levels of abstraction: e.g. the protocol for a collection is at Manifestation level, while the target service for the collection is at Item level.

Promoting consistency of schemata across LODE is desirable, and would motivate schema reuse, leading to the same API for all usages; but motivating use cases are needed to work out how to populate such a schema, with different levels of abstraction, for a collection description. Collating different collection descriptions at different levels of abstraction is such a use case ("give me all collections supporting SRU search" vs. "give me all collections supporting any kind of search"). How this would be carried through can be fleshed out in testing.

Registering content and collections in registries

ASPECT wanted to use OAI-PMH as just a synchronisation mechanism for content between different registries. The repository–to–registry ingest would occur through push (deposit), not through pull. OAI-PMH is overkill for the context of learning object registries, and the domain does not have well-defined federations of participants, which could be driven by OAI-PMH: any relevant party can push content into the learning object registries. SPI would also be overkill for this purpose: the detailed workflows SPI supports for managing publishing objects, and binding objects to metadata, are appropriate for Ariadne, but are too much for this context, as ASPECT is just circulating metadata, and not content objects. SWORD would be the likely protocol for content deposit.

Adding repositories to the registry is an activity that needs a use case to be formulated. ASPECT envisages a web page (or something of that sort) to self-register repositories, following the LODE repository description schema. Once the repository is registered, harvesting and associated actions can then happen. People could describe their collections as well as their repositories on the same web page, as a single act of registration. That does not deal with the case of a collection spanning multiple repositories. But the description of a collection is publicly accessible, and need not be bound to a single repository; it can reside at the registry, to span across the participating repositories.

The anticipated model for repository discovery is that one repository has its description pushed into a network, and then the rest of network discovers it: so this is automatic discovery, not automatic registration. A discovery service like UDDI would not work, because they are not using WSDL SOAP services.

Collections use cases

Not all collection descriptions would reside in a learning object repository. There are clear use cases for ad hoc collections, built out of existing collections, with their description objects hosted at a local registry level instead (e.g. Wales hosts an ad hoc collection including science collections from Spain and Britain). Such an ad hoc collection description would be prepared by the registry provider, not individual teachers. Being ad hoc, the collection has to be stored in the registry and not a single source repository. There could be a widget built for repositories, so that repository managers could deploy it wherever they want, and enable the repository users to add in collection level descriptions where needed.

Collections use cases being considered at ASPECT are also of interest to GLOBE and LODE. Use cases need to detail:

How to create collections, where.
How to define what objects belong to a collection, intensionally or extensionally (by enumeration or by property).
Describe collection.
Edit description of collection.
Combine collections through any set operation (will mostly be Set Union).
Expose collection (manual or automated).
Discover collection, at registry or client level (VLE, portal).
Evaluate collection, undertaken by user, on behalf of themselves or a community: this depends on the collection description made available, but also can involve viewing items from the collection.
If a commercial collection is involved, there is a Procurement use case as well.
Disaggregate collection and Reaggregate collection: users may want to see the components/contributors of a virtual collection.

Some use cases specific to content also involve the registry:

Describe learning object extensionally, to indicate to what collection it belongs.
Discover learning objects: the collection objects can be used to limit searches.
Evaluate learning object, with respect to a collection (i.e. according to the collection's goals, or drawing on information specific to the collection). E.g. what quality assurance was used for the object, based on metadata that has been recorded only at collection level.

Further work

The LRE application profile registry may feed into the Standards and Application Profiles registry work being proposed by Link Affiliates. BECTA have a profile registry running. At the moment it is limited to human readable descriptions, namely profiles of LOM. LRE will be offering access to application profiles as a service available for external consumption.

OMAR and OCKHAM are two existing registries of learning/repository content. OMAR is in EBXML. ASPECT would like to incoporate content from such registries, and repackage their content to their ends as exemplars of implementations, and potential sources of reusable code. The synchronisation protocols of these registries in particular may be an improvement over OAI-PMH.

Interoppo Research

2009-04-16

Identifier interoperability

2009-04-01

Visit to European Schoolnet

Identifiers

Metadata transformation & translation

ASPECT work with LODE

Abstract hierarchies of resources

Registering content and collections in registries

Collections use cases

Further work

About Me

Linking research & learning technologies through standards

Comments

Blog Archive

FEEDJIT Live Traffic Feed

Interoppo Research

2009-04-16

Identifier interoperability

2009-04-01

Visit to European Schoolnet

Identifiers

Metadata transformation & translation

ASPECT work with LODE

Abstract hierarchies of resources

Registering content and collections in registries

Collections use cases

Further work

About Me

Linking research & learning technologies through standards

Comments

Blog Archive

FEEDJIT Live Traffic Feed

Subscribe To