Using UML Component diagrams to embed e-Framework Service Usage Models

Given the background of what embedding SUMs in other SUMs can mean, I'm going to model what that embedding can look like from a systems POV, using UML component diagrams. The tool is somewhat awkward to the task, but I was rather taken with the ball-and-socket representation of interfaces in UML 2.0—even if I have to abandon that notation where it counts. I'm also using this as an opportunity to explore specifying the data sources for embedded SUMs—which may not be the same as the data sources for the embedding SUM.

The task I set myself here is to model, using embedded SUMs, functionality for searching for entries in a collection, annotating those entries, and syndicating the annotations (but not the collection entries themselves).

We can represent what needs to happen in an Activity diagram, which captures the fact that entries and annotations involve two different systems. (We'll model them as distinct data stores):

We can go from that Activity diagram to a simple SUM diagram, capturing the use of four services and two data sources:

But as indicated in the previous post, we want to capitalise on the existence of SUMs describing aspects of collection functionality, and modularise out service descriptions already given in those SUMs (along with the context those SUMs set). So:

where a "Searchable Collection" is a service usage model on searching and reading elements in a collection, and "Shareable Collection" is a service usage model on syndicating and harvesting elements in a collection—and all those services modelled may be part of the same system. We are making an important distinction here: the embedded searchable and shareable collection SUMs are generic, and can be used to expose any number of data sources. We nominate two distinct data sources, and align a different data source to each embedded SUM. So we are making the entries data source searchable, but the annotations data source shareable; and we are not relying on the embedded SUMs to tell us what data sources they talk to, when we do this orchestration.

Which is all very well, but what does embedding a SUM actually look like from a running application? I'm going to try to answer that through ball-and-socket. The collection SUM models a software component, which exposes several services for other systems and users to invoke. That software component may be a standalone application, or it may be integrated with other components to build something greater; that flexibility is of course the point of Service Oriented Architecture (and Approaches). The software component exposes a number of services, which can be treated as ports into the component:

And an external component can interface through one or more of those exposed services, giving software integration:

Each service defines its own interface, and the interface to a port is modelled in UML as a realisation of a component (hollow arrowhead): it's the face of the component that the outside world sees:

And outside components that use a port depend on that interface (dashed arrow): the integration cannot happen without that dependency being resolved, so the component using our services depends on our interface:

Exposed services have their interfaces documented in the SUM: that is part of the point of a SUM. But a SUM may not document the interface of just one exposed service, but of several. By default, it documents all exposed services. But if we allow a SUM to model only part of a system's functionality, then we can have different SUMs capturing only subsets of the exposed functionality of a system. By setting up simple, searchable and shareable collections, we're doing just that.

Now a SUM is much more than just an interface definition. But if a single SUM includes the interface definitions for all of Add Read Replace Remove and Search, then we can conflate the interfaces for all those services into a single reference to the searchable collection SUM—where all the interfaces are detailed. We can also have both the simple and the searchable collection SUMs as alternate interfaces into our collection: one gives you search, the other doesn't. (Moreover, we could have two distinct protocols into the collection, so that the distinction may not just be theoretical.)

This is not a well-formed UML diagram, on purpose: the dependency arrows are left hanging, as a reminder that each interface (a SUM) defines several service endpoints into the component. The reason that's not quite right is that the UML interface is specific to a port—each port has its own inteface instance; so a more correct notation would have been to preserve the distinct interface boxes, and use meta-notation to bundle them together into SUMs. Still, the very act of embedding SUMs glosses over the details of which services are being consumed from the embed. So independently of the multiple incoming (and one outgoing) arrows per interface, this diagram is telling us the story we need to tell: a SUM defines bundles of interfaces into a system, and a system may have its interfaces bundled in more than one way.

Let's return to our initial task; we want to search for entries in a collection, annotate those entries, and syndicate the annotations. We can model this with component diagrams, ignoring for now the specifics of the interfaces: we want the functionality identified in the first SUM diagram, of search, read, annotate, and syndicate. In a component diagram, what we want looks like this:

The Entries component exposes search and read services; the Annotations component (however it ends up realised) consumes them. The Annotations component exposes an annotate service to end users, and a syndicate service to other components (wherever they may be).

That's the functionality needed; but we already know that SUMs exist to describe that functionality, and we can use those SUMs to define the needed interfaces:

The Entries collection exposes search and read services through a Searchable Collections SUM, which targets the Entries data source. The Annotations collection exposes syndicate services through a Shareable Collections SUM, which targets the Annotations data source.

Now, in the original component diagram, Annotate was something you did on the metal, directly interfacing with the Entries component:

Expanding it out as we have, we're now saying that realising that Annotate port involves orchestration with a distinct Annotation data source, and consumes search and read services. So we map a port to a systems component realising the port:

Slotting the Annotate port onto a collection is equivalent to slotting that collection into the Search and Read service dependency of the the Annotate system:

So we have modelled the dependency between the Entries and Annotate components. But with interfaces, the services they expose, and data sources as proxies for components, we have enough to map this component diagram back to a SUM, with the interface-bundling SUMs embedded:

The embedded SUMs bundle and modularise away functionality. Notice that they do not necessarily define functionality as being external, and so they do not only describe "other systems". The shareable SUM exposes the annotations, and the searchable SUM exposes the entries: their functionality could easily reside on the same repository, and we can't think of both the Entries and the Annotations as "external" data—if we did, we'd have no internal data left. The embedded SUMs are simply building blocks for system functionality—again, independently of where the functionality is provided from.

What does anchor the embedded SUMs and the services alike are the data sources they interact with. An Annotations data sources can talk to a single Annotate service in the SUM, as readily as it can to a Syndicate service modularised into Shareable Collection. Because an embedded SUM can be anchored to one of "our" data sources, just like a standalone service can. That means that, if a SUM will be embedded within another SUM, it's important to know whether the embedded SUM's data sources are cordonned off, or are shared with the invoking context.

An authentication SUM will have its own data sources for users and credentials, and no other service should know about them except through the appropriate authorisation and authentication services. But a Shareable Collections SUM needs to know what data source it's syndicating—in this case, the same data source we're putting our annotations into. So the SUM diagram needs to identify the embedded SUM data source with its own Annotations data source. If data sources in a SUM can be accessed through external services, then embedding that SUM means working out the mapping between the embedding and embedded data sources—as the dashed "Entries" box shows, two diagrams up.

SUM diagrams are very useful for sketching out a range of functionality, and modularisation helps keep things tractable, but eventually you will want to insert slot A into tab B; if you're using embedded SUMs, you will need to say where the tabs are.

Embedding e-Framework SUMs

I've already posted on using UML sequence diagrams to derive e-Framework Service Usage Models (SUMs). SUMs can be used to model applications in terms of their component services. That includes the business requirements, workflows, implementation constraints and policy decisions are in place for an application, as well as the services themselves and their interfaces.

However, in strict Service Oriented Architecture, the application is not a well-bounded box, sitting on a single server: any number of different services from different domains can be brought together to realise some functionality: the only thing binding these services together is the particular business goal they are realising. We can go even further with this uncoupling of application from service: a service usage model, properly, is just that: a model for the usage of certain services for a particular goal. It need not describe just what a single application does; and it need not exhaustively describe what a single application does. If a business goal only requires some of the functionality of an application, the SUM will model only that much functionality. And since an application can be applied to multiple business problems, there can be multiple SUMs used to describe what a given application does (or will do).

This issue has come up in modelling work that Link Affiliates has been doing around Project Bamboo, and on core SUMs dealing with collections. The e-framework has already defined a SUM for simple collections, with CRUD functionality, and searchable collections, which offer CRUD functionality plus search. The searchable collection SUM includes all the functionality of the simple collection SUM, so the simple collection SUM is embedded in the searchable collection SUM:

The e-framework already has notation for embedding one SUM within another:

And in fact, the embedded SUMs are already in the diagram for the searchable collection: they are the nested rectangles around "Provision {Collection}" and "Manage {Collection}".

Embedding a SUM means that the functionality required is not described in this, but in another SUM. There is a separate SUM intended for managing a collection. That does not mean that the embedded SUM functionality is sourced from another application: the functionality for adding content, searching for content, and managing the content may well be provided by a single system. Then again, it may not: because the SUM presents a service-oriented approach, the functionality is described primarily through services, and the systems they may be provided through are a matter of deployment. But that means that the simple collection SUM, the searchable collection SUM, and the manage collection SUM can all be describing different bundles of functionality of the same system.

Embedding SUMs has been allowed in the e-Framework for quite a while, and has been a handy device to modularise out functionality we don't want to detail, particularly when it is only of secondary importance. Authentication & Authorisation, for instance, are required for most processes in most SUMs; but because SUMs are typically used as thumbnail sketches of functionality, they are often outsourced to an "Identity" SUM.

That modularisation does not mean that the OpenURL SUM shares all its business requirements or design constraints with the Identity SUM. After all, the Identity functionality may reside on a completely different system on the bus. Nor does it mean that every service of the Identity SUM is used by the OpenURL SUM—not even every service exposed to external users. The Identity SUM may offer Authentication, Authorisation, Accounting, Auditing, and Credentials Update, but OpenURL may use only a subset of those exposed services. In fact, the point of embedding the SUM is not to go into the details of which services will be used how from the embedded SUM: embedding the SUM is declining to detail it further, at least in the SUM diagram.

On the other hand, embedding the Identity SUM, as opposed to merely adding individual authentication & authorisation services to the SUM-

—lets us appeal to the embedded SUM for specifics of data models, protocols, implementation, or orchestration, which can also be modularised out of the current SUM.


Google Wave

Yeah, Me Too:

It's hard not to echo the YouTube commentor who said: "I love you google!! I can't wait for you to take over the world!!"

Some quick reax:
  • The special genius of Google is that the interface is not revolutionary: it's all notions we've seen elsewhere brought together, so people can immediately get the metaphor used and engage with it. I found myself annoyed that the developers were applauding so much at what were obvious inventions—and just as often smiling at the sprezzatura of it all.
  • But once everything becomes a Wave Object, and dynamic and negotiated and hooked in, it does destabilise the notion of what a document is massively. Then again, so did wikis.
  • Not everything will become a Wave Object. For reasons both sociological and technical. One of the more important gadgets to hook into this thing for e-scholarship, when it shows up on our browsers, is an annotation gadget for found, static documents (and their components). In fact, we have that even now elsewhere—Diigo for instance. But hooking that up to the Google eye candy, yes, that is A Good Thing.
  • All your base are belong to the Cloud. And of course, what the man said on the Cloud. This may be where the world is heading—all our intellectual output a bunch of sand mandalas, to sweep away with the next electromagnetic bomb or solar flare. One more reason why not everything should become a Wave Object; but you would still obviously want Wave objects to talk to anything online.
  • The eye candy matters, but the highlight for me was at 1h04, with the Wave Robot client communicating updates to the Bug Tracker. That's real service-driven interoperability, with agents translating status live into other systems' internal state. That, you can go a very long way on.
  • The metaphor is unnerving, and deliberately so: the agents are elevated to the same rank as humans, are christened robots, have their own agency in the text you are crafting. The spellchecker is not a tool, it is a conversation participant. But then, isn't that what futurists thought AI realisation would end up looking like anyway? Agents with deep understanding of limited domains, interacting with humans in a task. The metaphor is going to colour how people interact with computers though: just that icon of a parrot will make people thing of the gadget as a participant and not an instrument.
  • OK, so Lars moves around the stage; I found that endearing more than anything else.
  • The machine translation demo? Dunno if it was worth *that* much applause; the Unix Terminal demo actually communicated more profoundly than it did. The Translate Widget in OSX has given us live translation for years (with appallingly crap speed, and as my colleague Steve has pointed out, speed of performance in the real world will be the true test of all of this). That said, the fact that the translation was not quite correct was as important to the demo as the speed at which it translated character by character. It's something that will happen with the other robot interactions, I suspect: realising their limitations, so you interact with them in a more realistic way. The stochastic spellchecker is a welcome improvement, but users will still have to realise that it remains fallible. I know people who refuse to use predictive text on their mobiles for that reason, and people will have different thresholds of how much gadget intervention they'll accept. Word's intervention in Auto-Correct has not gained universal welcome.
  • There's going to be some workflow issues, like that the live update stuff can get really distracting quickly (and they realise this with their own use); Microsoft Word's track change functionality gets unusable over a certain number of changes.
  • Google Docs has not delivered massively more functionality than Word, and the motivation to use it has been somewhat abstract, it doesn't lead to mass adoption outside ideologues and specific circumstances. In my day job, we still fling Word Docs with track changes around; colleagues have tried to push us cloud-ward, unsuccessfully. (Partly that's a generational mistrust of the Cloud. Partly it isn't, because the colleague trying to push us cloud-ward is one generation older.) But the combination of Google Docs plus Google Wave for collaborative documents should make Microsoft nervous.
  • Microsoft. Remember them? :-)


Identifier interoperability

This is of course a month too late, but I thought I'd put down some thoughts about identifier interoperability.

Digital Identifiers Out There exist in a variety of schemes—(HTTP) URI, DOI, Handle, PURL, XRI. ARK, if only it was actually implemented more widely. Plus the large assortment of national bibliographic schemes, only some of which are caged in at Info-URI. ISBN, which is an identifier websites know how to do things with digitally. And so forth.

Confronted with a variety of schemes, users would rather one unified scheme. Or failing that, interoperability between schemes. Now, this makes intuitive sense when we're talking about services like search, with well defined interfaces and messages. The problem is that an identifier is not a service (despite the conflation of identifier and service in HTTP): it is a linguistic sign. In essence (as we have argued in the PILIN project), it is just a string, associated with some thing. You work out, from the string, what the thing is, through a service like resolution (though that is not the only possible service associated with an identifier). You get from the string to the thing through a service like retrieval (which is *not* necessarily the same as resolution—although URLs historically conflated the two.) But the identifier is the argument for the resolution or retrieval service; it's not the service itself.

And in a trivial way, if we ignore resolution and just concentrate on identifying things, pure strings are plenty interoperable. I can use an ISBN string like 978-1413304541 anywhere I want, whether on a napkin, or Wikipedia's Book Sources service, or LookUpByISBN.com, or an Access database. So what's the problem? That ASCII string can get used in multiple services, therefore it's interoperable.

That's the trivial way, of identifier string interoperability. (In PILIN, we referred to "labels" as more generic than strings.) And of course, that's not really what people mean by interoperable identifiers. What they mean is identifier service interoperability after all: some mechanism of resolution, which can deal with more than one identifier scheme. So http:// deals with resolving HTTP URIs and PURLs, and http://hdl.handle.net deals with resolving Handles, and a Name Mapping Authority like http://ark.cdlib.org deals with resolving ARKs. What people would like is a single resolver, which takes an identifier and a name for an identifier scheme, and gives you the resolution (or retrieval) for that identifier.

There's a couple of reasons why a universal resolver is harder than it looks. For one, different schemes have different associated metadata, and services to access that metadata: that is part of the reason they are different. So ARK has its ? and ?? operators; Handle has its association of an identifier with arbitrary metadata fields; XRI has its resource Descriptor; HTTP has its HTTP 303 vs HTTP 100 status code, differentiating (belatedly) between resolution and retrieval (getting the resource vs. getting the description of the resource). A single universal resolver would have to come up with some sort of superschema to represent access to all these various kinds of metadata, or else forego accessing them. If it did give up on accessing all of them—the ARK ?? , the Handle Description, the XRI Resource Descriptor—then you're only left with one kind of resolution: get the resource itself. So you'd have a universal retriever (download a document given any identifier scheme), but not the more abstract notion of a universal resolver (get the various kinds of available metadata, given any identifier scheme).

The second reason, related to the first, is that different identifier schemes can allow different services to be associated with their identifiers. In fact those different services depend on the different kinds of metadata that the schemes expose. But if the service is idiosyncratic to an identifier scheme, then getting it to interoperate with a different identifier scheme will require lowest common denominator interchange of data that may get clunky, and will end up discarding much of the idiosyncracy. A persistence guarantee service from ARK may not make sense applied to Handles. A checksum or a linkrot service applied across identifiers would end up falling back on the lowest common denominator service—that is, the universal retriever, which only knows about downloading resources.

On the other hand, the default universal retriever does already exist. The internet now has a universal protocol in HTTP, and a universal way of dereferencing HTTP references. As we argued in Using URIs as Persistent Identifiers, if an identifier scheme is to get any traction now on the internet, it has to be exposed through HTTP: that is, it has to be accessed as an HTTP URI. That makes HTTP URI resolvers the universal retriever: http://hdl.handle.net/ prefixed to Handles, http://ark.cdlib.org/ prefixed to ARKs, http://xri.net/ prefixed to XRIs. In the W3C's way of thinking, this means that HTTP URIs are the universal identifier, and there's no point in having anything else; to the extent that other identifier schemes exist, they are merely subsets of HTTP URIs (as XRI ended up going with, to address W3C's nix).

Despite the Semantic Web's intent of universality, I don't think that any URI has supplanted my name or my passport number: identifiers (and more to the point, linguistic signs) exist and are maintained independently, and are exposed through services and mechanisms of the system's choosing, whether they are exposed as URIs or not. A Handle can be maintained in the Handle system as a Handle, independently of how it is exposed as an HTTP URI; and exposing it as an HTTP URI does not preclude exposing it in different protocols (like UDP). But there are excellent reasons for any identifier used in the context of the web to be resolvable through the web—that is, dereferenced through HTTP. That's why the identifier schemes all end up inside HTTP URIs. What you end up with as a result of HTTP GET on that URI may be a resolution or a retrieval. The HTTP protocol distinguishes the two through status codes, but most people ignore the distinction, and they treat the splash page they get from http://arxiv.org/abs/cmp-lg/9609008 as Just Another Representation of Mark Lauer's thesis, rather than as a resolution distinct from retrieving the thesis. So HTTP GET is the Universal Retriever.

But again, retrieval is not all you can do with identifiers. You can just identify things with identifiers. And you can reason about what you have identified: in particular, whether two identifiers are identifying the same thing, and if not, how those two things are related. When the Identifier Interoperability stream of the UKOLN respository workshop sat down to work out what we could do about identifier interoperability, we did not pursue cross-scheme resolvers or universal metadata schemas: if we thought about that at all, we thought it would be too large an undertaking for a year's horizon, and probably too late, given the realities in repository land.

Instead, all we committed to was a service for informing users about whether two identifiers, which could be from different schemes, identified the same file. And for that, you don't need identifier service interoperability: you don't need to actually resolve the identifier live to work it out. Like all metadata, this assertion of equivalence is a claim that a particular authority is making. And like any claim, you can merely represent that assertion in something like RDF, with the identifier strings as arguments. So all you need for the claim "Handle 102.100.272/T9G74WJQH is equivalent to URI https://www.pilin.net.au/Project_Documents/PILIN_Ontology/PILIN_Ontology_Summary.htm" is identifier string interoperability—the fact you can insert identifiers from two different schemes in the same assertion. The same holds if you go further, and start modelling different kinds of relations between identifier referents, such as are covered in FRBR. And because any authority can make claims about anything, we opened up the prospect of not just a central equivalence service, but a decentralised network of hubs of authorities: each making their own assertions about identifiers to match their own purposes, and each available to be consumed by the outside world—subject to how much those authorities are trusted.

Defaulting from identifier service interoperability—i.e. interoperability as we know it—back to identifier string interoperability may seem retrograde. Saying things about strings certainly doesn't seem very interoperablish, when you don't seem to actually be doing anything with those strings. Put differently, if the identifier isn't being dereferenced, there does not seem to be an identifier operation at all, so there doesn't seem to be anything to interoperate with. But such thinking is falling back into the trap of conflating the identifier with clicking the identifier. Identifiers aren't just network locations, and they aren't just resolution requests—something everyone now agrees with, including the W3C. They exist as names for things, in addition to any dereferencing to get to those things. And because they exist as names for things, reasoning about how such names relate to each other is part of their core functionality, and is not tied up with live dereferencing of the names. (RDF would not work if they did.)

So this is less than interoperability as we know it; but in a way, it is more interoperable than any service. You don't even need a deployed resolver service in place, to get useful equivalence assertions about identifiers. Nothing prevents you making assertions about URNs, after all...


Visit to European Schoolnet

Somewhat belatedly (because some work came up when I returned to Australia), this is the writeup of my visit to European Schoolnet, Brussels, on the 18th of March.

As background: European Schoolnet are a partnership of European ministries of education, who are developing common e-learning infrastructure for use in schools throughout Europe. EUNet are involved in the ASPECT project, constructing an e-learning repository network for use in schools in multiple countries in Europe, in partnership with commercial content developers. (See summary.) The network involves adding resource descriptions and collection descriptions to central registries. The network being constructed is currently a closed version of the LRE (Learning Resource Exchange), which is under development.

Link Affiliates are following the progress of the ASPECT project, to see how its learnings can apply to the Digital Education Revolution initiative in Australia.

Link Affiliates (for DEEWR) are also participating with European Schoolnet on the IMS LODE (Learning Object Discovery and Exchange) Project Group, which is formulating common specifications for registering and exchanging e-learning objects between repositories. Link Affiliates is doing some software development to test out the specifications being developed at LODE, and was looking for more elaboration on the requirements that ASPECT in particular would like met.


EUNet are interested in exploring identifier issues for resources further. EUNet are dealing with 24 content providers (including 16 Ministries of Education), with each one identifying resources however it sees fit, and no preexisting coordination in how they identify resources through identifiers. EUNet never know, when they get a resource from a provider, whether they already have it registered or it is new.

EUNet are working on a comparator to guess whether resources deposited with them are identical, based on both attributes and content of the resource. People change the identifiers for objects within institutions; if that did not happen, a comparator would not be needed. Some contributors manage referatories, so they will have both different metadata and different identifiers for the same resource. The comparator service is becoming cleverer. ASPECT plans to promote Handle and persistent identifiers. If they are used correctly, they will not eliminate all problems; but they will deal with some resources better than what is happening now.

Metadata transformation & translation

ASPECT is setting up registries of application profiles and vocabulary banks. They aim to automatically transform metadata for learning resources between vocabularies and profiles. Vocabularies are the major challenge. ASPECT have promised to deliver 200 vocabularies, but that includes language translations: at a minimum ASPECT needs to support the 22 languages of the EU, and 10 or 12 LOM vocabularies in their application profile. The content providers are prepared to adopt the LRE vocabularies and application profile; the content providers transform their metadata vocabularies into the LRE European norm from any national vocabularies, as a compliance requirement. EUN use Systran for translating free text, but that is restricted to titles, descriptions and keywords. The vocabulary bank is used to translate controlled vocabulary entries.

Transformations between metadata schemas, such as DC to LOM, or LRE to and from MARC, will happen much later. The Swiss are making attempts in that direction; but the mappings are very complicated. EUN avoid the problem by sticking to the LRE application profile in-house; they would eventually want LRE to be able to acquire resources from cultural heritage institutions, which will require crosswalking MARC or DC to LOM.

The vocabulary bank will eventually map between distinct vocabularies; e.g. a national vocabulary and mapping will be uploaded centrally, to enable transformation to the LRE norm. One can do metadata transformation by mapping to a common spine, as is done in the UK (e.g. 2002 discussion paper). But the current agreed way is by allowing different degrees of equivalence in translation, and by allowing a single term to map to a Boolean conjunction of terms. Because LOM cannot have boolean conjunctions for its values, this approach cannot be used in static transformations, or in harvest; but federated search can expand out the Boolean conjunctions into multiple search terms. Harvested transformations can still fall back on notions of degrees of equivalence. The different possible mappings are described in:

F. Van Assche, S. Hartinger, A. Harvey, D. Massart, K. Synytsya, A. Wanniart, & M. Willem. 2005. Harmonisation of vocabularies for elearning, CEN Workshop Agreement (CWA 15453). November.

ASPECT work with LODE

IMS LODE is working on ILOX (Information for Learning Object Exchange), as an information model. ILOX includes a notion of abstract classes of resources, akin to FRBR's manifestations, expressions, and works. ASPECT is currently working on a new version of the LRE metadata application profile of ILOX + LOM, v.4: this corrects errors, adds new vocabularies, and does some tweaks including tweaks to identifier formatting. The profile also includes an information model akin to FRBR, as profiled for LRE under LODE/ILOX.

The ILOX schema that has already been made available to Link Affiliates for development work is stable: it will not change as a result of the current editing of the application profile. The application profile should be ready by the end of March. ASPECT will then ask content providers to format their metadata according to the new application profile, with the new binding based on ILOX. By the end of May ASPECT want to have infrastructure in place, to disseminate metadata following the profile.

The content in the LRE is currently restricted to what can be rendered in a browser, i.e. online resources. After May, ASPECT will add SCORM, Common Cartridge and other such packaged content to their scope: they will seek to describe them also with ILOX, and to see whether packaging information can be reused in searches, in order to select the right format for content delivery. This would capitalise on the added value of ILOX metadata, to deal with content in multiple formats.

Transformation services will be put in place to transform content. Most packaged content will be available in several (FRBR) manifestations. The first tests of this infrastructure will be by the end of September 2009; by February 2010 ASPECT aim to have sufficient experience to have the infrastructure running smoothly, and supporting pilot projects. EUN does not know yet if it will adopt this packaging infrastructure for the whole of LRE, or just ASPECT: this depends on the results of the pilots. There will be a mix of schemas in content delivery: content in ASPECT will use ILOX, while content in LRE will continue to use LOM. This should not present a major problem; ASPECT will provide XSL transforms from LRE metadata to ILOX on the first release of their metadata transformation service.

Within ASPECT, EUN have been working with KU Leuven on creating a tool to extract metadata straight out of a SCORM or Common Cartridge package, and generating ILOX metadata directly. KU Leuven have indicated that this should already be working, but they are now waiting for the application profile for testing. When the LRE is opened up to the outside world, it will offer both metadata formats, LRE LOM and LRE ILOX, so they can engage with other LODE partners who have indicated interest—particularly Canada and Australia.

The binding of the Registry information model in LODE is proceeding, using IMS tools. ASPECT want an IMS-compatible binding. The registry work will proceed based on that binding. The registry work is intended for use not just in ASPECT, but as an open source project for wider community feedback and contribution. The Canadians involved in LODE will contribute resources, as will Australia. The registry project is intended to start work in the coming weeks. Development in ASPECT will mostly be undertaken by EUN and KU Leuven. Two instances of the registry will be set up as running and talking to each other for testing. There may be different instances of registries run internationally to register content, and possibly a peer to peer network of registries to exchange information about learning resources. For example a K-12 resources registry in Australia run by education.au, could now talk to EUNet's registry.

There has not yet been a decision on what kind of open source license the registry project will use. They are currently inclined to the GNU lesser public license, as it allows both open source and commercial development. Suggestions are welcome.

The LRE architecture is presented at Slideshare , with a more complete description underway .

Abstract hierarchies of resources

ASPECT is using abstract hierarchies of learning resources as being modelled in ILOX, and derived from the abstract hierarchies of FRBR. ASPECT would like to display information on the (FRBR) Expression when a user does discovery, and then to automatically select the (FRBR) Manifestation of the object to deliver. Link Affiliates had proposed testing facetted search returning the different available expressions or manifestations of search items. ASPECT were not going to go all the way to facet-based discovery, and are not intending to expose manifestations directly to users: they prefer to have the search interface navigate through abstractions intelligently to end up at the most appropriate manifestation. Still, they are curious to see what facet-based discovery of resources might look like. Several parties are developing portals to LRE, and creating unexpected interfaces and uses of the LRE that they are interested in seeing.

The current test search interface is available online.

ASPECT would like to reuse the ILOX FRBR-ised schema for its collection descriptions. The ILOX schema takes different chunks of metadata, and groups them together according to what level of abstraction they apply to. (Some fields, such as "title" apply to all resources belonging to the same Work; some fields, such as "technical format" would be shared only by all resources belonging to the same Manifestation.) A collection description can also be broken down in this way, since different elements of the content description correspond to different levels of abstraction: e.g. the protocol for a collection is at Manifestation level, while the target service for the collection is at Item level.

Promoting consistency of schemata across LODE is desirable, and would motivate schema reuse, leading to the same API for all usages; but motivating use cases are needed to work out how to populate such a schema, with different levels of abstraction, for a collection description. Collating different collection descriptions at different levels of abstraction is such a use case ("give me all collections supporting SRU search" vs. "give me all collections supporting any kind of search"). How this would be carried through can be fleshed out in testing.

Registering content and collections in registries

ASPECT wanted to use OAI-PMH as just a synchronisation mechanism for content between different registries. The repository–to–registry ingest would occur through push (deposit), not through pull. OAI-PMH is overkill for the context of learning object registries, and the domain does not have well-defined federations of participants, which could be driven by OAI-PMH: any relevant party can push content into the learning object registries. SPI would also be overkill for this purpose: the detailed workflows SPI supports for managing publishing objects, and binding objects to metadata, are appropriate for Ariadne, but are too much for this context, as ASPECT is just circulating metadata, and not content objects. SWORD would be the likely protocol for content deposit.

Adding repositories to the registry is an activity that needs a use case to be formulated. ASPECT envisages a web page (or something of that sort) to self-register repositories, following the LODE repository description schema. Once the repository is registered, harvesting and associated actions can then happen. People could describe their collections as well as their repositories on the same web page, as a single act of registration. That does not deal with the case of a collection spanning multiple repositories. But the description of a collection is publicly accessible, and need not be bound to a single repository; it can reside at the registry, to span across the participating repositories.

The anticipated model for repository discovery is that one repository has its description pushed into a network, and then the rest of network discovers it: so this is automatic discovery, not automatic registration. A discovery service like UDDI would not work, because they are not using WSDL SOAP services.

Collections use cases

Not all collection descriptions would reside in a learning object repository. There are clear use cases for ad hoc collections, built out of existing collections, with their description objects hosted at a local registry level instead (e.g. Wales hosts an ad hoc collection including science collections from Spain and Britain). Such an ad hoc collection description would be prepared by the registry provider, not individual teachers. Being ad hoc, the collection has to be stored in the registry and not a single source repository. There could be a widget built for repositories, so that repository managers could deploy it wherever they want, and enable the repository users to add in collection level descriptions where needed.

Collections use cases being considered at ASPECT are also of interest to GLOBE and LODE. Use cases need to detail:

  • How to create collections, where.
  • How to define what objects belong to a collection, intensionally or extensionally (by enumeration or by property).
  • Describe collection.
  • Edit description of collection.
  • Combine collections through any set operation (will mostly be Set Union).
  • Expose collection (manual or automated).
  • Discover collection, at registry or client level (VLE, portal).
  • Evaluate collection, undertaken by user, on behalf of themselves or a community: this depends on the collection description made available, but also can involve viewing items from the collection.
  • If a commercial collection is involved, there is a Procurement use case as well.
  • Disaggregate collection and Reaggregate collection: users may want to see the components/contributors of a virtual collection.

Some use cases specific to content also involve the registry:
  • Describe learning object extensionally, to indicate to what collection it belongs.
  • Discover learning objects: the collection objects can be used to limit searches.
  • Evaluate learning object, with respect to a collection (i.e. according to the collection's goals, or drawing on information specific to the collection). E.g. what quality assurance was used for the object, based on metadata that has been recorded only at collection level.

Further work

The LRE application profile registry may feed into the Standards and Application Profiles registry work being proposed by Link Affiliates. BECTA have a profile registry running. At the moment it is limited to human readable descriptions, namely profiles of LOM. LRE will be offering access to application profiles as a service available for external consumption.

OMAR and OCKHAM are two existing registries of learning/repository content. OMAR is in EBXML. ASPECT would like to incoporate content from such registries, and repackage their content to their ends as exemplars of implementations, and potential sources of reusable code. The synchronisation protocols of these registries in particular may be an improvement over OAI-PMH.


URN NBN Resolver Demonstration

Web sites:

Demonstrated by Maurice Vanderfeesten.

Actually his very cool Prezi presentation will be more cogent than my notes: URN NBN Resolver Presentation. [EDIT: Moreover, he included his own notes in his discussion of the identifier workshop session.]

A few notes to supplement this:

  • The system uses URNs based on National Library Numbers (URN-NBN) as their persistent identifiers.
  • So it's a well-established bibliographic identification scheme, which can certainly be expanded to the research repository world. (The German National Library already covers research data.)
  • The pilot got coded start of 2009.
  • They are using John Kunze's Name-To-Thing resolver as their HTTP URI infrastructure for making their URNs resolvable.
  • Tim Berners-Lee might be surprised to see his Linked Data advocacy brought up in this presentation in the context of URNs. But as long as things can also be expressed as HTTP URIs, it does not matter.
    The blood on the blade of the W3C TAG URN finding is still fresh, I know.

  • Lots of EU countries are queuing up to use this as persistent identifier infrastructure.
  • The Firefox plugin works on resolving these URNs with predictable smoothness. :-)
  • They are working through what their granularity of referents will be, and what the long term sustainability expectations are for their components (the persistence guarantees, in the terms of the PILIN project)
  • They would like to update RFC 2141 on URNs, and already have in place RFC 3188 on NBNs.
  • They now need to convince the community of the urgency and benefits of persistent identifiers and of this particular approach, and to get community buy-in.

UKOLN International Repository Workshop: Identifier Interoperability

[EDIT: Maurice Vanderfeesten has a fuller summary of the outcomes.]

First Report:

  • Many resonances with what was already said in other streams: support for scholarly cycle, recognition of range of solutions, disagreement on scope, needing to work with more than traditional repositories.
  • Identifying: objects (not just data), institutions, and people in limited roles.
  • Will model relations between identifiers; there are both implicit and explicit information models involved.
  • Temporal change needs to be modelled; there are lots of challenges.
  • Not trying to build the one identifier system, but loose coupling of identifier services with already extant identifier systems.
  • Start with small sets of functionality and then expand.
  • Identifiers are created for defined periods and purposes, based on distinguishing attributes of things.

Second Report:

  • We can't avoid the "more research needed" phase of work: need to work out workflows and use cases to support the identifier services, though the infrastructure will be invisible to some users.
  • Need rapid prototyping of services, not waterfall.
  • The mindmaps provided by the workshop organisers of parties involved in the repository space [will be published soon] are useful, and need to be kept up to date through the lifetime of project.
  • There may not be much to do internationally for object identification, since repositories are doing this already; but we likely need identifiers for repositories.
  • Author identifiers: repositories should not be acting as naming authorities, but import that authority from outside.
  • There are different levels of trust for naming authorities; assertions about authors change across time.
  • An interoperability service will allow author to bind multiple identities together, and give authors the control to prevent their private identities being included in with their public personas.

Third Report:

  • The group has been pragmatic in its reduction of scope.
  • There will be identifiers for: Organisations, repositories, people, objects.
  • Identifiers are not names: we not building a name registry, and name registries have their own distinct authority.
  • Organisations:
    • Identifiers for these should be built on top of existing systems (which is a general principle for this work).
    • There could usefully be a collection of organisation identifiers, maintained as a federated system, and including temporal change in its model.
    • The organisation registry can be tackled by geographical region, and start on existing lists, e.g. DNS.

  • Repositories:
    • There shall be a registry for repositories. There shall be rules and vetting for getting on the registry, sanity checks. Here too there are temporal concerns to model: repositories come into and out of existence.
    • The registry shall be a self-populating system, building on existing systems like OpenDOAR. It should also offer depopulation (a repository is pinged, and found no longer to be live.)
    • There is a many-to-many relation of repositories to institutions.
    • The registry shall not be restricted to open access repositories.

  • Objects:
    • We are not proposing to do a new identifier scheme.
    • We are avoiding detailed information models such as FRBR for now.
    • We propose to create do equivalence service at FRBR Manifestation level between two identifiers: e.g. a query on whether this ARK and this Handle are pointing to the same bitstream of data, though possibly at different locations.
    • Later on could build a Same FRBR Expression service (do these two identifiers point to digital objects with the same content).
    • The equivalence service would be identifier schema independent [and would likely be realised in RDF].

  • People:
    • A people identification service could be federated or central.
    • People have multiple identities: we would offer an equivalence service and a non-equivalence service between multiple identities.
    • The non-equivalence service is needed because this is not a closed-world set: people may assert that two identities are the same, or are not the same.
    • The service would rely on self-assertions by the user being identified.
    • The user would select identities, out of a possibly prepopulated list.
    • People may want to leave identities out of their assertions of equivalence (i.e. keep them private).

UKOLN International Repository Workshop: Repository Organisation

First Report:

  • Aim: to support repository concepts with a common purpose.
  • To support the professional peer group, with bottom-up demand.
  • To support interoperability, assuring data quality.
  • To formulate guidelines, supporting national cooperation, to help recruit new repositories, to enable international interoperability.
  • The activity can be compared to the international collaboration behind Dublin Core.
  • The confederation would have a strategic role, providing support outside national boundaries to repository development.
  • It would provide a locus for interaction with other communities: researchers, publishers.
  • It will be driven by improving the scholarly process, and not just by repositories as an aim in themselves.

Second Report:

  • The group needed to define the nature of the organisation to work towards: finding a common point of departure was difficult.
  • Need to articulate benefits to stakeholders:
    • a forum for information exchange,
    • promoting repository management as a profession,
    • reflecting community needs,
    • channelling demands for new software.

  • The relations underlying the confederation are in place already, but the types of relations will be worked out tomorrow. The group has to establish evidence of need for the confederation.
  • The roles of the organisation will be worked through tomorrow: they will involve service to repositories and to researchers.
  • The workshop discussants have split into an advisory group, an investigatory group, and visionary group.

Third Report:

  • The organisation goal is to enhance the scholarly process through a federation of open access repositories.
  • They will approach funding agencies. The organisation must be independent, bottom-up, funded through membership.
  • Sustainability, political authority, visibility.
  • The organisation's core concepts will be formed around stakeholder needs and activities. These are varied; they need:
    • clarity of roles,
    • strong governance,
    • network of expertise,
    • carry through of interopability issues;
    • help in setting up repositories and repository advocacy;
    • certification & quality assurance.

  • Groups identified the contributions they could bring: money, expertise, ambassadors, suitable workflows.
  • Deliverables & outcomes: e.g. hold meetings, sessions in conferences, make visible the repository manager profession; lobbying, websites, potentially helpdesk.
  • Governance model: organisational membership, partnership with software providers.
  • Timeframe: proof of concept to circulate April, formal model of confederation May, letter of request of participation June.

UKOLN International Repository Workshop: Repository Handshake

First Report:

  • An attempt to rationalise the service requirements: working on PUT, not GET or KEEP
  • The aim is to populate repositories; support authors & friends (funders or institutions) making their research material available through open access
  • Have ingest support services that repositories will use downstream.
  • Focus on research papers, although that may scope more widely.
  • Balance of priorities between improving existing workflows vs. recruiting content from new depositors.
  • What information to be collected at point of ingest? —question unresolved. The group is scoping potential conflicts.
  • Machine-to-machine interoperability vs. computer-assisted human-mediated deposit: these form a continuum.
  • Workflow agreed on as the target of the group's work; the reification of "workflow" took three directions: e-research workflow; e-publication workflow; repository management.

Second Report:

  • Over the past ten years people's expectations have not been realised.
  • People have had stabs at different services.
  • Need to identify what is the sweet spot between useful services for the community [lots of metadata on ingest], and not imposing difficult requirements on author [little metadata on ingest].
  • [I lost track here I'm afraid.]

Third Report:

  • Deposit is the focus of this activity.
  • Handshake has two parts: PUT from the client, and BEG from the server. [i.e. recruit content].
  • Use cases: these are deposit opportunities, and range outside the boundary of the repository. Repositories communicating with each other is only one such use case.
  • Key words: more, better quality [of metadata], easier [remove obstacles to deposit], rewarding [for depositor]. Handshake must involve social contract of reward.
  • Plan, multiphase.

    • Phase 1: rapid engagement internationally. Some nations have national leverage, but not all do. A international framework is still needed.
    • Eight deposit opportunities have been identiified; 2-3 to focus on in workplan Phase 1, over 6 months. For example:
      • Multi authored paper, several institutions and countries—what does deposit look like, and how does it become once-only? (Will not be rich but minimally sufficient)
      • Use institutionally motivated deposit;
      • Communication between institutional and discipline repositories;
      • Publisher of journal offers open access service to author.

    • Seek real life description of those focus use cases, and exemplars already in use on the ground.
    • Output of this focussed activity is descriptions of what practice is, not code or prototypes.
    • Then gap analysis.
    • Overall 2-3 year time horizon, but not planning out so far yet.

UKOLN International Repository Workshop: Citation Services

First Report:

  • Currently small number of commercial service providers is dominant in this field. Are we evolving repository services [to accommodate the existing systems], or revolutionising them?

  • Since citations drive national funding, systems need to be trusted auditable and open.

  • Citations relate authors and ideas, and help connect concepts together; they provide literature ranking, and larger scale analytic services across literature.

  • International coordination: existing infrastructure of loosely coupled repositories can be foundation of robust scalable solution.

Second Report:

  • The group is producing no large plan and manifesto, but is going back to basics.

  • "Handshake" meant different things to different people; there are limitations to the metaphor.

  • There will be group activity, with two foci: business and technological.

  • Recruitment of content needs to happen outside repository established space, including through desktop bibliographic tools such as Zotero.

Third Report:

  • There is a huge variety of presentations of citations, and there are partial solutions specific to communities.

  • Model how to deal with citations: Isolate references from papers, and then extract reference data, and interpret it, from varying citation schemes.

  • For repository to be active in this without overconsuming resources, the repository shall be made responsible to hand on to external services the list of references extracted from their items (papers).

  • Plan of action:

    • Establish test bed of references, out of what repositories find interesting.
    • Create repository API, repository plugin, OAI PMH profile.
    • JISC developer competition to develop toolkits.
    • Then liaise with e.g. Crossref and establish collaboration: the commercial bodies already have such services.
    • Then create a reference item processor as an external service, decomposing references into constituent data.
    • Then build services like Citeseer and Google Scholar—or use those existing services, if they will collaborate.
    • Then build exemplar GUI end user services, e.g. trackbacks, visualisations.
    • Liaising with publishers important but not a dependency for remaining tasks.

UKOLN International Repository Workshop: Introductory remarks

From Norbert Lossau of DRIVER

  • The Vision underlying the workshop is the Berlin 2003 declaration: free & unrestricted access to human knowledge.

  • Need infrastructure to complete the research cycle: discovery > reuse > storage and preservation, for data as well as papers, at an international access level. Establishment of online reputation for researchers is critical.

  • Researchers have their existing discovery procedures; these are to be harmonised, not supplanted.

  • We are already advanced in Global harvesting, preservation of papers, repository storage.

  • A global network of repository infrastructure hubs, rather than one centralised infrastructure.

UKOLN International Repository Workshop

Have just finshed at the UKOLN International Repository Workshop, twittered at #repinf09. The workshop was a joint JISC/DRIVER event; it had international scope, but there were only a couple of East and South Asian participants, and Andrew Treloar and myself from Oceania.

The intention of the workshop was to formulate action plans which would make sense to fund for international infrastructure for repositories—in the first instance, research publication repositories. I took part in the identifier infrastructure workshop, and I have been cited publicly (though anonymously) as saying that it was "surprisingly pragmatic". The information superstructures that can be imposed over identifiers—and what they identify—can get quite open-ended and intellectually satisfying; but our business was to formulate something concrete, fundable, and realisable over the next year or so. What you put on top of it later is for another workshop.

There were four streams to the workshop: four different kinds of infrastructure that could be put in place. The four streams were:

  1. Repository Citation Services: Improving the ways in which citation data relating to open access research papers is shared. Citation data may be forwards or backwards citation. Includes the ability to recognise citations in repositories and the open web.
  2. Repository Handshake: Improving ways in which repositories can be populated with research papers, including authors, other repositories, publishers and research management systems. The "handshake" involves negotiation between a depositing agent and a repository, building on SWORD.
  3. Repository Interoperable Identification Infrastructure: Improve identifying entities in repositories and making connections across repositories, and provide useful services to do so.
  4. Repository Organisation: Provide international organisational support to enable research repositories to work together to meet the objectives of Open Access and eResearch through a confederation of repositories.

I'll post:

  • summaries of what these streams reported back on the three summary get-togethers in the workshop: a couple of streams really changed direction through the workshop.
  • Then, some notes on the first session of the identifier stream (which were behind the first report-back). We did not change tack as drastically as some streams, so they will still help inform what the stream eventually came up with.
  • A summary of the SURF demonstration of their persistent identifier work and their enhanced document work.
  • And finally (if I get to be so bold), my own take on what the identifier stream came up with.