Interoppo Research

NeCTAR Melbourne Town Hall

2010-11-30T17:27:00.001+11:00

NeCTAR Townhall, 2010-11-26

NeCTAR: National eResearch Collaboration Tools and Resources

$47 million of funding, 2010-2014. Build electronic collaboration infrastructure for national research community. Unimelb is the lead agent.

Aims to enhance research collaboration & outcomes, and to support the connected researcher at the desktop/benchtop. Aims to deploy national research infrastructure and services not otherwise available, in order to enable research collaboration and more rapid outcomes.

Board to approve final project plan, submit to DIISR by Mar 31 2011. Townhall meetings over the next two months.

Consultation paper [PDF] circulated , 60+ responses received, responses available.

Response themes:
* avoid duplicating existing offerings
* needs to be researcher-driven
* questions on how to leverage institutional investments
* need coherent outcomes across nectar
* need to focus on service delivery
* need to establish sustainability

Interim Project plan [PDF] avilable:
NeCTAR is funding four strands of activity. Two are discipline-specific, two are generic and overlaid on the discipline-specific strands.
* Research Tools (discipline-specific, may eventually generalise)
* Virtual labs (resources, not just instruments, available from desktop; emphasis on resources, to prevent them from being applicable to instrument science only).
* Research cloud (general or multi-disciplinary applications and services, plus a framework for using them)
* National server programme (core services, authentication, collaboration, data management services).
NeCTAR will clear up their use of terminology in future communications.

NeCTAR is meant to be serving Research Communities: these are defined as being discipline-based, and range across institutions. e-Research facilitates remote access to shared resources from desktop, in order to enhance collaboration for Research communities (making them Virtual Research communities).

NeCTAR will remain lightweight, to respond to generic and discipline-specific research community needs. Infrastructure is to be built through NeCTAR subprojects. The lead agent UniMelb will subcontract other organisations; some outcomes may be sourced from outside the research community. NeCTAR may start with early adopter groups who already have lots of infrastructure, and NeCTAR may take up existing groupware solutions from these. NeCTAR can only fund infrastructure and not operational services, as it is funded through EIF. Sustainability (as always) is entrusted to the disciplines, NeCTAR will cease at 2014.

Expert panels from across community are to advise the NeCTAR board on allocating subcontracts, as NeCTAR places a premium on transparency. Subcontracts must demonstrate a competitive co-investment model for what NECTAR can't fund: these will take the form of matching funds, likely in-kind, to cover maintenance and support as well as development.
Expert panels will include both researchers, and e-research experts who are familiar with what infrastructure already exists.

There will be a staged model for NeCTAR issuing subcontracts. In 2011 NeCTAR are funding early positive outcomes, in order to give slower-adopting communities more time to develop their proposals. Review of progress and plan for next stage in late 2011.

Research Communities will define the customised solutions they need; these will be delivered through Research Tools & Virtual Labs. Will reserve funds from subcontractors to fund research communities directly, to bring them into Virtual mode.

The considerations for what are the resourcing, scale, timeframe etc of target Virtual Research Communities will inform NECTAR's priorities on what to fund.

NeCTAR is funded to deploy resources for the Cloud nodes, with regard to the Research Cloud, but NeCTAR is not funded to create nodes for the Cloud. NeCTAR will work with existing cloud nodes, e.g. from Research Data Storage Infrastructure (RDSI). Some Research Cloud nodes and RDSI nodes will coexist—but more will be known once the RDSI lead agent has been announced. The consultation responses show a desire for a consistent user experience, which requires a consistent framework for service provision, based on international best practice. (This encompasses virtual machines, data stores access, applications migration, security, licensing, etc.) The framework for the Research Cloud will be developed in parallel with the early projects.

The National Server Program (NSP) will provide core services relevant to all disciplines, e.g. interfaces out of AAF, ANDS, RDSI. The underlying NSP infrastructure will be reliable enough to use as a foundation for more innovative services. The prospect of database hosting has been under much discussion. The National Server Program Allocation Committee is to recommend services for hosting to the NeCTAR board.

Contrast between the National Server Program and the Research Cloud:
* NSP supports core services (EVO, Sharepoint), Research Cloud supports discipline-specific services built on top of the core. (These can include: data analysis, visualisation, collaboration, security, data access, development environments, portals.)
* NSP runs for years, Research Cloud services may only run for minutes.
* NSP provides 24/7 support, Research Cloud provides 9-5 support.
* NSP has strict entry, security, maintenance criteria; Research Cloud less so.

UniMelb is delivering the NSP basic access phase: 50-100 virtual machines, at no charge in 2011, located at UniMelb. This is the first stage of deployment: there will be nodes elsewhere, and Virtual Machine numbers will ramp up.

Many universities are already delivering Virtual Machines, but they can use NeCTAR infrastructure as leverage. Virtual Machine distribution is increasingly used for application release, e.g. with TARDIS.

International exemplars for NeCTAR infrastructure: National Grid Service (UK): Eucalyptus; NASA (US): Open Nebula. NeCTAR will run an expert workshop early next year, inviting international experts and all potential research cloud nodes.

Discussion (from the Twitsphere: #NeCTAR)

* Will the existing ARCS data fabric be maintained? NeCTAR is not able to answer that, since the question is outside NeCTAR's remit. DIISR is in discussions with ARCS on the future of the Data Fabric as well as EVO.

ADLRR2010: My summary

2010-04-27T21:27:00.001+10:00

At an unsafe distance, I have posted my summary of what was discussed at the ADLRR2010 summit on the Link Affiliates group blog.

ADLRR2010: Wrap-up

2010-04-15T05:15:00.002+10:00

Dan Rehak:
Strongest trend in the meeting's discussions: What is the problem we're trying to solve, who is target community and how do we engage them?
Also: Sustainability, success, return on investment
Good consensus on No More Specs, figure out how to make what we already have work
Still seeing spectrum of solutions and different ecosystems, and don't know yet where to align along that spectrum
We should not focus on what we build, but on what requirements we satisfy
Learning is special, but not because it's the registry/repository architecture, but because we have particular requirements
We are technically mature, but socially immature in our solutions
Throwing it all away and starting from scratch has to be an option; cannot be captive to past approaches
Followup meeting in London next week

Paul Jesukiewicz:
Are soul-searching with the Administration on way forward (under time constraint of next two years: want to leave their mark in Education)
Govts worldwide are reprioritising their repository infrastructure
ADL is putting in recommendations, and govt wants to tackle the Dark Web

ADLRR2010: Breakout Groups: Future Directions

2010-04-15T05:10:00.000+10:00

Challenge: Where should ADL spend 10 mill bucks now on repository stuff? (Or 1 mill bucks, instructions to groups varied, and spending some money on a cruise was an option.)

Group 1:
We're back to herding cats
* Do we understand the problem statement clearly yet? Lots of discussion this part 2 days all over the place? Need to work out the grand challenge with the community still
* Need awareness of the problem space; lots of terms used loosely (repository vs registry), ppl don't know what they're trying to solve yet. What's happening in other domains?
* Harmonise, explore what's going on in the space, work with what you've got instead of reinventing solutions
* More infrastructure support: if solutions are out there, what we're missing is the market for it. What good is a highway system without any cars to go on it?

Group 2:
* Understand business drivers and requirements, formally (40%)
* Models for embedding and highly usable systems (25%) (widgets, allowing people to adapt their current systems of choice; don't end up in competition with grassroots systems that are more usable)
* Create mechanisms to establish trust and authority, in terms of functionality and content (15%) (clearinghouse? means of rating? something like Sourceforge?)—this is where the value of a content repository is
* Virtualise content repositories and registries (Google) (10%)—ref what Nick Nicholas was talking about: allow grassroots generated content to be a layer below Google for discovery: middleware API, Web Services, Cloud Computing: essentially data mining
* Study systems (Content Repositories, LCMSs) that already work (10%)

Group 3:
Most time spent on defining what the problem is
* Some thought there is no single problem, depends on what is being asked
* Some thought there is no problem at all, we'll take the money and go on holiday

There are two engineering problems, and one cultural problem:
* Cultural: Incentives for parties creating and maintaining content are diversifying, and are at odds with each other (e.g. more vs less duplicate content)
* Engineering 1: Need to discover repositories: registry-of-repositories (ASPECT). Sharing is the driver: without sharing, no reuse, and content is underutilised. Repositories are not discoverable by Google (Dark Web). Also need evaluation of repositories, what their content is, and how to get access to them. Second, need to make services coming out of R&D sustainable, including identifying business models. Third, need to capitalise on untapped feedback from users, and exchange.
* Engineering 2: Teaching the wrong thing because of lack of connection between teaching, and learning content. Learning content has much context; need to disambiguate this information to improve relations between content and users. Portfolio of assets must be aligned with student needs: need all information in one place. Don't want learning resources to be out of date or inaccurate.
* If you have 1 mill, get 100 users together and observe them, and find out what their requirements are. Don't just survey people, they'll say "Sure", and then not use the new repository because it has the wrong workflows.

Group 4:
* Research. Biggest thing needed is federating the various repository options out there
* Analysis of end user needs: both consumers and producers of content; both need refinement over what is currently available to them
* Systems interfaces to exchange content in multiple systems and multiple ways, including unanticipated uses: open-ended content exchange
* User feedback mechanisms: easier and faster collecting of metadata

Group 5: (My group)
* User Anonymity and security, for
* User Context to
* Drive discovery, which
* Needs data model for user context: typology, taxonomy
Crucial insight is: we're no longer doing discovery, we're going to push out the right learning content to users (suggestion systems), based on all the user behaviour data we and others are gathering—and aggregating it. The Amazon approach on recommending books, finding similar behaviours from other users. Becomes an issue of which social leader or expert to follow in the recommendations. (This is what repositories are *not* already doing: it's the next step forward)
Balanced against security concerns—stuff in firewalls, stuff outside, less reliable stuff in Cloud, etc

Group 6:
* Not everyone needs a repository: what is it for?
* Life cycle maintenance of content: don't focus on just the publishing stage
* Rethink metadata: too much focus on formal metadata, there's a lot of useful informal user feedback/paradata, much can be inferred
* Rethink search; leverage existing search capabilities: The Web itself is an information environment, explore deeper context-based searches (e.g. driven by competencies)
* What will motivate people to work together? (business drivers)
* Standards: how to minimise the set? (not all are appropriate to all communities)
* Exposing content as a service (e.g. sharing catalogues—good alternative to focus on registries, which is premature)
* Focus on domain-specific communities of practice (DoD business model not applicable to academic, constraints on reuse)
* Look at existing research on Web 2.0 + Repository integration

ADLRR2010: Panel: Vendors

2010-04-15T02:14:00.002+10:00

Gary Sikes, Giunti Labs

Publishers are restricted from the repository, can't see what they're content is getting there. ADL can have one repository for publishers outside the firewall, and one to publish into within the firewall.
More middleware use in repositories, web services and some API
User-based tagging (folksonomies) and ratings
Corporate education: providing access to digital markets, making content commercially reusable (resell)
Collaborationware and workflow tools, e.g. version comparison, shared workspaces
Workflows including project management roles and reviewing
Content access reporting: who is viewing, what versions are being viewed
Varying interface to repository by role
Challenges: security (publishers outside firewall, users within the firewall). Defining role-based interface. Interoperability. One-Stop Shops being asked for by client. For new implementations: how metadata deals with legacy data.
Standards also important for future-proofing content

John Alonso, OutStart

They provide tools, not knowledge.
Confusion from vendors: what counts as a repository? Google isn't one (referatory/repository confusion)
If we build it, they will not come; they will only come if it is important to them and has value. if too much cost and no return on getting to the content, they will go elsewhere
the clients are not telling him they want their stuff in the repository exposed and searchable
some great successes within the confines of the firewalls --- macdonald's corporate info is exposed to macdonald's local franchises well, motivated by cost efficiency and not mandates
We welcome standards—that people want to use: they lower the cost of entry. Vendors should not be driving the definitions of standards, they just want the business requirements. The buyers don't understand the standards themselves, they just treat them as checkboxes: buyers should articulate the business value of why they are requiring the standard in the first place: there is no business value to implementing the standard, so it never gets verified—or used.
Repositories vs registries: ppl use the terms interchangeably, hence the confusion. Trend is to abstract search, so that back end repositories can be swapped out. But I shouldn't have to write 10 different custom plugins to do so!

Ben Graff, K12 Inc.

Big problem space, many ways of both defining and slicing the problems
It's expensive to do this right, even if you do agree on the problem space: content design, rights management, content formatting & chunking, metadata creation, distribution strategy
The Return On Investment isn't always immediate
Teachers & Profs needs: Applicability (content at right size for right context), Discoverability (find it quickly), Utility (I can make it work in my environment: teachers are pragmatists), Community (peer recommendations, feeding into peers), Satisfaction (best available), Quality (proven, authoritative, innovative)
Students needs: Relevance (interesting & engaging), Applicability (need help finding right thing right now -- though I may not admit it, and I don't know what I don't know: I'm a novice)
Everyone needs: Simplicity (if it's not easy, I'll walk)

Support & respect content wherever it comes from: better exposure of content, greater availability helps society
Improve discovery through author-supplied metadata, ratings, and patterns of efficacy across an ecosystem of use—what we know by analysing usage.
Demonstrate and educate about ROI at multiple levels: government, business, educator, student
Not everyone will need to, want to, or be able to play along for years to come: keep breaking down barriers
Please have *a* standard, not a different standard for each client! content creation and publishing both become bad experiences: each standard becomes its own requirement set

Eric Shepherd, Questionmark

Author once, schedule once, single results set from a student, deliver anywhere, distributed authoring, management of multilingual translations; blended delivery—paper, secure browsers, regular browsers, transformation on the fly: autosense the target platform for delivery. Often embed Questionmark assessment in portals, blogs, wiki, Facebook.
Need to defer to accessibility experts to see if got accessibility right.
Analytics, once data anonymised, to establish quality and reliability of the software
*a* standard is utopian: different standards are targeted at different problems
Driver should not be the standard but the business need; but a vendor cannot survive without standards

ADLRR2010: Panel: Social Media, Alternative Technologies

2010-04-15T00:36:00.001+10:00

Susan van Gundy, NSDL

Cooperation with NSF & OSTP: Stem Exchange
NSDL has existed for a decade; digital library functionality, central metadata repository, community of grantees and resource providers, R&D for new tech and strategies
Aim now is to understand education impact of these materials; hence Stem Exchange: what are educators doing with NSDL resources? What knowledge do educators add to the resources, which can be fed back to resource developers?
This is about contextualisation & dissemination. Teachers know what NSDL is by now; now they want to know what other teachers are doing with it
Metadata is limited: labour intensive, expensive; limited use to end users beyond search & browse, though it is still important for content management: metadata is essential but not sufficient
"The evolving power of context": capture context of use of resources
Web service, open API: datastreams from NSDL straight into local repositories; teachers assemble resources online on the local repositories, generating resource profiles; this is paradata being fed back into NSDL (including favouriting of resources)
METANOTE: kudos for correct use of para- in paradata meaning "supporting context, circumstances"; cf. paratext
Generates data feeds of paradata: what others think and do with the resource. Akin to the use of hashtag in capturing usage.
Applies to open reuse subset of NSDL; will integrate into current social networking tools (e.g. RSS)
Now establishing working groups on how this will work
Are looking at folksonomies and pushing that back into NSDL formal metadata
People don't volunteer this data, need incentives: there will be automatic capture of the paradata in aggregate

Jon Phipps, Jes & Co

Interop isn't about using each other's interfaces any more —profusion of standards! Now we need to *understand* each others' interfaces
Linked Data: opportunity to share understanding, semantics of metadata
The 4 principles of Linked Data from Tim Berners-Lee
Jes & Co are making tools to support Master Data: central authoritative open data used throughout a system—in this case, the entire learning community
(Tends to be RDF, but doesn't have to be)
Given that, can start developing relationships between URIs; map understanding across boundaries
This enhances discoverability: ppl agree in advance on the vocabulary, more usefully and more ubiquitously—can aggregate data from disparate sources more effectively (semantic web)
e.g. map US learning objectives to AUS learning objectives for engineering learning resources. not a common et of standards, but a commonly understood set of standards
RDF: there's More Than One Way To Do It: that's chaos, but not necessarily a bad thing

Me

Can't really liveblog myself talking; I'm going through my position paper, and I've recorded myself (19.7 MB MP3, 21 mins).

Sarah Currier, consultant

"Nick said everything I wanted to say" :-)
Others have been high level, big strategic initiatives. This is microcosm education community, addressing compelling need of their own.
14 months of purely Web 2.0 based repository with community, "faux-pository", ad hoc repository
How do edu communities best use both formal repositories and Web 2.0 to share resources? How can repository developers support them using Web 2.0
Is a diigo group a repository? Netvibes is: http://www.netvibes.com/Employability
Community wanted a website powered a repository (whatever that is, they weren't techo); £40k budget. They went Web 2.0: though repositories were being built that were similar to that need, nothing the community could just jump in and use. (And the repositories that were built don't provide RSS!)
"Must not be driven by traditional project reporting outputs": more important to develop a good site than a project report!
Ppl needed both private and public comms spaces, and freely available software.
Paedagogy, social, organisational aspects of communities have not been involved in repository development, and are the major barriers now.
Everyone thinks their repository is somewhere everyone goes to. You're competing with Email, Google, Facebook: no, the repository is not the one-stop shop, push content to where people actually do go
There is a SWORD widget for Netvibes, but it's still rudimentary
Put edu communities at the heart of your requirements gathering and ongoing planning!
You *must* support newsfeeds, including recommendations and commentary, and make sure they work in all platforms
Need easy deposit tools, which can work from desktop and web 2.0 tools
Allow ppl to save this resource to Web 2.0 tools like Facebook; don't make your own Facebook

ADLRR2010: Breakout Groups: What are the problems we've identified so far with repositories?

2010-04-14T07:20:00.001+10:00

Understanding users and user needs: reuse is not as simple as hitting a button on iTunes.
Mindshare: how do you get enough resources to compete with google, esp as google are defining user expectations: standards are left wagging the tail of the vendor dogs.
complexity of systems, metadata and policy.
Lack of high quality metadata and tools.

Discussion mostly on organisational and social issues.
Need for ways for authors to connect to repositories, reuse at point of authoring.
Parochial development --- "not developed here", barrier to reuse.
Difficult to get ppl to create metadata.
Network enclaves, access restrictions
Organisational inertia

Security: identity management
Scale: scaling up
Building repositories: is that the right answer? (what is a repository anyway?) What would repository standards cover?
Are repositories solving yesterday's problems? Do we need more? we don't know yet.
Connectivity between repositories -- virtual silos
User-centric view of repositories
Is reuse relevant driver? Is there authority to reuse? Is content authoritative?
Optimising repositories to kinds of content
Manual metadata is too expensive
Getting discovered content: too hard, too costly
Sharing is key driver for repositories

More Incentives needed for using a repository, rather than more standards. Developing app profiles just as dangerous as developing more standards: they are very time consuming, and difficult to maintain.
Trust: single sign on. Security of data: needs trust, common security model.
Need common terminology, still stumbling on repositories vs registries
Quality assurance and validity of control.
Must focus on community and user requirements before looking at technology or content procurement; this has been a wrong focus.

Organisations may have bought a repository but be unaware of the investment; need registry of repositories.
Every agency builds silo, need mandate to unify repositories.
Holy Grail is reusable data, reusable at every level. Many business challenges to that: how to learn from failed efforts? Outside schoolhouses, difficult to get right, and much harder than it seems.
Search needs to be brokered.
What apps are needed for easy registering.
What models will incentivise innovation but not impede progress?
Bottom-up approach approach makes it difficult to get shared vision.
Difficult to set up repository technically. Could use turnkey repositories.
Lack of best practice guides for leveraging repositories, or to get answers to questions from community on how best to do things.

Searching is not finding: may want different granularities, content can be pushed out according to curriculum structures.
Should search be exact or sloppy? Sloppy search is very useful for developing paedagogy.
Process of metadata generation is iterative . user perspective can be trapped to inform subsequent attempts to search.
User generated and computer generated metadata is better than none.
Interoperability is a problem across repositories (application profiles, granularity). interoperability layer of repository is more important than the underlying technology of the repository.

Conclusion:

We're missing the users as a constituency in this summit! Hard to draw conclusions without them.
We're also missing the big social networking players like Google & Facebook: they're not interested in engaging despite multiple attempts.
We're missing the publishers. Some had been invited...
Repositories' relation to the web: repositories must not be closed off from the web, growing realisation over past 8 years that the Web is the knowledge environment.

Noone wants more specs here
There is no great new saviour tech, but some new techs are interesting and ready for prime time
"What's special about learning" in this? How do we differentiate, esp if we go down the path of social media?
Have we addressed the business models of how to make this all work?
When do we have enough metadata? Google works because page ranking is complex, *and* critical mass of users. If could gather and share all our analytic data from all our repositories, and share it, could we supplant metadata and start on machine learning? Open question
Building our own competitor to Google & Facebook, our own social tool: is it such a good idea?
Open Source drives innovation, but the highest profile innovation recently has been the closed-source iPhone. Are things moving towards closed source after all? If so, how do repositories play in the Apple-based world?

ADLRR2010: Tech, Interop, Models Panels:

2010-04-14T04:58:00.000+10:00

Joel Thierstein, Rice Uni

Connexions: Rice repository platform. 16k modules, 1k collections
Started as elec eng content local to rice, now k-12, community college, lifelong learning, all disciplines
modularised structure: all content can be reused; more freedom at board of studies level, building on a common core
module makes for more efficient updating
"Lenses": social software to do peer review of content
Permanent versioning -- there will be multiple answers given by the source
CC-BY licensing, can buy hard copy as well as getting online pdf or epub.
Can be customised as platform: local branding; k-12 can zone off content for their own purposes
Want to make it available to the world

David Massart, EU Schoolnet

Federation of 30 learning object repositories in EU
Move content to user, not user to content: so very hard to control user access
Driven by metadata, to access content and arrange access to services
Tech interop: most components are in place -- search protocols, harvest and push protocols, metadata profiles; still need repository of repositories to discover repositories, with associated registry description, including autoconfigure of service access. At most need to establish best practice.
The problem no is semantic interop: meaningful queries.
Though theoretically everything is LOM, lots of application profiles, so need repositories of application profiles as well. With that done, can start recording each profile's controlled vocabularies, then crosswalks between the vocabularies, then transformations from one application profile to another.
ASPECT project is trying to do this all now: vocabulary bank, crosswalk, transformation service; trying to work out what would go into an application profile registry.
Dramatic knowledge building: some national repositories were not even up on LOM at the start

Valerie Smothers, MedBiquitous

MedEdPortal: not just learning objects, but learning plans, curricula: structure.
They routinely partner with other repositories. This has had blockers: no standard for packaging content (IMS not applicable to them.)
Peer review, and professional credit for submissions; but this means reviewers need to access files, different downloads every week.
Taxonomies are big in medicine, but don't cover medical education well.
They need fed search into MedEdPortal from other collections; they are reluctant to import other collections or refer out to them, because of how stringent they are.
LOM is profiled. Tracking reuse, and identifying reasons for reuse. Off the shelf products don't support profiles.
Interest in harnessing social networking, and Friend Of A Friend information.

Christophe Blanchi, CNRI

Identifiers are key to digital infrastructure. Ids have to be usable by systems as well as humans, provide client what they need in different contexts.
Identifiers often not interoperable. Syntactic interoperability: has been address with standards; problem is now different communities using different, non-native identifiers. Semantic interoperability: how to tell whether they mean the same thing? Functional interoperability: what can I do with the identifier? You don't always know what you'll get when you act on the identifier. Community interoperability: policy, site of the most silo'ing of identifiers. Persistence interoperability with the future.
Want to provide users with granular access. Recommendation: identifiers should provide user a glimpse of what the resource is. Identifiers resolving to self-defining descriptions. Identifiers must be polymorphic. Identifiers must be mapped to their intrinsic behaviours (typing, cf. MIME).

ADLRR2010: Repository Initiatives

2010-04-14T02:31:00.000+10:00

Dan Rehak

Registries and repositories.

Dan and others have been drawing pictures of what systems to do content discovery should be.
So what? People don't understand what these diagrams communicate.
Underlying all this are: models. User workflow models. Business models. Service models. Data models. Technical model. The models interact.
Try to constrain the vocabularies in each of the models.
Needs: provide discovery access delivery management; support user expectations, diverse collections, policies, diverse tech, scalings.
Do we want the single Google of learning? Do we want portals? Do we want to search (and sift through), or more to discover relevant content? Social paradigm: pushes content out.
How to get there? People do things the web 2.0 way. (Iditarod illustration of embrace of web 2.0.)

Panel: Initiatives.

Larry Lannom, CNRI.

ADL did interoperability by coming up with SCORM. Registry to encourage reuse of content within DoD: content stays in place, persistent identification, searchable.
ADL works, although policy took a lot of negotiation. The tech has been taken up in other projects: GENI, M-FASR, commercial product currently embargoed.
Problems: limited adoption. Not clear short-term gain, metadata is hard and expensive, reuse presupposes right granularity of what to reuse.
Tech challenges: quality metadata: tools to map to required schemas, create metadata as close to creating content as possible. Federation across heterogeneous data sets, including vocabulary mapping -- intractable as there are always different ways of thinking about world, so need balance between system interop and semantic incompatibility. Lots of tech, but still no coherence.
Future: Need transparent middleware to ingest content. Need default repository service for those who don't have one. Gaming & virtual worlds registry. Internationalisation. Simple metadata for more general use. Need turnkey registry for easier deployment. Need to revisit CORDRA.
Difference between push and pull is implementation detail, should be transparent to user.

Frans van Assche, Ariadne Foundation.

Globe foundation: largest group of repositories in the world.
Ariadne: federation. Services: harvest, auto metadata generation, ranking. Six expert centres counts as success. Lots of providers in federation.
Problems: exchange between GLOBE partners (there are 15). n² connection matrix. Language problems. Need a central collection registry, rather than have everyone connect to everyone.
Ariadne is a broker between providers; still need to engage end users.
Tech Challenges: scaling up across all of Globe. Ministries had been disclosing very small amounts of resources, now deluging them.
Need to serve users better, with performant discovery mechanisms, dealing with broken links and duplicates and ranking in a federation particularly. Alt knowledge sources such as Slideshare and iTunes Uni: you can't get away from federates search.
Need social metadata, but will have to wait until basic infrastructure in place.
Ultimately want discovery uniquely tailored to user needs.
Multilingual issues pressing in Europe, need mappings between vocabularies: managing 23 languages is difficult.

Sarah Currier, consultancy.

UK Higher ed repositories, CETIS. Policy, and community analysis around repositories.
First time they reached the broad community, not just the early adopters: reflects what they needed, and their sense of community, got non-techie users from Web 1 to Web 2 mindset on how to use and reuse resources. None of the funding went into tech (which is good).
Their success is the end users; but often the repository content could not be exposed via NetVibes or Widgets, which shocked her. Lots of work by small group of people, so Tragedy of Commons; hard to retain engagement with some users -- though tech this time was not the barrier.
"Fly under the radar": IP, metadata profiles, tech -- got quick outcomes because didn't have to bother with that; the cost is, no influence on repository policy to get them to play along.
Still need to start from users (wide range); what we currently have online in Web 2.0 is very user friendly. They are mostly interoperable and backuppable, so sustainability not as much an issue as it used to be. Lack of interop to Web 2.0 from repositories is still major trouble; until DuraSpace gives Web 2.0 feeds, can't build.
This is not creating own Facebook on top of Fedora: this is about using existing tools on top of Fedora.

Thornton Staples, DuraSpace

Durability is in hand with distribution. DuraCloud is their move into CloudSpace, providing trust there.
Fedora, DSpace, Mulgara triplestore.
Fedora is used around the world, now including govt agencies with open data. Now using fedora in interesting ways, not just as archives, but as a graph of interrelated resources, relating also to external resources.
Fedora no longer grant funded, but open source self-standing project.
Problem: communication of what Fedora is and is intended to do, so ppl just expected their own shiny object out of it. Fedora is complicated product. Fedora in between library timescale and IT timescale; should have put out a user-oriented app much earlier than the base infrastructure, this took much longer to happen (only past couple of years), and blocked adoption.
Tech Challenge: scaling. How many objects in repository affects access, discovery, etc. Size of objects also affects this. Data Conservancy is pushing limits of Fedora: are adding new kinds of data streams to deal with such data more effectively.

Jim Martino, Johns Hopkins Data Conservancy

NSF funded. Data curation as means to address challenges in science.
Came about from astronomers wanting to offload data curation onto library. Has broadened in coverage and use.
Driven by science complex needs, disparate data sets. Will do analysis on how data used, including when not to preserve data.
Data is getting more sizeable.

ADLRR2010: US Govt Perspectives

2010-04-13T23:58:00.002+10:00

Paul Jesukiewicz, ADL

Lot of tech, but not a lot of uptake. There are lots of approaches out there to take stock of. Administration: we still don't have good ways of finding content, across govt portals. Need systems that can work for everyone and for varied needs, which is difficult.
Previous administration, not a lot of inter-agency collaboration; that is now happening again.
White House wants to know where things are up to; lots of money for content development & assessment. "Why not Amazon/iTunes/Google experience?"
Technically more possible than policy side. Push to transparent government, so open. Must support both closed and open content.
Will have to have system of systems, each system dealing with different kind of requirements.

Karen Cator, Dept of Ed

National Edu Tech Plan. Move to digital content.
* Learning: largest area, creating engaging and ubiquitous content.
* Assessment: embedded, multiple kinds including simulations; needs context such as "what's next", discoverable, should be ultimately pushable to student.
* Teaching: how to make teachers more effective, making sure they're connected to data and experts.
* Infrastructure: broadband everywhere, mobile access.
* Productivity: cost efficiencies day to day. Personalised learning is very participatory.

States are collaborating on standards; this is a microcosm of what is possible.

Bonus section: R&D. What more needs to be invented? Textbooks addressing full range of standards, not just the easy to test ones. Content interoperability and aggregation.

Student Data interoperability others are working on, including data anonymisation; but content interop is expedient priority for them now.

Open Source: the world is using it so we have to.

Teacher portals are all ad hoc; priority to get content interop there. New business models can arise given interoperable content, but this needs open models.

Content will have to come from everywhere—globally.

Frank Olken, NSF

Works on: Knowledge integration, semantic web, data mining.
National Science Digital Library: longterm program. Now built on Fedora, over RDF, Mulgara triplestore.
RDF enables faceted search, because multiple hierarchies are possible over same resource.
Big vocabs (esp in medical field) are happening through description logics, OWL. NSF not currently using it. RDF has been maturing quickly; the description logic engines and the rule systems are less mature, but the most important part of all of them is the conceptual map.
Most work on semantic web is in Europe through EU support; some US work is being commercialised, but not much US support for logic based approaches.

Can user contribute to taxonomy (= folksonomy)? They are doing research on turning folksonomies into rigorous taxonomies: open research over past two years, but no smashing success so far. NSDL metadata registry project.
Mappings between taxonomies: needs order-preservation to keep hierarchies internally consistent, active research.

ADLRR2010: Notes from ADL Learning Content Registries and Repositories Summit

2010-04-13T23:30:00.005+10:00

The following posts are notes from the ADL Learning Content Registries and Repositories Summit, Alexandria VA, 2010-04-13–2010–04-14. (ADLRR2010)

[EDIT: ADLRR2010 series of posts]

Using UML Component diagrams to embed e-Framework Service Usage Models

2009-06-14T13:04:00.003+10:00

Given the background of what embedding SUMs in other SUMs can mean, I'm going to model what that embedding can look like from a systems POV, using UML component diagrams. The tool is somewhat awkward to the task, but I was rather taken with the ball-and-socket representation of interfaces in UML 2.0—even if I have to abandon that notation where it counts. I'm also using this as an opportunity to explore specifying the data sources for embedded SUMs—which may not be the same as the data sources for the embedding SUM.

The task I set myself here is to model, using embedded SUMs, functionality for searching for entries in a collection, annotating those entries, and syndicating the annotations (but not the collection entries themselves).

We can represent what needs to happen in an Activity diagram, which captures the fact that entries and annotations involve two different systems. (We'll model them as distinct data stores):

We can go from that Activity diagram to a simple SUM diagram, capturing the use of four services and two data sources:

But as indicated in the previous post, we want to capitalise on the existence of SUMs describing aspects of collection functionality, and modularise out service descriptions already given in those SUMs (along with the context those SUMs set). So:

where a "Searchable Collection" is a service usage model on searching and reading elements in a collection, and "Shareable Collection" is a service usage model on syndicating and harvesting elements in a collection—and all those services modelled may be part of the same system. We are making an important distinction here: the embedded searchable and shareable collection SUMs are generic, and can be used to expose any number of data sources. We nominate two distinct data sources, and align a different data source to each embedded SUM. So we are making the entries data source searchable, but the annotations data source shareable; and we are not relying on the embedded SUMs to tell us what data sources they talk to, when we do this orchestration.

Which is all very well, but what does embedding a SUM actually look like from a running application? I'm going to try to answer that through ball-and-socket. The collection SUM models a software component, which exposes several services for other systems and users to invoke. That software component may be a standalone application, or it may be integrated with other components to build something greater; that flexibility is of course the point of Service Oriented Architecture (and Approaches). The software component exposes a number of services, which can be treated as ports into the component:

And an external component can interface through one or more of those exposed services, giving software integration:

Each service defines its own interface, and the interface to a port is modelled in UML as a realisation of a component (hollow arrowhead): it's the face of the component that the outside world sees:

And outside components that use a port depend on that interface (dashed arrow): the integration cannot happen without that dependency being resolved, so the component using our services depends on our interface:

Exposed services have their interfaces documented in the SUM: that is part of the point of a SUM. But a SUM may not document the interface of just one exposed service, but of several. By default, it documents all exposed services. But if we allow a SUM to model only part of a system's functionality, then we can have different SUMs capturing only subsets of the exposed functionality of a system. By setting up simple, searchable and shareable collections, we're doing just that.

Now a SUM is much more than just an interface definition. But if a single SUM includes the interface definitions for all of Add Read Replace Remove and Search, then we can conflate the interfaces for all those services into a single reference to the searchable collection SUM—where all the interfaces are detailed. We can also have both the simple and the searchable collection SUMs as alternate interfaces into our collection: one gives you search, the other doesn't. (Moreover, we could have two distinct protocols into the collection, so that the distinction may not just be theoretical.)

This is not a well-formed UML diagram, on purpose: the dependency arrows are left hanging, as a reminder that each interface (a SUM) defines several service endpoints into the component. The reason that's not quite right is that the UML interface is specific to a port—each port has its own inteface instance; so a more correct notation would have been to preserve the distinct interface boxes, and use meta-notation to bundle them together into SUMs. Still, the very act of embedding SUMs glosses over the details of which services are being consumed from the embed. So independently of the multiple incoming (and one outgoing) arrows per interface, this diagram is telling us the story we need to tell: a SUM defines bundles of interfaces into a system, and a system may have its interfaces bundled in more than one way.

Let's return to our initial task; we want to search for entries in a collection, annotate those entries, and syndicate the annotations. We can model this with component diagrams, ignoring for now the specifics of the interfaces: we want the functionality identified in the first SUM diagram, of search, read, annotate, and syndicate. In a component diagram, what we want looks like this:

The Entries component exposes search and read services; the Annotations component (however it ends up realised) consumes them. The Annotations component exposes an annotate service to end users, and a syndicate service to other components (wherever they may be).

That's the functionality needed; but we already know that SUMs exist to describe that functionality, and we can use those SUMs to define the needed interfaces:

The Entries collection exposes search and read services through a Searchable Collections SUM, which targets the Entries data source. The Annotations collection exposes syndicate services through a Shareable Collections SUM, which targets the Annotations data source.

Now, in the original component diagram, Annotate was something you did on the metal, directly interfacing with the Entries component:

Expanding it out as we have, we're now saying that realising that Annotate port involves orchestration with a distinct Annotation data source, and consumes search and read services. So we map a port to a systems component realising the port:

Slotting the Annotate port onto a collection is equivalent to slotting that collection into the Search and Read service dependency of the the Annotate system:

So we have modelled the dependency between the Entries and Annotate components. But with interfaces, the services they expose, and data sources as proxies for components, we have enough to map this component diagram back to a SUM, with the interface-bundling SUMs embedded:

The embedded SUMs bundle and modularise away functionality. Notice that they do not necessarily define functionality as being external, and so they do not only describe "other systems". The shareable SUM exposes the annotations, and the searchable SUM exposes the entries: their functionality could easily reside on the same repository, and we can't think of both the Entries and the Annotations as "external" data—if we did, we'd have no internal data left. The embedded SUMs are simply building blocks for system functionality—again, independently of where the functionality is provided from.

What does anchor the embedded SUMs and the services alike are the data sources they interact with. An Annotations data sources can talk to a single Annotate service in the SUM, as readily as it can to a Syndicate service modularised into Shareable Collection. Because an embedded SUM can be anchored to one of "our" data sources, just like a standalone service can. That means that, if a SUM will be embedded within another SUM, it's important to know whether the embedded SUM's data sources are cordonned off, or are shared with the invoking context.

An authentication SUM will have its own data sources for users and credentials, and no other service should know about them except through the appropriate authorisation and authentication services. But a Shareable Collections SUM needs to know what data source it's syndicating—in this case, the same data source we're putting our annotations into. So the SUM diagram needs to identify the embedded SUM data source with its own Annotations data source. If data sources in a SUM can be accessed through external services, then embedding that SUM means working out the mapping between the embedding and embedded data sources—as the dashed "Entries" box shows, two diagrams up.

SUM diagrams are very useful for sketching out a range of functionality, and modularisation helps keep things tractable, but eventually you will want to insert slot A into tab B; if you're using embedded SUMs, you will need to say where the tabs are.

Embedding e-Framework SUMs

2009-06-14T11:47:00.004+10:00

I've already posted on using UML sequence diagrams to derive e-Framework Service Usage Models (SUMs). SUMs can be used to model applications in terms of their component services. That includes the business requirements, workflows, implementation constraints and policy decisions are in place for an application, as well as the services themselves and their interfaces.

However, in strict Service Oriented Architecture, the application is not a well-bounded box, sitting on a single server: any number of different services from different domains can be brought together to realise some functionality: the only thing binding these services together is the particular business goal they are realising. We can go even further with this uncoupling of application from service: a service usage model, properly, is just that: a model for the usage of certain services for a particular goal. It need not describe just what a single application does; and it need not exhaustively describe what a single application does. If a business goal only requires some of the functionality of an application, the SUM will model only that much functionality. And since an application can be applied to multiple business problems, there can be multiple SUMs used to describe what a given application does (or will do).

This issue has come up in modelling work that Link Affiliates has been doing around Project Bamboo, and on core SUMs dealing with collections. The e-framework has already defined a SUM for simple collections, with CRUD functionality, and searchable collections, which offer CRUD functionality plus search. The searchable collection SUM includes all the functionality of the simple collection SUM, so the simple collection SUM is embedded in the searchable collection SUM:

The e-framework already has notation for embedding one SUM within another:

And in fact, the embedded SUMs are already in the diagram for the searchable collection: they are the nested rectangles around "Provision {Collection}" and "Manage {Collection}".

Embedding a SUM means that the functionality required is not described in this, but in another SUM. There is a separate SUM intended for managing a collection. That does not mean that the embedded SUM functionality is sourced from another application: the functionality for adding content, searching for content, and managing the content may well be provided by a single system. Then again, it may not: because the SUM presents a service-oriented approach, the functionality is described primarily through services, and the systems they may be provided through are a matter of deployment. But that means that the simple collection SUM, the searchable collection SUM, and the manage collection SUM can all be describing different bundles of functionality of the same system.

Embedding SUMs has been allowed in the e-Framework for quite a while, and has been a handy device to modularise out functionality we don't want to detail, particularly when it is only of secondary importance. Authentication & Authorisation, for instance, are required for most processes in most SUMs; but because SUMs are typically used as thumbnail sketches of functionality, they are often outsourced to an "Identity" SUM.

That modularisation does not mean that the OpenURL SUM shares all its business requirements or design constraints with the Identity SUM. After all, the Identity functionality may reside on a completely different system on the bus. Nor does it mean that every service of the Identity SUM is used by the OpenURL SUM—not even every service exposed to external users. The Identity SUM may offer Authentication, Authorisation, Accounting, Auditing, and Credentials Update, but OpenURL may use only a subset of those exposed services. In fact, the point of embedding the SUM is not to go into the details of which services will be used how from the embedded SUM: embedding the SUM is declining to detail it further, at least in the SUM diagram.

On the other hand, embedding the Identity SUM, as opposed to merely adding individual authentication & authorisation services to the SUM-

—lets us appeal to the embedded SUM for specifics of data models, protocols, implementation, or orchestration, which can also be modularised out of the current SUM.

Google Wave

2009-05-29T17:32:00.002+10:00

Yeah, Me Too:

It's hard not to echo the YouTube commentor who said: "I love you google!! I can't wait for you to take over the world!!"

Some quick reax:

The special genius of Google is that the interface is not revolutionary: it's all notions we've seen elsewhere brought together, so people can immediately get the metaphor used and engage with it. I found myself annoyed that the developers were applauding so much at what were obvious inventions—and just as often smiling at the sprezzatura of it all.
But once everything becomes a Wave Object, and dynamic and negotiated and hooked in, it does destabilise the notion of what a document is massively. Then again, so did wikis.
Not everything will become a Wave Object. For reasons both sociological and technical. One of the more important gadgets to hook into this thing for e-scholarship, when it shows up on our browsers, is an annotation gadget for found, static documents (and their components). In fact, we have that even now elsewhere—Diigo for instance. But hooking that up to the Google eye candy, yes, that is A Good Thing.
All your base are belong to the Cloud. And of course, what the man said on the Cloud. This may be where the world is heading—all our intellectual output a bunch of sand mandalas, to sweep away with the next electromagnetic bomb or solar flare. One more reason why not everything should become a Wave Object; but you would still obviously want Wave objects to talk to anything online.
The eye candy matters, but the highlight for me was at 1h04, with the Wave Robot client communicating updates to the Bug Tracker. That's real service-driven interoperability, with agents translating status live into other systems' internal state. That, you can go a very long way on.
The metaphor is unnerving, and deliberately so: the agents are elevated to the same rank as humans, are christened robots, have their own agency in the text you are crafting. The spellchecker is not a tool, it is a conversation participant. But then, isn't that what futurists thought AI realisation would end up looking like anyway? Agents with deep understanding of limited domains, interacting with humans in a task. The metaphor is going to colour how people interact with computers though: just that icon of a parrot will make people thing of the gadget as a participant and not an instrument.
OK, so Lars moves around the stage; I found that endearing more than anything else.
The machine translation demo? Dunno if it was worth *that* much applause; the Unix Terminal demo actually communicated more profoundly than it did. The Translate Widget in OSX has given us live translation for years (with appallingly crap speed, and as my colleague Steve has pointed out, speed of performance in the real world will be the true test of all of this). That said, the fact that the translation was not quite correct was as important to the demo as the speed at which it translated character by character. It's something that will happen with the other robot interactions, I suspect: realising their limitations, so you interact with them in a more realistic way. The stochastic spellchecker is a welcome improvement, but users will still have to realise that it remains fallible. I know people who refuse to use predictive text on their mobiles for that reason, and people will have different thresholds of how much gadget intervention they'll accept. Word's intervention in Auto-Correct has not gained universal welcome.
There's going to be some workflow issues, like that the live update stuff can get really distracting quickly (and they realise this with their own use); Microsoft Word's track change functionality gets unusable over a certain number of changes.
Google Docs has not delivered massively more functionality than Word, and the motivation to use it has been somewhat abstract, it doesn't lead to mass adoption outside ideologues and specific circumstances. In my day job, we still fling Word Docs with track changes around; colleagues have tried to push us cloud-ward, unsuccessfully. (Partly that's a generational mistrust of the Cloud. Partly it isn't, because the colleague trying to push us cloud-ward is one generation older.) But the combination of Google Docs plus Google Wave for collaborative documents should make Microsoft nervous.
Microsoft. Remember them? :-)

Identifier interoperability

2009-04-16T11:27:00.003+10:00

This is of course a month too late, but I thought I'd put down some thoughts about identifier interoperability.

Digital Identifiers Out There exist in a variety of schemes—(HTTP) URI, DOI, Handle, PURL, XRI. ARK, if only it was actually implemented more widely. Plus the large assortment of national bibliographic schemes, only some of which are caged in at Info-URI. ISBN, which is an identifier websites know how to do things with digitally. And so forth.

Confronted with a variety of schemes, users would rather one unified scheme. Or failing that, interoperability between schemes. Now, this makes intuitive sense when we're talking about services like search, with well defined interfaces and messages. The problem is that an identifier is not a service (despite the conflation of identifier and service in HTTP): it is a linguistic sign. In essence (as we have argued in the PILIN project), it is just a string, associated with some thing. You work out, from the string, what the thing is, through a service like resolution (though that is not the only possible service associated with an identifier). You get from the string to the thing through a service like retrieval (which is *not* necessarily the same as resolution—although URLs historically conflated the two.) But the identifier is the argument for the resolution or retrieval service; it's not the service itself.

And in a trivial way, if we ignore resolution and just concentrate on identifying things, pure strings are plenty interoperable. I can use an ISBN string like 978-1413304541 anywhere I want, whether on a napkin, or Wikipedia's Book Sources service, or LookUpByISBN.com, or an Access database. So what's the problem? That ASCII string can get used in multiple services, therefore it's interoperable.

That's the trivial way, of identifier string interoperability. (In PILIN, we referred to "labels" as more generic than strings.) And of course, that's not really what people mean by interoperable identifiers. What they mean is identifier service interoperability after all: some mechanism of resolution, which can deal with more than one identifier scheme. So http:// deals with resolving HTTP URIs and PURLs, and http://hdl.handle.net deals with resolving Handles, and a Name Mapping Authority like http://ark.cdlib.org deals with resolving ARKs. What people would like is a single resolver, which takes an identifier and a name for an identifier scheme, and gives you the resolution (or retrieval) for that identifier.

There's a couple of reasons why a universal resolver is harder than it looks. For one, different schemes have different associated metadata, and services to access that metadata: that is part of the reason they are different. So ARK has its ? and ?? operators; Handle has its association of an identifier with arbitrary metadata fields; XRI has its resource Descriptor; HTTP has its HTTP 303 vs HTTP 100 status code, differentiating (belatedly) between resolution and retrieval (getting the resource vs. getting the description of the resource). A single universal resolver would have to come up with some sort of superschema to represent access to all these various kinds of metadata, or else forego accessing them. If it did give up on accessing all of them—the ARK ?? , the Handle Description, the XRI Resource Descriptor—then you're only left with one kind of resolution: get the resource itself. So you'd have a universal retriever (download a document given any identifier scheme), but not the more abstract notion of a universal resolver (get the various kinds of available metadata, given any identifier scheme).

The second reason, related to the first, is that different identifier schemes can allow different services to be associated with their identifiers. In fact those different services depend on the different kinds of metadata that the schemes expose. But if the service is idiosyncratic to an identifier scheme, then getting it to interoperate with a different identifier scheme will require lowest common denominator interchange of data that may get clunky, and will end up discarding much of the idiosyncracy. A persistence guarantee service from ARK may not make sense applied to Handles. A checksum or a linkrot service applied across identifiers would end up falling back on the lowest common denominator service—that is, the universal retriever, which only knows about downloading resources.

On the other hand, the default universal retriever does already exist. The internet now has a universal protocol in HTTP, and a universal way of dereferencing HTTP references. As we argued in Using URIs as Persistent Identifiers, if an identifier scheme is to get any traction now on the internet, it has to be exposed through HTTP: that is, it has to be accessed as an HTTP URI. That makes HTTP URI resolvers the universal retriever: http://hdl.handle.net/ prefixed to Handles, http://ark.cdlib.org/ prefixed to ARKs, http://xri.net/ prefixed to XRIs. In the W3C's way of thinking, this means that HTTP URIs are the universal identifier, and there's no point in having anything else; to the extent that other identifier schemes exist, they are merely subsets of HTTP URIs (as XRI ended up going with, to address W3C's nix).

Despite the Semantic Web's intent of universality, I don't think that any URI has supplanted my name or my passport number: identifiers (and more to the point, linguistic signs) exist and are maintained independently, and are exposed through services and mechanisms of the system's choosing, whether they are exposed as URIs or not. A Handle can be maintained in the Handle system as a Handle, independently of how it is exposed as an HTTP URI; and exposing it as an HTTP URI does not preclude exposing it in different protocols (like UDP). But there are excellent reasons for any identifier used in the context of the web to be resolvable through the web—that is, dereferenced through HTTP. That's why the identifier schemes all end up inside HTTP URIs. What you end up with as a result of HTTP GET on that URI may be a resolution or a retrieval. The HTTP protocol distinguishes the two through status codes, but most people ignore the distinction, and they treat the splash page they get from http://arxiv.org/abs/cmp-lg/9609008 as Just Another Representation of Mark Lauer's thesis, rather than as a resolution distinct from retrieving the thesis. So HTTP GET is the Universal Retriever.

But again, retrieval is not all you can do with identifiers. You can just identify things with identifiers. And you can reason about what you have identified: in particular, whether two identifiers are identifying the same thing, and if not, how those two things are related. When the Identifier Interoperability stream of the UKOLN respository workshop sat down to work out what we could do about identifier interoperability, we did not pursue cross-scheme resolvers or universal metadata schemas: if we thought about that at all, we thought it would be too large an undertaking for a year's horizon, and probably too late, given the realities in repository land.

Instead, all we committed to was a service for informing users about whether two identifiers, which could be from different schemes, identified the same file. And for that, you don't need identifier service interoperability: you don't need to actually resolve the identifier live to work it out. Like all metadata, this assertion of equivalence is a claim that a particular authority is making. And like any claim, you can merely represent that assertion in something like RDF, with the identifier strings as arguments. So all you need for the claim "Handle 102.100.272/T9G74WJQH is equivalent to URI https://www.pilin.net.au/Project_Documents/PILIN_Ontology/PILIN_Ontology_Summary.htm" is identifier string interoperability—the fact you can insert identifiers from two different schemes in the same assertion. The same holds if you go further, and start modelling different kinds of relations between identifier referents, such as are covered in FRBR. And because any authority can make claims about anything, we opened up the prospect of not just a central equivalence service, but a decentralised network of hubs of authorities: each making their own assertions about identifiers to match their own purposes, and each available to be consumed by the outside world—subject to how much those authorities are trusted.

Defaulting from identifier service interoperability—i.e. interoperability as we know it—back to identifier string interoperability may seem retrograde. Saying things about strings certainly doesn't seem very interoperablish, when you don't seem to actually be doing anything with those strings. Put differently, if the identifier isn't being dereferenced, there does not seem to be an identifier operation at all, so there doesn't seem to be anything to interoperate with. But such thinking is falling back into the trap of conflating the identifier with clicking the identifier. Identifiers aren't just network locations, and they aren't just resolution requests—something everyone now agrees with, including the W3C. They exist as names for things, in addition to any dereferencing to get to those things. And because they exist as names for things, reasoning about how such names relate to each other is part of their core functionality, and is not tied up with live dereferencing of the names. (RDF would not work if they did.)

So this is less than interoperability as we know it; but in a way, it is more interoperable than any service. You don't even need a deployed resolver service in place, to get useful equivalence assertions about identifiers. Nothing prevents you making assertions about URNs, after all...

Visit to European Schoolnet

2009-04-01T17:40:00.006+11:00

Somewhat belatedly (because some work came up when I returned to Australia), this is the writeup of my visit to European Schoolnet, Brussels, on the 18th of March.

As background: European Schoolnet are a partnership of European ministries of education, who are developing common e-learning infrastructure for use in schools throughout Europe. EUNet are involved in the ASPECT project, constructing an e-learning repository network for use in schools in multiple countries in Europe, in partnership with commercial content developers. (See summary.) The network involves adding resource descriptions and collection descriptions to central registries. The network being constructed is currently a closed version of the LRE (Learning Resource Exchange), which is under development.

Link Affiliates are following the progress of the ASPECT project, to see how its learnings can apply to the Digital Education Revolution initiative in Australia.

Link Affiliates (for DEEWR) are also participating with European Schoolnet on the IMS LODE (Learning Object Discovery and Exchange) Project Group, which is formulating common specifications for registering and exchanging e-learning objects between repositories. Link Affiliates is doing some software development to test out the specifications being developed at LODE, and was looking for more elaboration on the requirements that ASPECT in particular would like met.

Identifiers

EUNet are interested in exploring identifier issues for resources further. EUNet are dealing with 24 content providers (including 16 Ministries of Education), with each one identifying resources however it sees fit, and no preexisting coordination in how they identify resources through identifiers. EUNet never know, when they get a resource from a provider, whether they already have it registered or it is new.

EUNet are working on a comparator to guess whether resources deposited with them are identical, based on both attributes and content of the resource. People change the identifiers for objects within institutions; if that did not happen, a comparator would not be needed. Some contributors manage referatories, so they will have both different metadata and different identifiers for the same resource. The comparator service is becoming cleverer. ASPECT plans to promote Handle and persistent identifiers. If they are used correctly, they will not eliminate all problems; but they will deal with some resources better than what is happening now.

Metadata transformation & translation

ASPECT is setting up registries of application profiles and vocabulary banks. They aim to automatically transform metadata for learning resources between vocabularies and profiles. Vocabularies are the major challenge. ASPECT have promised to deliver 200 vocabularies, but that includes language translations: at a minimum ASPECT needs to support the 22 languages of the EU, and 10 or 12 LOM vocabularies in their application profile. The content providers are prepared to adopt the LRE vocabularies and application profile; the content providers transform their metadata vocabularies into the LRE European norm from any national vocabularies, as a compliance requirement. EUN use Systran for translating free text, but that is restricted to titles, descriptions and keywords. The vocabulary bank is used to translate controlled vocabulary entries.

Transformations between metadata schemas, such as DC to LOM, or LRE to and from MARC, will happen much later. The Swiss are making attempts in that direction; but the mappings are very complicated. EUN avoid the problem by sticking to the LRE application profile in-house; they would eventually want LRE to be able to acquire resources from cultural heritage institutions, which will require crosswalking MARC or DC to LOM.

The vocabulary bank will eventually map between distinct vocabularies; e.g. a national vocabulary and mapping will be uploaded centrally, to enable transformation to the LRE norm. One can do metadata transformation by mapping to a common spine, as is done in the UK (e.g. 2002 discussion paper). But the current agreed way is by allowing different degrees of equivalence in translation, and by allowing a single term to map to a Boolean conjunction of terms. Because LOM cannot have boolean conjunctions for its values, this approach cannot be used in static transformations, or in harvest; but federated search can expand out the Boolean conjunctions into multiple search terms. Harvested transformations can still fall back on notions of degrees of equivalence. The different possible mappings are described in:

F. Van Assche, S. Hartinger, A. Harvey, D. Massart, K. Synytsya, A. Wanniart, & M. Willem. 2005. Harmonisation of vocabularies for elearning, CEN Workshop Agreement (CWA 15453). November.

ASPECT work with LODE

IMS LODE is working on ILOX (Information for Learning Object Exchange), as an information model. ILOX includes a notion of abstract classes of resources, akin to FRBR's manifestations, expressions, and works. ASPECT is currently working on a new version of the LRE metadata application profile of ILOX + LOM, v.4: this corrects errors, adds new vocabularies, and does some tweaks including tweaks to identifier formatting. The profile also includes an information model akin to FRBR, as profiled for LRE under LODE/ILOX.

The ILOX schema that has already been made available to Link Affiliates for development work is stable: it will not change as a result of the current editing of the application profile. The application profile should be ready by the end of March. ASPECT will then ask content providers to format their metadata according to the new application profile, with the new binding based on ILOX. By the end of May ASPECT want to have infrastructure in place, to disseminate metadata following the profile.

The content in the LRE is currently restricted to what can be rendered in a browser, i.e. online resources. After May, ASPECT will add SCORM, Common Cartridge and other such packaged content to their scope: they will seek to describe them also with ILOX, and to see whether packaging information can be reused in searches, in order to select the right format for content delivery. This would capitalise on the added value of ILOX metadata, to deal with content in multiple formats.

Transformation services will be put in place to transform content. Most packaged content will be available in several (FRBR) manifestations. The first tests of this infrastructure will be by the end of September 2009; by February 2010 ASPECT aim to have sufficient experience to have the infrastructure running smoothly, and supporting pilot projects. EUN does not know yet if it will adopt this packaging infrastructure for the whole of LRE, or just ASPECT: this depends on the results of the pilots. There will be a mix of schemas in content delivery: content in ASPECT will use ILOX, while content in LRE will continue to use LOM. This should not present a major problem; ASPECT will provide XSL transforms from LRE metadata to ILOX on the first release of their metadata transformation service.

Within ASPECT, EUN have been working with KU Leuven on creating a tool to extract metadata straight out of a SCORM or Common Cartridge package, and generating ILOX metadata directly. KU Leuven have indicated that this should already be working, but they are now waiting for the application profile for testing. When the LRE is opened up to the outside world, it will offer both metadata formats, LRE LOM and LRE ILOX, so they can engage with other LODE partners who have indicated interest—particularly Canada and Australia.

The binding of the Registry information model in LODE is proceeding, using IMS tools. ASPECT want an IMS-compatible binding. The registry work will proceed based on that binding. The registry work is intended for use not just in ASPECT, but as an open source project for wider community feedback and contribution. The Canadians involved in LODE will contribute resources, as will Australia. The registry project is intended to start work in the coming weeks. Development in ASPECT will mostly be undertaken by EUN and KU Leuven. Two instances of the registry will be set up as running and talking to each other for testing. There may be different instances of registries run internationally to register content, and possibly a peer to peer network of registries to exchange information about learning resources. For example a K-12 resources registry in Australia run by education.au, could now talk to EUNet's registry.

There has not yet been a decision on what kind of open source license the registry project will use. They are currently inclined to the GNU lesser public license, as it allows both open source and commercial development. Suggestions are welcome.

The LRE architecture is presented at Slideshare , with a more complete description underway .

Abstract hierarchies of resources

ASPECT is using abstract hierarchies of learning resources as being modelled in ILOX, and derived from the abstract hierarchies of FRBR. ASPECT would like to display information on the (FRBR) Expression when a user does discovery, and then to automatically select the (FRBR) Manifestation of the object to deliver. Link Affiliates had proposed testing facetted search returning the different available expressions or manifestations of search items. ASPECT were not going to go all the way to facet-based discovery, and are not intending to expose manifestations directly to users: they prefer to have the search interface navigate through abstractions intelligently to end up at the most appropriate manifestation. Still, they are curious to see what facet-based discovery of resources might look like. Several parties are developing portals to LRE, and creating unexpected interfaces and uses of the LRE that they are interested in seeing.

The current test search interface is available online.

ASPECT would like to reuse the ILOX FRBR-ised schema for its collection descriptions. The ILOX schema takes different chunks of metadata, and groups them together according to what level of abstraction they apply to. (Some fields, such as "title" apply to all resources belonging to the same Work; some fields, such as "technical format" would be shared only by all resources belonging to the same Manifestation.) A collection description can also be broken down in this way, since different elements of the content description correspond to different levels of abstraction: e.g. the protocol for a collection is at Manifestation level, while the target service for the collection is at Item level.

Promoting consistency of schemata across LODE is desirable, and would motivate schema reuse, leading to the same API for all usages; but motivating use cases are needed to work out how to populate such a schema, with different levels of abstraction, for a collection description. Collating different collection descriptions at different levels of abstraction is such a use case ("give me all collections supporting SRU search" vs. "give me all collections supporting any kind of search"). How this would be carried through can be fleshed out in testing.

Registering content and collections in registries

ASPECT wanted to use OAI-PMH as just a synchronisation mechanism for content between different registries. The repository–to–registry ingest would occur through push (deposit), not through pull. OAI-PMH is overkill for the context of learning object registries, and the domain does not have well-defined federations of participants, which could be driven by OAI-PMH: any relevant party can push content into the learning object registries. SPI would also be overkill for this purpose: the detailed workflows SPI supports for managing publishing objects, and binding objects to metadata, are appropriate for Ariadne, but are too much for this context, as ASPECT is just circulating metadata, and not content objects. SWORD would be the likely protocol for content deposit.

Adding repositories to the registry is an activity that needs a use case to be formulated. ASPECT envisages a web page (or something of that sort) to self-register repositories, following the LODE repository description schema. Once the repository is registered, harvesting and associated actions can then happen. People could describe their collections as well as their repositories on the same web page, as a single act of registration. That does not deal with the case of a collection spanning multiple repositories. But the description of a collection is publicly accessible, and need not be bound to a single repository; it can reside at the registry, to span across the participating repositories.

The anticipated model for repository discovery is that one repository has its description pushed into a network, and then the rest of network discovers it: so this is automatic discovery, not automatic registration. A discovery service like UDDI would not work, because they are not using WSDL SOAP services.

Collections use cases

Not all collection descriptions would reside in a learning object repository. There are clear use cases for ad hoc collections, built out of existing collections, with their description objects hosted at a local registry level instead (e.g. Wales hosts an ad hoc collection including science collections from Spain and Britain). Such an ad hoc collection description would be prepared by the registry provider, not individual teachers. Being ad hoc, the collection has to be stored in the registry and not a single source repository. There could be a widget built for repositories, so that repository managers could deploy it wherever they want, and enable the repository users to add in collection level descriptions where needed.

Collections use cases being considered at ASPECT are also of interest to GLOBE and LODE. Use cases need to detail:

How to create collections, where.
How to define what objects belong to a collection, intensionally or extensionally (by enumeration or by property).
Describe collection.
Edit description of collection.
Combine collections through any set operation (will mostly be Set Union).
Expose collection (manual or automated).
Discover collection, at registry or client level (VLE, portal).
Evaluate collection, undertaken by user, on behalf of themselves or a community: this depends on the collection description made available, but also can involve viewing items from the collection.
If a commercial collection is involved, there is a Procurement use case as well.
Disaggregate collection and Reaggregate collection: users may want to see the components/contributors of a virtual collection.

Some use cases specific to content also involve the registry:

Describe learning object extensionally, to indicate to what collection it belongs.
Discover learning objects: the collection objects can be used to limit searches.
Evaluate learning object, with respect to a collection (i.e. according to the collection's goals, or drawing on information specific to the collection). E.g. what quality assurance was used for the object, based on metadata that has been recorded only at collection level.

Further work

The LRE application profile registry may feed into the Standards and Application Profiles registry work being proposed by Link Affiliates. BECTA have a profile registry running. At the moment it is limited to human readable descriptions, namely profiles of LOM. LRE will be offering access to application profiles as a service available for external consumption.

OMAR and OCKHAM are two existing registries of learning/repository content. OMAR is in EBXML. ASPECT would like to incoporate content from such registries, and repackage their content to their ends as exemplars of implementations, and potential sources of reusable code. The synchronisation protocols of these registries in particular may be an improvement over OAI-PMH.

URN NBN Resolver Demonstration

2009-03-18T01:42:00.004+11:00

Web sites:

Demonstrated by Maurice Vanderfeesten.

Actually his very cool Prezi presentation will be more cogent than my notes: URN NBN Resolver Presentation. [EDIT: Moreover, he included his own notes in his discussion of the identifier workshop session.]

A few notes to supplement this:

The system uses URNs based on National Library Numbers (URN-NBN) as their persistent identifiers.
So it's a well-established bibliographic identification scheme, which can certainly be expanded to the research repository world. (The German National Library already covers research data.)
The pilot got coded start of 2009.
They are using John Kunze's Name-To-Thing resolver as their HTTP URI infrastructure for making their URNs resolvable.
Tim Berners-Lee might be surprised to see his Linked Data advocacy brought up in this presentation in the context of URNs. But as long as things can also be expressed as HTTP URIs, it does not matter.
The blood on the blade of the W3C TAG URN finding is still fresh, I know.
Lots of EU countries are queuing up to use this as persistent identifier infrastructure.
The Firefox plugin works on resolving these URNs with predictable smoothness. :-)
They are working through what their granularity of referents will be, and what the long term sustainability expectations are for their components (the persistence guarantees, in the terms of the PILIN project)
They would like to update RFC 2141 on URNs, and already have in place RFC 3188 on NBNs.
They now need to convince the community of the urgency and benefits of persistent identifiers and of this particular approach, and to get community buy-in.

UKOLN International Repository Workshop: Identifier Interoperability

2009-03-18T01:15:00.005+11:00

[EDIT: Maurice Vanderfeesten has a fuller summary of the outcomes.]

First Report:

Many resonances with what was already said in other streams: support for scholarly cycle, recognition of range of solutions, disagreement on scope, needing to work with more than traditional repositories.
Identifying: objects (not just data), institutions, and people in limited roles.
Will model relations between identifiers; there are both implicit and explicit information models involved.
Temporal change needs to be modelled; there are lots of challenges.
Not trying to build the one identifier system, but loose coupling of identifier services with already extant identifier systems.
Start with small sets of functionality and then expand.
Identifiers are created for defined periods and purposes, based on distinguishing attributes of things.

Second Report:

We can't avoid the "more research needed" phase of work: need to work out workflows and use cases to support the identifier services, though the infrastructure will be invisible to some users.
Need rapid prototyping of services, not waterfall.
The mindmaps provided by the workshop organisers of parties involved in the repository space [will be published soon] are useful, and need to be kept up to date through the lifetime of project.
There may not be much to do internationally for object identification, since repositories are doing this already; but we likely need identifiers for repositories.
Author identifiers: repositories should not be acting as naming authorities, but import that authority from outside.
There are different levels of trust for naming authorities; assertions about authors change across time.
An interoperability service will allow author to bind multiple identities together, and give authors the control to prevent their private identities being included in with their public personas.

Third Report:

The group has been pragmatic in its reduction of scope.
There will be identifiers for: Organisations, repositories, people, objects.
Identifiers are not names: we not building a name registry, and name registries have their own distinct authority.
Organisations:
- Identifiers for these should be built on top of existing systems (which is a general principle for this work).
- There could usefully be a collection of organisation identifiers, maintained as a federated system, and including temporal change in its model.
- The organisation registry can be tackled by geographical region, and start on existing lists, e.g. DNS.
Repositories:
- There shall be a registry for repositories. There shall be rules and vetting for getting on the registry, sanity checks. Here too there are temporal concerns to model: repositories come into and out of existence.
- The registry shall be a self-populating system, building on existing systems like OpenDOAR. It should also offer depopulation (a repository is pinged, and found no longer to be live.)
- There is a many-to-many relation of repositories to institutions.
- The registry shall not be restricted to open access repositories.
Objects:
- We are not proposing to do a new identifier scheme.
- We are avoiding detailed information models such as FRBR for now.
- We propose to create do equivalence service at FRBR Manifestation level between two identifiers: e.g. a query on whether this ARK and this Handle are pointing to the same bitstream of data, though possibly at different locations.
- Later on could build a Same FRBR Expression service (do these two identifiers point to digital objects with the same content).
- The equivalence service would be identifier schema independent [and would likely be realised in RDF].
People:
- A people identification service could be federated or central.
- People have multiple identities: we would offer an equivalence service and a non-equivalence service between multiple identities.
- The non-equivalence service is needed because this is not a closed-world set: people may assert that two identities are the same, or are not the same.
- The service would rely on self-assertions by the user being identified.
- The user would select identities, out of a possibly prepopulated list.
- People may want to leave identities out of their assertions of equivalence (i.e. keep them private).

UKOLN International Repository Workshop: Repository Organisation

2009-03-18T01:02:00.004+11:00

First Report:

Aim: to support repository concepts with a common purpose.
To support the professional peer group, with bottom-up demand.
To support interoperability, assuring data quality.
To formulate guidelines, supporting national cooperation, to help recruit new repositories, to enable international interoperability.
The activity can be compared to the international collaboration behind Dublin Core.
The confederation would have a strategic role, providing support outside national boundaries to repository development.
It would provide a locus for interaction with other communities: researchers, publishers.
It will be driven by improving the scholarly process, and not just by repositories as an aim in themselves.

Second Report:

The group needed to define the nature of the organisation to work towards: finding a common point of departure was difficult.
Need to articulate benefits to stakeholders:
- a forum for information exchange,
- promoting repository management as a profession,
- reflecting community needs,
- channelling demands for new software.
The relations underlying the confederation are in place already, but the types of relations will be worked out tomorrow. The group has to establish evidence of need for the confederation.
The roles of the organisation will be worked through tomorrow: they will involve service to repositories and to researchers.
The workshop discussants have split into an advisory group, an investigatory group, and visionary group.

Third Report:

The organisation goal is to enhance the scholarly process through a federation of open access repositories.
They will approach funding agencies. The organisation must be independent, bottom-up, funded through membership.
Sustainability, political authority, visibility.
The organisation's core concepts will be formed around stakeholder needs and activities. These are varied; they need:
- clarity of roles,
- strong governance,
- network of expertise,
- carry through of interopability issues;
- help in setting up repositories and repository advocacy;
- certification & quality assurance.
Groups identified the contributions they could bring: money, expertise, ambassadors, suitable workflows.
Deliverables & outcomes: e.g. hold meetings, sessions in conferences, make visible the repository manager profession; lobbying, websites, potentially helpdesk.
Governance model: organisational membership, partnership with software providers.
Timeframe: proof of concept to circulate April, formal model of confederation May, letter of request of participation June.

UKOLN International Repository Workshop: Repository Handshake

2009-03-18T00:52:00.004+11:00

First Report:

An attempt to rationalise the service requirements: working on PUT, not GET or KEEP
The aim is to populate repositories; support authors & friends (funders or institutions) making their research material available through open access
Have ingest support services that repositories will use downstream.
Focus on research papers, although that may scope more widely.
Balance of priorities between improving existing workflows vs. recruiting content from new depositors.
What information to be collected at point of ingest? —question unresolved. The group is scoping potential conflicts.
Machine-to-machine interoperability vs. computer-assisted human-mediated deposit: these form a continuum.
Workflow agreed on as the target of the group's work; the reification of "workflow" took three directions: e-research workflow; e-publication workflow; repository management.

Second Report:

Over the past ten years people's expectations have not been realised.
People have had stabs at different services.
Need to identify what is the sweet spot between useful services for the community [lots of metadata on ingest], and not imposing difficult requirements on author [little metadata on ingest].
[I lost track here I'm afraid.]

Third Report:

Deposit is the focus of this activity.
Handshake has two parts: PUT from the client, and BEG from the server. [i.e. recruit content].
Use cases: these are deposit opportunities, and range outside the boundary of the repository. Repositories communicating with each other is only one such use case.
Key words: more, better quality [of metadata], easier [remove obstacles to deposit], rewarding [for depositor]. Handshake must involve social contract of reward.
Plan, multiphase.
- Phase 1: rapid engagement internationally. Some nations have national leverage, but not all do. A international framework is still needed.
- Eight deposit opportunities have been identiified; 2-3 to focus on in workplan Phase 1, over 6 months. For example:
  - Multi authored paper, several institutions and countries—what does deposit look like, and how does it become once-only? (Will not be rich but minimally sufficient)
  - Use institutionally motivated deposit;
  - Communication between institutional and discipline repositories;
  - Publisher of journal offers open access service to author.
- Seek real life description of those focus use cases, and exemplars already in use on the ground.
- Output of this focussed activity is descriptions of what practice is, not code or prototypes.
- Then gap analysis.
- Overall 2-3 year time horizon, but not planning out so far yet.

UKOLN International Repository Workshop: Citation Services

2009-03-18T00:45:00.002+11:00

First Report:

Currently small number of commercial service providers is dominant in this field. Are we evolving repository services [to accommodate the existing systems], or revolutionising them?
Since citations drive national funding, systems need to be trusted auditable and open.
Citations relate authors and ideas, and help connect concepts together; they provide literature ranking, and larger scale analytic services across literature.
International coordination: existing infrastructure of loosely coupled repositories can be foundation of robust scalable solution.

Second Report:

The group is producing no large plan and manifesto, but is going back to basics.
"Handshake" meant different things to different people; there are limitations to the metaphor.
There will be group activity, with two foci: business and technological.
Recruitment of content needs to happen outside repository established space, including through desktop bibliographic tools such as Zotero.

Third Report:

There is a huge variety of presentations of citations, and there are partial solutions specific to communities.
Model how to deal with citations: Isolate references from papers, and then extract reference data, and interpret it, from varying citation schemes.
For repository to be active in this without overconsuming resources, the repository shall be made responsible to hand on to external services the list of references extracted from their items (papers).
Plan of action:
- Establish test bed of references, out of what repositories find interesting.
- Create repository API, repository plugin, OAI PMH profile.
- JISC developer competition to develop toolkits.
- Then liaise with e.g. Crossref and establish collaboration: the commercial bodies already have such services.
- Then create a reference item processor as an external service, decomposing references into constituent data.
- Then build services like Citeseer and Google Scholar—or use those existing services, if they will collaborate.
- Then build exemplar GUI end user services, e.g. trackbacks, visualisations.
- Liaising with publishers important but not a dependency for remaining tasks.

UKOLN International Repository Workshop: Introductory remarks

2009-03-18T00:41:00.002+11:00

From Norbert Lossau of DRIVER

The Vision underlying the workshop is the Berlin 2003 declaration: free & unrestricted access to human knowledge.
Need infrastructure to complete the research cycle: discovery > reuse > storage and preservation, for data as well as papers, at an international access level. Establishment of online reputation for researchers is critical.
Researchers have their existing discovery procedures; these are to be harmonised, not supplanted.
We are already advanced in Global harvesting, preservation of papers, repository storage.
A global network of repository infrastructure hubs, rather than one centralised infrastructure.

UKOLN International Repository Workshop

2009-03-18T00:17:00.002+11:00

Have just finshed at the UKOLN International Repository Workshop, twittered at #repinf09. The workshop was a joint JISC/DRIVER event; it had international scope, but there were only a couple of East and South Asian participants, and Andrew Treloar and myself from Oceania.

The intention of the workshop was to formulate action plans which would make sense to fund for international infrastructure for repositories—in the first instance, research publication repositories. I took part in the identifier infrastructure workshop, and I have been cited publicly (though anonymously) as saying that it was "surprisingly pragmatic". The information superstructures that can be imposed over identifiers—and what they identify—can get quite open-ended and intellectually satisfying; but our business was to formulate something concrete, fundable, and realisable over the next year or so. What you put on top of it later is for another workshop.

There were four streams to the workshop: four different kinds of infrastructure that could be put in place. The four streams were:

Repository Citation Services: Improving the ways in which citation data relating to open access research papers is shared. Citation data may be forwards or backwards citation. Includes the ability to recognise citations in repositories and the open web.
Repository Handshake: Improving ways in which repositories can be populated with research papers, including authors, other repositories, publishers and research management systems. The "handshake" involves negotiation between a depositing agent and a repository, building on SWORD.
Repository Interoperable Identification Infrastructure: Improve identifying entities in repositories and making connections across repositories, and provide useful services to do so.
Repository Organisation: Provide international organisational support to enable research repositories to work together to meet the objectives of Open Access and eResearch through a confederation of repositories.

I'll post:

summaries of what these streams reported back on the three summary get-togethers in the workshop: a couple of streams really changed direction through the workshop.
Then, some notes on the first session of the identifier stream (which were behind the first report-back). We did not change tack as drastically as some streams, so they will still help inform what the stream eventually came up with.
A summary of the SURF demonstration of their persistent identifier work and their enhanced document work.
And finally (if I get to be so bold), my own take on what the identifier stream came up with.

XRI, Handle, and persistent descriptors, Pt 2

2008-11-19T17:28:00.006+11:00

(Back to Pt 1)

Let's now look at our favourite XRI, =drummond. If I retrieve the XRDS for =drummond, through the resolution service http://xri.net/=drummond?_xrd_r=application/xrds+xml, I get (at current writing!)

a canonical (and persistent!) i-number corresponding to the i-name =drummond, =!F83.62B1.44F.2813
A Skype call service endpoint
A Skype chat service endpoint
A contact webpage service endpoint
A forwarding webpage service endpoint
An OpenID signon endpoint

The XRDS does not anywhere say what =drummond is identifying; just some services associated with =drummond contingently. I could infer Drummond's full name from the Skype username being drummondreed, but that's hardly failsafe. What I would like is access to some text like...

VP Infrastructure at Parity Communications (www.parity.inc), Chief Architect to Cordance Corporation (www.cordance.net), co-chair of the OASIS XRI and XDI Technical Committees (www.oasis-open.org), board member of the OpenID Foundation (www.openid.net) and the Information Card Foundation(www.informationcard.net), ...

Oh, as in the contact webpage that http://xri.net/=drummond resolves to, http://2idi.com/contact/=drummond . Well, yes, but I did not know ahead of time that the contact webpage would have the information I wanted, with enough bio information to differentiate Drummond from other candidates: it's a contact page, not a bio page. (Drummond providing bio info is a lagniappe, which simply proves he knows about identity issues.)

What I want is some consistent way of getting from =drummond to a description of what =drummond identifies. XRDS is a descriptor already, which is why =drummond resolves to it: it describes the service interfaces that get to =drummond. But it's a descriptor of service endpoints and synonyms; it still doesn't persistently describe Drummond, the way the DESC field does in Handle. (Or would, if anyone ever used DESC).

Now, the technology-independent description of what is being described is needed for persistent identifiers; it's not as important for reassignable identifiers. So even if =drummond doesn't take me directly to a persistent description, persistence is still satisfied if =drummond takes me to =!F83.62B1.44F.2813, and =!F83.62B1.44F.2813 takes me to a persistent description. XRI allows =drummond and =!F83.62B1.44F.2813 to have different XRDS (because they can have different services attached)—though typically when an i-name is registered against an i-broker, the XRDS is the same. The requirement would be for the persistent description to be accessed through the i-number's XRDS, which may not be the same as the i-name's.

The easy way of adding a persistent description to an XRDS is treating it as yet another service endpoint on the identifier: I give you an identifier, I get back a persistent description. Drummond's contact page already accidentally the description. What I'd like is some canonical class of service for getting to the persistent description. It could be something as simple as an +i-service*(+description)*($v*1.0) service type, to match the xri://+i-service*(+contact)*($v*1.0) type which gave me Drummond's contact page.

This description service is actually the reverse of David Booth's http://thing-described-by.org/. David starts with the URL for a description as a web page, http://dbooth.org/2005/dbooth/, and creates an abstract identifier http://thing-described-by.org?http://dbooth.org/2005/dbooth/ for the entity described by the web page . XRI starts with @xri*david.booth (I can't see David actually registering his own XRI), which is already an inherently abstract identifier—unlike HTTP URIs.

Getting from there back to the description http://dbooth.org/2005/dbooth/ is a resolution; we could access it through http://is-description-of.org/?@xri*david.booth . (We would likely access it through normal HXRI proxy http://xri.net/@xri*david.booth too; the point is, we're constraining the HTTP resolution to a specific kind of representation. David Is Not His Homepage.)

I'll note that David's description is worth emulating: "The URI http://thing-described-by.org?http://dbooth.org/2005/dbooth/ hereby acts as a globally unique name for the natural person named David Booth with email address dbooth@hp.com (as of 1-Jan-2005)."

The catch with that approach is, we're now relying on an external service to guarantee the persistent metadata for our persistent identifier. And as I argued in the previous post, you don't want to do that: your system for persistence should be self-contained, since you are accountable for it. It is easier for the description to persist if it sits inside the i-number's XRDS than outside it.

Even that does not give much of a guarantee of archival-level persistence. It is a feature and not a bug of XRI that users manage their own XRDS for personal i-names: the i-broker refers resolution queries back out to the user's XRDS, and promises only not to reassign the i-number. i-brokers do not commit to registering their own persistent metadata against the i-number. But once the user's XRDS goes offline, noone is able to resolve the i-name or the i-number. The trick with persistence in identifiers is, it's always persistence of something. Once the service endpoints for your identifier go away, you lose persistence of actionability. Not reassigning the i-number maintains persistence of reference (the i-number can't start referring to something else). But without a description accessible down the road, it does not maintain persistence of resolution (a user finding out what it referred to, even if no service endpoints are available).

Maybe that's OK: XRIs are addressing a particular issue—digital identity across multiple services. If the user is trusted to maintain their digital identity, then XRI is not geared to address long-term archival needs. In the same way, the user-centered practice of self-archiving has nothing to do with long-term archives (as Stevan Harnad has to keep repeating—with only himself to blame for introducing the term in the first place. )

Oh, can't resist: Wikipedia entry on self-archiving:
SelfArchive.org: a self-archiving wiki - DEAD LINK

Bwahah. And don't get me started on an "archivangelism" with its emphasis on "arch"...

XRI, Handle, and persistent descriptors, Pt 1

2008-11-19T15:46:00.006+11:00

This post is to suggest that XRDS (or equivalent) includes not just service endpoints, but also persistent descriptions—potentially as a distinct service endpoint. It takes a while to build up the argument, so I'm splitting it in parts.

One of the critical insights we came up with in the PILIN persistent identifier project is: if you want the identifier to persist, it's not enough to just keep updating URLs that the identifier resolves to. You want to record somewhere a piece of metadata, that tells you what the thing identified is—independent of the URLs. That piece of metadata will itself be persistent: it will not be affected by any changes in the service endpoints of your identifier. But it doesn't have to be machine-readable: it can be a description in prose.

Having that piece of information helps you in disaster recovery. If all your URLs go out the window, you can still use the description to reconstruct how the identifier should resolve (and reformulate the URLs). And you can't really claim persistence if you don't have some kind of disaster recovery.
Having that piece of information is also critical for archival use of identifiers—after the services resolved to are no longer accessible. (And persistent identifiers should persist longer than the services they had resolved to.)
Getting to that piece of metadata in itself involves a service, and in itself is a resolution. (That means it can integrate into the current XRDS as a service endpoint.)
But if you entrust that piece of metadata to a service outside your identifier management system, you are putting persistence at risk.

Let me first illustrate this principle with the technology we used in PILIN, Handle.

info:hdl:102.100.272/0N8J991QH

resolves to the Handle record:


URL: https://www.pilin.net.au
EMAIL: opoudjis@gmail.com
HS_ADMIN: [admin bit masks]

I can update my URLs and Emails as things change, but that's pretty poor information management. If I disappear, and the DNS registration expires, I'm not allowing anyone to reconstruct what the identifier resolved to. If someone's found the Handle 102.100.272/0N8J991QH on a printout at some point in the distant future (like, say, 5 years), and they find a Handle resolver which gives the information above, they too are none the wiser about what the Handle was supposed to identify. Because the Handle was supposed to be persistent, it has failed.

But Handle also provides a DESCription field, which allows you to say what is being identified:


URL: https://www.pilin.net.au
EMAIL: opoudjis@gmail.com
HS_ADMIN: [admin bit masks]
DESC: Website for the PILIN project (Persistent Linking Infrastructure), 
funded by the Australian Government to investigate policy and technology 
for digital identifier persistence.

That description is at least a fallback if the URL does not get maintained. I'd argue further that the description is the real resolution of the identifier (as PILIN defined resolution this year: information distinctive to the thing identified, differentiating it from all other things). The description actually tells you what is being identified, and it stays the same even if the URL location of the website does not. It gives a persistent resolution of the Handle, which is not constrained by a particular service or protocol.

Moreover, if the description is part of the Handle record, then it will persist so long as the Handle record itself persists. It does not depend on an external agent to guarantee it sticks around. Which is what you want for the metadata that will guarantee the persistence of the Handle.

If on the other hand I put my descriptions in an external service, like http://description-of.org/hdl/102.100.272/0N8J991QH , then I will lose my persistent descriptions if http://description-of.org goes down: I am dependent on http://description-of.org for the long-term persistence of my identifiers. And I should not be dependent: persisting my 102.100.272/0N8J991QH Handle is my responsibility (for which I am accountable), and it's what I set up my identifier management system to do.

Next Post, we run that notion against XRI.