2008-06-28

Version Identification Framework

David Puplett.

Two streams. VERSIONS project then VIF. Much overlap. VERSIONS: e-prints in economics. Main output: toolkit, mainly for authors of e-prints, to understand the types of versions they would be encountering in Open Access landscape (preprint postprint draft etc.)

Beforehand RIVER project and NISO / ALPSP project had come up with terminologies for journal article versions; controversial, because focussed on "versions of record", which privileged postprint as publisher version.

VERSIONS did interviews on what academics' practice was, when they made things public, to whom (e.g. scholarly networks, departments, repositories); then started own version of terminology. Also what behaviours to realistically expect of authors when contributing content: what engagement to realistically expect in differentiating versions on their own, and how aware they were of issues. (Lots of academics deleted as they went, left only printed versions behind.) Reports on surveys on VERSIONS website: lots of anecdotal material from author point of view. Output toolkit to disseminate to repositories: how to describe different versions, and how to make them useful in the repository context. Draft, Submitted, Accepted, Published, Updated.

Since specific to domain and e-prints, could go on with more recommendations; e.g. embedding versioning into cover sheet of e-print (with disclaimers). At mo', added manually. Coversheet embedding ensures googling still gets you the metadata. Inspired by arXiv's use of watermarking into the margins of the PDF.

Open Access has been driver. VIF was about all objects in repositories, so not just items in publication cycle; e.g. also research data, videos. So not involved in Open Access debate. VERSIONS tried to be agnostic towards Open Access, but does support it through encouraging content depositing.

VERSIONS was mostly scoping. Need identified for broader solution to version identification problem; e.g. organising content in repository, versioning of metadata, cross-repository discovery (deduplication).

VIF applicable to any object on repository. 10 months long, with concrete deliverables at the end. (Will be a short followup in September, surveying takeup and further publicity.) Started with its own survey, of academics and repository managers: how they discover content, when they find multiple versions, and how they found version they wanted (or near enough). Bad news: very few people found it easy: terminology is confusing, or no metadata about versions presented at all. Accepted copy, self-deposited (esp. early on): there was minimal metadata gathered. People constantly going back to google, coz no metadata embedded on what they'd retrieved. Led to the framework work.

Education component: raise awareness, help contributors reflect on what versions are: there hasn't been enough repository outreach on educating, since effort has been just on gathering content. Needed to do doom-mongering: versioning is essential to establishing authority for research outputs. Audiences:

  1. repository managers (key audience);
  2. content creators (difficult group to engage with --- if they were already engaged, they knew the bare basics, and only cared about the bare basics, don't care about repository mechanics. So minimised advice burden: no overkill, rely on the toolkit to get people started.) (Toolkit goes into advocacy of repository managers towards content providers.)
  3. Software community: developers of the major repository packages as well as local systems teams customising repositories.

Got progress with e-prints, who were engaged, and have integrated version tagging à la VIF into their development; D-Space much further behind --- have scoped on their own that versioning needs to happen, but have not prioritised it for development. Fedora does versioning, though datastreams: not very flexible: assume linear sequence of versions, which VIF was not restricted to. Recommendations were not to individual software packages but generic. Version support in the three repository packages are different.

VIF counts as versions both FRBR expressions and manifestations: there is disparity in what different people count as a version. The Work (author-title) brings both kinds of object together. Not high awareness at the time of FRBR in the repository community.

e-prints application profile for scholarly metadata. several app profiles coming through, ongoing development: images, geospatial data. e-prints have straightforward FRBR structure; images much more problematic: e.g. what is the subject matter unifying the objects into a single Work. JISC wants the app profile groups to work together more for consistent outputs. Given disparity in resourcing of repositories, challenge of getting coherent policy nationally. To that end, coversheet is much more doable than abstractions of FRBR and app profiles: have had to be pragmatic, not a technical project. Certainly not doing the big philosophical questions of what is a version, but tangible solutions and easy wins. (Unusual for JISC projects, which are more experimental typically.)

Blog updating existing subscribers on developments after project conclusion. Are getting generic questions about metadata which have version issues involved. Written articles to maintain awareness. Have not gotten into learning objects: versioning had not been an issue in UK, because only latest version is maintained. Also large scale repository issues with mapping of astronomical data into different versions way too complex to be in scope.

Some limited vocab work, but not focus of project; reflecting existing best practice.

2008-06-26

e-IUS

Whole project team.

JISC 2 year project, 9 months left. Aim: to capture experience of using e-infrastructure to support research, driven by research community viewpoint.


  • Interviews, 1-to-1 and 1-to-team: user experience reports;
  • thence, use cases: actually scenarios, based on fact not fantasy.
  • Once critical mass of scenarios, Service Usage Models (SUMs).


Primarily research community, aim to raise awareness of how e-infrastructure can do things for them. Do need to make outputs applicable cross-domain. Also engagement at higher level with JISC/funding: SUMs feed into funding process. Project can serve as snapshot of takeup of e-infrastructure, now that that phase of deployment is wrapping up.

Mercedes has been full time RA doing e-framework writing since start of project. Others have come in later. Have been recruitment delays. Initial output was scoping study, core document (must read): methodology with critical evaluation, sample use cases from interviews, with reflections both from team and researchers. Tried and true methodology, used in other projects already. Their SUMs are higher level than some of the JISC SUMs.

Methodology challenge: establishing contact for interviewing. Not enough resources to embed into projects (as the anthropologists would prefer), so hour-long open-ended interviews. Discuss day in the life of their research. Do not ask what infrastructure they use. Establish points of connection between their research and the extant e-infrastructure services by elicitation. There is still much overlap between practitioners and developers, so still fuzzy in establishing that boundary.

e-infrastructure is broadly meant: not just e-science, but not as broad as just using Word either. Advanced networked IT is what they understand as e-infrastructure: the national Grid is archetypal such e-infrastructure: common, service-oriented. Much e-infrastructure is not yet services, still pilot oriented to a particular project.

Use cases: long version (prose) => short version (pictures). Have ensured they're aligned to each other. Given scenario, identify what is behind it at different levels. E.g. for grid SUMs, authentication is a separate, omnipresent level of functionality; so is job monitoring. Only some of the functions are used in any one scenario. Also translate requirements statement (GEMS-I: Grid enabling MIMAS services project) into business processes.

Used SUMs to compare systems: which service genres do they have in common, and what effort are they replicating. Very good summary paper which does so for GEMS-I, GEMEDA, MOSES.

In general: cogent --- I'd say compelling --- tie-in of business analysis to service usage models, anchoring one to the other, and an excellent model for getting stakeholder buy-in into the e-framework. I think they've got it exactly right.

Virtual Research Environment for the Humanities

Ruth Kirkham, project manager

2005: requirements gathering. Go to humanities researchers, and see what they wanted --- not build it first and force it on them. How would it integrate into their daily work, what tools do they currently use, what would be useful. 6 months. Towards the end, four demonstrations to run past them:


  • Research discovery service (already in place in medical sciences; pushing Oxford humanities research into the web). Because Oxford is so decentralised, people didn't even know what was happening within faculties (people talk within colleges instead). Customising medical discovery solution to the humanities.
  • Physical tools: digital pen and paper (also done with biology vre), allowed digital notetaking. But libraries would not allow anything in the building with ink in it! Are working on pencil version. Hasn't been followed up so far, will likely get resumed next phase.
  • Access grid, personal interface. "But we have books, I want to use my office." Overtaken by developments like Skype.
  • English department, Jane Austen manuscripts being digitised and compared: cross searching of databases into their environment: English Books Online, Samuel Johnson dictionary, 18th century bibliographical dictionaries, etc. This did not become fully flegeded service, but has gone forward.



Ancient Documents demonstrator got more funding, became more robust, fully functional mockup. The English engagement was going on while the ancient documents work was going on, and is on-going. The VRE for docs & mss funded last year, March 2006-March 2009: broaden outputs from previous project and working them into the VRE. Focussing on those two demonstrators.

Development & user requirements are iterative: feedback from users every three months. Now have functioning workspace; most goodies are on the development site, not the more public site. Most work is still with the ancient documents people. Recently have started speaking to English again. Archaeology Virtual Research Environment at Reading (VERA) is trying to pull in data from databases; working with them to situate artefacts within their archeological contexts; currently proof of concept. (The Reading work is on a pre-literate site at Silchester, so the data isn't there yet, but the conceptual work can still be demonstrated.)

Recently moving to generic RDF triple store, which will accommodate disparate unstructured data better. Of course not well optimised for access like relational databases are, so ultimately may be scalability issue (but probably not in this discipline). Annotations stored as RDF, as well as metadata about images fitting an ontology. Had to homebrew ontology, what was out there did not match requirements. Intend ultimately to link to CIDOC ontology (museums & archives), which is gaining them traction for artefacts like monuments.

Hey, the ontology website is hosted in Greece. Must keep in mind for next junket. :-)


Lots of linked images of the same artefact, varying from each other minutely; no standard to capture that distinction. Other VRE is looking at polynomial texture map: how the object reacts to light, a polynomial for each pixel. Need lots of photos (30-50). Effect is virtual "moving a torch across" the artefact, seeing it under different lighting. (Is not going so far as a 3D model.) Archaeologists already using this tech live. There will be bulk photo'ing of the Vindolanda tablets at the British Museum July.

Plugging in work from Ségolène Tarte also working for Prof Bowman: imaging ancient documents, image analysis of artefacts. Removing woodgrain from pics of stylus tablets, to highlight the writing.

Will be pursuing collaboration with Aphrodisias project at King's College --- and anyone else relevant in the field.

Oxford CMS is SAKAI. Intend seemless integration. Classicists haven't picked up on CMS yet as a practice. Publishing images, annotations to repositories not something on the classicists' horizon yet (too traditional, and community moving very slowly in that direction), but Oxford institutional repository very much interested. Will wait on institutional repository to give the lead, and the team will do the interfaces to it. Single button deposit vision would make life easiest for them. The project is careful to follow the users and not dictate solutions to them.

Lessons:


  • Cool -- and useful --- eye candy
  • They know well the cultural resistance they will find, so they are developing features slowly and incrementally; and always what the users want and see the point of, rather than what is blindingly obvious (e.g. electronic publication)
  • In the humanities, it's all about the metadata, not the data. (The data are just photos)
  • Hence the embrace of the woolly mess of RDF --- a very far way from the rigid schemas of CCLRC
  • Portal environment for collaborationware, and to provide access to online databases. Just access: they're not going to try and RDF the Perseus project (mercifully)
  • On the other hand, bold vision of mashing up classics and archeological data (from VERA) to situate their data in context.

    • That's the kind of cross-disciplinary work that needs to happen and doesn't
    • They're lucky to have champions in both disciplines who "get it"
    • Mashing up RDF with RDF? It'll be wonderful when it happens; it'll also be very difficult.

2008-06-24

RIDIR

The proper pronunciation of the project name is with the first i long: [ɹaidəɹ].

Spent well over an hour gesticulating my way through the high-level use case diagrams of the PILIN national service, and the deliverables. (No software demo.)

RIDIR is leading up to UK mgt of national identifier scheme: fraught, but doable. JISC interested in scoping whether PILIN stuff is useful; if this happens, it'll be in the next nine months. UK needs to be sold on uses of identifiers to support national identifier infrastructure. PILIN has some interesting ideas they want to hear at more length to use in their advocacy.

Martin Dow and Steve Bayliss (ex-Rightscom, now Acuity Unlimited) have done extensive ontology work; we must hook up with them to find out how for our own ontology formalisation. Their chosen ontology IRE (built on top of DOLCE) can manage workflows and time-dependencies. They have also worked in the past on OntologyX.

Some subject domains get central data services here more than others (e.g. atmospheric data service is centralised, but there is no central geo data service: they just make data and throw it away).

JISC had started out from Norman Paskin's d-lib article, understanding that identifiers were happening in the physical world and with Rightscom's work, and then wanted to investigate analogies with the issues in the repository world. Hence the initial workshops; no obvious usecases or painpoints ("fear factors") emerged from the workshops. Was clear to RIDIR, a few months in, that the communities had no clear requirements for identifiers. There were only vague indications of the importance of identifiers to the communities. Rightscom had to go back to JISC for a steer half-way through: they could either (a) talk about value of identifiers & persistence in itself; or (b) illustrate how to do persistence ("cost" approach), taking the need for persistent identifiers as given. The latter was preferred, muchly because it was felt PILIN had already captured the why's and technical details of identifier persistence. (They are satisfied from my presentation that PILIN and RIDIR are still complementary.)

RIDIR went for corner cases, which sit outside policy and identifier infrastructure. Still difficult to arrive at concrete cases. They are at the last quarter of designing & building software, which had been held up because of lack of use cases. They only had time for a couple of software iterations. But in tandem with PILIN, will be useful work.

Would be v. interested in getting what they can from PILIN, about the usefulness and structure of a national provider, before they finalise their report.

Cf. Freebase as dynamically registered type language framework, lets you identify anything with your own vocabulary.

Demonstrators: the achievable use cases to highlight issues;

  • Lost resource finder, backtracking through OAI-PMH provenances, to mitigate when identifiers really do get broken (cited resources which no longer resolve). Accepting that the world isn't perfect, and dealing with the problems as they arise.
  • Locating related versions of things. Vocabulary for relations between things: making the assertions of relation first-class objects, with names and types and authorities, drawing on existing vocabularies. There is no agreed vocab of what is identified, so needed to put in place free-fom, user-driven semantic workspace: they choose their ontologies, you just provide the infrastructure for it.


Could go further with this; eg Open-calais [sp?], which extracts vocabs of concepts from a corpus, and could reconcile them with the labels of the concept graphs.

RIDIR is about identifier interoperability, which means:
  1. metadata interoperability, reconciling different schemas; metadata are claims about relation between identifier and thing;
  2. mechanisms for expression of relationship between different referents (e.g. relationship service);
  3. creation of common services, consistent & predictable user experience across services


Demo 1: Lost Resource Finder. Capture relations between URLs that no longer work and new URLs; no reverse lookup via a Handle. Two relations between URLs; the relations are RDF driven. First, manager can register a new URL corresponding to old URL ---- leading to a redirection splash page (it's an *authoritative* redirect record). This deals with decommisioned repositories. Second, can also crowdsource redirections of lost resources: "RIDIR users suggest that this has been redirected to...", complete with confidence rating. Redirect allows user to supply their own vote: "that's it", "no it isn't", etc. Can also search for alternate versions of content through searching the OAI-PMH harvest of all [e-research] repositories in the UK. OpenURL metadata driven query can be used to plug in to redirect, driven by metadata, to the new resource --- and identifiers are not invoked at all: this takes place, after all, as a fallback when the persistent identifier has already failed. This is a custom 404 page, offering heuristic alternative resolutions.

This was accompanied by a demo of the underlying RDF (not in final release); they have reified all the assertions involved in the model, so they can make claims about them. ORE came in to being too late for RIDIR project to engage with substantively, though they like what they see; maybe next extension.

Demo 2: Locate Related Versions. We have identifiers for stuff and relations between them again, the relation changes to versioning. Demonstrator works off TRILT (all BBC broadcasts in UK), and Spoken Word Service (repository of BBC items useable as learning resources); wanted to do relationship assertions across repositories to build added discovery value. Had a ready list of relations (not just "identical-to"); but building a complete vocab was completely out of scope and inappropriate, esp. given disparate communities of users. Can crowdsource suggestions of other related versions of the resource, and indeed of other related content. Very very cool.

And I demo'd icanhascheezburger.com as instance of crowdsourcing links between resources.

REPOMMAN

REPOMMAN:

Two goals:

1. facilitate workflow interaction with repository, to facilitate personal interaction. There has been less takeup of repositories because people are only asked for input at the end of the creative process, so the input looks to them like an imposition, extra task. REPOMMAN aims to allow use of repository at start of creative process, capitalise on benefits of repository. e.g. first draft deposited in repository: is secure and backed up. REPOMMAN uses FEDORA, so it has versioning, allows for backdating, revert. Web accessible tool, so users can interact with their own files from anywhere on internet, more flexibly than they would with a network drive. Many many types of digital content, so didn't want to restrict to any one genre: hence FEDORA. Not focusing on open access (Hull is not research intensive) or e-prints, but enabling structured management whatever the content. Much organisational change at Hull about learning materials, which is now settling down, and will decide takeup in that sphere. Pursuing e-thesis content as well.

Pragmaticaly, too much variation to serve all needs, so REPOMMAN could not go down the ICE path of bolting workflow on to content. They treat it as a network drive, competing with Sharepoint, to get user engagement. Interface mimics FTP.

2. Automated generation of metadata. Another perceived barrier to takeup of repositories, esp. for self-archiving. Still no perfect solution, but some things can be done with descriptive metadata. To capture metadata: aspects of profile of user depositing --- deposit happens through portal. Can capture tech metadata (JHOVE). For descriptive metadata, went hunting for tools; best one was IVEA (backend) within Data Fountains project (frontend), ex UC-Riverside. Available download as well as online demo; linux (Debian & Red Hat). Easy install.

Most extracters match texts against standard vocabularies/schemas, to identify key terms; e.g. Nat Library NZ, agricultural collections for metadata extraction (KEA) for preservation metadata. But such solutions need established vocabulary, and work best with single subject repositories, not practical institutional repositories. IVEA does not require standard vocabs. Has been trained to deal with wide range of data; not infallible, but good enough. Deals with anything with words in it. So long as metadata screen is partially populated, easier to complete population.

Joan Grey is setting up Data Fountains account from ARROW. Proposed to do parallel QA tests with Hull. Must follow up; must hook ARCHER up as well.

User requirements gathering done at start: no surprises. Interviews with researchers, admins, teachers. Not released in wild yet: sustainability issue (gradually working towards), and people need to cross the curation boundary from private to public repository, so the additional publish step needs to be scoped as well. ReMAP is that followup project dealing with that, underway this year. Focusses REPOMMAP for records management & preservation. These are library processes that repositories should be supporting, can help get records in the right shape for subsequent processes. REMAP sets flags & notifications for when tasks should happen; e.g. review annually, obsolete, archive (e.g. PRONOM, national archives, AONS). Proactive records management. Notifications are to humans. They use BPEL, and BPEL for People. REPOMMAN identified need for repository to support admin processes. These were low hanging fruit.

Scope for more testing, hoping to do so in parallel with ARROW. Need to generate "good enough" metadata, not perfect; already appears to be there. Working on the institutional takeup; e-theses most promising avenue.

Although REPOMAN is personal space, would like to get into collab space as well. I noted the definition of curation boundary allowing the distinction between collab space and public space to be formulated.

2008-06-22

Everything Oxford

Anne Trefethen



Director, Oxford e-research Centre

Have tools for data integration in climate research in collab with Reading; much of it based on Google Earth overlays, RESTful computation services. Will be enabling community participation à la Web 2.0, to make it a shared resource.

SWITCH have developed Short Life Certificate: shibboleth => ticket access to Grid. JISC is doing similar work for national Grid service.

UK is missing federated ID in their shibb, and Simon Cole's group (Southhampton) keen on taking it up. Australia has finally gotten the federated attribute through in AAF.

Project for integrating OxGrid outputs into Fedora research repository; will be interested in ARCHER export to METS facility.

Open Middleware Institute (Neil Chue Hong, director at omii.ac.uk) would be an opening for disseminating concertedly e-research workflow stuff from ARCHER.

Concerns over business model to sustain infrastructure in the longer term, especially as data management plans are requiring maintaining data for 10 years.

Current feasibility study on UK research national data service: Jean Sykes

----

Mike Fraser



Head of infrastructure, OCS. Access Mgt and hierarchical file service (archiving & backup, not generalised file service). Back up of client whatever their purposes; archiving service is research-oriented: long term file store, no data curation, tape in triplicate. IBM Tivoli system. 1 TB free, review after 5 yrs. Need contact person for data for follow up. Data not always well documented. Migration & long term store by policy.

Currently metadata is at the depositer's discretion. Archiving become more prominent over backup service. Not providing proactive guidelines on how to bundle data with metadata & documentation for longterm reuse. No centralised filestore, data storage is funded per project. Oxford is decentralised, so is a federation of file storage; SRB as middleware for federation is attractive. Need to be able to integrate into existing Oxford workflows.

Luis Martinez-Uribe



ex-data librarian, LSE. Scoping digital repository services for research data management: scoping requirements. Currently interviewing researchers for their requirements, current practice in managing their data. They do want help from central university. Requirements: solution for storage of large research data, esp. medicine & hard sciences (simulation data); infrastructure for sustainable publication & preservation of research data; guidance & advice. Publishing in couple of weeks. Used data collected as case study for scoping of UK Research Data Service.

John Pybus



Building Virtual Research Environment for Humanities. (I'll be revisiting the project manager for this project, Ruth Kirkham, next Wednesday.)

Collab environment, not just backroom services. First phase 2005 was requirements analysis: use cases, surveys of Humanities at Oxford. Identify what the differences were from sciences and within humanities. Large scale collab is much less important: a lot of lone researchers, collabs are small and ad hoc. Then, second phase was pilot: Centre for Study of Ancient Documents, director Prof Bowman. Imaging of papyri & inscriptions. VRE is focussing on workspace environment beyond image capture, to produce editions. Collaborative decisions on readings of mss are rare, since ppl don't often get together around the artefacts.

Technically, standards-compliant portlets JSR168, deployed in U-Portal. Not as bad to develop as it might have been. Java, not sexy like My Experiment (Ruby on Rails). End goal is set of tools that can be deployed outside Oxford Classics, so need ability to develop custom portlets on top.


  • High-res zoomable Image viewer & annotation tool: online editor; intend to make harvestable: will be Annotea schema (though not Annotea server). Also annotations on annotations. There are private, group, and public annotations.
  • Chat portlet and other standard collab tools.
  • Link to databases &c., bring them into the portlet environment.


Schema & API/web services need to reuse existing material.

Common desire within the research fields to find other ppl, even within Oxford, to collaborate with.

Iterative refinement of tools.

Where possible, shibbolised, given how dispersed ppl are in the field. There are enough academics already on IDPs, that this is doable.

Difficulty of conveying to user what content they're contributing is publicly viewable and what isn't. My Experiment does this as a preview.

Ross Gardler



EDIT: see comments for corrections

OSS Watch: Open source advisory service for HigherEd, JISC funded projects. Institutional: Procurement (advise over choice of open against closed infrastructure). Project level: project calls include para on sustainability, involves them for advisory. Sustainability only priority for JISC in past two years: change in existing project structure --- no projects had budgetted for sustainability.

Until now, projects evaluated on user satisfaction, not sustainability. New funding round (August): self-supporting, community-driven, knowledge-base approach rather than central advisory body. OSS Watch also assign funded resources to strategic projects as priority for sustainability, to give them the necessary legup to make them sustainable and generic without disrupting the project.

Overlap with Open Middleware Institute, which also develop software, OSS Watch won't, but instead build communities. Sustainability needs larger communities than just institutions: OSS Watch interested in linking up projects across institutions for critical mass. (Especially because the champions for open source projects are thin on the ground.)

Happens through selecting the right power users having expectation management, so they can be hand-held through playing with the development. OSS Watch currently mediate between developer expertise and users, translating feedback as a buffer. In the longer term, they will showcase successful open source sustainable projects to advocate projects signing up of their own will.

Problem in selecting which project is priority for strategic without domain knowledge across all domains.

2008-06-19

BECTA presentation

Gave BECTA folks summary of our work and deliverables on PILIN and FRED. BECTA are themselves at the scoping stage, and had some welcome scepticism about whether it was worth doing certain things: investing in persisting lots of identifiers, dealing with Their Stuff vs. Our Stuff, setting up formal repositories with metadata as opposed to leaving things to Google custom search. Aware of non-resolvability of hdl: ; I retorted with Standard Resolver solution, as PILIN advocated with resolver.net.au . Wanted to be able to establish flexibility of resolution services, and granularity; we discussed information modelling. Were interested in Reverse Citation service as solution to bookmarking of Handles problem.

Were curious about extent of persistent identifier takeup by e-learning content providers, and what pain points LORN was finding. Liked service approach in FRED, but were aware that little inconsistencies in profiling would ruin interop. Found the decentralisation constraints (for different reasons) of VET and School sector in Australia interesting. Interested in FRED SUM as resource. Would be outsourcing at least some functionality to commercial providers for greater persistence (marked-sustained rather than project-based); interest in tuning ranking of searchers rather than sticking with elaborate metadata searches. Rights management piggybacking on institutional affiliation of requester would presumably exploit shibboleth attributes. Want to get engaged with IMS LODE again. Interested in keeping channel open with us.

Here endeth the point form.

TILE Workshop

TILE Reference Group Meeting

Phil Nicholls in attendance (who has already named me as an "acolyte of Kerry"); he has produced SUMs for the Library 2.0 requirements of data mining for User Context data (strip-mining logs, I guess), and identifying content as relevant to a given user context (correlating courses to reading lists and library loans). In other words: how to extract user context from users like me, to recommend content to me; and how to datamine that user context into existence in the first place. Reading lists, loans records, enrolments, repository logs, user feedback.

Larger context: the Read Write Web: recommendation engines --- a key tech in 2008 on the web. The space for TILE is already populated by non-library providers. The domain is responding to the requirement: libraries talk to Amazon, have borrowing suggestions at Huddersfeld. MESUR project has farmed ginormous amounts of data on loans and citations. Higher ed libraries have huge amounts of data that can be capitalised on for resource discovery, and which is comparatively well defined.

Going e-framework for to get synergies with the other domains e-framework is in (e-learning, e-research).

Tension of e-framework specificity and leveraging/reuse vs. "constant beta", flexible software development, which is contra rigid specs. Need to experiment for a length of time before fixing things down in e-framework. The approach seen as more questionable at a local institutional level than in a national context.

Need shared vocab, not just shared software, to move forward in the library field --- enable dialogue between participants in the national context; e-framework can help build up the vocabulary again.

Sidestepping researcher identity as feeding into this: too hard for now (not familiar with the domain), quite diverse in interface and sparsely populated. The student data is rich and uniform, so working with that as a priority.

Pain points: why isn't your uni library catalogue already like amazon? how do you get the bits of the uni to talk to each other to deliver this? why do people want an amazon experience on their library catalogue, when?

Students already compare notes informally about their reading, which *might* motivate this kind of recommendation structure. But libraries are worried about data privacy; and US are even more touchy. Data will be anonymised; but Student Admin will ask questions once data is aggregated by individual subject, let alone grades awarded.

Peak use of loan recommendations in the existing prototype (Huddersfield) is a month after start of term, when students start exploring beyond their prescribed reading lists.

Reading lists are useful inputs, but not necessarily useful outputs: they are fixed by academics for the once.

e-portfolio a more important parameter for driving this tech than transitory external social networks like Facebook.

Contexts are multiple: can be institutional as well as individual ("what are our students reading?"), and people have multiple identities (Facebook vs. enrolment record): context needs to be tied down, to work out what to harvest. Students are also enrolled in more than one institution! If recommendations should be driven by learner-centered approach, then learner should have control of how their recommendations are used.

--- But if we just throw information into the open, without prescribing context, then contexts will form themselves around what data is available: users will drive it. (Web 2.0 thinking: no prescribed service definition, but data-centric driving.)

Systems need to be able to capitalise on this data to improve e.g. discovery (clustering).

***

Architectures of participation: the efforts of the many can improve the experience of the individual. Need to articulate benefits to users to motivate them to crowdsource.

Deduplication is key to users of catalogues. But surely that shouldn't mean JISC implements its own search engine?

OPACs not very good for discovery (no stemming, spellcheck); don't do task support (e.g. suggest new search terms).

Impediments: control, cultural imperatives; user base --- include lifelong learners?; trust, data quality (tag noise); data longevity; task/workflow support (may not support full workflows, which are not well understood, but can support defined tasks); cost; granularity of annotation target.

Unis are already silo'ing in their learning environments (Blackboard), and that's where they put their reading lists: how do you get information outside the silo?

JISC build a search engine? No, JISC get providers to open up their data, so the existing open source etc. efforts can inform their own search engines with the providers' contextual information.

CNRI Handle System Workshop: Middle Half

Daan Broeder, Max-Planck Institute for Psycholinguistics: Hadle System in European Research Infrastructure Projects



Already been involved in infrastructure for several e-research projects.

Reliable references and citations of net accessible resources, particularly in language: audio-visual, lexica, concepts...

Number of resources can be large, especially if disaggregating corpora. (Then again, a persistent identifier for each paragraph is overkill, and has a cost.)

Identifiers are cited, and embedded, and in databases.

CLARIN project: making language resources more available. Aims to create federation of language resources, mediated through persistent identifiers. Preparatory phase 2008-2011, construction up to 2020. Builds on DAM-LR project: unified metadata catalogue, shibboleth federation, Handle system.

For flexibility, DAM-LR minimised amount of sharing required. Developed mover (move data + update identifier), and restore Handle DB from scratch. All data needs to be recoverable from the archives themselves. Found that federation is not for all organisations, does impose an IT burden. Need centralised registration.

Max-Planck Society wants PID system throughout the Max-Planck society, which will also support external German scientific organisations.

Requirements for CLARIN: political independence: European GHR and no single point of failure; wide(r) acceptance of PID scheme (w3c!); support for object part addressing (ISO TC37/SC4 CITER: citation of electronic language resources); secure management of resource copies.

CLARIN will do third party registration for small archives.

Ongoing static from W3C in ISO. Proposes URLified Handles, suggests ARK model: http://hdl.handle.net/hdl:/1039/R5

Part identifiers: just like fragments in URIs: A#z => objectA?part=z, with a standard syntax for "z" for the given data type, exploiting existing standards.

Replicas: federated edit access to handle record by old and new managers. Known issue of access by multiple parties, trust. Could also have indirect Handles, i.e. aliasing. Not everything supports aliases well, doubtful status of the new alias for citation.

Value-add services: document integrity; collection registration service (single PID for collection, with aggregation map à la ORE); citation information service (acknowledgements, preferred citation format to be included in citation); lost resource detective (trawl the logs, the web, etc to find where the resource has ended up, including tracking provenance history of who last deposited).



---

What have I learned from the Handle workshop?


  • Handle type registry is coming. Complete with schema (which will need work) and policies (which look to be way too laissez faire)
  • The Nijmegen folks are gratifyingly coming to similar conclusions about things as us (e.g. REST resolver queries)
  • That European digital library is going to be huge... if it can hold together.
  • Selling the entire Max-Planck Gesellschaft on using persistent identifiers—that's huge too.
  • Scholars blog. And want credit for blogging.
  • There are ISO standards for disaggregating texts, among other media. (I can gets standardz?) And Nijmegen looks kindly on ORE.
  • The ADL-R is being released in a genericised form: DO registry.
  • Handle is being integrated into Grid services (but we already halfway knew that)
  • OpenHandle is still a good thing
  • XMP, for embedding metadata into digital objects, is now getting currency, and can be used to brand objects with their identifiers (amongst other things) and update that metadata with online reference (as I identified in a use case last year, methinks)
  • W3C continues to be all W3C-ish about non-HTTP URIs. People have not given up on registering hdl: schema.

2008-06-17

CNRI Handle System Workshop: Second Half

Nigel Ward, Link Affiliates: Towards an Australian Persistent Identifier System: Thoughts on Services and Policy



Well, I already know what he's saying, having reviewed the presentation last night...



Larry Lannom: Grid Identifier Service



Globus approached CNRI, liked the persistent identifiers, distributed and scalable as they are, and could be embedded in Grid software.

Challenges: lots of local prefix (namespace) administration: much greater scale than existing CNRI practice. This means GHR queries may have to be passed on to delegate. This has been prototyped but not deployed; anticipate implementation next year. Can register and resolve Handles through standard grid protocol --- fully embedded in the Grid.

To do more than just HTTP redirects, need to do interpretation of Handle records, with intelligent Handle types. Handle does not validate types, though apps may. Developers do make up types, but that compromises interop.

So every Handle type should be a registered Handle. hdl:0.TYPE/ is cumbersome and restricted to CNRI. Recommend people use their own custom Handles. CNRI are creating Handle Value Type registry, to search for types, and open to the public (policies TBD, but look to be minimalist). Can be used by apps for custom transformation. A schema for types is in internal testing. e.g. 0.TYPE/URL : 10320/HVT-R : . Some fields can be developed with reference to existing fields; e.g. MIME types and RFCs, which should themselves have Handles.

CNRI Handle System Workshop: First Half

Larry Lannom, CNRI: Handle System Update



DOI: 2625 prefixes; Other: 1172

Handles: 35M in DOI.

Three global service sites.

"Chooseby" datatype: structured alternatives in a single Handle value, including selection criteria.

Computed Handles: resolve Handles that have not been registered. Data store of records can be replaced by a computed Handle if the record contents are predictable. e.g. 123/456.x => URL xyz/x (rewrite rule). Resolution attempts registered Handle first, then computed Handle.

More global mirors: new one from Crossref.

Update to RFCs: delegation has changed; execute permissions are obsolete.

Attempt at URI registration of Handle scheme.

Tony Hammond, Nature: A Distributed Metadata Architecture



Downloaded assets have no metadata linking them back to source.

Services: generic (Handle Form, OpenHandle), specific (proxy).

OpenHandle: Accessible as web service, easy read, expose values as markup. In various programming languages and serialisations: RDF/XML, JSON.

Coming up: Web Admin. cf. Freebase MQL transacted through JSON.

XMP: Extensible Metadata Platform. Embeds XML into arbitrary file formats, e.g. PDFs. XMP packet would be RDF/XML document with wrapper. That would include DOI watermarking. Can be viewed through an inspector app. Metadata can be hooked up with services. Can also include "see also" link to more metadata curated online, to allow updating of metadata accessed from branded object.

Ed Pentz, Crossref: DOI Impact on End Users



Trust. Crossref have business interest in trustworthiness of their hyperlinks between texts.

Filling in citation data as they expand. Network effects driven by digitisation of back content. Crossref can automatch biblio citations back to DOIs because of biblio metadata.

Publishers flocking to getting DOIs as imprimatur. (Not necessarily a realistic interpretation of what DOI is!)

IDF Open Meeting: Second Half

Jan Brase, German National Library for Science & Technology (TIB): Access to non-textual information



Data > Publication > Knowledge is accessible; the data itself is not published or accessible. Known problem: verifiability, duplication, data reuse. Data accessibility for 10 years has been mandated in Germany --- and ignored.

Solution: strong data centres; global access to datasets & metadata through existing library catalogues; persistent identifiers.

Results: citability of primary data. High visibility. Verification. Data re-use. Bibliometrics. Enforcing good scientific practice.

Use of DOI to that end: TIB is now a non-commercial DOI registration agency for datasets. Datasets gets DOIs, catalogue entries. Can disaggregate datasets (e.g. multiple measurements), accept conditions, choose formats etc through portal access to dataset.

TIB registers data worldwide, and any community funded research in Europe. Half million objects registered (but not stored at TIB --- they are not a data centre).

Scientific info is not only text: data, tables, pictures, slides, movies, source code, ... which should also be accessible through library catalogues, as publicly funded research outputs. The catalogue becomes a portal within a network of trusted content providers, with persistent links.

Institutions often find it hard to get DOIs from a foreign library (TIB currently being the only show in town); so TIB want to set up new worldwide agency, paralleling CrossRef, as consortium registering DOIs for scientific content by libraries. So far signed up ETH Zürich, INIST France.

ICSTI has started project for citing numerical data and integrating it with text, in which TIB is participant.

Jill Cousins, European Digital Library Foundation: Access to National Resources



European Library: consortium of Council of Europe national libraries. Federated search: starts at federated registry, and also goes to library servers (SRU, Z39.50). Helped national libraries have low barrier to entry, annoying as it is to the user. National libraries themselves know that this won't scale, and are moving from Z39.50 to OAI harvesting.

Persistent identifiers were not priority for national libraries: they hadn't digitised much (5 million for the whole continent), and Z39.50 didn't need to interoperate with external systems. This will change: 100 million items digitised in next 5 years, born digital content, move to OAI PMH, OpenURL.

CENL (Council of euro nat libs) recommends there must be resolution services, based on URNs primarily from NBN namespaces. Each nat lib to have its own resolver service to access its own stuff, following European standards. URN service must deal with other id schemes. For long term survival, DOIs can eventually redirect to nat lib copies (copies of last resort).

NBNs are already being used. They do identify Items not Works, though now that libraries digitise themselves, they move away from that. Resolvers need to deal with appropriate copy.

SURFnet are proposing a global resolver; interested "because it's free (at the moment)", and is prepared to work with both NBNs and DOIs. Nat libs are still learning what the point of persistent identifiers are. Are not working with IDF because of perception of costs and little return (ah, but did they negotiate?); and need to resolve the "last resort" issue, which is not dependent on IDF. Also, a lot of "not invented here", wanting to avoid external providers. Libraries already have working NBNs which work internally, so they haven't had the pressure until now to resolve consistently.

European Digital Library (Europeana) underway. No standards for unique identification yet! Still trying to work out how to realise it (e.g. decentralised?)

IDF Open Meeting: First Half

Normal Paskin: background



DOI is being standardised through ISO: early 2009 formal standard.

Unique persistent identification; required interoperable structural descriptions. DOI does not have apps, but basis for apps. Builds on existing schemes, standards, data types.

Rights metadata as example of structured description. Distributed metadata, distributed rights management. Each object involved --- agreements, permissions, requirements --- is itself a digital object to be identified, linked, and disaggregated. DOI does interoperability of those digital objects.

DOI services: metadata@10.1000/123456, rights, abstract, sample, buy, license, pdf, etc. DOI data vocab needs to be carefully set up to support that. (The Old "one identifier, multiple standards" issue that we've come up against in our own identifier work.)

Gordon Dunsire, U. Strathclyde: Resource description & access for the digital world




  • RDA: New content standard for bibliographic metadata, to be published early 2009
  • Wider range of info carriers, digital & physical
  • Authors themselves are writing metadata, not just professional cataloguers
  • New metadata formats in use
  • Opportunities to harness new digital environment
  • Internationalised and globalised info services


RDA: attributes; guidance on creating content; more controlled vocabs. Esp. carrier type (e.g. "online resource"), and content type.

The wider context of resource description. e.g. International Standard for Bibliographic Description (ISBD). Statement of International Cataloguing Principles. Almost harmonise with RDA. There are wider range of related standards, increasingly interlinked.

  • FRBR (Biblio Records), FRAD (Authority Data).
    • FRBR has been extended to Object Oriented (FRBRoo), based on CIDOC conceptual reference model
    • Project underway to open up FRBR to other developers, esp. through RDF, SKOS (Simple Knowledge Organisation System: formalises vocabs and their interdependencies)

  • RDA/ONIX ontology to improve metadata interop with publishers <=> libraries
    • can build up vocabs, high level content and carrier types

  • Dublin Core: DCMI/RDA Task Group
    • Task group: broader use of RDA by Dublin Core & others (SKOS, LOM).
    • Define RDA modelling entities as an RDF vocab, to be consumed by these schemas
    • Identify RDA vocabs for publication as RDF schemas, for consumption by SKOS
    • Dublin Core specific application profiles for RDA, based on FRBR/FRAD

  • Do also need to tie RDA in with MARC, since it's not going away


Pairwise interop of these standards is starting. RDA can serve as middle of chain(s)/networks of standards interdependencies.

Common semantic foundations: Semantic Web; RDF; SKOS (based on RDF); OWL (slight overlap with SKOS, bridges between vocabs)

DOI has same underlying apporaoch to ontology as RDA/ONIX. IDF has requested funding from JISC to extend the framework: comprehensive vocab of resource relators and categories, for mapping/crosswalks, and reference set (chain above)

Reflection



So what have we learned?


  • RDA is setting itself up as semantic middleware for a range of biblio-related metadata standards.
  • Standards are relying on and exploiting other standards already, pairwise
  • The whole world will bow down before the Semantic Web: it will outdate current catalogue entries


Brian Green, EDItEUR: Standards for rights expression within the ONIX family



ONIX: family of formats for metadata about publications, with common data elements, developed & maintained by EDItEUR.

ONIX for Licensing Terms. Subprojects: Publications Licenses (ONIX-PL); Reproduction Rights Organisations; contributing to ACAP.

Problem
  • Need to automate electronic resource management
  • variations in license terms: automatic policing on the coalface of license, which has typically been filed away in paper and is inaccessible


Libraries want: machine readable rights; dissemination of rights info in resource metadata; exposure of rights to users.

ONIX-PL seeks to address this. Expresses complete publisher/library license, for import into libraries' Electronic Resource Mgt system.

Not a technical protection measure, does not include enforcement technologies (like ODRL), but allows flexibility, compliance-based. Communicates usage terms, allows library override e.g. fair use.

Will provide ONIX-PL editing tools (JISC funded): form-filling. Open source, available now. JISC Collections are using the prototype. Bits of the license will be exposed to different parties at a time.

ACAP uses ONIX Licensing Terms semantics.

Both ONIX and IDF use indecs view of metadata. Ongoing sharing of dictionary work beyond indecs. IDF is active member of ACAP.

Reflection




  • Communicating licenses means the license should remain human-readable
  • The license is in chunks so that relevant bits can be excerpted and shown to humans
  • The license remains machine-readable enough in bits that machines can act on it
  • Common meta-metalanguage (indecs) necessary to preserve semantic interop

2008-06-14