2008-11-19

XRI, Handle, and persistent descriptors, Pt 2

(Back to Pt 1)

Let's now look at our favourite XRI, =drummond. If I retrieve the XRDS for =drummond, through the resolution service http://xri.net/=drummond?_xrd_r=application/xrds+xml, I get (at current writing!)

  • a canonical (and persistent!) i-number corresponding to the i-name =drummond, =!F83.62B1.44F.2813
  • A Skype call service endpoint
  • A Skype chat service endpoint
  • A contact webpage service endpoint
  • A forwarding webpage service endpoint
  • An OpenID signon endpoint


The XRDS does not anywhere say what =drummond is identifying; just some services associated with =drummond contingently. I could infer Drummond's full name from the Skype username being drummondreed, but that's hardly failsafe. What I would like is access to some text like...

VP Infrastructure at Parity Communications (www.parity.inc), Chief Architect to Cordance Corporation (www.cordance.net), co-chair of the OASIS XRI and XDI Technical Committees (www.oasis-open.org), board member of the OpenID Foundation (www.openid.net) and the Information Card Foundation(www.informationcard.net), ...


Oh, as in the contact webpage that http://xri.net/=drummond resolves to, http://2idi.com/contact/=drummond . Well, yes, but I did not know ahead of time that the contact webpage would have the information I wanted, with enough bio information to differentiate Drummond from other candidates: it's a contact page, not a bio page. (Drummond providing bio info is a lagniappe, which simply proves he knows about identity issues.)

What I want is some consistent way of getting from =drummond to a description of what =drummond identifies. XRDS is a descriptor already, which is why =drummond resolves to it: it describes the service interfaces that get to =drummond. But it's a descriptor of service endpoints and synonyms; it still doesn't persistently describe Drummond, the way the DESC field does in Handle. (Or would, if anyone ever used DESC).

Now, the technology-independent description of what is being described is needed for persistent identifiers; it's not as important for reassignable identifiers. So even if =drummond doesn't take me directly to a persistent description, persistence is still satisfied if =drummond takes me to =!F83.62B1.44F.2813, and =!F83.62B1.44F.2813 takes me to a persistent description. XRI allows =drummond and =!F83.62B1.44F.2813 to have different XRDS (because they can have different services attached)—though typically when an i-name is registered against an i-broker, the XRDS is the same. The requirement would be for the persistent description to be accessed through the i-number's XRDS, which may not be the same as the i-name's.

The easy way of adding a persistent description to an XRDS is treating it as yet another service endpoint on the identifier: I give you an identifier, I get back a persistent description. Drummond's contact page already accidentally the description. What I'd like is some canonical class of service for getting to the persistent description. It could be something as simple as an +i-service*(+description)*($v*1.0) service type, to match the xri://+i-service*(+contact)*($v*1.0) type which gave me Drummond's contact page.

This description service is actually the reverse of David Booth's http://thing-described-by.org/. David starts with the URL for a description as a web page, http://dbooth.org/2005/dbooth/, and creates an abstract identifier http://thing-described-by.org?http://dbooth.org/2005/dbooth/ for the entity described by the web page . XRI starts with @xri*david.booth (I can't see David actually registering his own XRI), which is already an inherently abstract identifier—unlike HTTP URIs.

Getting from there back to the description http://dbooth.org/2005/dbooth/ is a resolution; we could access it through http://is-description-of.org/?@xri*david.booth . (We would likely access it through normal HXRI proxy http://xri.net/@xri*david.booth too; the point is, we're constraining the HTTP resolution to a specific kind of representation. David Is Not His Homepage.)

I'll note that David's description is worth emulating: "The URI http://thing-described-by.org?http://dbooth.org/2005/dbooth/ hereby acts as a globally unique name for the natural person named David Booth with email address dbooth@hp.com (as of 1-Jan-2005)."


The catch with that approach is, we're now relying on an external service to guarantee the persistent metadata for our persistent identifier. And as I argued in the previous post, you don't want to do that: your system for persistence should be self-contained, since you are accountable for it. It is easier for the description to persist if it sits inside the i-number's XRDS than outside it.

Even that does not give much of a guarantee of archival-level persistence. It is a feature and not a bug of XRI that users manage their own XRDS for personal i-names: the i-broker refers resolution queries back out to the user's XRDS, and promises only not to reassign the i-number. i-brokers do not commit to registering their own persistent metadata against the i-number. But once the user's XRDS goes offline, noone is able to resolve the i-name or the i-number. The trick with persistence in identifiers is, it's always persistence of something. Once the service endpoints for your identifier go away, you lose persistence of actionability. Not reassigning the i-number maintains persistence of reference (the i-number can't start referring to something else). But without a description accessible down the road, it does not maintain persistence of resolution (a user finding out what it referred to, even if no service endpoints are available).

Maybe that's OK: XRIs are addressing a particular issue—digital identity across multiple services. If the user is trusted to maintain their digital identity, then XRI is not geared to address long-term archival needs. In the same way, the user-centered practice of self-archiving has nothing to do with long-term archives (as Stevan Harnad has to keep repeating—with only himself to blame for introducing the term in the first place. )

Oh, can't resist: Wikipedia entry on self-archiving:

Bwahah. And don't get me started on an "archivangelism" with its emphasis on "arch"...

XRI, Handle, and persistent descriptors, Pt 1

This post is to suggest that XRDS (or equivalent) includes not just service endpoints, but also persistent descriptions—potentially as a distinct service endpoint. It takes a while to build up the argument, so I'm splitting it in parts.

One of the critical insights we came up with in the PILIN persistent identifier project is: if you want the identifier to persist, it's not enough to just keep updating URLs that the identifier resolves to. You want to record somewhere a piece of metadata, that tells you what the thing identified is—independent of the URLs. That piece of metadata will itself be persistent: it will not be affected by any changes in the service endpoints of your identifier. But it doesn't have to be machine-readable: it can be a description in prose.


  • Having that piece of information helps you in disaster recovery. If all your URLs go out the window, you can still use the description to reconstruct how the identifier should resolve (and reformulate the URLs). And you can't really claim persistence if you don't have some kind of disaster recovery.
  • Having that piece of information is also critical for archival use of identifiers—after the services resolved to are no longer accessible. (And persistent identifiers should persist longer than the services they had resolved to.)
  • Getting to that piece of metadata in itself involves a service, and in itself is a resolution. (That means it can integrate into the current XRDS as a service endpoint.)
  • But if you entrust that piece of metadata to a service outside your identifier management system, you are putting persistence at risk.


Let me first illustrate this principle with the technology we used in PILIN, Handle.

info:hdl:102.100.272/0N8J991QH 


resolves to the Handle record:


URL: https://www.pilin.net.au
EMAIL: opoudjis@gmail.com
HS_ADMIN: [admin bit masks]


I can update my URLs and Emails as things change, but that's pretty poor information management. If I disappear, and the DNS registration expires, I'm not allowing anyone to reconstruct what the identifier resolved to. If someone's found the Handle 102.100.272/0N8J991QH on a printout at some point in the distant future (like, say, 5 years), and they find a Handle resolver which gives the information above, they too are none the wiser about what the Handle was supposed to identify. Because the Handle was supposed to be persistent, it has failed.

But Handle also provides a DESCription field, which allows you to say what is being identified:


URL: https://www.pilin.net.au
EMAIL: opoudjis@gmail.com
HS_ADMIN: [admin bit masks]
DESC: Website for the PILIN project (Persistent Linking Infrastructure),
funded by the Australian Government to investigate policy and technology
for digital identifier persistence.


That description is at least a fallback if the URL does not get maintained. I'd argue further that the description is the real resolution of the identifier (as PILIN defined resolution this year: information distinctive to the thing identified, differentiating it from all other things). The description actually tells you what is being identified, and it stays the same even if the URL location of the website does not. It gives a persistent resolution of the Handle, which is not constrained by a particular service or protocol.

Moreover, if the description is part of the Handle record, then it will persist so long as the Handle record itself persists. It does not depend on an external agent to guarantee it sticks around. Which is what you want for the metadata that will guarantee the persistence of the Handle.

If on the other hand I put my descriptions in an external service, like http://description-of.org/hdl/102.100.272/0N8J991QH , then I will lose my persistent descriptions if http://description-of.org goes down: I am dependent on http://description-of.org for the long-term persistence of my identifiers. And I should not be dependent: persisting my 102.100.272/0N8J991QH Handle is my responsibility (for which I am accountable), and it's what I set up my identifier management system to do.

Next Post, we run that notion against XRI.

Introduction to XRI

Yet another introduction to XRI, which I presented at the !DEA 2008 workshop.

Introduction to XRI

2008-11-11

Using UML Sequence diagrams to derive e-Framework Service Usage Models

The e-Framework is a documentation standard for service-oriented system development, that I've been involved in. It has a registry of abstract services (service genres) and services profiled to communities and standards (service expressions). It also has service usage models (SUM), which present the services needed to realise a system, by lining up the services and data sources that each business process uses in the system. Like this:



That's just the SUM diagram; there is a whole document that goes with it, explaining how the business processes map to services via system functions, the usage scenarios, what situations the SUM is applicable to, design considerations, and so on. But the SUM diagram already gives an overview of the wherewithal for putting such a system together. And so long as the services and data sources are kept reasonably abstract, the diagram can be used to compare different systems from different domains, and work out their common infrastructure requirements.

In a project I've been working on recently for Link Affiliates, I had to come up with a range of implementation options for the solution I was describing, and use the e-Framework to do so. I had been describing the solutions with UML Sequence diagrams. The following is a way of mapping from the former to the latter that I came up; it's pretty obvious, but I thought it might be of interest anyway.

I am assuming you're already familiar with UML Sequence diagrams.



UML Sequence diagrams are a good match for service usage models, because both are concerned with how a system interacts with the outside world. The interactions are drawn explicitly in the UML; in the SUM, interactions happen through the services that the system exposes. (That's why it's a service usage model: it's how external users interact with the system.)

We make the following assumptions:


  1. Any sequence of interactions initiated by a human actor corresponds to a business process meaningful to a human. Some sequences initiated by computer agents are also potentially meaningful business processes.
  2. All interactions between objects are through services. (We are taking a service-oriented view of the interactions, after all.)
  3. All objects sending or receiving data through messages are potentially data sources and data sinks. (The two are not differentiated in the e-Framework.)


Given the first assumption, we can break up a large sequence of interactions into several business processes, depending on how actors intervene:



Of course, this step is cheating: you probably already have an idea of what business processes you want to see. Anyway.

Given the next assumption, if you want to know what services your application uses, just read off the messages from the UML diagram. Each of those messages should be communicated as a service --- through a defined interface between systems. So the messages are all service calls.

Some provisos:

  • Like we said, the SUM is about how the system interacts with external users and systems. So any interactions within a system are out of scope: they aren't exposed as a service.
  • Some services will in fact involve several subsidiary service interactions. They would be described in a distinct service usage model, which can be modularised out of the current SUM.
  • Return messages are included in the definition of a service; so they do not need to be counted separately.
  • A message forwarding a request from one actor to another may be ignored, as it does not represent a new service instance.

    • For instance, we choose not to model the Ordering message from the Orders system to the Warehouse Manager; that message is really only forwarding the initial order made by the customer, and can instead be counted as service choreography.

  • The e-Framework consolidates services into a minimal-ish vocabulary (at the service genre level). So the messages should be mapped to established types of services wherever possible; the point of the exercise is to compare between systems, and that means the services have to make sense outside their particular business context.

    • So in the example below, "Disambiguate" will actually be done through a search; so that message is counted as an instance of Search.

  • Likewise, if a message is described only in terms of its payload, you will have to come up with a sensible service to match.

    • The message from the Orders system to the Warehouse Management system is described just as "Part Name". Because this is a retrieval of information based on the part name, we describe it explicitly as Search.






Likewise, the swimlanes acting as data sources and sinks are interpreted as e-framework data sources.

Now that we know the business processes, the services, and the data sources from our UML Sequence diagram, we only have to line them up into the e-Framework SUM diagram:

2008-10-20

This is the conclusion...

... of the blog summaries of presentations associated with eResearch Australasia 2008.

Presentation on Centre for e-Research, KCL: Tobias Blanke, Mark Hedges

Centre for e-Research, Kings' College. Presentation given at VERSI (Victorian E-Research Strategic Initiative), 2008-10-06

CeRCH were formed out of the Arts & Humanities Data Service; once it was discontinued, KCL set up the centre to keep work going. (The current obligation for data is presentation in the UK is now to hand data over to an institution committed to maintain it for "at least 3 years").

Size of collections had been skyrocketting, because of introduction of video resources (45 TB at the time of axing AHDS). The resources got split up: KCL do performing arts, history went to the Social Sciences data service, archaeology remained independent, language and literature (starting with the Oxford Text Archives) to the Oxford Research Archive.

CeRch has been going for a year, and is designated to support the entire research lifecycle at KCL, including planning and proposals. They will be teaching a Masters on Digital Asset Management from next year, in collaboration with the Centre for Computing in the Humanities. They research e-research.

Grids & Innovation Lab: build on Access Grid as creative space for teaching, strong link to Theatre department. Setting up Campus Grid: KCL still not on board with national grid. Currently piloting with early adopters.

ICTGuides: database of projects, methods and tools in Arts & Humanities. www.Arts-humanities.net: a collaborative environment for e-Humanities. DARIAH. CLARIN. Are starting to move beyond arts to medicine; seeking to work with industry & business as well, as a business service (starting with museums and libraries).


Arts & Humanities e-Science Support Centre

UK infrastructure has been built about national centres.

Push to get more users using services, to recoup costs in e-research; arts & humanities have a lot of users. Plus, network effect because users are more familiar with what is available. Humanities are more about creating sustainable artefacts than science is.

User interface design is more important, because end users are difficult to train up.

Linking Structured Humanities data (LaQuAT). Diverse/non-standard and isolated data resources: allow integration and useful resources. OGSA-DAI as linking infrastructure, allows researchers to retain local ownership. Ref. www.ahessc.ac.uk for projects.

Antonio Calanducci et al.: Digital libraries on the Grid to preserve cultural heritage

Project description

Author digitised: De Roberto.

High resolution scans into multipage works; 2 TB, 8000 pp. Embedded metadata: physical features and some semantics, added in Adobe XMP.

Intended easy navigation, constant availability, long-term preservation (LOCKSS)

Storage element is pool of hard disks; file catalogue is virtual file system across storage systems. Metadata organised by collection.

gLibrary project allows digital library to be stored and accessed through Grid. Filtering browsing, like iTunes.

Andreas Aschenbrenner & Jens Mittelbach: TextGrid: Towards a national e-infrastructure for the arts and humanities in Germany

Project site

TEI community. Funded to €2 million. Virtual Research Environment: data grid as a virtual archive for data curation; service grid for collab work, including existing TEI tools; and a collaborative platoform for scientific text data processing.

TextGridLab is the grid client. Globus-based infrastructure. The physicists throw away their Grid data because individual pieces of data are of relatively low vaue; TextGrid want to archive data as they go.

Usage by: end users (accessing scholarly texts); editors (philologists); tool developers; institutional content providers.

Tobias Blanke, Mark Hedges & Stuart Dunn: Grassroots research in Arts & Humanities e-Science in the UK

Use networks to connect resources. e-science agenda was driven by the Grid, to bridge across administrative domains. In e-humanities, books need to talk to each other.

Challenges: ongoing growth of corpora; digital recording of current human developments; computational methods to deal with inconsistent data; reluctance by humanities scholars to collaborate in research.

Engage researchers by giving them money and letting them do something at grass-roots level, though with some coordination.



DARIAH: Digital Research Infrastructure for the Arts & Humanities, European project to facilitate long-term access and use of cultural heritage digital information --- connecting the national data centres. Fedora demonstrator, flat texts across Grid; Arena demonstrator: database integration with web services.

AHRC ICT Methods Network.

Peter Higgs: Business data commons: addressing the information needs of creative businesses

Project investigating creative industries, to support better funding. 8% survey responses, so data quality from survey is not good. Recoursing to secondary data, like censuses. No data available about what is happening in business life cycles, or how they change across time; poor granularity. Surveys don't work in general; can't aggregate readily.

Alternative? Build trust, manage identity & confidentiality; crucially, provide benefits to the business for the service, and to the business gateways. Collection must be with open consent; upsell data back to the survey subjects (can provide them more data, and articulate benefits). Ended up escalating information gathering.

Need pseudo-anonymity; can't have anonymity, since are feeding the data back to the survey subjects (delta data gathering). Use OpenID to do pseudo-anonymity. Do not reask questions already gathered. Benefit to business is personalised benchmark report: customised to each recipient.

This appears to be a good strategy for collecting embarrassing data like hospital performance; emphasise that this is benchmarking, so can improve your own performance over time, given the data you have made available to the survey.

Kylie Pappalardo: Publishing and Open Access in an e-Research Environment

OAKLAW: Open Access to Knowledge Law Project

Survey of academic authors on publishing agreements and open access: 509 participants. Generally they support open access. They are OK with end user free access, and reuse. Many authors don't deposit into repositories because they don't know there is a repository -- or what the legal status of their publication was. Most did not know whether they had licensed or assigned copyright (i.e. whether they retained rights). General lack of understanding of copyright issues; authors are not asking the publishers the right questions.

Arts & Humanities scholars are more clued in about copyright issues than Science scholars.

OAKLIST, tracking publisher positions on open access and repositories. OAKLAW have also produced guide to open access for researchers; sample publishing agreement (forthcoming): exclusive license to publish, non-exclusive license for other rights.

Peter Sefton: Priming digital humanities support services

Visual ethnography project: recording messages from schools to community. They use nVivo. Analyses locked up in proprietary tool.

Recommends using image tags instead of nVivo to embed qualitative evaluation. Data aware publications, with the ICE authoring tool and microformats.

Given how scholarly publishing is run, data aware publications are currently unused -- publishers just take Word documents and put them to paper. Need at the least a standards group for semantic documents; and ARCS will have some tools to bear.

Elzbieta Majocha: "So many tools"

Humanities scholars are loners writing monographs; funding agencies, however, like collaborative interdisciplinary projects. Resource production in the humanities is labour intensive, and reuse is desired (though not practiced).

Example attempt: collaborative wiki for the Early Mediaeval History Network. This was imposed onto an existing research community. Takeup was only after training, demo, and leading by example. And then --- it died again. So was it a success?

Shared resource library, semantic web underpinnings, on PLONE. "We built it, why aren't they coming?" (But there's no images in the library, no metadata, and no updates!)

Setting up collaboration: "What, not on email? Not another portal?" No activity, because the collaborationware was bolted on to the group -- who had no time to spend on the project anyway.

So how to make the culture change? Mandate works when there is support, but evaporates when there is not. Build And They Will Come? No: Hope is Not a Plan. Those who have to use it will use it? But it is not clear that they will.

To get takeup of collaborative infrastructure, make it part of the researcher's daily workflow -- e.g. daily workdiary.

Katie Cavanagh: How Humanities e-Researchers can come to love infrastructure

Flinders Humanities Research Centre for Cultural Heritage and Cultural Exchange

Do e-research structures form the questions, or do the questions define the structures? There should be a feedback loop connecting the two; but there is a worry that the tools are not actually helping ask the right scholarly questions.

Archiving influences the construct of the archive itself.

Institutional repositories are driven by capture and preservation, not retrieval and interpretation. Institutional repository content is not googleable: the archive is orphaned from its context, so it is no longer retrievable into a sensible context. If institutional repositories are for research, where is the middleware to provide access to the repositories? Must all projects be bespoke, and can unique solutions interact? What can you currently do with institutional repositories, other than print out PDFs?

Humanities queries memory and cultural heritage, not just data sets; so it depends on context. Important to curate collections, not just archive them. And doing so is no quicker than with paper collections; nor is it immediately obvious to researchers that it's more useful to do so digitally.

Metadata and preservation are not the problems to be solved any more; making the content usable and discoverable is.

Multi-pronged approach: build a community around modest tools; create tools to underscore current research practice (e.g. OCR); user centered design.

Also, track researchers who are already doing good practice and have the IT skills. Create forums, ICT guides, etc.

Make the infrastructure indispensable.

Jo Evans: Designing for Diversity

e-Scholarship Research Centre, University of Melbourne.

"The system won't let you do that?" No, the designer didn't. Technologists need to know about real requirements; scholars need to be able to articulate requirements.

The humanities involve heterogeneous data, used with a long tail, and with limited availability of resources for humanities. Systems for e-research in the humanities must be designed with those constraints in mind, allowing for tailoring: What and when to standardise, what to customise, and what to customise in standardised ways (once the technologies allow it, e.g. CSS/XML).

Design philosophy: standardise the back-end, customise the front-end. The back-end must be of archival quality, and scholarly.

They use OHRM (Online Heritage Resource Manager) as a basis for their systems: OHRM describes entities and contexts separately, and allows custom ontology reflecting community standards. Extensible types. Front end has exhibition functionality. Templates to add pages to presentation.

The tailored exhibition is a new research narrative. Can have service oriented approach to link to other information. The centre encourages OHRM as a tool for active research, not just for research outputs. Build incrementally, not in response to imagined needs.

Paul Turnbull & Mark Fallu. Making cross-cultural history in networked digital media

Detailed paper

Too often, solutions in e-humanities complicate rather than supporting work. The point of e-research is to enable work.

Building on existing web project on the history of the South Seas with NLA. Political, cultural, and technical problems have been encountered. They have made a point to use techniques and technical standards for web-based scholarship. Historians have disputed that they used the web instead of computers to communicate -- but they do anyway.

Will now use TEI P5 markup, which they couldn't use in 2000. Want web-based collaborative editing. Images with Persistent identifiers; can scrape metadata off Picture Australia, and otherwise capitalise on other existing online resources.

Collaboration with AUSTeHC allowed solid grounding of knowledge management.

By 2000 the future of digital history was distributed/collaborative editorship. Visual appraisal of historical knowldge is important (Yet Another Google Maps Mashup; also timelines) (CIDOC CRM ontology entries) --- this is now feasible, not really back then.

Complex and contested knowledge: a nodal architecture, using ontologies and based on PLONE (CIDOC-CRM, with Finnish History Ontology (HISTO) and ABC Harmony). Lots of tools now available, should be able to integrate rather than redesign.

Information will not stay static or be represented in a single way; need to create connections between information on the fly. TEI P4 markup is sound foundation, stores the original textual record; TEI P5 introduces model of relation between content and real world (semantics). They have built on that with microformats --- not embedded in the original XML, but in annotation.

Migrating content online: build conditions of trust. Without a solid architecture, cannot trust presentation of knowledge enough to build scholarly debate on it.

Must be careful of language use in persuading academics to adopt technology. e.g. you are representing Knowledge, not mere Content.

e-history is a partnership between historian and IT: it is not just the historian's intellectual achievement. Yet the problem in the past has been foregrounding IT, rather than what IT can do for scholarship.

Steve Hays & Ian Johnson. Building integrated databases for the web

Archaeological Computing Laboratory, University of Sydney

Novel data modelling approach. Real world relationships aren't as simple as what is modelled by entity relationship diagrams; there can be multiple contingent relations changing over time, and entities can split into complex types as knowledge grows.

Heurist knowledge management model: start with table of record types, then table of detail types (= fields), and requirements table, binding details to records and how they behave. Summary data is stored in a record, detail information are stored as name/value pairs. Relationships are modelled as a first order record. (Reifying the relationship allows it to have attributes.)

Raw querying performance is poor; can't use complex SQL queries; obscure to explain. But performs acceptably with 100k records; export to RDF triple store with SPARQL to improve performance. Increase in flexibility will outweigh drawbacks.

Point is to create a meta-database, linking info to info across archives. (Would like to use persistent identifiers to do so.)

Presentations from the workshop on e-Research in the Arts, Humanities and Cultural Heritage

Abstract

The following posts are summaries of presentations in the workshop on e-Research in the Arts, Humanities and Cultural Heritage, held after the eResearch Australasia 2008 conference.

eResearch Australasia Workshop: ANDS: Developing eResearch Capabilities

Abstract

Core ANDS team has 3 EFT dedicated to developing capabilities: community engagement, activity coordination, needs analysis, materials preparation & editing, event logistics, knowledge transfer/ train the trainer, surveys, reviews of progress.

$150K for materials & curriculum development; $50K in training and events logistics. Course delivery will be undertaken by the institutions, not ANDS.

Assumptions: ANDS is not funded to train everybody. It will partner with organisations to develop content to train from. A structured, national coordinated set of training materials will make a difference. ANDS will do some train-the-trainer activities; it will partner with strategic communities to deliver capability building. Training will be complemented by adhoc workshops.

Outcomes: a structured set of modules; partner to develop & maintain them; partners to deliver them; a certification framework.

Cultural collections sector in scope for ANDS. However, the cultural collections sector is not prioritised for training, because they are not ready; they need different approach to bring them up to speed. (The Atlas of Living Australia project involves cultural collections, so that will accelerate engagement with the cultural sector.)

Tracey Hind, CSIRO: Enterprise perspective of Data Management

CSIRO has 5 areas and 16 research divisions. 9 Flagship programs to produce research across the streams. 100 themes and many hundred projects and partnerships. There are localised solutions to data management, but CSIRO needs an enterprise solution. Not all divisions actually use the enterprise data storage solution already in place.

CSIRO needs to discover and partner data with other researchers; maximise value of their investment in infrastructure; open access to data. Will enable the flagship projects to move forwards.

CSIRO does not recognise data management is an issue still: e-Science Information Management strategy (eSIM is still unfunded. Scientists are not working well across disciplines --- they don't know what they don't know. Data management is not a technology issue, but a human problem. Easy discovery of data is key, but divisions do not understand potential of unlocking their own data. Data Management is a hard sell: there are no showpieces like machine rooms; researchers don't understand the benefits immediately. Need exemplar projects to demonstrate benefits of data management, to get buy-in.

eSIM model to build capabilities: people, processes, technology, governance. people challenges, e.g. incentives for people to deposit data into repositories. (May even need changes in job descriptions.) Governance includes proper enterprise funding, data management plan requirements.

Exemplar projects: AuScope, Atlas of Living Australia, Corporate Communications (managed data repository, enterprise workflow and process, corporate reporting). Exemplars will drive changing behaviours.

Researchers will be easier to convince of the benefits of integrated data management than the lawyers will be.

eResearch Australasia Workshop: ANDS: Seeding the Australian Data Commons

Abstract

ANDS: Australian National Data Service

Goal: greater access to research data assets, in forms that support easier and more effective data use and reuse. ANDS will be a "voice for data".

Not all data shared will be open; institutionally supported storage solutions (not enough funding to do its own storage); ANDS will only start to build the Data Commons.

Largely the Commons will be virtual access: no centralised point of storage.

ANDS delivery: developing frameworks, providing utilities, seeding the commons, building capabilities. Mediated through NeAT (National e-Research Architecture Taskforce).

One way of seeding Data Commons is through discovery service. Make more things available for harvesting, and link them with persistent identification. Discovery service will be underpinned by ISO 2146 (Registry Services for Libraries and Related Organisations). Will need to collect ISO 2146 data to describe entities for discovery service.

Seeding the commons will involve opportunistic content recruitment in the first year, and targetted areas in years 2 and 3, to improve data management, content, and capture. Are working with repository managers to identify candidates for content recruitment.

Content systems enhancement:
  • convene a tech forum (ANU responsibility), map the landscape;
  • model a reference data repository software stack, available for easy deployment;
  • repository interface toolkit for easier submission --- working with SWORD deposit protocol;
  • relationships with equivalent operations overseas.


Building Capabilities: train-the-trainer model; initial targets: early career researchers, research support staff. Build a community around data management.

Establishment project has met its deliverable; DIISR has signed contract. First business plan available online, runs to June 2009.

ANDS and ARCS are closely related. ARCS are tying state-based storage fabric into national fabric. ANDS are agnostic as to what storage you use.

2008-10-08

James Dalziel, Macquarie.U: Deployment Strategies for Joining the AAF Shibboleth Federation

Abstract

Trust Federations have emerged as alternatives to services running their own accounts. Identity provider, Service provider, trust federation connecting them -- with policy and technical framework.

Trust federation requires identity providers to establish who there is and how we know about them; the identity provider joins the trust federation; a service provider joins the trust federation, and gives user attributes to determine access.

The MAMS testbed federation has 27 ID providers (900k identities), 28 service providers -- from repositories to wikis to forums. The core infrastructure, including WAYF (Where Are You From service), is already production quality. MAMS working on software to help deployment.

"AAF" (Australian Access Federation) is the legal framework; "Shibboleth Federation" is the technical framework: the fabric to realise the legal federation.

A Shibboleth federation does not require Shibboleth software: it only commits to Shibboleth data standards. Shibboleth software is the reference implementation, but there are others.

Principles for federation have been formulated and are available

Deployment models. Do allow partial deployment models. Requirement AAF Req14 has core and optional attributes, and identity providers can limit deployment at first to staff who will be known to use the federation, rather than blanket all staff. Could have separate directory for just that staff, with just their core attributes. Alternative: staff only deployment, or full staff identity records and partial student records.

Also facility for the federation to map between native and AAF attributes.

Shibboleth version 1 is most safe to use; Shibboleth version 2 can be used with due caution to interoperability. OpenID supports weaker trust than a proper education federation; it can be added to a Shibboleth Identity Provider as a plugin.

Service provider: need to connect the Shibboleth Service Provider software (or equivalent) to your software; then determine the required attributes for access. Could specify authentication protocol as attribute; service description contains one or more service offerings. Can use "People Picker" to nominate individuals rather than entire federation. Could specify other policies on top, e.g. fees; that's a non-technical arrangement.

Ron Chernich, U.Queensland: A Generic Schema-Driven Metadata Editor for the eResearch Community

Abstract

Schema-based metadata editor (MDE): to ensure highest quality data through conformance. Lightweight client, Web 2.0. Builds on ten years of previous editor experience. It has emerged that users wanted editor with conformance, usable in their browser (no installation).

Generic metadata editor: cross-browser, generic to schemata. Schema driven, where the schema includes validation constraints. Help as floating messages. Cannot enforce a persistence mechanism, because that is app-specific. Nesting of elements. Live validation.

Web server supports the embedding application, the application calls MDE. MDE talks to Service Provider Interface via a broker, to fetch the metadata record given the record identifier (from the application); and MDE talks to the metadata schema repository to get schema.

Problems: Cross-browser portability; the EXT library Javascript library has given them good portability. Security delegated to the Service Provider.

Schemas are typically in XSD, which are is normalised and contain embedded schemata. Normalisation and flattening require preprocessing, so they use the MSS format as a type of flattened XSD. Reuse encouraged with documentation and reference implemention.

Why not XForms? Not very user friendly.

Available at metadata.net. To do: add element refinements; implement encoding scheme support; provide ontology and thesaurus tie-in.

Chris Myers, VERSI: Virtual Beamline eResearch Environment at the Australian Synchrotron

Abstract

Optimise use of expensive synchrotrons: remote usage.
User friendly, safe, reliable, fast, modularly designed.

  • Web interface to synchrotron for monitoring.
  • WML interface to phones.
  • Educational Virtual BeamLine.
  • Online Induction System (slides + video).
  • Beamline Operating Scheduling System (scheduling).
  • Instant Messaging. Transfer portal into e.g. SRB.

Alan Holmes, Latrobe: Virtuosity: techniques, procedures & skills for effective virtual communications

Abstract

The technically non-savvy have lots of misunderstandings about how to use the Internet. Much is taken for granted in face-to-face -- visual cues. Absent those cues, different strategies are needed to make communication effective.

Survey, 26 interviews, of people spending more than 60 hrs/wk online.


  • Initiation,
  • Experimentation (to establish, in a feedback loop,
    • Efficiency,
    • Identity, and
    • Networking),
  • Integration.

These steps apply to any new tool.


Initiation:
intrinsic factors (technophobic?), extrinsic factors (drivers, availability of assistance)

Experimentation
Efficiency;
Identity:
how present myself to the world;
Networking:
socialisation, developing group norms.


Integration:
  • enhance life (improve what you do, as driver);
  • limiting own usage (avoiding isolation, stress and burnout);
  • technology usage (media manipulation, convergence)


Typology of high end users:

  • New Frontiersmen (early adopters, male, utopians; have self-limited their usage, disillusioned with how others have used internet; are no longer big socialisers);
  • Pragmatic Enterpreneurs (small business, mainly women, net is mechanism for business, have multiple businesses, little exploring and socialising, have relative or friend to provide tech assistance, are very protective of computer and physical environment)
  • Technicians (like gadgets, mainly male, net is huge library, work in IT, disparage most people's use of net; distrust net-based non-verifiable info; like strategy games but not to socialise; into anime & scifi tv; aware of addiction and actively self-limit);
  • Virtual Workers (the virtual environment is just a workspace; even gender split; separate virtual persona from real life; use net for info and trust it; good at manipulating virtual environment; focus on speed & efficiency)
  • Entertainers (the Net is a carnival; even gender split; lots of socialising games incl. poker; dating, download, contact, socialising; love the newness of the Net, early adopters of SecondLife and social networking sites, and move on quickly to next thing; unaware of addiction, and tend to social isolation in real life);
  • Social Networkers (the Net is all about me; mainly women under 25; the Net allows them to stay in contact with friends; not open to communicating outside narrow circle; not fussed about privacy; outgoing, share info, gossip, photos; use the Net as proof and way to brag; love the tech convergence which helps them document their lives; don't like downloading stuff -- takes too long)

2008-10-06

Ashley Wright, ARCS: ARCS Collaboration Services

Abstract

ARCS: Australian Research Collaboration Services

ARCS provides longterm eresearch support, esp. collaboration. Modalities: Video; Web Based; Custom.

Video service is Desktop-based; allows short-burst communications, instead of physical attendance or extensive timespans (e.g. Access Grid). Needs to scale to large numbers, allow encryption. Obviate need for special room booking.

EVO: Enabling Virtual Organisations; Access Grid. ARCS provide advice on deployment.

EVO licensed by Caltech. Can create user communities on request. 222 registered users under ARCS-AARNET community; other australian communities: AuScope, TRIN. More than 100 communities worldwide, mostly high-energy physics. Soon: phone dial-in to meetings, interaction with AARNET audio/video conferencing; Australian portal for registration & support.

Access Grid: advice on equipment & installation; quality assurance.

Web Collaboration: CMS, collabative environments, shared apps (like Google Apps). Wiki, forums, file sharing, annotation. Full control over user permissions & visibility. Hosting by ARCS or can help set up locally. Sakai, Drupal, Plone, Wikis.

Customisation. They have staff on hand to help: advice & options

Are open to adopting New Tools.

Nicki Henningham & Joanne Evans, U. Melbourne: Australian Women's Archives

Australian Women's Archives: Next generation infrastructure for women's studies

Abstract

Fragmented record keeping in the past, because organisations were not institutional; best addressed by personal papers, which are much more susceptible to loss. Project initiated from awareness of the impermanence of the data. Encourages womens' organisations to protect their records and deposit them. Maintains register of data.

Based on OHRM. All-in-one biographical dictionary, with annotated bibliography.

A working model of federated info architecture for sustainable humanities computing. Enhanced capabilities, improved sustainability.

Both content development, acting as aggregator and annotator; and technological development, to support creation, capture, and reuse of data. Feeds into People Australia through harvest; will be populating into registry by harvesting from researchers (and vice versa). Need lightweight solutions because of diversity of platforms.

Margaret Birtley, Collections Council of Australia: Integrating systems to deliver digital heritage collections

Abstract

2500 collecting organisations in Australia; uneven staffing and resourcing. Still much to go with digitising collections.

Digital heritage is a subset of made or born digital information, prioritised for significance. Digital heritage is organised and structured, managed for access and use.

Collecting organisations innovate with Web 3.0, data selection & curation (metadata protocols).

Researcher issues: collection visibility, accessibility, availability, interoperability.

Collecting organisations issues: funding, lack of coordination, standards

Working towards national framework; action plan 2007, being broken up into advocacy plans and development plans.

Mark Hedges, King's College London: ICTGuides

ICTGuides: advancing computational methods in the digital humanities

Abstract

ICTGuides: Taxonomy of methods in humanities to allow reuse; facilitate communities of practice around common computational methods.

Listing of projects includes metadata and service standards. Includes tutorials, available tools.

Mark Birkin, U. Leeds: An architecture for urban simulation enabled by e-research

Abstract

Model impact of demographic change for service provision in cities.

Construct a synthetic population, with fully enumerated households; Demographic projections: dynamically model individual state transitions; look at particular application domains.

Components, MoSeS (Modelling and Simulation for e-Social Science): Data : Analysis : Computation : Visualisation : Collaboration. Aim to use secondary as well as primary data. User functionality through JSR-168 portlets. Moving to loose coupling with web services; allows workflow enactors like Taverna.

e-infrastructure through sharing resources through the Grid.

Complication that IT has to catch up, so still need to develop both IT and scholarship at the same time.

Kerry Kilner & Anna Gerber, UQ: Austlit

Transforming the Study of Australian Literature through a collaborative eResearch environment

Abstract

Austlit aims to be the central virtual research resource for Australian literature research & teaching. Has provided extensive biographical & bibliographical records, specialist data set creations. Was converting paper to web-based projects (Bibliography of Australian Literature); moves to process rather than product view of scholarship.

Supports research community building. Upcoming: Aus-e-Lit, deeper engagement with new forms of scholarly communication & publication.


  • Federated search, visual reports (graphs, maps: New Empiricism). Allow intelligent metadata queries.
  • Tagging & Annotation: collaborative (Scholarly editions; simple tagging)
  • Compound Object Authoring tools (OAI ORE), for publishing as aggregates
    • Data model: Literature Object Reuse & Exchange (based on FRBR)

Michael Fulford, University of Reading: From excavation to publication

From excavation to publication: the integration of developing digital technologies with a long-running archaeological project

Plenary session; Abstract

Archaeological project on Silchester has been running for 12 years. Complete town. Project involves management of large number of researchers, including undergrads; non-trivial logistical exercise. Stratographical complexity.

Integrated Archaeological Databased (IADB): used since start in 1997; has evolved with project. Contains most records gathered on site, incl. field records, records of context sheets. Aimed to provide integrated access to excavation records, in virtual research environment. Scope has broadened to include archival functions, project management, and now web publication.

Digital research essential: no pre-digital archaeological town studies have ever been properly published.

VERA: Virtual Environment for Research in Archaeology. JISC funded, has contributed to current excavations. Aims to enhance how data is documented; web portal, develop novel generic tools, and test them with archaeologists.

Have piloted digital pen for context notes, are now using throughout: 50% of notes. Speeds up post-excavation work by preventing transcribing cost. Had also experimented with iPAQs, tablet PCs (problem with sunlight), digimemo pad (not robust).

Capturing 2D plans. Have started trialling GPS.

Stratigraphy based on gathered data in IADB. Collaborative authoring. LEAP Project: linking electronic resources and publications, so can inspect data holdings supporting papers. Has not been easy in archaeology until now.

There is no one answer on how to use the tech: it must be driven by the research. Resourcing constrains what can be done.

Summaries from e-research Australasia

The next several posts capture sessions I attended during the e-research Australasia 2008 conference. I attended sessions Monday and Wednesday, and workshops Thursday and Friday. The notes are short jottings, and will probably not go any further than the inevitable Powerpoints when they are published.

2008-06-28

Version Identification Framework

David Puplett.

Two streams. VERSIONS project then VIF. Much overlap. VERSIONS: e-prints in economics. Main output: toolkit, mainly for authors of e-prints, to understand the types of versions they would be encountering in Open Access landscape (preprint postprint draft etc.)

Beforehand RIVER project and NISO / ALPSP project had come up with terminologies for journal article versions; controversial, because focussed on "versions of record", which privileged postprint as publisher version.

VERSIONS did interviews on what academics' practice was, when they made things public, to whom (e.g. scholarly networks, departments, repositories); then started own version of terminology. Also what behaviours to realistically expect of authors when contributing content: what engagement to realistically expect in differentiating versions on their own, and how aware they were of issues. (Lots of academics deleted as they went, left only printed versions behind.) Reports on surveys on VERSIONS website: lots of anecdotal material from author point of view. Output toolkit to disseminate to repositories: how to describe different versions, and how to make them useful in the repository context. Draft, Submitted, Accepted, Published, Updated.

Since specific to domain and e-prints, could go on with more recommendations; e.g. embedding versioning into cover sheet of e-print (with disclaimers). At mo', added manually. Coversheet embedding ensures googling still gets you the metadata. Inspired by arXiv's use of watermarking into the margins of the PDF.

Open Access has been driver. VIF was about all objects in repositories, so not just items in publication cycle; e.g. also research data, videos. So not involved in Open Access debate. VERSIONS tried to be agnostic towards Open Access, but does support it through encouraging content depositing.

VERSIONS was mostly scoping. Need identified for broader solution to version identification problem; e.g. organising content in repository, versioning of metadata, cross-repository discovery (deduplication).

VIF applicable to any object on repository. 10 months long, with concrete deliverables at the end. (Will be a short followup in September, surveying takeup and further publicity.) Started with its own survey, of academics and repository managers: how they discover content, when they find multiple versions, and how they found version they wanted (or near enough). Bad news: very few people found it easy: terminology is confusing, or no metadata about versions presented at all. Accepted copy, self-deposited (esp. early on): there was minimal metadata gathered. People constantly going back to google, coz no metadata embedded on what they'd retrieved. Led to the framework work.

Education component: raise awareness, help contributors reflect on what versions are: there hasn't been enough repository outreach on educating, since effort has been just on gathering content. Needed to do doom-mongering: versioning is essential to establishing authority for research outputs. Audiences:

  1. repository managers (key audience);
  2. content creators (difficult group to engage with --- if they were already engaged, they knew the bare basics, and only cared about the bare basics, don't care about repository mechanics. So minimised advice burden: no overkill, rely on the toolkit to get people started.) (Toolkit goes into advocacy of repository managers towards content providers.)
  3. Software community: developers of the major repository packages as well as local systems teams customising repositories.

Got progress with e-prints, who were engaged, and have integrated version tagging à la VIF into their development; D-Space much further behind --- have scoped on their own that versioning needs to happen, but have not prioritised it for development. Fedora does versioning, though datastreams: not very flexible: assume linear sequence of versions, which VIF was not restricted to. Recommendations were not to individual software packages but generic. Version support in the three repository packages are different.

VIF counts as versions both FRBR expressions and manifestations: there is disparity in what different people count as a version. The Work (author-title) brings both kinds of object together. Not high awareness at the time of FRBR in the repository community.

e-prints application profile for scholarly metadata. several app profiles coming through, ongoing development: images, geospatial data. e-prints have straightforward FRBR structure; images much more problematic: e.g. what is the subject matter unifying the objects into a single Work. JISC wants the app profile groups to work together more for consistent outputs. Given disparity in resourcing of repositories, challenge of getting coherent policy nationally. To that end, coversheet is much more doable than abstractions of FRBR and app profiles: have had to be pragmatic, not a technical project. Certainly not doing the big philosophical questions of what is a version, but tangible solutions and easy wins. (Unusual for JISC projects, which are more experimental typically.)

Blog updating existing subscribers on developments after project conclusion. Are getting generic questions about metadata which have version issues involved. Written articles to maintain awareness. Have not gotten into learning objects: versioning had not been an issue in UK, because only latest version is maintained. Also large scale repository issues with mapping of astronomical data into different versions way too complex to be in scope.

Some limited vocab work, but not focus of project; reflecting existing best practice.

2008-06-26

e-IUS

Whole project team.

JISC 2 year project, 9 months left. Aim: to capture experience of using e-infrastructure to support research, driven by research community viewpoint.


  • Interviews, 1-to-1 and 1-to-team: user experience reports;
  • thence, use cases: actually scenarios, based on fact not fantasy.
  • Once critical mass of scenarios, Service Usage Models (SUMs).


Primarily research community, aim to raise awareness of how e-infrastructure can do things for them. Do need to make outputs applicable cross-domain. Also engagement at higher level with JISC/funding: SUMs feed into funding process. Project can serve as snapshot of takeup of e-infrastructure, now that that phase of deployment is wrapping up.

Mercedes has been full time RA doing e-framework writing since start of project. Others have come in later. Have been recruitment delays. Initial output was scoping study, core document (must read): methodology with critical evaluation, sample use cases from interviews, with reflections both from team and researchers. Tried and true methodology, used in other projects already. Their SUMs are higher level than some of the JISC SUMs.

Methodology challenge: establishing contact for interviewing. Not enough resources to embed into projects (as the anthropologists would prefer), so hour-long open-ended interviews. Discuss day in the life of their research. Do not ask what infrastructure they use. Establish points of connection between their research and the extant e-infrastructure services by elicitation. There is still much overlap between practitioners and developers, so still fuzzy in establishing that boundary.

e-infrastructure is broadly meant: not just e-science, but not as broad as just using Word either. Advanced networked IT is what they understand as e-infrastructure: the national Grid is archetypal such e-infrastructure: common, service-oriented. Much e-infrastructure is not yet services, still pilot oriented to a particular project.

Use cases: long version (prose) => short version (pictures). Have ensured they're aligned to each other. Given scenario, identify what is behind it at different levels. E.g. for grid SUMs, authentication is a separate, omnipresent level of functionality; so is job monitoring. Only some of the functions are used in any one scenario. Also translate requirements statement (GEMS-I: Grid enabling MIMAS services project) into business processes.

Used SUMs to compare systems: which service genres do they have in common, and what effort are they replicating. Very good summary paper which does so for GEMS-I, GEMEDA, MOSES.

In general: cogent --- I'd say compelling --- tie-in of business analysis to service usage models, anchoring one to the other, and an excellent model for getting stakeholder buy-in into the e-framework. I think they've got it exactly right.

Virtual Research Environment for the Humanities

Ruth Kirkham, project manager

2005: requirements gathering. Go to humanities researchers, and see what they wanted --- not build it first and force it on them. How would it integrate into their daily work, what tools do they currently use, what would be useful. 6 months. Towards the end, four demonstrations to run past them:


  • Research discovery service (already in place in medical sciences; pushing Oxford humanities research into the web). Because Oxford is so decentralised, people didn't even know what was happening within faculties (people talk within colleges instead). Customising medical discovery solution to the humanities.
  • Physical tools: digital pen and paper (also done with biology vre), allowed digital notetaking. But libraries would not allow anything in the building with ink in it! Are working on pencil version. Hasn't been followed up so far, will likely get resumed next phase.
  • Access grid, personal interface. "But we have books, I want to use my office." Overtaken by developments like Skype.
  • English department, Jane Austen manuscripts being digitised and compared: cross searching of databases into their environment: English Books Online, Samuel Johnson dictionary, 18th century bibliographical dictionaries, etc. This did not become fully flegeded service, but has gone forward.



Ancient Documents demonstrator got more funding, became more robust, fully functional mockup. The English engagement was going on while the ancient documents work was going on, and is on-going. The VRE for docs & mss funded last year, March 2006-March 2009: broaden outputs from previous project and working them into the VRE. Focussing on those two demonstrators.

Development & user requirements are iterative: feedback from users every three months. Now have functioning workspace; most goodies are on the development site, not the more public site. Most work is still with the ancient documents people. Recently have started speaking to English again. Archaeology Virtual Research Environment at Reading (VERA) is trying to pull in data from databases; working with them to situate artefacts within their archeological contexts; currently proof of concept. (The Reading work is on a pre-literate site at Silchester, so the data isn't there yet, but the conceptual work can still be demonstrated.)

Recently moving to generic RDF triple store, which will accommodate disparate unstructured data better. Of course not well optimised for access like relational databases are, so ultimately may be scalability issue (but probably not in this discipline). Annotations stored as RDF, as well as metadata about images fitting an ontology. Had to homebrew ontology, what was out there did not match requirements. Intend ultimately to link to CIDOC ontology (museums & archives), which is gaining them traction for artefacts like monuments.

Hey, the ontology website is hosted in Greece. Must keep in mind for next junket. :-)


Lots of linked images of the same artefact, varying from each other minutely; no standard to capture that distinction. Other VRE is looking at polynomial texture map: how the object reacts to light, a polynomial for each pixel. Need lots of photos (30-50). Effect is virtual "moving a torch across" the artefact, seeing it under different lighting. (Is not going so far as a 3D model.) Archaeologists already using this tech live. There will be bulk photo'ing of the Vindolanda tablets at the British Museum July.

Plugging in work from Ségolène Tarte also working for Prof Bowman: imaging ancient documents, image analysis of artefacts. Removing woodgrain from pics of stylus tablets, to highlight the writing.

Will be pursuing collaboration with Aphrodisias project at King's College --- and anyone else relevant in the field.

Oxford CMS is SAKAI. Intend seemless integration. Classicists haven't picked up on CMS yet as a practice. Publishing images, annotations to repositories not something on the classicists' horizon yet (too traditional, and community moving very slowly in that direction), but Oxford institutional repository very much interested. Will wait on institutional repository to give the lead, and the team will do the interfaces to it. Single button deposit vision would make life easiest for them. The project is careful to follow the users and not dictate solutions to them.

Lessons:


  • Cool -- and useful --- eye candy
  • They know well the cultural resistance they will find, so they are developing features slowly and incrementally; and always what the users want and see the point of, rather than what is blindingly obvious (e.g. electronic publication)
  • In the humanities, it's all about the metadata, not the data. (The data are just photos)
  • Hence the embrace of the woolly mess of RDF --- a very far way from the rigid schemas of CCLRC
  • Portal environment for collaborationware, and to provide access to online databases. Just access: they're not going to try and RDF the Perseus project (mercifully)
  • On the other hand, bold vision of mashing up classics and archeological data (from VERA) to situate their data in context.

    • That's the kind of cross-disciplinary work that needs to happen and doesn't
    • They're lucky to have champions in both disciplines who "get it"
    • Mashing up RDF with RDF? It'll be wonderful when it happens; it'll also be very difficult.

2008-06-24

RIDIR

The proper pronunciation of the project name is with the first i long: [ɹaidəɹ].

Spent well over an hour gesticulating my way through the high-level use case diagrams of the PILIN national service, and the deliverables. (No software demo.)

RIDIR is leading up to UK mgt of national identifier scheme: fraught, but doable. JISC interested in scoping whether PILIN stuff is useful; if this happens, it'll be in the next nine months. UK needs to be sold on uses of identifiers to support national identifier infrastructure. PILIN has some interesting ideas they want to hear at more length to use in their advocacy.

Martin Dow and Steve Bayliss (ex-Rightscom, now Acuity Unlimited) have done extensive ontology work; we must hook up with them to find out how for our own ontology formalisation. Their chosen ontology IRE (built on top of DOLCE) can manage workflows and time-dependencies. They have also worked in the past on OntologyX.

Some subject domains get central data services here more than others (e.g. atmospheric data service is centralised, but there is no central geo data service: they just make data and throw it away).

JISC had started out from Norman Paskin's d-lib article, understanding that identifiers were happening in the physical world and with Rightscom's work, and then wanted to investigate analogies with the issues in the repository world. Hence the initial workshops; no obvious usecases or painpoints ("fear factors") emerged from the workshops. Was clear to RIDIR, a few months in, that the communities had no clear requirements for identifiers. There were only vague indications of the importance of identifiers to the communities. Rightscom had to go back to JISC for a steer half-way through: they could either (a) talk about value of identifiers & persistence in itself; or (b) illustrate how to do persistence ("cost" approach), taking the need for persistent identifiers as given. The latter was preferred, muchly because it was felt PILIN had already captured the why's and technical details of identifier persistence. (They are satisfied from my presentation that PILIN and RIDIR are still complementary.)

RIDIR went for corner cases, which sit outside policy and identifier infrastructure. Still difficult to arrive at concrete cases. They are at the last quarter of designing & building software, which had been held up because of lack of use cases. They only had time for a couple of software iterations. But in tandem with PILIN, will be useful work.

Would be v. interested in getting what they can from PILIN, about the usefulness and structure of a national provider, before they finalise their report.

Cf. Freebase as dynamically registered type language framework, lets you identify anything with your own vocabulary.

Demonstrators: the achievable use cases to highlight issues;

  • Lost resource finder, backtracking through OAI-PMH provenances, to mitigate when identifiers really do get broken (cited resources which no longer resolve). Accepting that the world isn't perfect, and dealing with the problems as they arise.
  • Locating related versions of things. Vocabulary for relations between things: making the assertions of relation first-class objects, with names and types and authorities, drawing on existing vocabularies. There is no agreed vocab of what is identified, so needed to put in place free-fom, user-driven semantic workspace: they choose their ontologies, you just provide the infrastructure for it.


Could go further with this; eg Open-calais [sp?], which extracts vocabs of concepts from a corpus, and could reconcile them with the labels of the concept graphs.

RIDIR is about identifier interoperability, which means:
  1. metadata interoperability, reconciling different schemas; metadata are claims about relation between identifier and thing;
  2. mechanisms for expression of relationship between different referents (e.g. relationship service);
  3. creation of common services, consistent & predictable user experience across services


Demo 1: Lost Resource Finder. Capture relations between URLs that no longer work and new URLs; no reverse lookup via a Handle. Two relations between URLs; the relations are RDF driven. First, manager can register a new URL corresponding to old URL ---- leading to a redirection splash page (it's an *authoritative* redirect record). This deals with decommisioned repositories. Second, can also crowdsource redirections of lost resources: "RIDIR users suggest that this has been redirected to...", complete with confidence rating. Redirect allows user to supply their own vote: "that's it", "no it isn't", etc. Can also search for alternate versions of content through searching the OAI-PMH harvest of all [e-research] repositories in the UK. OpenURL metadata driven query can be used to plug in to redirect, driven by metadata, to the new resource --- and identifiers are not invoked at all: this takes place, after all, as a fallback when the persistent identifier has already failed. This is a custom 404 page, offering heuristic alternative resolutions.

This was accompanied by a demo of the underlying RDF (not in final release); they have reified all the assertions involved in the model, so they can make claims about them. ORE came in to being too late for RIDIR project to engage with substantively, though they like what they see; maybe next extension.

Demo 2: Locate Related Versions. We have identifiers for stuff and relations between them again, the relation changes to versioning. Demonstrator works off TRILT (all BBC broadcasts in UK), and Spoken Word Service (repository of BBC items useable as learning resources); wanted to do relationship assertions across repositories to build added discovery value. Had a ready list of relations (not just "identical-to"); but building a complete vocab was completely out of scope and inappropriate, esp. given disparate communities of users. Can crowdsource suggestions of other related versions of the resource, and indeed of other related content. Very very cool.

And I demo'd icanhascheezburger.com as instance of crowdsourcing links between resources.

REPOMMAN

REPOMMAN:

Two goals:

1. facilitate workflow interaction with repository, to facilitate personal interaction. There has been less takeup of repositories because people are only asked for input at the end of the creative process, so the input looks to them like an imposition, extra task. REPOMMAN aims to allow use of repository at start of creative process, capitalise on benefits of repository. e.g. first draft deposited in repository: is secure and backed up. REPOMMAN uses FEDORA, so it has versioning, allows for backdating, revert. Web accessible tool, so users can interact with their own files from anywhere on internet, more flexibly than they would with a network drive. Many many types of digital content, so didn't want to restrict to any one genre: hence FEDORA. Not focusing on open access (Hull is not research intensive) or e-prints, but enabling structured management whatever the content. Much organisational change at Hull about learning materials, which is now settling down, and will decide takeup in that sphere. Pursuing e-thesis content as well.

Pragmaticaly, too much variation to serve all needs, so REPOMMAN could not go down the ICE path of bolting workflow on to content. They treat it as a network drive, competing with Sharepoint, to get user engagement. Interface mimics FTP.

2. Automated generation of metadata. Another perceived barrier to takeup of repositories, esp. for self-archiving. Still no perfect solution, but some things can be done with descriptive metadata. To capture metadata: aspects of profile of user depositing --- deposit happens through portal. Can capture tech metadata (JHOVE). For descriptive metadata, went hunting for tools; best one was IVEA (backend) within Data Fountains project (frontend), ex UC-Riverside. Available download as well as online demo; linux (Debian & Red Hat). Easy install.

Most extracters match texts against standard vocabularies/schemas, to identify key terms; e.g. Nat Library NZ, agricultural collections for metadata extraction (KEA) for preservation metadata. But such solutions need established vocabulary, and work best with single subject repositories, not practical institutional repositories. IVEA does not require standard vocabs. Has been trained to deal with wide range of data; not infallible, but good enough. Deals with anything with words in it. So long as metadata screen is partially populated, easier to complete population.

Joan Grey is setting up Data Fountains account from ARROW. Proposed to do parallel QA tests with Hull. Must follow up; must hook ARCHER up as well.

User requirements gathering done at start: no surprises. Interviews with researchers, admins, teachers. Not released in wild yet: sustainability issue (gradually working towards), and people need to cross the curation boundary from private to public repository, so the additional publish step needs to be scoped as well. ReMAP is that followup project dealing with that, underway this year. Focusses REPOMMAP for records management & preservation. These are library processes that repositories should be supporting, can help get records in the right shape for subsequent processes. REMAP sets flags & notifications for when tasks should happen; e.g. review annually, obsolete, archive (e.g. PRONOM, national archives, AONS). Proactive records management. Notifications are to humans. They use BPEL, and BPEL for People. REPOMMAN identified need for repository to support admin processes. These were low hanging fruit.

Scope for more testing, hoping to do so in parallel with ARROW. Need to generate "good enough" metadata, not perfect; already appears to be there. Working on the institutional takeup; e-theses most promising avenue.

Although REPOMAN is personal space, would like to get into collab space as well. I noted the definition of curation boundary allowing the distinction between collab space and public space to be formulated.

2008-06-22

Everything Oxford

Anne Trefethen



Director, Oxford e-research Centre

Have tools for data integration in climate research in collab with Reading; much of it based on Google Earth overlays, RESTful computation services. Will be enabling community participation à la Web 2.0, to make it a shared resource.

SWITCH have developed Short Life Certificate: shibboleth => ticket access to Grid. JISC is doing similar work for national Grid service.

UK is missing federated ID in their shibb, and Simon Cole's group (Southhampton) keen on taking it up. Australia has finally gotten the federated attribute through in AAF.

Project for integrating OxGrid outputs into Fedora research repository; will be interested in ARCHER export to METS facility.

Open Middleware Institute (Neil Chue Hong, director at omii.ac.uk) would be an opening for disseminating concertedly e-research workflow stuff from ARCHER.

Concerns over business model to sustain infrastructure in the longer term, especially as data management plans are requiring maintaining data for 10 years.

Current feasibility study on UK research national data service: Jean Sykes

----

Mike Fraser



Head of infrastructure, OCS. Access Mgt and hierarchical file service (archiving & backup, not generalised file service). Back up of client whatever their purposes; archiving service is research-oriented: long term file store, no data curation, tape in triplicate. IBM Tivoli system. 1 TB free, review after 5 yrs. Need contact person for data for follow up. Data not always well documented. Migration & long term store by policy.

Currently metadata is at the depositer's discretion. Archiving become more prominent over backup service. Not providing proactive guidelines on how to bundle data with metadata & documentation for longterm reuse. No centralised filestore, data storage is funded per project. Oxford is decentralised, so is a federation of file storage; SRB as middleware for federation is attractive. Need to be able to integrate into existing Oxford workflows.

Luis Martinez-Uribe



ex-data librarian, LSE. Scoping digital repository services for research data management: scoping requirements. Currently interviewing researchers for their requirements, current practice in managing their data. They do want help from central university. Requirements: solution for storage of large research data, esp. medicine & hard sciences (simulation data); infrastructure for sustainable publication & preservation of research data; guidance & advice. Publishing in couple of weeks. Used data collected as case study for scoping of UK Research Data Service.

John Pybus



Building Virtual Research Environment for Humanities. (I'll be revisiting the project manager for this project, Ruth Kirkham, next Wednesday.)

Collab environment, not just backroom services. First phase 2005 was requirements analysis: use cases, surveys of Humanities at Oxford. Identify what the differences were from sciences and within humanities. Large scale collab is much less important: a lot of lone researchers, collabs are small and ad hoc. Then, second phase was pilot: Centre for Study of Ancient Documents, director Prof Bowman. Imaging of papyri & inscriptions. VRE is focussing on workspace environment beyond image capture, to produce editions. Collaborative decisions on readings of mss are rare, since ppl don't often get together around the artefacts.

Technically, standards-compliant portlets JSR168, deployed in U-Portal. Not as bad to develop as it might have been. Java, not sexy like My Experiment (Ruby on Rails). End goal is set of tools that can be deployed outside Oxford Classics, so need ability to develop custom portlets on top.


  • High-res zoomable Image viewer & annotation tool: online editor; intend to make harvestable: will be Annotea schema (though not Annotea server). Also annotations on annotations. There are private, group, and public annotations.
  • Chat portlet and other standard collab tools.
  • Link to databases &c., bring them into the portlet environment.


Schema & API/web services need to reuse existing material.

Common desire within the research fields to find other ppl, even within Oxford, to collaborate with.

Iterative refinement of tools.

Where possible, shibbolised, given how dispersed ppl are in the field. There are enough academics already on IDPs, that this is doable.

Difficulty of conveying to user what content they're contributing is publicly viewable and what isn't. My Experiment does this as a preview.

Ross Gardler



EDIT: see comments for corrections

OSS Watch: Open source advisory service for HigherEd, JISC funded projects. Institutional: Procurement (advise over choice of open against closed infrastructure). Project level: project calls include para on sustainability, involves them for advisory. Sustainability only priority for JISC in past two years: change in existing project structure --- no projects had budgetted for sustainability.

Until now, projects evaluated on user satisfaction, not sustainability. New funding round (August): self-supporting, community-driven, knowledge-base approach rather than central advisory body. OSS Watch also assign funded resources to strategic projects as priority for sustainability, to give them the necessary legup to make them sustainable and generic without disrupting the project.

Overlap with Open Middleware Institute, which also develop software, OSS Watch won't, but instead build communities. Sustainability needs larger communities than just institutions: OSS Watch interested in linking up projects across institutions for critical mass. (Especially because the champions for open source projects are thin on the ground.)

Happens through selecting the right power users having expectation management, so they can be hand-held through playing with the development. OSS Watch currently mediate between developer expertise and users, translating feedback as a buffer. In the longer term, they will showcase successful open source sustainable projects to advocate projects signing up of their own will.

Problem in selecting which project is priority for strategic without domain knowledge across all domains.

2008-06-19

BECTA presentation

Gave BECTA folks summary of our work and deliverables on PILIN and FRED. BECTA are themselves at the scoping stage, and had some welcome scepticism about whether it was worth doing certain things: investing in persisting lots of identifiers, dealing with Their Stuff vs. Our Stuff, setting up formal repositories with metadata as opposed to leaving things to Google custom search. Aware of non-resolvability of hdl: ; I retorted with Standard Resolver solution, as PILIN advocated with resolver.net.au . Wanted to be able to establish flexibility of resolution services, and granularity; we discussed information modelling. Were interested in Reverse Citation service as solution to bookmarking of Handles problem.

Were curious about extent of persistent identifier takeup by e-learning content providers, and what pain points LORN was finding. Liked service approach in FRED, but were aware that little inconsistencies in profiling would ruin interop. Found the decentralisation constraints (for different reasons) of VET and School sector in Australia interesting. Interested in FRED SUM as resource. Would be outsourcing at least some functionality to commercial providers for greater persistence (marked-sustained rather than project-based); interest in tuning ranking of searchers rather than sticking with elaborate metadata searches. Rights management piggybacking on institutional affiliation of requester would presumably exploit shibboleth attributes. Want to get engaged with IMS LODE again. Interested in keeping channel open with us.

Here endeth the point form.

TILE Workshop

TILE Reference Group Meeting

Phil Nicholls in attendance (who has already named me as an "acolyte of Kerry"); he has produced SUMs for the Library 2.0 requirements of data mining for User Context data (strip-mining logs, I guess), and identifying content as relevant to a given user context (correlating courses to reading lists and library loans). In other words: how to extract user context from users like me, to recommend content to me; and how to datamine that user context into existence in the first place. Reading lists, loans records, enrolments, repository logs, user feedback.

Larger context: the Read Write Web: recommendation engines --- a key tech in 2008 on the web. The space for TILE is already populated by non-library providers. The domain is responding to the requirement: libraries talk to Amazon, have borrowing suggestions at Huddersfeld. MESUR project has farmed ginormous amounts of data on loans and citations. Higher ed libraries have huge amounts of data that can be capitalised on for resource discovery, and which is comparatively well defined.

Going e-framework for to get synergies with the other domains e-framework is in (e-learning, e-research).

Tension of e-framework specificity and leveraging/reuse vs. "constant beta", flexible software development, which is contra rigid specs. Need to experiment for a length of time before fixing things down in e-framework. The approach seen as more questionable at a local institutional level than in a national context.

Need shared vocab, not just shared software, to move forward in the library field --- enable dialogue between participants in the national context; e-framework can help build up the vocabulary again.

Sidestepping researcher identity as feeding into this: too hard for now (not familiar with the domain), quite diverse in interface and sparsely populated. The student data is rich and uniform, so working with that as a priority.

Pain points: why isn't your uni library catalogue already like amazon? how do you get the bits of the uni to talk to each other to deliver this? why do people want an amazon experience on their library catalogue, when?

Students already compare notes informally about their reading, which *might* motivate this kind of recommendation structure. But libraries are worried about data privacy; and US are even more touchy. Data will be anonymised; but Student Admin will ask questions once data is aggregated by individual subject, let alone grades awarded.

Peak use of loan recommendations in the existing prototype (Huddersfield) is a month after start of term, when students start exploring beyond their prescribed reading lists.

Reading lists are useful inputs, but not necessarily useful outputs: they are fixed by academics for the once.

e-portfolio a more important parameter for driving this tech than transitory external social networks like Facebook.

Contexts are multiple: can be institutional as well as individual ("what are our students reading?"), and people have multiple identities (Facebook vs. enrolment record): context needs to be tied down, to work out what to harvest. Students are also enrolled in more than one institution! If recommendations should be driven by learner-centered approach, then learner should have control of how their recommendations are used.

--- But if we just throw information into the open, without prescribing context, then contexts will form themselves around what data is available: users will drive it. (Web 2.0 thinking: no prescribed service definition, but data-centric driving.)

Systems need to be able to capitalise on this data to improve e.g. discovery (clustering).

***

Architectures of participation: the efforts of the many can improve the experience of the individual. Need to articulate benefits to users to motivate them to crowdsource.

Deduplication is key to users of catalogues. But surely that shouldn't mean JISC implements its own search engine?

OPACs not very good for discovery (no stemming, spellcheck); don't do task support (e.g. suggest new search terms).

Impediments: control, cultural imperatives; user base --- include lifelong learners?; trust, data quality (tag noise); data longevity; task/workflow support (may not support full workflows, which are not well understood, but can support defined tasks); cost; granularity of annotation target.

Unis are already silo'ing in their learning environments (Blackboard), and that's where they put their reading lists: how do you get information outside the silo?

JISC build a search engine? No, JISC get providers to open up their data, so the existing open source etc. efforts can inform their own search engines with the providers' contextual information.

CNRI Handle System Workshop: Middle Half

Daan Broeder, Max-Planck Institute for Psycholinguistics: Hadle System in European Research Infrastructure Projects



Already been involved in infrastructure for several e-research projects.

Reliable references and citations of net accessible resources, particularly in language: audio-visual, lexica, concepts...

Number of resources can be large, especially if disaggregating corpora. (Then again, a persistent identifier for each paragraph is overkill, and has a cost.)

Identifiers are cited, and embedded, and in databases.

CLARIN project: making language resources more available. Aims to create federation of language resources, mediated through persistent identifiers. Preparatory phase 2008-2011, construction up to 2020. Builds on DAM-LR project: unified metadata catalogue, shibboleth federation, Handle system.

For flexibility, DAM-LR minimised amount of sharing required. Developed mover (move data + update identifier), and restore Handle DB from scratch. All data needs to be recoverable from the archives themselves. Found that federation is not for all organisations, does impose an IT burden. Need centralised registration.

Max-Planck Society wants PID system throughout the Max-Planck society, which will also support external German scientific organisations.

Requirements for CLARIN: political independence: European GHR and no single point of failure; wide(r) acceptance of PID scheme (w3c!); support for object part addressing (ISO TC37/SC4 CITER: citation of electronic language resources); secure management of resource copies.

CLARIN will do third party registration for small archives.

Ongoing static from W3C in ISO. Proposes URLified Handles, suggests ARK model: http://hdl.handle.net/hdl:/1039/R5

Part identifiers: just like fragments in URIs: A#z => objectA?part=z, with a standard syntax for "z" for the given data type, exploiting existing standards.

Replicas: federated edit access to handle record by old and new managers. Known issue of access by multiple parties, trust. Could also have indirect Handles, i.e. aliasing. Not everything supports aliases well, doubtful status of the new alias for citation.

Value-add services: document integrity; collection registration service (single PID for collection, with aggregation map à la ORE); citation information service (acknowledgements, preferred citation format to be included in citation); lost resource detective (trawl the logs, the web, etc to find where the resource has ended up, including tracking provenance history of who last deposited).



---

What have I learned from the Handle workshop?


  • Handle type registry is coming. Complete with schema (which will need work) and policies (which look to be way too laissez faire)
  • The Nijmegen folks are gratifyingly coming to similar conclusions about things as us (e.g. REST resolver queries)
  • That European digital library is going to be huge... if it can hold together.
  • Selling the entire Max-Planck Gesellschaft on using persistent identifiers—that's huge too.
  • Scholars blog. And want credit for blogging.
  • There are ISO standards for disaggregating texts, among other media. (I can gets standardz?) And Nijmegen looks kindly on ORE.
  • The ADL-R is being released in a genericised form: DO registry.
  • Handle is being integrated into Grid services (but we already halfway knew that)
  • OpenHandle is still a good thing
  • XMP, for embedding metadata into digital objects, is now getting currency, and can be used to brand objects with their identifiers (amongst other things) and update that metadata with online reference (as I identified in a use case last year, methinks)
  • W3C continues to be all W3C-ish about non-HTTP URIs. People have not given up on registering hdl: schema.