In this issue:
- What's on, and What's new
- Editorial: Digital curation skills: Who needs them and how do you get them? Joy Davidson, Associate Director, Digital Curation Centre (DCC).
- Who's who: Sixty second interview with Jen Mitcham of Archaeology Data Service
- One world: Maggie Jones, Australia (previous Executive Secretary of the DPC)
- Your view: commentary, questions and debate from readers
What's on:
What's New:
Editorial: Joy Davidson, Associate Director, Digital Curation Centre (DCC)
Digital curation skills: Who needs them and how do you get them?
In the words of William Kilbride in last month's What's New editorial,
'Information management is not trivial and it’s not new: but dependence on digital sources for research and a massive increase in data volumes and complexity means that researchers face new challenge'.
Couldn't have said it better myself! Over the last decade we've seen a vast increase in the general awareness about the need for digital curation and preservation activity. Indeed, many funding bodies are now seeking assurances from institutions and the researchers they employ at the bid stage that they are ready and able to manage access to their digital information over time. The DCC's handy policy overview shows just what UK funders' currently expect. But just who is responsible for digital curation and how do those charged with responsibility get the skills they need to do the job?
First things first - who is responsible for digital curation? Well, there really isn't any single role within an institution that can take on the effective management of digital information from creation through to re-use in isolation. Whether you're a researcher, a librarian, an IT specialist, or a senior manager you've all got a role to play. What is less clear at this point is when and how the various roles should interact to best effect. Obviously, this will vary from institution to institution depending on infrastructures and support systems. The ongoing JISC Research Data Management Infrastructure (RDMI) Programme projects should go some way towards identifying current and good practice in this respect and will produce some exemplars that other UK HEIs can follow. What we do know for certain is that clear and regular communication between the range of stakeholders is essential from the outset of any digital curation activity. It is easy to under-estimate just how vital communication is as you focus on seemingly more complex technical and operational challenges. But, without clear communication about just what it is you are aiming to achieve you may find yourself facing unnecessary roadblocks and costly delays. It may seem petty, but do spend some time early on agreeing and communicating key terms - like just what it is you mean by the term digital preservation - as different stakeholders can have very different interpretations!
Ok - we know who is responsible for digital curation but now we need to determine how the various stakeholders can get the skills they need to undertake their specific roles in the digital curation lifecycle. There has been a lot of work in recent years to develop intensive training courses for data custodians and digital preservation practitioners such as the Digital Preservation Training Programme (DPTP), Digital Curation 101, and Digital Futures. These courses all aim to attract participants from a range of professional backgrounds to ensure that a wide variety of perspectives are shared and that viable curation approaches can be jointly developed and implemented at the institutional level. These courses have proved very successful to date and have led to some real changes in working practice within institutions. There are also numerous postgraduate courses emerging that aim to produce professional data curators such as the MA in Digital Asset Management (MADAM) at Kings College London and the Graduate School of Library and Information Sciences MSc in Data Curation specialism at the University of Urbana Champagne.
So data custodians and preservation practitioners have some formal and informal training options, but how about reaching those who are generating research data in various disciplines? In their 2008 report to JISC, Swan and Brown recommended the development of short, postgraduate training courses aimed at researchers to help ensure that basic data management and curation skills are embedded into professional research practice. Taking this recommendation forward, JISC has recently issued a call for bids to develop disciplinary-specific research data management training programmes. In addition, several of the current JISC RDMI projects are also seeking to develop and implement postgraduate level training as part of their sustainability plans. Researchers also have increasing access to a number of high quality resources and dedicated support like those being produced by the UK Data Archive to assist them with their data management and curation activity. So researchers from a range of disciplines should have access to a number of postgraduate training options for honing their curation skills in the coming years.
But how can we start to ensure that formal and informal educational programmes for professional data curators and researchers complement each other and allow for portability of skills across both institutions and countries? At the moment, there are several international working groups trying to establish basic skill-sets for emerging professional data curators. The International Data curation Education Action (IDEA) and the European Commission MSc in Digital Curation and Preservation working groups are both aiming to pin down minimum requirements to allow for greater comparability of skills across current educational programmes. The RIN Information Handling Working Group are also active in this area and are using the draft Vitae Researcher Developer Framework (RDF) - which is intended for use as professional and career development planning tool for researchers at all stages of their career - as a means of benchmarking emerging training resources and providing pathways through heterogeneous professional development training offerings.
As you can see, there is a lot going on at the moment to help build capacity and hone curation skills. However, it seems the more we do the more we realise needs to be done so watch this space to keep up to date with what’s new and what’s on the horizon in the field of digital curation training.
Who's Who: sixty second interview with Jen Mitcham, Archaeology Data Service
Where do you work and what's your job title?
I am a Curatorial Officer at the Archaeology Data Service (ADS) based at the University of York. Our offices are in King’s Manor which is a beautiful Grade I listed building just outside York city walls.
Tell us a bit about your organisation
We are a digital archive for archaeological data from all sectors. We were originally set up in 1996 and hosted one of the 5 subject centres of the Arts and Humanities Data Service (we were also known as AHDS Archaeology), but in 2008 funding for AHDS was discontinued, and AHDS Archaeology was no more. The AHRC agreed to continue to support archaeology though and the ADS has been directly funded by them for the last couple of years.
How did you end up in digital preservation?
By accident really. I started off with a degree in archaeology and worked as a field archaeologist for a few years. Then I did a MSc course in archaeological computing and got more involved in the computer side of archaeology (databases, web sites, Geographic Information Systems, that sort of thing). On the basis of my computer skills, I managed to get a job at the Archaeology Data Service and have been here for 7 years now. Digital preservation has been something I have learnt ‘on the job’ while I have been here.
What projects are you working on at the moment?
We are really excited to be building up to the big launch of our brand new website. This is something that we have been working on behind the scenes for many months now but it is great to see it all finally coming together. I have been concentrating on fine tuning the web delivery of all of our digital archives. Whereas in the past we have hard coded details of each archive into the html pages, we now have a more dynamic web site which allows many of the details of each of our 300+ collections to pulled out of an underlying database. This will enable us to display each of our archives in a much more consistent way, which makes me very happy indeed!
I am also busy working on lots of other bits and pieces. As well as individual archives such as a searchable database of fieldwork summaries for Medieval and Post-Medieval Britain and Ireland, I am also interested in the ‘bigger picture’ of digital archiving. Certification of archives is something I need to spend more time looking at when I get the chance.
What are the challenges of digital preservation for data services such as yours?
A lot of the challenges are the same as any other repository would have to deal with, though some headaches are caused by the wide range of file types that we have to handle. A lot of the projects we are asked to archive feature cutting edge research using new and innovative technologies. As well as the more common documents, databases and images that can be found in the majority of archives, we also have to deal with the outputs from maritime and terrestrial geophysics, photogrammetry, Lidar, virtual reality and anything else that our depositors come up with. Finding ways of trying to preserve these sorts of data can be a big challenge. They are invariably large in size and come in a huge variety of proprietary and binary data formats …sigh!
We are also continuing to work with data creators to think about the costs of digital archiving during the earliest stages of a project. Digital archiving is often perceived as expensive and it is a challenge to convey to potential depositors the difference between digital archiving (and the time and costs involved in that) and disseminating something on line with no archival backup.
Another more specific issue for archaeological data is that a lot of these born-digital data sets come from fieldwork or excavations that are non-repeatable. In excavating an archaeological site you are in effect destroying the remains that are in the ground. Once it has been excavated and recorded, there is no way you can go back and repeat the exercise. The digital records that are created on site during these excavations are therefore a pretty valuable resource and that is why it is important that we do a good job of preserving it.
What projects would you like to work on in the future?
We are just starting a project to implement Fedora Commons as our digital object management system. This is one of those things we have been talking about for a long time but never had the time/money to implement. We currently have 300+ collections with an estimated one million files so we didn’t think it was going to be trivial task to get this all set up and all of our current data recorded in the new system! We are very lucky to have received funding from DEDEFI to make this dream a reality and this is something I am hoping to get involved with as the project gets underway.
As I said earlier I am interesting in the certification of digital archives and want to continue following closely any developments in this area. The TRAC checklist and the Data Seal of Approval have so far been useful benchmarks for us in make sure the ADS is moving in the right direction though we haven’t formalized any of this as yet. This is definitely something for the future.
What sort of partnerships would you like to develop?
We are always keen to work with other archaeological digital archives in order to avoid duplication of effort in archiving material deposited in other locations, whilst still ensuring that methodologies are employed to ensure users can find the archives regardless of their home (through the use of metadata harvesting, web services etc). We currently work closely with the London Archaeological Archive and Research Centre (LAARC), the Royal Commission on the Ancient and Historical Monuments of Scotland (RCAHMS) and the British Geological Survey (BGS) for example.
If just one tool or standard could be brought into existence that would make your job easier, what would it be?
Two things actually (I am being greedy?)
- A metadata extraction tool (like JHOVE or with the combined power of FITS) that can recognize and process all of the many file formats that we deal with. The existing tools are a great, but we want to be able to collect that level of information for all the weird and wonderful types of file that our depositors give us.
- A tool that reliably and seamlessly batch processes PDF files into PDF/A …..now wouldn’t that save a whole lot of time?
If you could save for perpetuity just one digital file, what would it be?
On the basis of what I said earlier about archaeological excavation being non-repeatable, I think it would have to be a large excavation database such as the one created by Framework Archaeology during the excavations in advance of improvement works at Stansted airport. See: http://ads.ahds.ac.uk/catalogue/archive/stansted_framework_2009/
The stratigraphy, finds and features of this highly complex multi-period site were recorded in an extensive database which we have archived as a series of delimited text files. The database was used to drive a Geographic Information System of the site which we have replicated on-line as far as possible.
As the interpretation of archaeological data can be quite a subjective thing, the same dataset could be made to tell several different stories. Preserving a database such as this is crucially important as it will allow future archaeologists to return to the primary data in order to test new hypothesis and carry out further analysis, leading to fresh interpretations of the site. Keeping datasets such as this one alive for further research is really what we are all about!
Finally, where can we contact you or find out about your work?
You can have a look at our website at http://ads.ahds.ac.uk (though the new website won’t be available publicly for a few months yet) or e-mail me at jlm10@york.ac.uk
One World
With thanks to Cornel Platzer and Michael Carden [NAA]; Maxine Davis, Douglas Elford , David Pearson, and Colin Webb [NLA] who provided insights into their current work. Any conclusions drawn, particularly with regard to the efficiency dividend, are mine and do not necessarily reflect those of either institution.
This brief report was written from the perspective of two cultural institutions; the National Archives of Australia and the National Library of Australia. Both organisations are digital preservation pioneers and have been practising what they preach for several years. This article provides a brief overview of some of the issues they are thinking about as they move well beyond project phase and into fully integrating digital preservation within their respective organisations.
The National Archives of Australia has a dual role, articulated on its website as to:
- promote good records management in Australian Government agencies
- manage the valuable records of our nation, and make them accessible now and for future generations
In terms of digital preservation, the NAA sought to fulfil these roles by developing tools to efficiently manage the ingest of digital records into their archival store. One of these tools, Xena [XML electronic normalising for archives] detects the file formats of ingested digital objects and then converts digital records into open formats. The second tool, the Digital Preservation Recorder, manages the integrity and authenticity of records by recording and maintaining metadata related to preservation actions . Version 5 of Xena was recently released and at each update, NAA staff have been careful to take on board comments and feedback received from other organisations who have used Xena, as well as their own ongoing requirements. Xena is freely available, released under the GNU General Public License and NAA staff welcome feedback on it. The rationale behind Xena is that by converting [or normalising] at ingest into open formats, digital records will have greater prospects for longevity and ease of management over the long term. The latest version also incorporates an OCR feature employing Google’s Tesseract software, which enables extraction of plain text from scanned documents. This feature was designed with longer term enhanced access for researchers in mind but is likely to prove valuable in the short term for greater efficiency in access examination by NAA staff.
The NAA is currently investing further effort in preparing for an increasing volume of digital records from Australian government agencies by undertaking projects to test the efficient ingest and management of anticipated categories of records, such as database backed business systems. NAA know they will need to be able to deal with legacy systems and their own practical experience is now enabling them to better prepare for future challenges. The need to be able to scale operations to meet demand is a major source of effort and research on NAA’s part but it appears that transformation on ingest is becoming an increasingly efficient strategy for managing and preserving digital records.
Under its Act, the National Library of Australia is required to develop and maintain a national collection of library materials and to make this material available. Translating this requirement to the digital world has been occupying the NLA for close to twenty years so they have now built up considerable experience and expertise in all things digital. PANDORA (Preserving and Accessing Networked Documentary Resources of Australia), one of the first web archives in the world, began in 1996 and several other digital projects preceded that. Improving efficiency is also on the minds of NLA digital preservation staff as their digital materials collection has just reached 0.5PB. This is gathered from a number of different sources, including manuscript material, harvested websites, negotiated websites, a significant Oral History collection, and their own extensive digitisation programmes.
As well as volume, the diversity of digital material collected is posing some significant practical challenges. Like the NAA, the NLA has needed to develop its own tools to support its programmes. Recent tools include Prometheus, a tool developed specifically for NLA acquisitions and cataloguing staff to efficiently transfer data on unstable physical carriers to mass storage. Another NLA specific tool has also been developed to provide information on carriers for collection areas to facilitate informed decisions on acquisition of content on various carriers. Media Pedia is a tool designed to provide a knowledge base of structured descriptions of a wide range of data carriers. The latter is likely to have a more generic utility and indeed collaborative community development of the tool is actively encouraged. These tools have been developed by NLA in response to a detailed risk assessment exercise, in this case focussing on the specific risks associated with content held on discrete physical carriers.
A further tool currently under development is ‘Configulator’, a means of documenting the various hardware/software permutations needed to provide access to content held in a range of file and format carriers. The intention is to encourage community sharing of such information to enable indicators of effective obsolescence. NLA is also maintaining viewpath components in the pragmatic belief that it makes no sense to discard working instances of them before alternative means of access can be assured. They stress however that this is not a commitment to a technology museum strategy, simply a further risk management approach.
Collaboration is seen to be key though this in itself is quite resource intensive and sometimes different funding regimes can inhibit effective collaboration globally. It is also often necessary to invest significant time and effort into building effective partnerships during which little tangible result may be seen short term. The One World article contributed by Inge Angevaare in April’s instalment of What’s New? has some interesting observations on levels of collaboration, based on four sectors.
In terms of web archiving, NLA’s membership of the International Internet Preservation Consortium (IIPC ) has been an important source of pooling experience and expertise. The IIPC was formed in 2003 by 12 organisations, including the NLA. It currently has 37 members. A current IIPC Preservation Working Group project the NLA is participating in aims to document the tools which provide effective access to live web content. An earlier IIPC project tested available tools for their usefulness in preserving meaningful access to web content and concluded that there are no currently available tools capable of satisfactory transformation of web archives or for emulation of the full technical environment required to represent web archives in a different environment. Moving beyond capturing dynamic web content to ensuring continued access to it is a challenge facing all organisations involved in preserving this category of content so it makes sense to build on collaboration and partnership to test viable strategies.
The NLA feels they are now in a strong position to understand and identify their specific requirements. This is based in part on their increasing practical experience but also thanks to exercises such as a recent major proposal for additional funding, which provided a catalyst for refining requirements. It has become clear that in order to move the NLA into the next phase will require a new repository capable of streamlining inconsistencies, enabling sound preservation planning, and paving the way for a fully integrated and sustainable system.
In addition to forging a way forward, the NLA has retained a commitment to maintaining the PADI current awareness service, which originated the What’s New in Digital Preservation? updates.
Both NAA and NLA spoke of the difficulties in attracting sufficient, ongoing funding. No surprises there but digital preservation remains a tough act to sell when competing with other priorities.
An additional burden faced by Australian commonwealth government agencies is the annual application of an onerous cost saving measure known as the efficiency dividend. Originally marketed as a short term means to achieve a more streamlined public service, the combination of ingenuity and staff dedication in successfully coping with these cuts without obvious degradation to the level of service has made it an irresistible cost cutting measure for successive governments for the past two decades. This year however, cultural organisations are making a concerted effort to urge the Rudd government to revisit the efficiency dividend and are drawing attention to the negative impact of the cumulative effect of the cuts. Well argued submissions have been made but it remains to be seen whether these will be effective in a federal election year. It is to be hoped they will for cultural organisations generally and digital preservation specifically. Australia made a rapid and early response to the challenges of digital preservation, resulting in one of the world’s first web archives and innovative development of tools and software when there was almost nothing available off the shelf. Whether they can maintain such leadership will inevitably depend on the ability to attract adequate funding to build robust, scalable systems.