Dorota Minkiewicz is Archivist, Long-term Digital and Web Preservation, at the Publications Office of the European Union. She attended the IIPC Web Archiving Conference with support from the DPC Career Development Fund, which is funded by DPC Supporters.
On a rainy day in May, the Web Archiving community flocked to the Dutch city of Hilversum, where at the stunning home of the Netherlands Institute for Sound and Vision this year’s IIPC Web Archiving Conference took place.
It was my first time at this conference, thanks to the DPC Career Development Fund. And since I’m still a novice in the Web Archiving world, I was particularly keen to listen to the discussions around capture methods, playback tools, and promoting the active use of archives among researchers and students.
As you might have seen, Barbara Fuentes has already summed up the conference in her excellent post. So, I want to let you in on the details of one particular presentation titled: “Towards an effective long-term preservation of the web,” in which my colleague from the Publications Office of the EU, Corinne Frappart, tied together the Web Archiving and the digital preservation worlds.
Read on to find out more about what she had to say.
Why should we even consider long-term preservation for web archives?
At the Publications Office of the EU, we archive all websites in the europa.eu domain as well as a few additional websites of key interest. The tool we currently use for capture is Internet Archive’s ArchiveIT, and the whole archive comes up to 60 TB of data gathered since 2013 across several collections. We also keep a copy of the whole ARC and WARC collection on hard discs in our physical archives.
And since January 2022, the new EU Legal Deposit scheme has strengthened our mandate to capture websites authored by the EU Institutions and to keep the archive open to all.
Now, as the long-term digital archive team, we preserve different types of publications authored by the EU Institutions: the Official Journal, general publications, procurement notices…, but what about websites?
Aren’t they also worth keeping? If they fall under the EU Legal Deposit, why shouldn’t they be covered by our preservation planning as well? Can we allow ourselves to wait until this information gradually fades away from the hard discs? Can we rely wholly on an organisation based outside of the European Union to preserve this content securely? And as no organisation lasts forever, what if (however unlikely it seems) Archive-It disappears, or otherwise we’re forced to end the subscription?
Those were the questions that first prompted us to look into ways of not only crawling websites and giving access to our Web Archive, but also safeguarding the ARC/WARC files with a proper planning of preservation actions beyond simply bit preservation. And as our internal research didn’t bring the expected results, we decided to commission a study that would help us investigate the current situation and best practices.
The study and its findings
Starting in July 2021, experts affiliated with our contractor reviewed the published and grey literature around the topic and conducted a series of interviews with leading institutions in the web archiving world. Here are the most important conclusions from this study:
-
We have both ARC and WARC files in our collections, but there is no business case to actually migrate the older ARCs into WARCs before ingestion into the preservation system: they are stable, ubiquitous in the current generation of web archives, and access to content held in them is well supported.
-
File format should be analysed after the files have been ingested into the preservation system and encapsulated within the appropriate AIP structure.
-
But what does appropriate mean for web archives? It turns out the schema we use now for EU Legal Acts and Publications will probably not do, and E-ARK AIP would be a better fit. Each crawl should be considered a separate intellectual entity, so each ingested crawl should correspond to a new AIP.
-
And what happens when the results of incremental crawls are spread across many WARCs in multiple AIPs? Apparently, they can be seamlessly traversed thanks to indexing and handled by access and presentation utilities.
-
When it comes to metadata, the most important technical and provenance metadata are already present inside the WARCs (primary seed URL, target name, harvest date, etc.).
-
Where descriptive metadata is created, it is generally done manually and most often based on Dublin Core records. It is feasible rather for selective harvests than broad ones and mainly at the level of the seed or collection.
-
Since the digital world is in constant flux, archival representation information requirements are difficult to anticipate in advance. But to give future users a reasonable chance of reconstructing the context and content of the crawls, we should link them with at least the following ARI items:
-
ARC Description, WARC ISO Standard,
-
Description of CDX and DAT structures,
-
Harvester/Crawler and its source code,
-
Viewer (and its source code) and configuration details,
-
Some JPEG or PNG “images” (snapshots) of the pages to validate renderings overtime,
-
Harvesting and preservation contracts.
-
Where do we go from here?
The IIPC conference served as an excellent outlet to inform the community about our reflexions and to get some valuable feedback on the topic. We are now planning to publish the study and hopefully get even more responses on our approach, from you, the community.
During next year, we will design and implement a workflow to start ingesting the web archive files into our long-term digital preservation system. This, of course, is a tremendous endeavour, just considering the sheer size of this new collection. But all things considered, we think it’s going to be worth it in the long run.
And what are your thoughts? Are we being over-careful in our approach? Or rather right to try safeguarding this content according to the best long-term preservation standards?
Acknowledgements
The Career Development Fund is sponsored by the DPC’s Supporters who recognize the benefit and seek to support a connected and trained digital preservation workforce. We gratefully acknowledge their financial support to this programme and ask applicants to acknowledge that support in any communications that result. At the time of writing, the Career Development Fund is supported by Arkivum, Artefactual Systems Inc., AVP, Ex Libris, Iron Mountain, Libnova, Max Communications, Preservica and Twist Bioscience. A full list of supporters is online here.