IIPCWAC23 – The ’Crème de la Crème of’ Web Archiving work

Also in this section
Blog Topics

Latest Comments

14 Things I Loved and Learned at iPRES 2025
- Villy Magero 4 months ago
  
  I am so proud of your work Ruby. Keep going you are destined for good things in the profession!
Archiving Facebook, Right Now
- Helena 7 months ago
  
  Thanks for sharing this Andy, it is such a useful read
The Data Recovery of it all: iPRES 2025
- Norah 6 months ago
  
  Interesting to learn that like many solutions or innovations there are a lot of adjustments to be ...

DPC Blog RSS Feed

Also in this section

Barbara Fuentes

Last updated on 4 July 2023

Career Development Fund

Barbara Fuentes is Web Archiving Officer at National Records of Scotland. She attended the IIPC Web Archiving Conference 2023 with support from the DPC Career Development Fund, which is funded by DPC Supporters.

I recently received a Career Development Fund Grant to attend the IIPC Web Archiving Conference in Hilversum. The conference was held at the colourful Institute for Sound and Vision and KB, National Library of Netherlands.

The IIPC Web Archiving Conference brings together web archiving practitioners from all over the world to share ideas about the challenges and solutions in the future of web archiving.

Netherlands Institute for Sound & Vision | Beeld & Geluid

Stimulating and challenging concepts

This year’s theme was Resilience and Renewal. The conference offered thought provoking talks about the very topical theme of Artificial Intelligence (AI) and its pros and cons. The director of the institute welcomed everyone by showing two speeches: one written by himself and the other by AI. Both speeches were so similar that it was impossible to guess who wrote which!

The first keynote speaker was Eliot Higgins from Bellingcat. He spoke about their work training people (especially journalists) to identify the true source of information in conflicts. They started with the war in Syria in 2011, and since then the organisation has grown and now focuses on conflicts in Ukraine and Yemen. One main aspect of their training is how to geolocate videos posted by online communities and use metadata to prove their authenticity. They provide datasets for human rights organisations so key witnesses can be protected and they collaborate with law enforcement, making materials available to them. Eliot asked the audience who should preserve these datasets, e.g., Bellingcat or records organisations? Conversations to be had regarding legal questions, copyright issues, etc.

Makiba Foster, lead of the Archiving the Black Web project, delivered a talk on Renewal in Web Archiving: Towards More Inclusive Representation and Practices. Makiba asked why are there are not black web archivists. We all need to do more to support participation of minority groups on web archiving and the creation of digital communities. She asked us to think about what type of institutions do web archiving and how to recruit our practitioners to make institutions more inclusive. I found this quote from the session a good takeaway: “To collect, record, and archive aspects of the world is an intentional act, one that typically benefits those who have the power to decide what should be collected.”

The final keynote speaker was Marleen Stikker from Waag Futurelab. She spoke about the lack of public values in the public domain. The internet was originally based on fair social roots. However, it’s now controlled by big tech companies with a specific business model based on extraction of value. Human values and democracy could be at stake if it's not regulated to some level. Public values should be at the centre of the public digital domain. Technology is not diverse nor neutral; we need to think who is behind every tool and potential biases. She said that she was not against AI but emphasised the need for greater legislation and moderation.

The practical stuff

I chose to attend the sessions below because they were precisely relevant to my role as the Web Archiving Officer in National Records of Scotland (NRS). My current job focusses on the scheduling and quality assurance (QA) of our crawls.

The Auto QA Process at UK Government Web Archive with Kourosh Feissali and Jake Bickford.

The National Archives (TNA), as NRS, consider the quality of the captures a high priority; furthermore, every crawl is quality assured. The speakers explained the evolution of QA for them, how it went from a visual to more semitechnical role, e.g. currently checking log files, running macros, using tools like Screaming Frog, and manipulating data, comparing to the live site, checking PDFs for hyperlinks and visual checks. However, this evolution is not enough as collections keep growing, the frequency of crawls has increased, and this process had to be run locally on one person’s machine with some manual preparations still required.

Their new tool, Open-Auto-QA, unifies multiple processes and runs them automatically in the background (in AWS) before the QA team start manual QA.

How it works:

Continuously listens for updates to JIRA issues
Updated issues are checked for triggers (each sub-process of Open Auto QA will rum given certain conditions are met
The outputs of the sub process are sent to the JIRA issue along with comments for context to be used next time

When the team start QA, crawls have already been analysed and patched, leaving more time to investigate difficult issues. Open-Auto-QA consists of three separate processes:

Crawl log analysis on every crawl
Screaming Frog/Heritrix comparison on demand
PDF parsing on demand

The Human in the Machine: Sustaining a Quality Assurance Lifecycle at the Library of Congress (LOC), Grace Bicho

As TNA, LOC emphasized the importance of maintaining an ecosystem of quantitative and qualitative methods to assess web archive quality, particularly as collections continue to grow. LoC’s QA process is mainly based on the grounded theory for QA in three dimensions by Dr Reyes Ayala. Grace explained Dr Ayala’s principles of:

Archivability: degree to which the intrinsic properties of a website make it easier or more difficult to archive
Relevance: pertinence of the contents of an archived website to the original website
Correspondence: degree of similarity, or resemblance, between the original website and the archived website, e.g. visual, interactional, completeness

LOC mostly focus on ‘Correspondence’. They rank issues found from 1 to 5 depending on their severity, e.g. 1=blocker, 2=critical, 3=high, 4=medium and 5=minor

It was reassuring to listen to Grace’s list of common issues as they were quite similar to the ones found by NRS, e.g. missing images, AV, style and content behind paywall, discrepancy in dates of capture and publication, paginations not working, interactive content not replaying.

LOC have trained 51 staff members in the QA process. It differs from NRS in that they do not patch post-crawl, instead they use the information recorded on JIRA to apply the relevant changes to the configuration of future crawls, so the process is more ‘front-loaded’.

Workshop Browser-based crawling for all: Getting Started with Browsertrix Cloud by Andrew Jackson (British Library), Anders Klindt Myrvoll (The Royal Danish Library), Ilya Kreymer (Web Recorder)

Ilya guided us through the process of running our own Browsertrix crawl. My crawl was successful in capturing a site heavy on dynamic content and videos. We also heard from organisations taking part in a pilot using the tool, e.g. Anders spoke of the success of this crawler and his wish of making it more scalable.

Finally, at risk of sounding a bit like an Oscar winner, I‘d like to thank the IIPC for an amazing conference and the prospect of watching the recordings of the sessions I missed. Also, very grateful to the DPC Career Development Fund for awarding me the grant and lastly, thank you to NRS for granting the time and their contribution to personal expenses. Also, thank you to the attendees. Nothing compares to the benefits of talking face-to-face with a colleague within a similar role in a different organisation and ‘compare notes’ about challenges and potential solutions.

Acknowledgements

The Career Development Fund is sponsored by the DPC’s Supporters who recognize the benefit and seek to support a connected and trained digital preservation workforce. We gratefully acknowledge their financial support to this programme and ask applicants to acknowledge that support in any communications that result. At the time of writing, the Career Development Fund is supported by Arkivum, Artefactual Systems Inc., AVP, Ex Libris, Iron Mountain, Libnova, Max Communications, Preservica and Twist Bioscience. A full list of supporters is online here.

Add comment

14 Things I Loved and Learned at iPRES 2025

Archiving Facebook, Right Now

The Data Recovery of it all: iPRES 2025

IIPCWAC23 – The ’Crème de la Crème of’ Web Archiving work

Barbara Fuentes

Stimulating and challenging concepts

The practical stuff

Acknowledgements