Michael Popham

Michael Popham

Last updated on 4 May 2023

On 15th March this year, the DPC organized a short webinar on the theme of AI for digital preservation. Recordings of each of the presentations are available via the DPC website (here) for any registered user, but this blog post is an attempt to summarize the key points made by each of the speakers for the benefit of those who do not have access to the recordings.

The event opened with a joint presentation from Tobias Blanke and Charles Jeurgens, both from the University of Amsterdam, who offered a summary of their recently published article Archives and AI: An Overview of Current Debates and Future Perspectives (which was co-authored by their colleagues Giovanni Colavizza and Julia Noordegraaf). They undertook a survey of fifty-three articles published since 2015 which broadly examined the potential for the use of AI/machine learning in traditional recordkeeping practices within archives.

Charles began by discussing the recent changes in archives, moving from the predominantly manual processing of records, to treating archives as repositories of machine-readable data (often at the risk of losing key information relating to provenance, appraisal, contextualization, transparency and accountability). Charles then went on to look at how the traditional recordkeeping lifecycle, with its focus on managing containers of information over linear time, has been replaced by an approach based on the continuous interaction within and between the information held in (digital/digitized) records. Their literature survey revealed that most recent discussions of AI and recordkeeping in archives has looked at accelerating work such as indexing and information retrieval, whilst much less has been written about the potential applications of AI in the creation and capture of records.

Tobias then looked at the practical application of AI in archival work. In the large-scale archival integration project “The European Holocaust Research Infrastructure: Archives as Big Data”, AI tools (e.g. the Stanford NLP toolkit) were used to extract not just entities such as place, time, organization etc., but also relationships between such entities. This work was very promising, although it resulted in a fifteen-fold increase in the volume of data to be processed. The challenge that remained was how to transform the findings of research-led activities such as this, into lasting archival theory, practice and infrastructure. Tobias noted that their literature survey had revealed the need to develop a stronger ethical framework around AI techniques, and a better understanding of their (potential) impact on research practices. He concluded by observing that we should not just be thinking in terms of “What can AI do for archives?” but also, “What can archives do for AI?” – and that archivists’ expertise in provenance, appraisal, contextualization, transparency and accountability could provide some valuable input into the development of AI tools.

The next speaker was Yunhyoung Kim, from the University of Glasgow, who chose to look at what digital preservation can do for AI (rather than the other way around!). Yunhyoung began by describing her background: starting work with digital documents and automated metadata extraction using Machine Learning (ML), then looking at domain models and designated communities, followed by looking at personal digital archives and issues around privacy, and most recently teaching humanities students about responsible AI. She then discussed the use of visual characterizations of musical sources, and what inferences can be drawn from visual similarity. Yunhyoung encouraged the audience to give her some definitions of “digital preservation”, and then compared these with an attempt made by ChatGPT.

Yunhyoung discussed her view that research in digital preservation appears to be reducing (based on an analysis of publications in Web of Science), but suggested that this effect may be the result of research activity moving from academia to industry – which can make it difficult to find and use. She then suggested that research around AI is even more dispersed. AI draws on data, and typically uses either a learning algorithm or curated knowledge to create a model; preserving such data and curated knowledge is definitely an area of interest to digital preservation. Yunhyoung suggested that AI is no longer a choice because it is now used everywhere, and so we have to ask ourselves how can digital preservation help steer AI in the right direction? As AI becomes increasingly involved in the creation of born-digital content, we will need to consider how (if at all) we can distinguish between material generated by humans and that created by AI. She then presented some examples of poetry and music, and asked the audience if they could identify which of the examples was created by humans, and which by AI. Yunhyoung concluded by suggesting that the digital preservation community needs to contribute to AI literacy – noting that bias is not just about data, but about representation across the entire AI workflow.

Jeanne Kramer-Smyth then talked about how the World Bank Group (WBG) has been using AI (specifically ML) in the automated appraisal of video recordings. Their Archives Video Appraiser (AVA) tool generates archival recommendations with an accuracy of approximately 85%, which has resulted in a saving of 1.5 person days per month. AVA was developed to facilitate the appraisal of internally generated born-digital moving image records, and help automate the decision of whether a record should be selected for ingest into WBG’s digital vault, or be destroyed. The hope was that the introduction of such a tool would lead to reduced decision-making and transfer times, whilst simultaneously increasing the accuracy of appraisal decisions.

Jeanne described the context within WBG, where staff can request that recordings of video conference meetings can be centrally recorded and held. Batches of these recordings (grouped by month) are then sent for evaluation for preservation by the archives team – and this process determines which videos are preserved, and which destroyed. Prior to the creation of AVA, this was a wholly manual process, and most of the appraisal decisions were based on the title of the video recording (which was typically based on the title of the original event, and so may or may not have been very informative!).

AVA was developed following extensive collaboration between WBG archivists and IT staff. They used 1000 videos as training data, combined with the manual appraisal decisions – meaning they could subsequently compare AVA’s recommendations against the manual decisions. This was an iterative process, heavily reliant on the existing metadata. What they discovered is that they needed a larger training data set, and so they tried again with 3000 videos – and this was much more successful. AVA generates preliminary appraisal decisions which are then accepted or rejected by a (human) archivist, and the video recordings are subsequently allocated to staging areas for preservation or deleting.

It only takes AVA ten minutes to generate recommendations for 200 videos, and the tool is particularly effective when detecting empty and/or soundless videos (which would not be selected for preservation). A human archivist then only needs about thirty minutes to verify or take any corrective action, and cumulatively this results in a saving of 1.5 person days per month for WBG (whilst also reducing manual errors, and increasing the accuracy of appraisal decisions).

The WBG believe that AVA is extremely effective and has saved time, but notes that identifying a representative set of training data initially required a high investment of time. Moreover, increasing the accuracy of AVA’s decisions required ongoing, human-driven, and iterative training. Jeanne acknowledged that their use case had involved relatively straightforward criteria, and that enabling AVA to tackle more formats and complex criteria would present greater challenges. She concluded by emphasizing the importance of remaining vigilant to reduce any bias included in their training data.

The next speaker was Sarah Higgins from the University of Aberystwyth, who discussed a collaborative project to scope an AI-enabled repository for Wales, involving the National Library of Wales, and the Royal Commission on the Ancient and Historical Monuments of Wales (in addition to several other partner organizations). The project had several aims, to address the challenges arising from having a number of data silos spread across Wales by developing AI-enabled cross-repository knowledge discovery. This required an open conversation with creator and custodian communities, not least to scope a new repository for data with no obvious home, and to identify all data likely to be of interest to arts and humanities researchers.

The project began by establishing a number of focus groups – both online and in-person – to gather information which could be fed back to the developers. They considered how AI might be used to improve discovery, and in particular how supervised ML might be used for classification and regression (e.g. in image classification), whilst unsupervised ML might be used in tasks such as big data visualization. When asked what they perceived as the risks of AI, many of the focus group respondents identified issues around bias, with results being potentially unreliable or untrustworthy – and so the project team needed to think how they might address such concerns.

The developers created a prototype technical architecture, using DSpace. They incorporated several ML algorithms which used audio analysis, text analysis, and image analysis to generate enhanced metadata, which could be used in combination with other sources (e.g. metadata from wikidata) to provide an enhanced discovery service. Sarah then showed some examples of the analysis tools, and discussed how they had coped with Welsh language sources.

In summary, Sarah reported that project had amply demonstrated the need and appetite for an AI-enabled distributed repository across Wales’s HEIs and GLAM institutions. The prototype technical methodology had enabled the project to identify and test some suitable tools, and they had also established a need for additional trusted storage (notably for ‘estray’ data). Other findings identified a list of the five priority data types (i.e. databases, text, audio, images, and websites), the need for a mandatory Dublin Core profile, the use of FAIR data principles and Creative Commons licences, an ethical governance framework, and the need for sustainability to be built-in from the outset. Overall, it was felt that accuracy of 90% from the AI-model was acceptable, although participants were keen to ensure that a clear distinction was identifiable between human-created and AI-derived data and metadata). Future development work included the creation of a bilingual AI-enhanced interoperable discovery layer, and an API for the re-ingest of AI-enhanced metadata, which would lead to a number of expected benefits – notably, auto-translation of resources and metadata, more sophisticated and intuitive discovery across collections (metadata and data) and improved name, placename, and geospatial recognition.

The final session consisted of a presentation from Abigail Potter and Meghan Ferriter, both from the Office of the Chief Information Officer of the Library of Congress (LC), who were discussing “A call for a community framework for implementing ML and AI”. Abigail began by explaining the background to the framework, and reminding everyone that they are still keen to gather new perspectives and input. Meghan said that they had started by asking themselves the question, “How might we develop a shared framework that helps our organizations and teams responsibly operationalize AI?”. They had recognized that there was a lot of enthusiasm for adopting and implementing AI, but also a lot of uncertainty about such technologies and their potential impact (both positive and negative). They also acknowledged that the LC was not alone in addressing such challenges and concerns. Through their work, they had developed a number of recommendations around ML, notably: take small and specific steps (rather than adopting systems wholesale); share knowledge and lessons learned; centring people and integrating a range of knowledge and expertise; work together with others on these challenges.

Abbey then presented the framework of planning tools for implementing AI as it is currently being used in the LC Labs. They have developed an organizational profile, and thinking through that had helped them be specific about their use cases and goals around enabling discovery at scale and use (by researchers), developing business use cases, and augmenting user services. This approach had been coupled with a risk and benefit analysis matrix, which they used to articulate user stories, determine different risk categories, and clarify their ideas of “quality” in their training data. More information about their work and experiments can be found at https://labs.loc.gov/work/experiments/machine-learning/

Towards the end of March 2023, the Executive Director of the DPC, William Kilbride, visited Australia and New Zealand to introduce the work of the Digital Preservation Coalition and to celebrate the opening of the Australasian and Asia-Pacific Office. This included a visit to the National Film and Sound Archive in Canberra, where a watch-party and panel session took place to explore some of the themes raised at the AI for digital preservation webinar.

The event was opened by Patrick McIntyre, CEO of Australia’s National Film and Sound Archive (NFSA), which joined the DPC in 2021. Patrick began by outlining the mission of the NFSA, and the challenges arising from their shift to handling ever-increasing volumes of digitized and born-digital materials, in particular the difficulties of preserving such data. Patrick was followed by William Kilbride, who proposed that everyone was at the event because digital materials have value, and ongoing access to such materials means overcoming the barriers that arise from technological and organizational change. William then stressed the benefits of being part of the worldwide digital preservation community, and set the scene for a discussion around AI.

The panel consisted of Yaso Arumugam (Assistant Director General for Data and Digital, National Archives of Australia), Keir Winesmith (Chief Digital Officer, NFSA), and Ingrid Mason (Co-convenor of AI4LAM regional chapter for Australia and Aotearoa New Zealand), with William chairing the discussion.

William began by posing the question: when it comes to training data, is more always better (or sometimes more problematic) and is there potentially a role for libraries and archives to provide better and more structured data? In response, Ingrid highlighted the benefits of sharing high quality training data, and suggested that there is now a real opportunity for collaboration and the reuse training datasets. She emphasised that she wasn’t just talking about collaboration between institutions and nations, but also the benefits that could be derived from greater interdisciplinary collaboration.

William noted that in the White Paper that accompanied the recent release of ChatGPT4, there was surprisingly little detail given about the training data that had been used. Yaso emphasized the
need to ensure that training data is coming from a reliable source, especially if you want to reflect diversity in your dataset. She suggested that AI developers need to be very aware what they are trying to achieve, and the extent to which may need to update and reuse your training data as AI tools develop. Yaso also noted that individuals will always have some kind of bias, so it is essential that AI developers think carefully about governance and the re-training of AI tools. Keir agreed, suggesting that we all need to be aware of potential biases in our data selection, and the implications that might follow from this.

Next, William noted that AI has been talked of as “the next Big Thing” for many years, and asked what had changed recently to bring AI back to the forefront? Was this an indication of a marked change in the capabilities of AI, or a change in the surrounding hype? The panellists acknowledged that hype cycles are often cyclical, and that AI had been talked about for many years. However, Yaso noted that AI implementations have become more accessible, and the technology now exists to enable more of us to interact with large volumes of data and is no longer restricted to those with large budgets. Keir felt that there has been a material change in the available technology (and the ability to work with larger volumes of data), and that now is the first time that the hype- and the delivery-cycles are aligned. He noted that the tools to develop AI have been around for a while, but now more of us have easier access to the available AI tools to work with data at scale. Yaso remarked that we will continue to become data rich, and so ML/AI technologies will continue to develop.

A member of the audience asked the panel if there might be benefits for anyone involved in advocating for digital preservation (DP) within their organization if they could attach DP concerns to other hype moments, such as those around AI. Ingrid restated her view that collaborating with others was always likely to bring fresh perspectives and new ideas. Keir noted that selling DP is hard in most contexts, and whilst linking to the current hot topics might be advantageous, it might be more beneficial to try to align DP concerns with the organizational mission and vision, and show how they relate. William remarked that advocacy work in DP is never-ending, and that the key skill is knowing when to tap into the right point of the hype cycle and explain how DP is relevant.

Another member of the audience questioned whether the apparent interest in adopting AI within the cultural sector was because of the broader pressure “to do more with less”, and that the rush to introduce such technologies will risk ignoring the problems arising from underlying/inherent biases in our collections and training data sets; asking how might we balance these pressures to ensure that AI is introduced in a sensible and ethical way? Yaso suggested that AI is most effective in settings where the issues can be defined really well, and that we all risk losing trust (in ourselves and in AI) if we implement AI badly and/or get things wrong. Keir stressed the important part that good governance can play in the introduction of AI into any organization, whilst Ingrid suggested that collaborating with others and learning from their experiences can help save time and effort (and avoid mistakes).

The session ended with a brief discussion about how complex and commercialized AI seems to have become, and the potential merits of keeping AI as open as possible. Ingrid reminded everyone not to underestimate the investment required to build, maintain, and support open infrastructure, and that collaborations between sectors (e.g. cultural heritage and commercial) could be fruitful. Keir remarked that forming consortia can be a good way to bring down/share the costs of procuring commercial solutions, and is another way to foster effective collaborations. Yaso concluded by suggesting that whether you are working with an open source or commercial AI solution, you need to understand as much as possible about the datasets you are working with, and be open to learning from the experiences of others both nationally and internationally.

Comments

Rachel Tropea
1 year ago
Great post, very informative, thanks Michael.
Quote

Scroll to top