DP and Artificial Intelligence - A Four Point Plan

Also in this section
Blog Topics

Latest Comments

A perspective on need among digital preservation professionals
- Micky Lindlar 2 months ago
  
  Hi James! Great work - thanks for conducting it and raising awareness of it through this blog. I'm ...
An Unexpected Gift
- Niamh Murphy 6 months ago
  
  This is fantastic! Thank you so much, Andy! Merry Christmas!
The unsung digital preservation story arc in the Star Wars galaxy
- Euan Cochrane 7 months ago
  
  This is great and it reminds me of an old post from 2017 after Rogue One came out. Jon Tilbury at ...

DPC Blog RSS Feed

Also in this section

William Kilbride

Last updated on 21 June 2024

In June 2024 I was invited to give an opening keynote on Artificial Intelligence and Digital Preservation to a workshop of around 200 librarians organized by the International Federation of Library Associations (IFLA), sponsored by the IFLA Information Technology Section, the IFLA IT Special Interest Group in Artificial Intelligence and IFLA Preservation and Conservation Section. The workshop had the theme ‘AI and the future of digital preservation’ and my own presentation was followed by a series of case studies:

Herbert Menezes Dorea Filho: "Artificial Intelligence: Situational Analysis and Digital Preservation of Archives at UFBA"
Pablo Gobira and Emanuelle Silva: "Using AI as Part of the Recreation Strategy in Digital Preservation"
Holly Chan, Lau Ming Kit Jack, and Zhang Ka Ho Eric: "Beyond Pixels: AI-driven Image Processing for Enhanced Contextualization of HKUST's Digital Images (1988-2000s) through the Applications of AI Models for Image Tagging, Object Detection, and Facial Recognition"
Filippo Mengoni: "Like Never Before: AI for Oral Sources. How ASR and LLMs Can Revolutionize Our View of Oral History"

This blog is my working script for the presentation. A more polished version along with case studies are due for publication in a special edition of ‘TILT’ Trends and Issues in Library Technology which is scheduled for publication in 2025.

Introduction

Thank you Ray and colleagues at IFLA for the warm welcome and giving me an opportunity to speak today. Such has been the pace of development in AI that I think we’re all on a learning curve, so I fully expect I will be gaining more than I contribute. I am excited about the case studies and conversations that will follow.

I feel it’s important to start by saying that I really am here, whether you are really there or not. I have written this presentation the old-fashioned way: reading a lot (see especially Crawford 2021 and Stokel-Walker 2024) and talking with colleagues as well as with a little experimentation of my own. I promise none of this paper is machine-generated, though you may wish it had been. One of the undoubted benefits of artificial intelligence is that, if you don’t like what I have to say, you are at liberty to generate an alternative version of this paper in due course.

I have taken the unusual step of scripting my presentation. I know I speak too quickly and tend to digress. The script will keep me to time and subject. I will share the text afterwards. By all means make notes, but the full text will be available shortly.

I have six themes for you in the next 30 minutes.

I want to start by outlining the digital preservation challenge as I perceive it. That might mean we cover some familiar topics but I don’t apologize for that.

A key concern for really any new technology - especially one with a hype curve as sheer as artificial intelligence – is to ensure we focus on our objectives. The history of technology is littered with solutions looking for problems. Let’s get that the right way round.

I will then explore the four ways in which digital preservation and artificial Intelligence intersect, and consider whether and how we might make progress:

We can use artificial intelligence tools to ‘do’ digital preservation.

I am thinking here about tools which can support ongoing digital preservation challenges such as appraisal and selection, or metadata extraction or forensics. Spoiler, I think we’re going to have some great case studies on these themes later in this workshop.

Artificial intelligence also puts pressure on digital preservation and arguably makes some of our roles and functions more important.

This will lead me to discuss misinformation and disinformation, and the aspects of digital preservation which maintain and promote ‘authentic’ knowledge. There are some useful things to report here but there are profound challenges of capacity and capability that need urgent attention.

I want to consider the contribution that digital preservation tools and services can make to the development of artificial intelligence, whether as training data, language models or algorithms.

This may seem niche but there’s a good story to tell. There are reasons to suppose that some of the large language models have simply grown too big. There’s an opportunity here for us.

Finally I want to explore the possibility of preserving artificial intelligence itself, at a systemic level.

I grant you this may be a flight of fancy considering the size, complexity and incomprehensibility of the task. But I also firmly believe that reproducibility is essential for accountability and trust. So really I am challenging the confidence that is placed in some of these tools, with an implicit call to action.

At the end I want to sketch a simple agenda to ensure digital preservation benefits from, and contributes to, the development of better and more accountable artificial intelligence.

I will remind you all that digital preservation is a socio-technical challenge and all the technology in the world can’t solve it alone. The ways in which AI is embedded within law, policy and business processes will also matter, and we should not allow ourselves to be overwhelmed by tech.

DPC is a friend and ally to that conversation.

I have been allocated 30 minutes. Please do use all the usual functions to ask questions or make notes and I am looking forward also to the discussion that will follow, not simply in this formal session but in conversation and shared thinking afterwards, and I start with an open offer of friendship and support to your own goals.

So I want to provoke you at the very start: what would you like to know? What would you like to do? Take a moment to write that down and bring it back later.

What is Digital Preservation

Let’s start at the beginning. Digital preservation is “the series of managed activities necessary to ensure continued access to digital materials for as long as necessary.”

There are bits to emphasize here which will frame some of what follows. It’s a series of things not an event; it fits within a policy framework; it’s about more than backup; and it’s for as long as necessary – not forever and certainly not everything. It’s not just about data or informational content – 'materials' is pleasantly vague. Not just digitization And, because digital lifecycles are short, lots of agencies which are not memory institutions or libraries in the traditional sense have a digital preservation problem.

The digital preservation problem arises in all sorts of ways – really just about every single way in which you can lose access to programs or data - and some of these challenges are themselves dynamic. So there’s not a ‘once and for all’ solution. There’s a continuing commitment which in turn means policy and resources are as important as technology.

Over the years tools and standards have emerged and it’s fair to say that many of the issues are tractable, but solutions not very scalable. Contexts make a significant difference. So, while there’s a lot of good practice and experience, often-times the answer to any digital preservation question begins with the slightly disconcerting phrase – ‘it depends’.

Let me give you two examples

Metadata – or perhaps more accurately packaging information - is never not an issue in digital preservation. Achieving independent utility of a digital object means providing a lot of context. There’s a risk of almost infinite regression here. In a conceptual sense you need to document the whole of human knowledge to make a single digit meaningful.

That’s absurd – partly because human knowledge is pretty well documented elsewhere; and also because you can anticipate a degree of background knowledge from a community of users who will know more or less about the topic.

So, how much metadata do you need for effective preservation? – it depends.

I am labouring this point for two reasons. There’s a school of thought which would tell you that the archival information package is the distinguishing characteristic of digital preservation. Experience shows that the process of creating archival information packages can be arduous. Approaches like minimal effort ingest have emerged as a way to deal with this – sorting out the metadata afterwards.

So that’s a real problem, not one we’ve made up so that we can play with fancy new tech. And coming this afternoon, AI tools that help with metadata creation.

A second, related issue – understanding dependencies.

Imagine a Word document and ask whether or not you need to preserve the font. Sometimes yes, sometimes no, and sometimes it will be embedded in the file any way; but not always and in those circumstances you may discover a dependency on a remote service which may or may not be available.

Scale that up to the cloud and before you know it you have services built on services and microservices which all need to be mapped and risk assessed before you can be confident about any preservation steps.

I mention this for two reasons.

Firstly, mapping dependencies at the scale of the cloud, risk assessing them, and firing up services that can replicate broken dependencies sounds exactly the kind of tool suite that we need, especially if we want to recreate not only data but the whole execution environment of a given application. It’s the kind of on-the-fly problem solving that AI would do very well.

But it’s not all just technology. Dependence is not just technical it’s also an economic reality. Digital preservation teaches us that, especially on cloud platforms, data loss is an outcome of business processes and economic realities which are way beyond our control. Creating dependencies on enterprise level AI at a time of great market turbulence should set off alarm bells. Remember the central paradox of digital preservation systems and tools: they are susceptible to the same kinds of obsolescence as the materials they are designed to preserve.

AI Tools that help us 'do' digital preservation

Okay you’re here for the AI and that’s what I want to turn to now.

In truth I have said a lot of what I need to say here already: there are some wonderful, emerging tools that are making simple but practical contributions that make digital preservation tractable.

In a moment you will hear case studies of metadata extraction and classification in music, in image processing and audio transcription. You will hear also about digital restoration. These are all tools that will help and they sit within an emerging AI toolkit for digital preservation. It’s worth sharing a few others. These will all be linked from the text.

There’s the very practical example from the World Bank Group. They have a long tradition of automatically recording video conferences, generating massive quantities of data. But the reality is that not every meeting goes ahead, many meetings start late and some even finish early. So the collection of recordings includes a significant amount of dead air which clutters up the servers and makes it hard to find the right material. Jeanne Kramer-Smyth has demonstrated simple recommender services based on Machine Learning that speed up selection and disposal, flagging empty recordings to ensure the archive is not inundated with dead air.

AI has enabled advances in handwritten text recognition. There are probably colleagues on the call better placed to describe, for example, the use of AI within Transkribus which has been trained on digitized texts. Joe Nockels has described using AI at the National Library of Scotland to help with the transcription of the childhood diaries of Marjory Fleming. This is not simply a technical accomplishment because it’s hard to hear the voices of female Scottish children from the early nineteenth century. So it gives access to an authentic and overlooked voice. There are so many voices from so many digitization projects that are just waiting to be heard.

I am a fan of work at the White House Historical Association which has owns a massive but largely uncatalogued photographic collection produced from years and years of official White House photography. By using Amazon’s ‘Rekognition’ tool, Stephanie Tuszynski has been able to identify the subjects of many thousands of photos, opening up access to a collection of around 300,000 images of daily life at the White House. Essentially everyone who visited the White House in the second half of the twentieth century is there.

Also supporting access, but at a different kind of scale, Leontien Talboom previously at the National Archives in London, has sketched out a practical guide to computational access for digital archives. This is important because it moves the dial at a service level, over and above individual case studies. As with the Marjory Fleming case, we also need to provide access to collections if we can move on. Leontien’s work helps set and manage expectations and needs of readers who don’t simply want access to individual documents in the reading room.

Access like this will be the future: when users bring their APIs to our AIPs.

These are just a few examples and I am looking forward to having a deep dive in a moment with our other speakers.

There’s a lot more we could be doing. For example, there are well developed tools and processes to assess fixity. Checksums will quickly tell you that a file has changed, but they make no comment on how egregious that change is or what might have caused it. So it’s your worst nightmare as a manager - and the question is whether we can’t use AI tools to do better: to identify and assess the scale of the change, and to make an informed assessment on how that change came about.

So these emerging tools are welcome, and there’s a lot more we can be doing, provided we focus clearly on real digital preservation objectives and the obstacles that prevent us doing a better job.

It’s worth remembering something of the context of our work, especially as they relate to the scale of the problem. Time and money turn out to be the biggest challenge to preserving our digital heritage. There are lots of examples but here’s a library example you may find familiar in your own context.

The UK Web Archive at the British Library is gathered under statute by an agency established for that purpose. In 2017 the DPC studied the impact of changed regulations on the web archive which we noted had grown from 365,000 collections in 2013 to almost 14,000,000 in 2018. A thirty-eight-fold increase which has not been met with an equivalent increase in funding. We need all the tools we can get, just to stand still.

Digital Preservation and the Outputs of AI

It seems entirely obvious that digital preservation’s interaction with AI will also be about the preservation of content which AI has helped to generate. There’s a wall of content coming our way and how we handle it could have profound implications.

There are numerous examples, comic and tragic, of AI being used to simulate text and images that seem convincing, which are then lifted out context. If we’re not careful these will find their way into official and legal processes and form a completely erroneous historical record.

You should look up what happened when Roberto Mata sued Avianca Airlines in March 2023.

Mata’s lawyers filed a suit which, unbeknown to lead counsel, had been compiled with a little help. It quoted key legal opinions and judgements demonstrating that Mata’s was an open and shut case: and that the airline should settle before the case reached court. It was a surprising reversal when the judge and the defence team asked Mata’s team for copies of these previous rulings, which they couldn’t find because they did not exist. Mata’s lawyer was basing his case on bogus decisions with bogus quotes and bogus citations. ChapGPT had simply guessed the cases into existence.

Judge Kevin Castel fined Mata’s legal team, ruling that they had abandoned their responsibilities to the court. His judgement described the ‘gatekeeping role’ of attorneys to ensure the accuracy of their filings. It’s not an isolated case, and later that year senior judges had to issue renewed guidance to attorneys about the use of generative AI in court.

The Mata case is a text-book example precisely because the judge spotted the error and stepped in before it went any further. But that’s not always going to happen. Bogus records will be cited and sooner or later, probably already, the courts and other regulators will accept them onto the official record.

This is misinformation – perhaps a lazy or over-worked assistant reaching for a cheap solution and not fully understanding the implications of their actions. But they made no attempt to cover their steps and made an immediate and fulsome apology.

Imagine what will happen when someone sets out deliberately to create and then insert false records into archive and library systems. It may be years before such fraud is detected. So my point is that we have to be working to protect the record.

The question arises how do you detect AI-generated nonsense, whether the deliberate fakes or lazy simulacra?

I have three reasons to be pessimistic about the prospects of protecting the record from these kinds of disinformation and misinformation.

On one hand there’s just the brutal reality that digital preservation roles are not funded and not taken seriously. We have extraordinary investment for cybersecurity, and rightly so, but laughably poor investment in information security. We are already face a crisis of public confidence in government and the transparency of its processes.

So my strong sense is that the weaknesses of our ingest and transfer processes need to be declared a public emergency.

Even if our appraisal and selection of records were properly funded we would still struggle, because the volume of fake news and the ease with which it is created have been transformed. The deluge is here already.

Is there an AI solution to this AI-generated problem?

For sure it is possible to code fakes in such a way to make detection easy, and that might be a direction of travel for regulation. But that sounds like an invitation to an arms race, or at least a way to lower our guard: because as surely as there are safety features built-in, so there will be work arounds.

And there are AI tools, like GPTzero that already claim to spot AI-generated text. These next generation plagiarism detection tools misidentify about 1 in 20 examples of human-generated text as machine-generated: that proportion needs to come down if the usefulness is to improve. Even more alarmingly they have a tendency to reverse engineer hidden biases within the large language models, so non-native speakers are reportedly ten or twelve times more likely to be suspected of plagiarism.

Digital Preservation Supporting AI

Can digital preservation support the development of AI? It may be less obvious but I think yes, digital preservation has a role also in the development of AI too and it’s not unconnected to the need to verify authentic content. I am particularly thinking about language models here.

Large language models underpin text simulators like ChapGPT. They are voracious eaters of content and it’s hard not to be impressed if a little scared of the pace of their growth. They provide the statistical basis for transformer tools that then essentially make informed guesses about word should come next in a sequence. The larger the sample of text it can access, the better, or more convincing its performance will be at guessing which string of words to deliver. That’s the basic model of the Generative Pre-trained Transformer.

The companies behind the systems are circumspect about how much data they have crunched and where it comes from, but you can reverse engineer some estimates to give you an idea of their consumption patterns.

In late 2022, ChapGPT3.5 was reported to work through around 175billion parameters as it simulated text. ChaptGPT4, released in March 2023 is reported to have around 1 trillion data parameters. That’s a 5-fold increase in less than a year, which is an incredible rate, and significantly faster than the growth of the Internet. That hints at a coming issue. The Internet is very very big, but it’s not infinite. There is an upper limit to the amount of human-generated text which it can supply.

Some industry analysts have suggested that LLMs may begin to run out of data, perhaps as soon as 2026. It’s hard to see how they can continue to improve beyond the point that they’ve consumed the whole of machine readable human text.

I’ll plant with you now the idea that access to our digital collections might help with that. It’s not exactly as if the big tech companies have been paying much attention to the restrictions that libraries place on use, but it’s possible they will start to pay more attention.

Here’s why: LLMs are indiscriminate consumers of text and their GPTs are profligate producers of text. This may be their downfall.

Sooner or later LLMs will start to consume their own simulated outcomes. In statistical terms they will revert to the mean and instead of becoming better at producing text they will become better at producing the average.

At the same time, they will inevitably consume misinformation and disinformation. So, over time the outputs become less creative and less reliable: a sort of system collapse. That’s the kind of thing that happens when you start to breathe your own exhaust fumes.

As the saying almost goes: garbage out, garbage in.

So there will come a point where LLMs cannot improve, except with access to more, high quality human-generated text, and they will need to know this information content is valid and not just more of the same. We keep being told that digital content from the GLAM sector is locked away and hard to access. So, it’s just possible that the solution for the LLM conundrum is better access to and better prioritization of digital repositories.

Preserving AI

I want to turn my attention, briefly, to what might be the biggest challenge of all: can we preserve Artificial Intelligence at the system level?

The arguments as to why we should want to make AI reproducible seem unimpeachable. There’s an old maxim reported from IBM that computers cannot be held accountable therefore should not be allowed to make decisions.

We seem to have forgotten this rule.

Some of the earliest developments of AI were in expert systems which were designed for diagnostic decision support, but lately there are all sorts of uses that seem to go beyond decision support and move directly to action.

Facial recognition might be a case in point. It is notoriously unreliable, failing to recognise faces up to 80% of the time, but still deployed routinely by police forces in the UK. AI is supporting immigration decisions, insurance premiums, university admissions, healthcare plans and credit ratings; AI is analysing long range population demographics, condition monitoring for vital infrastructure, and targeting political and social messaging to millions. You would hope that such systems could be held accountable and challenged to reproduce their outcomes.

But AI systems typically embed their own learning through reinforcement: this is one of their most important characteristics. They are dynamic, responding to their own outputs such that every iteration is subtly different. There is no steady state and, with complex dependencies across a staggering range of tools and services, the changes are hard to trace. So the prospects for reproducibility – and thus accountability, are inherently weak.

It’s not all bad news. There may be some hope in the context of open source.

One of the most telling developments of the last year has the been a sudden glut of open-source AI tools based on open source LLMs. It’s partly happened by accident, with Facebook’s inadvertent release of LLaMA and the emergence of some surprising, low-cost but highly effective competitors like Koala or Vicuna.

There’s a sense that systems like Bard or ChatGPT have been built with billions of dollars of investment but, as Luke Sernau of Google admitted, have been built without a moat and without defences. In recent months the open-source community seems to have wrestled control off the one or two major corporations who have invested so much over the years, and it seems to be running off with capabilities that have cost billions to develop.

Why does this matter for digital preservation and reproducibility? It is harder to hide open-source architectures within black boxes, and harder to conceal open-source workflows with non-disclosure agreements.

The secret sauce is just sauce and that makes it one step closer to being reproducible.

What is to be done: an agenda

We’ve covered a lot of ground in this presentation and there’s so much more we haven’t covered: no mention of copyright, no mention of environmental impacts, no mention of the human cost of faux-tomation, no mention of the problematic military and coercive use-cases which have sustained and underpinned so much research.

I’ve not raised regulation, lobbying or the apparent ease with which high officials act in the interests of those whom they should be regulating rather than the publics they’re committed to protect; I’ve not mentioned interference in markets or elections or the extractive exploitation of data about vulnerable and marginalized groups. I’ve not mentioned obscene profits or the feudal relationships that emerge around new kinds of rent-seeking behaviours. Ethics have hardly surfaced at all.

It’s not that they don’t matter, it’s just that I’ve only got so much time.

I want to propose a simple 4-point research and development roadmap for AI and digital preservation. We should:

Investigate, benchmark and share best practice for artificial intelligence-based tools that support and enhance digital preservation workflows, such as but not limited to appraisal, metadata extraction, forensics and access.
Establish frameworks for the assessment of content generated by artificial intelligence, alerting governments and others in authority to the coming threat to authenticity and accountability that arise. We should declare a public emergency.
Be prepared to engage the artificial intelligence industry making sure the ethics and competencies of librarianship are represented at the highest level, and making the case for existing, authentic but inaccessible content as a contribution to better AI.
Create urgency and momentum towards the preservation of artificial intelligenceat a systemic level and recognise that reproducible AI is a core component for accountability.

This agenda is not limited by the 30-minute time frame of a presentation so it absolutely must take place in the context of all the other ethical, legal and socio-economic matters which I have overlooked.

To repeat what many of you have heard me say before and which the street signs here in Glasgow almost say: people make digital preservation. If we concentrate too much on technology then we will never fully succeed. Digital preservation is not for the sake of the bits and the bytes: the files will not thank us. So this research agenda has to be framed not simply around what the technology needs, but what will be most effective at empowering people and ensuring opportunities are handed through generations independent of technology.

That should keep us plenty busy. And if you hadn’t noticed already, the Digital Preservation Coalition is a ready-made global partner in this conversation.

Thank you again for your time today. I hope you have found something in this which is useful to you and I am grateful to IFLA for their warm welcome and generous invitation to speak.

Acknowledgements

I am grateful to colleagues at IFLA for their invitation to speak. I am grateful to Michael Popham who reviewed an early draft of this text, to Prof Maria Economou and Dr Lynne Verschuren who prompted me to organize my thoughts, and to Prof Andrew Cox who introduced me to the organizing committee.

Add comment

A perspective on need among digital preservation professionals

An Unexpected Gift

The unsung digital preservation story arc in the Star Wars galaxy