After the Repository: Reproducibility, Transparency and Artificial Intelligence

Also in this section
Blog Topics

Latest Comments

Archiving Facebook, Right Now
- George Oates 3 weeks ago
  
  Hi Andy, Yes, it's still a difficult nightmare, and that's just to archive your own stuff! Thanks for ...
Preserving Legislative Records: Why they matter and what the Nairobi City County Assembly can teach us
- Nelly 1 month ago
  
  Thanks for this informative insight Villy
- Adah 1 month ago
  
  A Very informative article.

DPC Blog RSS Feed

Also in this section

William Kilbride

Last updated on 28 March 2023

In February 2023 I was invited to speak at a workshop organized by the AEOLIAN Network entitled ‘New Horizons in AI and Machine Learning’ Circumstances, including a postponement of the workshop on account of industrial action meant I was not able to attend and present in person. Therefore I have shared this text of my presentation for publication afterwards, with the consent of the organizers.

It’s time we built the basis for a new digital preservation. Emerging technologies, including AI, invite us to rethink what we have been taking for granted for too long.

This may sound like a dramatic development but to those familiar with both disciplines it’s probably a statement of the obvious. Artificial intelligence is, of course, the next big thing in computing: you cannot hide from the hype cycle. Also, AI has been the next big thing for at least four decades. So perhaps this time it will get over the inflated expectations and head for something productive and routine: perhaps it really is the next big thing.

Regardless of how the current headlines about AI pan out, we’ve also, already, and always been needing to imagine a new digital preservation.

I am sorry that I cannot be with you in person today, even remotely. Some of you will recall that the workshop was originally scheduled on 23^rd February which coincided with a day of industrial action. The event, which I was very much looking forward to, has been moved to 24^th March instead coinciding with the launch of the DPC’s offices in Australia. By the time you are hearing my voice or reading this I will be safely tucked up in bed on the other side of the world.So I am grateful to Paul and Lise for their patience – they had a very good opportunity to ditch me from the schedule. I am grateful also to my fellow panellists who might want to take a firm view on what I am about to say. They may or may not also be grateful to Paul and Lise too. I have taken the step of scripting this presentation, partly to help with the discussion so that colleagues can reflect on it before hand. Paul and Lise also know my bad habit of talking too long so this text will keep me to time. It also affords me the luxury of posting the text to the DPC blog in due course, assuming you are happy for me to do so.

I worry that digital preservation is broken, and I wonder if AI is the nail in the coffin. More particularly I am concerned that the repository model of digital preservation is facing its own obsolescence.

Sic Transit Gloria Repositoriae

Of all the people here today, it is the digital preservationists that should be aware of the risks of obsolescence. But this workshop is about new horizons for libraries and archives. It’s been more than 20 years since the idea of the Trusted Repository was first proposed. It can hardly be argued that trusted repositories are the future of libraries and archives. After so many years they are, or ought to be, the here and now.

At a previous Event I outlined the 4 ways in which I think archives and libraries might intersect usefully with AI.

We can use artificial intelligence to do digital preservation better. That’s coming into focus in a lot of the presentations through the Aeolian network and elsewhere. My current favourite is the work which the World Bank has been doing around video classification which I recommend you look out.
We can become better at capturing, receiving and arranging the inputs and outcomes of artificial intelligence, knowing and documenting the variables in play and the dependencies that exist between them. This might be naively characterized as using forensic tools and techniques to spot the deep fakes as well as the shallow ones.
We can disrupt or take over artificial intelligence - creating, monitoring and maintaining the kinds of services on which AI depends.
We can take steps to preserve artificial intelligence at a systemic level. That’s something we’ve barely even begun to think about. We’re used to worrying about data and our usual complaint is that as the amount of data grows, so the job gets larger.

In my last presentation I emphasized the opportunity, even the need, for archivists and librarians to disrupt AI with ethics at a systemic level by training AI tools on complex document sets that reveal the deeper complexity of real life rather than the exaggerated regularity and fauxtomation on which many systems depend.

In this presentation I will turn again to the question of preserving AI.

So what are the prospects for the preservation of AI? Is there even a remote possibility of building an infrastructure around AI in which transparency and reproducibility can be demonstrated over the longer term?

DP101

So firstly, let me offer a brief definition of digital preservation.

It’s “the series of managed activities necessary to ensure continued access to digital materials for as long as necessary.”

I know you know this so forgive me, but there’s a lot packed into this short statement which is worth holding in mind. The words in bold are important. Digital preservation is a process not an event; it fits within a managed framework of an organization and its mission; it’s about access which means more than just backup or storage; and it’s for as long as necessary – not forever and certainly not for everything.

This definition includes all sorts of decisions and actions that impact on the design, creation and use of digital materials. So key digital preservation decisions happen long before we realise the need for preservation.

An awful lot therefore falls into scope for digital preservation. Really any digital object where the lifecycle and use case is longer than the lifecycle of the infrastructure on which it was created.

I’m labouring the point. Perhaps it would be simpler to say that I am aiming at “the series of managed activities necessary to ensure continued access to Artificial Intelligence for as long as necessary.”

Why preserve AI?

Put in those terms the preservation argument seems robust. You need preservation because system safety is hard and dangers are real.

I don’t need to talk to this audience about the extensive reach that AI could have – already does have - into our daily lives, nor about the ethical, legal or environmental concerns that have arisen. Barely a day goes by without some headline suggesting that ChatGPT might put essay mills out of business.

I can afford to be tongue in cheek here because it seems so entirely obvious that systems which make and inform important decisions and intervene in so many aspects of our lives and work should be open to scrutiny. We also know that systems embed all manner of stereotypes and biases. In so doing they give a gloss of objectivity which can disguise and obfuscate all manner of privilege and prejudice.

Some are funny and mostly harmless – just try voice activated systems with a Glasgow accent and being told literally that your voice doesn’t fit. Three decades of regional accents on the tele but here’s an Anglo-normative elevator ready and willing to other you. Buttons were never so judgey.

Some of them are profoundly serious. AI is the Third Offset. State actors and non-state actors of all kinds are busy harvesting and measuring massive aggregations of data so that individuals – I mean you and me – can be understood, tracked and controlled. Systems which were extra-legal by design to enable infrastructural warfare are now available on demand to every level of municipal government and law enforcement. As Benjamin Bratton has argued, states have turned to surveillance computing because surveillance computing has started to subvert the jurisdiction and accountabilities of the state.

You’d want the outcomes to be reproducible; and you’d want processes to be stable enough to do meaningful side-by-side comparisons with different inputs.

Are you charging Abdul more for his car insurance because he’s a terrible driver, or because there’s a racist plug-in to your scoring system. Are you targeting this person for stop and search because a poorly designed emotion detection system in the CCTV matches them to some generic ideal of danger, or because they have a disability. What about expert systems which decide on cancer treatments or school placements or benefit applications, or exam grades?

These need to be regulated. My broad argument is that transparency needs preservation – perhaps better described as reproducibility. This is not some nice to have luxury. It’s essential. We should not have proceeded without it.

Killer Apps

Back in digital preservation, we’ve been so concerned with data that we’ve not been thinking enough about the business processes and transactional systems which data occupies.

We have all sorts of tools for managing the data, and some tools for reconstituting the processes but nothing close to this scale. We seem to have forgotten that, although we can parse and validate files till the cows come home, the data volumes are more than just a question of scale. The ranking algorithms and personalization of view are more important to the shaping of public discourse than the file-based contents they purvey.

Search algorithms are in the highest risk categories for preservation, but not because they have been abandoned at the back of the cupboard. It’s because the work needed to preserve them is so specific that by the time we figure it out the object has already changed.

Search is dynamic, and highly proprietary which means it is already a difficult beast to tame. It is also highly personalized, accessing the shadow persona of cookies and apps that track your habits and preferences. It’s unlikely that tech firms would ever surrender the finer details of their algorithms or share your social graph. They benefit from the continuous invisible improvement cycle in which every click is harvested into a feedback loop of optimisation. It generates a massive commercial advantage and a massive impediment to competition.

You might be wondering why I have gone off on this tangent about search and personalisation. Search was the killer app around the turn of the millennium, at about the same time as the Trusted Digital Repository was in development. Twenty years later and it remains incredibly hard to see what’s going on with search rankings.

If that’s true of search twenty years on, then it’s going to be even more true with AI systems.

Night in the Repository

Let’s imagine digital preservation tools and approaches when presented with AI. Here are two common if generic ways to think about digital preservation.

Around 20 years ago, maybe more, Nance McGovern described digital preservation as a three legged stool which needs technology, resources and policy. You can measure success by how far you have progressed with each of these. My guess is that, if you asked how advanced we are on each of these you draw a blank: no, no and no. That’s discouraging, but it’s tractable.

At the same time the space science community were developing their model for the Open Archival Information System (see Brian Lavoie's Introductory Guide from 2014). That’s made a lot of the running and it lays 6 responsibilities on anyone doing digital preservation.

Negotiate for and accept appropriate information from producers;
Obtain sufficient control in order to meet long-term preservation objectives;
Determine the scope of the user community;
Ensure that the preserved information is independently understandable to the user community;
Follow documented policies and procedures to ensure the information is preserved against all reasonable contingencies.
Make the preserved information available to the user community.

This is where we get into more serious difficulties. Making something independently understandable has always been the trickiest part for the implementation of the OAIS and why the information model, rather than the functional model is the core.

Let’s look a bit more closely. The core requirements of an OAIS include need to 'Ensure that the preserved information is independently understandable to the user community, in the sense that the information can be understood by users without the assistance of the information producer’. This requirement is resolved by placing information packages in front of a designated community. The OAIS becomes a broker within an information exchange between producers and this special group of consumers who, through the deployment of implicit knowledge, prevent an almost infinite regression of representation information.

Even so, independent the utility a submission information package is only possible with a significant additional packaging, and arguably it is the packaging of representation information which marks out digital preservation as distinct from other forms of content management.

What would that be like in the context of AI? A number of questions arise:

Is it possible, meaningfully and practically, to ingest something as large, distributed and complex as an AI engine into the sorts of preservation architectures outlined described by OAIS?
If OAIS is not the correct reference model to describe the preservation functions that would make AI systems reproducible, what other model do we have? Shouldn’t we be starting work on that soon?
Is it possible to obtain control over the system without setting limits that could in turn be harmful to reproducibility?
How would be scope a designated community when addressing a question like this?
Would it ever be possible to obtain the representation information needed to ensure independent utility?

There’s quite a long discussion to be had about what level of reproducibility is sufficient. But on first inspection my sense is that the only way we could answer yes to these questions would be if the system were small and self-contained, if we had very precise requirements for reproducibility; that these requirements were very stable over time.

Let’s dig a bit more deeply here into AI systems. You might have heard GPT-4 was released to an unsuspecting world last week, somewhat delayed but with some nice new features. It registers in the top 10% of bar exams for lawyers and can now accept image inputs as well as text. It continues to be a source of amazement.

But also amazing are the terms under which it has been released. Open AI, the company behind GPT-4 has released a technical paper alongside the release. For all the hype, the supporting documentation is surprisingly opaque about how the tools work and how they have been trained.

‘Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.’ This seems like a bigger story than the extraordinary results it has produced. It is vaguely reminiscent of the time Amazon released its ‘Mechanical Turk’ which gave the impression of Artificial Omniscience while hiding the real labour out of sight.

Granted GPT-4 is a chat bot – a language simulator which is not the only or even the most important form of artificial intelligence. But I am not seeing a lot of transparency from other providers either. As a general rule AI has emerged from military and surveillance computing, so opacity and avoidance of regulation are not a bug it is a feature.

It’s for others to point out the ethical, legal and economic concerns that arise from this. For the record though, I don’t think we’re going to have much luck asking about the inner workings of the system in order that we can better preserve it.

More likely, in the distant future (or next year whichever comes first) we would achieve a form of reproducibility by asking newer forms of AI to answer ‘in the style of GPT-4’. Building AI to capture AI.

That would be an approach to reproducibility but arguably move us farther from transparency: in addition to understanding how the old system understood the world, we’d need to understand the limits to how the new system understood how the old system understood the world. And thus the puzzle is now wrapped in an enigma.

Either way there’s a big conversation to be had about what kinds of reproducibility are required. It’s entirely possible that we’re asking the wrong questions. Established models in digital preservation, like OAIS offer a particular view on preservation which works for documents and for data. This has served well for a computing paradigm of the last century, it will continue to be useful in contexts where documents and data are self-contained. But it offers little to meet the emerging challenges in preservation.

This might all seem a little unfair: I am asking OAIS to preserve something it was never designed to preserve, probably because it didn’t exist when OAIS it was described, and then I’m complaining that OAIS can’t preserve it. But that’s what obsolescence is always like. We need to start imagining what digital preservation and reproducibility mean after the repository.

Conclusion

To conclude. We need a wider discussion of digital preservation and reproducibility. I am by no means the first person to have suggested this nor are the practical limitation of OAIS unfamiliar. Efforts to make AI reproducible, a necessary sub-plot in its regulation, will make these issues disconcertingly apparent. To be fair, the Trusted Digital Repository has always been a misnomer- it’s the people and their processes we trust not the repository itself.

It’s time we built the foundations of a new digital preservation.

Acknowledgements

I am grateful to Jen Mitcham and Michael Popham who reviewed an early draft of this text prior to publication and to Paul Gooding and Lise Jaillant and partners in the AEOLIAN project for inviting my contribution to their workshop.

Add comment

Archiving Facebook, Right Now

Preserving Legislative Records: Why they matter and what the Nairobi City County Assembly can teach us