Sara Day Thomson is Research Officer at the DPC |  Graham Purnell is Digital Preservation Assistant at the National Library of Scotland


1: Getting Our Feet Wet in Emulation

World Digital Preservation Day celebrates all the inspiring work being done in digital preservation around the world. It celebrates innovative tools and techniques, effective advocacy and awareness-raising, and collaboration among fellow practitioners. This summer, I got to experience all of those things at the National Library of Scotland (NLS) as part of a digital preservation skills upgrade.

It all started when Lee and Graham - the digital preservation team at NLS - agreed to let me help them with some speculative research into emulation for digital preservation and access. I learned a lot about emulation as a technology but also about the real-life, facepalming frustrations of trying to coax 20-odd year old software to work properly on a legacy OS built in VirtualBox.

While the research was limited to a few test cases just to get our feet wet, I came away with a much deeper understanding of the opportunities and limitations introduced by emulation as a digital preservation strategy for heritage institutions. With patience and good humour, Graham taught this young dog some old tricks.

But before I get to the juicy bits, I thought I’d paint the scene a little with a few basics about emulation for those readers who, like I did, need a refresher.

If you’re all up to date on emulation as a digital preservation strategy, feel free to skip to section 3: It works! Wait, is it working?.

2: A Brief Romp Through Emulation for Digital Preservation

The Oxford LibGuides defines an ‘emulator’ as ‘software which mimics the behaviour of another computer environment’. ‘It is used in digital preservation’, the guide explains, ‘to access software and digital files which require obsolete technological environments to run. For example, an organization could use a Windows 3.1 emulator to access a WordPerfect file from 1994 on the document editing software which originally created it (Corel WordPerfect version 7.x)’.

The Digital Preservation Handbook describes emulation as ‘an alternative solution to migration that allows archives to preserve and deliver access to users directly from original files’. The Handbook explains that emulation ‘attempts to preserve the original behaviours and the look and feel of applications, as well as informational content. It is based on the view that only the original programme is the authority on the format and this is particularly useful for complex objects with multiple interdependencies, such as games or interactive apps’.

Emulation is achieved using a Virtual Machine (VM) - a ‘machine’ that lives inside a modern machine and mimics the hardware and behaviour of an older machine (or different machine). A VM is basically the disembodied spirit of a computer - it acts the same but has no physical existence. A ghost machine, if you will, that brings digital corpses back to life. The modern, physical machine used to run the VM is called a host machine (so there are plenty of excellent alien metaphors as well).

A similar strategy to emulation is virtualization, which performs a similar function using a VM, but executing some commands in host software rather than translating all the commands. But as David S. H. Rosenthal states, ‘the difference between emulation and virtualization is not absolute’. If Rosenthal doesn’t think I should overly worry about the difference, then I’m not going to overly worry about the difference.

In his 2015 paper ‘Emulation & Virtualization as Preservation Strategies’, Rosenthal outlines some of the main barriers to wide-spread adoption of emulation as a preservation and access strategy. He lists the main barriers as: 1) inadequate tools for creating preserved system images and, 2) lack of legal clarity and, where there is clarity, highly restrictive legislation. Ultimately, as Rosenthal puts it, the greatest barrier to widespread use of emulation is economic. There are not adequate resources to build the technology or to purchase the necessary software licenses (or lobby for the needed legislative change). But the difficulties of embracing emulation extend beyond these main challenges, as noted by Rosenthal four years ago.

For instance:

  • The creation and maintenance of emulators themselves face serious challenges because the major developers of emulation software are motivated by business needs rather than digital preservation. Therefore support for emulators for legacy content tapers off after a short period of development when developers move onto emulators for new applications. 
  • The momentum behind progress in emulation technology (in other words, emulators and frameworks to make them usable for the public) relies heavily on a handful of passionate enthusiasts dedicated to this work. While the work done by these champions is remarkable, it is not enough to institutionalise emulation for the world at large.  
  • Collecting institutions do not commonly collect preserved system images. Large collections of preserved system images, as Rosenthal points out, are ‘the key to emulation’s usefulness as a preservation strategy’. This challenge comes with an ever approaching expiration date as legacy content degrades and disappears.  
  • Common metadata extractors, such as DROID and JHOVE, don’t meet the requirements for emulation. Though work is underway to create workflows to produce the required technical metadata, these new extraction techniques are not mainstream enough for most heritage organisations to just pick up and use.  
  • Emulation does not necessarily deliver an authentic experience of legacy content for end users, partly because the modern machines running the emulators are much faster than the original machines and in part because, quite simply, they’re different machines. Therefore curators and archivists still have a job to do to recreate faithful experiences of old software and old machines.

Despite these challenges, emulation has made significant progress towards widespread use. In the early days, emulation was accessible only to the developer community and among the gaming community - who energized the development of emulation in order to play legacy games. Now, however, due to developments such as the bwFLA framework from the University of Freiburg, it is possible for users to interact with emulators using Web browsers, significantly removing a barrier to wider implementation.

National libraries such as the British Library and the German National Library have implemented the bwFLA framework as a solution to preserving born-digital content on physical media. The Flashback project at the British Library has successfully tested a sample of born-digital legacy content, particularly disk-based content acquired as inserts or attachments to books or magazines, using EaaS (Emulation as a Service, based on bwFLA). The German National Library has also implemented a bwFLA-based service that provides access to legacy content in the Library’s reading rooms, particularly complex or interactive content such as scientific databases.

Since Rosenthal’s report was published in 2015, the EaaSI service led by the Digital Preservation Services team at Yale University, has further improved the usability of web-based emulation. Among other advances, EaaSI enables users to find the computing environment they need for a particular digital object from a pool of environments shared across a network of institutions. This progress helps to remove the barrier of scale; the creation of an emulation environment (hardware, operating system, drivers, etc) to render each individual object or disk image in a collection is not feasible.

The ability to access a large corpus of environments through a shared service is a bingo moment for many collecting institutions looking to implement emulation. However, despite the progress the EaaSI project has made, in partnership with the Software Preservation Network and others, the legal restrictions to performing preservation actions (copying, sharing) on licensed software in Europe and other parts of the world is still a prohibitive barrier.

Despite continued barriers to widespread use, Rosenthal asserts the increasing importance of emulation as a preservation strategy: ‘The evolution of digital artefacts means that current artefacts are more difficult and expensive to collect and preserve than those from the past, and less suitable for migration. This trend is expected to continue’. And that was in 2015, so I’d say indeed, that trend has continued.

For the National Library of Scotland, like many heritage sector organisations, the focus of the emulation research was to open and review legacy content and explore options for a potential digital preservation and access strategy. In Lee’s own words, these initial investigations into using emulation were about ‘[...] some hands-on trying out of software and services, and some brains-on thinking about how this could be implemented for the Library and some of the obstacles and potential.’ One goal of the project was ‘to get some data from [the] Digital Archivist to emulate through a browser or dedicated desktop terminal as if we were presenting it to a researcher in our reading rooms.’

So we began.

3: It works! Wait, is it working?

My first day on-site at the Library, Lee, Graham, and I went for tea and cake (obviously, because you can’t do digital preservation without cake these days) to make some plans.

I only had 6 days over 8 weeks, so we set out some tiered goals:

  • ‘Bronze’: Emulate legacy CDROMs (internal Library software, not collection content), including Microsoft Works 4 and OmniPage pro (OCR software).
  • ‘Silver’: Create disk images of software and data to mount into virtual machines and use emulation to open and render collection data that has not yet been seen
  • ‘Gold’: investigate issues and roadblocks encountered during emulation and evaluate restrictions to using emulation as a service and viability of future emulation solutions at the Library

Graham and I were then set loose on a dedicated Library-issue, non-networked laptop and a couple of old CDROMs. For our experiments (and based on metadata from the CDROM covers) we needed to emulate three main environments to begin with: Windows 3.11, Windows 95 and Linux Mint. We built these environments on Virtualbox by Oracle (VB), which is open-source free-to-use software.

Downloading and running VB was pretty straightforward. The complicated bit is knowing how to configure your VMs on VB so that the legacy systems will work properly (apologies for all the V acronyms). A great deal of guidance for building particular VMs exists, much of it, not surprisingly, out of the gaming community. For example, we used some instructions from this post by John Greenfield on Medium as a reference guide.

Based on this experience, I would say creating a VM in VB is definitely doable for a complete novice who only learned about VMs just now from this blog post, as long as that novice has some basic computing skills (mainly Googling) and is motivated and patient. However, we did run into a few initial roadblocks.

For example:

  • When we built the Windows 95 VM, we set the RAM too high, which prevented it from running properly. We removed the faulty machine from VB and rebuilt it using the default RAM setting of 256MB (in other words, we scrapped it and started over, using the default options... ). 
  • The Windows 95 VM’s colour density was restricted to 16 colours, which made all the graphics seem grainy. We solved this problem by installing the graphics driver SciTech Display Doctor. 
  • We were not able to connect to the internet from our Windows 95 VM, which we eventually solved by changing the protocol in the Network Neighborhood settings.

In Blog reality that process took a few bullet points, in emulation reality, that process took an entire afternoon. Our sample set for the first session (2 days) was only 2 CDROMs and it took an entire afternoon to build a couple of the environments we needed.

But onto the next thing! The following day, we needed to figure out how to get stuff onto to the VM - how to load programmes and run software and render some data.

We experimented with adding a CD drive to the Windows 95 VM through VB so that we could mount one of the physical CDROMs into the host computer’s CD drive and open it in the emulated environment. After encountering an initial error (that required enabling the host computer’s hardware virtualisation), we successfully installed MS Works 4 on Windows 95 straight from the disk (and there was much celebrating). After all that, it was very gratifying to be able to create a document in MS Works 4 (with a little help from Graham who remembers how the programme works from the first time round...).

Once we established we could open the CDROMs using the host machine’s CD drive, we moved on to creating disk images that could be mounted virtually into the VM through VB. For other newbies to emulation, this is the principal way of accessing legacy content on optical disks (CDROMs) using emulators. Lots of software exists to generate disk images, usually IMGs or ISOs, but we used Wincdemu which runs from a right click shell extension rather than a GUI interface app. We also used ImDisk which allows an ISO to be mounted using a shell extension.

After we established the basics and successfully built the first three VMs - Windows 3.11, Windows 95 and Linux Mint - we took a field trip over to Manuscripts to see about getting some more legacy content from the Digital Archivist. He had a nice little bundle of old stuff on disks for us, requiring a few different types of emulators from an array of vintages.

Using some of our existing VMs as well as Windows 98 VM on VB and the Basilisk for running Mac OS 8.1, we resurrected some old content. In addition to the new VMs we build for the legacy content from the Digital Archivist, we also spun up a variety of other emulators for learning purposes, including DOSBox for running MSDOS and Bluestacks for emulating Android environments. Each VM and emulator came with some wrinkles to iron out, often due to configuration settings, drivers, and support for other dependencies.

For example, one legacy application required support for plugins for old versions of Shockwave Flash (which, in the end, Graham discovered was supported by Internet Explorer 6 with service pack 1, which he found at the Internet Archive, bless them). In some cases, as with the application that required old Shockwave Flash, these phased-out dependencies were critical to opening and viewing content. Without that version of Shockwave, that particular object just didn’t work.

One overarching theme across all these objects, was the relative idea of ‘working’. There were many moments I exclaimed ‘it’s working!’, only to find a few minutes later, after more jiggery pokery, that I wasn’t in fact sure if it was really working. It was often unclear if a programme was really functioning the way it was meant to, because I didn’t have knowledge of how some of these programmes were meant to work in the first place.

For OmniPage pro (the OCR software) for example, the application booted up, we could upload an image of printed text, but the OCR software could not render it in digital text. Is this because we didn't have an adequate sample of text to upload into the application? Was the accuracy of the programme always quite poor?

It’s very difficult to validate the effectiveness of an emulator without knowing how the original content (for example, software, database, etc.) worked at the time it was created. It’s one thing to open the software and click around. It’s another thing entirely to have an authentic experience using an application the way it was used at the time it was released. 

4: Anyone else out there?

To round out these experiments, and start working towards ‘gold’ - evaluating the viability of future emulation solutions - Graham and I reached out to other organisations who have implemented emulation as a solution in a Library context.

The digital preservation team at the British Library took time to chat with us about their work on the Flashback project (mentioned earlier). In particular, we talked through their extensive disk imaging initiative and the tools and workflows they set up to get legacy (or soon to be legacy) content on servers. Using an adaptation of BitCurator, the team made considerable progress. Though multisession CDROMs (recordable CDs that allow users to record in more than one session) created a bit of a challenge. 5.25 floppy disks required a bit of extra effort as well, but Kryoflux handled these fairly well. (Thanks Simon and Kevin!)

The University of Freiburg and German National Library (DNB) (also mentioned earlier) kindly answered our questions about their strategy for obtaining licenses (or permission) to provide access to the software running in their emulation service. Luckily, German legal deposit legislation extends to software applications, so the DNB can provide access to applications in reading rooms without further licencing. However, gaining licenses for operating systems required a considerable effort (and persistence). (Thanks Klaus and Tobias!)

These two national libraries provide really useful blueprints for how emulation might be implemented in a Library for both digital preservation and access. The main takeaway from the conversation with the BL was the importance of imaging all content on disks, starting with the older and more difficult content (whether emulation is in place yet or not!). The main takeaway from the conversation with Freiburg and the DNB, was that even with legal deposit scope to provide access to software applications, legal restrictions make the implementation of emulation an arduous and resource-heavy solution.

Graham and I hoped for some positive news from the EaaSI team at Yale who agreed to chat with us about their current work (thanks Euan!). The EaaSI shared service definitely overcomes many of the barriers to emulation. Namely, the prohibitively resource-heavy job of building emulation environments for every legacy item. The EaaSI project has also made notable progress in empowering US institutions with the legal mandate to make software accessible for access to legacy content in heritage and research collections. Unfortunately, this progress largely relies on the Fair Use principles in US copyright law, so not as much of a leap forward for the rest of the world. The EaaSI team is looking to address the issue of legal restrictions in Europe and other parts of the world, though, so hopefully their success will rub off on the rest of us soon!

Section 5: Lessons Learned

I’ve already touched on the lessons I learned through this emulation research at NLS, but there are a few overarching lessons I think worth summarizing.

  1. Knowledge of legacy content and how it’s meant to operate is crucial both to validating the quality of an emulator as well as to the ability of future users to open and manipulate the content in the future. For example, my limited command line skills sufficed to get MS-DOS up and running (with some prompts from Graham). But how many generations into the future will Library users remember how to use a command line operating system? What other documentation will be required to ensure users in the future - beyond a niche group of select specialists - will be able to use and appreciate these legacy systems? 
  2. Even though the research at NLS was a learning experience with lots of time spent explaining and demonstrating, it was still clear that manually building emulation environments for individual items would be an enormous task. This approach was not really a viable option for preserving or accessing the Library’s legacy collection materials at scale.
  3. Shared emulation services, namely EaaSI, address the challenge of scale, but the legal restrictions still remain for countries outside the US. Ultimately, the resources required to advocate for legislative change is beyond the capability of any one institution. For those of us in the EU, a great deal of work remains to establish the full extent of current rights and freedoms for heritage and research institutions within the legal framework as it exists. A great deal more work then remains to establish more precisely what exceptions and allowances are required and then, the advocacy to try and bring them about.

I don’t want to end with a list of challenges. Though any exercise in problem-solving often focuses on - well - problems, the emulation research at NLS also conveyed to me the undeniable benefits and opportunities provided by emulation.

I got to experience the nostalgia of Windows 95 booting up (slowly) in all its pixelated glory, an experience I hadn’t had since I was a kid. It brought back a flood of memories of getting our first family desktop so my mom could write her PhD thesis (and so my brother and I could play games, obviously). I turned in homework assignments printed from floppy discs in the school library, where there were three whole computers. As a teenager, I stayed up until the wee hours making ‘mix tapes’ from Limewire.

I have a living memory of how computers and then the internet altered the patterns of modern life - how I did my homework, how I interacted with my friends, how I shopped for books and then clothes, how I got directions to new places, the list goes on. The shift in society attributed to the rise in digital technology was not just a wave of abstract, digital innovations, it was physical and human. Emulation revives those memories and relates those lived historical experiences in a way migration completely sanitizes.

In my opinion (as a child of the 90s), it would be a mistake to dismiss emulation as a digital preservation and access strategy due to the challenges it poses. Small but meaningful steps are the way forward: collaborative working groups across institutions, more experimentation for even small institutions, more coordinated monitoring and responding to calls for consultation on new legislation. 

Perhaps above all, a role for DPC to amplify the excellent work being done in emulation across the Coalition and wider community as well as to facilitate joint efforts to advocate for more conducive legislation.  

A huge thanks to Graham Purnell, Lee Hibberd, and the NLS for letting me come tinker with your toys!  

Comments

Lee Hibberd
5 years ago
What a great read. Thanks for giving us all of your help Sara and you are welcome back any time. To the community: What can we do to remove the legal barriers around licensing? Lee
Quote