The Email Explosion: safeguarding the literary correspondence of the twenty-first century
The University of Manchester Library holds outstanding eighteenth- and nineteenth-century literary correspondence collections, relating to Samuel Johnson, Elizabeth Gaskell and others. These are a testament to the golden age of letter-writing. The Carcanet Press Email Preservation Project has ensured that some of the fruits of email’s golden age are similarly safeguarded for future generations.
The archive of premier poetry publishers Carcanet Press is one of the largest and most significant archives held by the Library. The Carcanet list includes poets, translators, editors and artists from across the world including national and Nobel Laureates and even one former Archbishop of Canterbury. The archive fills 1,200 boxes and has formed the basis of numerous research projects, seminars, talks and exhibitions.
Since the late 1990s the quantity of hard copy correspondence acquired in annual accruals to the archive has steadily diminished. Most correspondence is now conducted by email, often including manuscripts and proofs as attachments. As Michael Schmidt (Carcanet’s Editorial and Managing Director) points out, ‘the publishing back room, from commissioning through editing and production, now lives entirely in the digital sphere’. This digital archive was residing on hard drives at the Carcanet office, increasingly at risk of loss or obsolescence.
The Library recognized the importance of rescuing this invaluable primary research material and JISC funding enabled us to kick-start this work, tackling one of the most complex record types to preserve. We built on the achievements of our initial seven-week project with a further programme of internally-funded work.
Basing our workflows on traditional archival practice and digital preservation standards, our key achievements were:
- the preservation of 215,000 emails and 65,500 attachments, covering a twelve-year period, along with full metadata and conforming to digital preservation standards. We believe that we are the only UK institution to have preserved email to such a granular level;
- ingest of 282,375 digital objects into our institutional repository (five types of object from collection down to individual message) each accompanied by full technical, preservation, descriptive and structural metadata. The objects are comprehensively indexed, providing the foundation for different ways of exploiting the material.
Key work packages contributing to this success were:
1. Consulting our audience
Our focus was on ingest and preservation, and the archive is currently closed for data protection and copyright reasons. However, we realised that decisions taken at an early stage can influence how researchers will access and use the material in future. Nine researchers, from a variety of disciplines, were interviewed about how they might envisage using email archives. Topics discussed included:
- Cataloguing and descriptive metadata
- Handling hybrid collections (digital and paper)
- Emulating original email environments
- Graphical representations of email archives
The results of this exercise fed into our later work.
2. Defining ‘significant properties’
We identified the salient characteristics, or ‘significant properties’ of our email archive to:
- ensure that the emails remain accessible and meaningful over time and through any format migrations;
- assure future users of the archive’s authenticity.
Some properties are generic, but some are quite specific to this archive generated by a poetry publisher, e.g.:
- line breaks and indentation, where the text of a poem is included in the message body;
- use of font colour in extracts from proofs which are pasted into the message body – red text indicating printer’s errors and blue indicating authorial emendations.
A representative test set of messages was compiled, against which format migrations could be verified.
3. Migrating our emails to preservation formats
The Carcanet email was acquired as several Microsoft PST (personal storage) files, each containing thousands of messages and attachments. We preserved these files and will track technological developments, taking action when they are at risk of obsolescence.
We also ‘broke down’ these huge files to individual message level, and preserved each message in several formats:
- MSG: the native Microsoft format for single messages.
- EML: a more neutral format which can be rendered using several email clients.
- XML: retains all formatting information, and provides the basis for text-mining and visualisation experiments.
- MHT: web-friendly format with potential for future access purposes.
4. Developing workflows and code
We developed workflows covering acquisition, processing, creation/extraction of metadata, and ingest into our institutional repository. This involved using off-the-shelf tools, some designed for digital preservation and some more general, and writing our own code to string these together and facilitate ingest of the archive. Tools used included:
- File Information Tool Set (FITS): for validating formats and creating technical metadata.
- Paraben’s Email Examiner: for viewing and appraising the archive.
- Aid4Mail: for migration and metadata extraction.
- PST Reporter: for metadata extraction.
We also used several schemas for recording different types of metadata, including PREMIS for preservation metadata; and EAD for descriptive metadata.
5. Visualisation experiments
Several researchers expressed an interest in graphical representations, so we undertook some visualisation experiments. Visualisations can offer meaningful access to the archive without releasing full message content.
These simple bar-charts provide a quantitative summary of Michael Schmidt’s correspondence with two different individuals – his outgoing messages represented above the line, and incoming below. They reveal peaks and troughs which may be immediately meaningful to a researcher working on a specific writer or publication. They also reveal degrees of mutuality in correspondence which, as illustrated below, can sometimes be lacking.
In the network graphs, the nodes are individual correspondents, with the lines representing both direct and indirect relationships between them.
The second example aptly illustrates the email ‘explosion’ of recent years.
6. Disseminating our work
We produced a detailed project report and a 117-page manual which are available online. Our software code will be made available on GitHub. We intend to augment these workflows in future as we tackle email archives in alternative formats.
We have disseminated our work to various audiences including:
- Carcanet’s literary journal, PN Review
- The John Rylands Library Special Collections Blog
- The Archives and Records Association
We are building on our achievements to date, and are currently developing a user-friendly curatorial tool for managing the ingested objects.