Heather Tompkins

Heather Tompkins

Last updated on 4 November 2020

Heather Tompkins is a Project Officer at the Library and Archives Canada


When the call came out for blog posts for this year’s World Digital Preservation Day, we within the Digital Preservation and Migration Division of Library and Archives Canada (LAC) wondered what we could discuss and what might be of interest for the external community.  Today, we are opting to blog about our Pre-Ingest workflow.  

Our Pre-Ingest review is part of an essential workflow for preserving digital archives.  We’re looking forward to sharing what we are doing and hearing from you re: how your own work may be similar or different.  So far, this work hasn’t been greatly impacted by COVID-19 – we have continued to do Pre-Ingest despite working from home with minor network speed issues.  Our built infrastructure (specifically, a 20TB server to which we can connect via VPN), and our ability to message, share screens, and video chat have all been put to good use!  

As is often the case with digital archival transfers, we don’t always have the opportunity to review the content prior to transfer or gather much information.  As a result, sometimes what is transferred… isn’t always what we intended to acquire or preserve.  LAC’s Pre-Ingest workflow helps to address this challenge. Initiated in 2013 with only two staff members, this function has grown over the past seven years to include five Digital Archivists from the Digital Integration section who bring both archival experience and a digital preservation mindset to the work.

So what is Pre-Ingest?

Pre-Ingest entails the staging and review of transferred digital records in order to pro-actively identify potential digital archival and preservation issues.  It is conducted by Digital Archivists and occurs prior to archival selection, arrangement and description, which is then performed by a separate team of archivists at LAC. 

Pre-Ingest includes the following main tasks:

  • Copying transferred digital content to our infrastructure with specialized software (Pinpoint Labs SafeCopy)

    • This is to ensure no data loss or changes to metadata (e.g. time/date stamps)

  • Creating a Digital Object Inventory to document all incoming digital objects regardless of whether they are selected and preserved

  • Performing file format identification, analysis and triage

  • Identifying encrypted and password protected files

  • Weeding of non-archival digital objects such as temporary, application and system files

    • We don’t hit delete but instead segregate such content so that the responsible archivist will be left with a subset of potential archival records to process

  • Completing a Pre-Ingest report to document our analysis and next steps

What is the goal of Pre-Ingest?

The goals of Pre-Ingest are several:

  • To establish control over transferred data and ensure authenticity, and that integrity of the records is maintained

  • To better document the transferred digital objects - characterization allows us to gauge our confidence in our institutional ability to preserve and provide access to the content, while identifying any gaps

  • To provide the responsible archivist with strategic advice related to our findings, in order to aid their selection, arrangement, and description decisions

  • And ultimately, to ensure that a well understood SIP is handed over to our Digital Preservation section for long-term preservation (i.e. we know what we have and how to access it)

This work is necessary for LAC to fulfill its mandate, which includes preserving Canada’s documentary heritage for present and future generations as well as acting as the continuing memory of the Government of Canada and its institutions.  Thus, by establishing control over transferred content, which includes inventorying and integrity (i.e. checksums), we can demonstrate that what LAC ultimately preserves in its digital archives are the records that donors and departments initially transferred to us.

File Format Analysis – DROID reports & knowing your data!

Since we normally only have a general sense of the anticipated formats in a transfer, one of the most important functions of our Pre-Ingest work is file format analysis.  So we’ll focus on this part of the workflow for the rest of our blog post.

At LAC, we employ several software tools to establish what file formats and content categories are present in a transfer.  Initially, we use a software program called TreeSize Pro to provide us with a high-level view of the content categories which can tell us how many images, office files, system/application files etc. may be present.  It can also tell us the total volume, number of folders and files per category and overall.  However, TreeSize can only give us a rough idea since its analysis is based on a digital object’s extension – and we all know that extensions may be misleading or may not even be present!  When extensions are unknown by TreeSize they are grouped into a rather unhelpful categorization of “Miscellaneous”.  In the example transfer below, over 73% of the file formats could not be identified by TreeSize Pro alone.

ht1

So our next step is to run all transferred data through DROID also, which is a much more robust tool for file format identification (thanks to the UK National Archives for continuing to develop and support this much needed and appreciated tool!).  Our output is an MS Excel based DROID Report in which we triage the file formats identified or in some cases, those which are unidentifiable.  The aim of the DROID report is to further categorize the digital objects by file format and flag those that may pose preservation or access issues (e.g. such as formats we cannot identify nor access, or formats for which we have no known migration pathway).  We also use the DROID results to identify file formats that we would normally weed as non-archival material such as temporary files, system and application files – those files that donors or departments likely did not intend to transfer as an archival record.  Below is an example of a triaged DROID report as well as a summary sheet that we provide for more complex transfers.

ht2ht3

All of our findings are bundled into a higher-level Pre-Ingest Report, which summarizes the content that was transferred (e.g. total number of digital objects and volume), what was ultimately weeded as system/application files etc., and those content categories that may be on hold and/or require further research or information. 

Below is an example of a Pre-Ingest Report. The bottom half of the report contains information on content weeded during Pre-Ingest while the top half contains information on various content categories that may require special consideration with regards to preservation and access.  This report is provided to the responsible archivist, who performs selection, arrangement and description. It’s not unusual for a Digital Archivist to explain this report in person (or via a video call), to show examples of the content in order to discuss with the responsible archivist what might make sense for their next processing steps. We discuss what formats or digital objects may require further information from a donor/department (e.g., to determine how they could be accessed), or which may require consultation with a LAC business area (such as the Web Archiving and Social Media Program), to determine whether the formats provided have a migration pathway and/or high confidence for long-term preservation. This is a critical step in the processing workflow and provides an opportunity to give strategic digital advice to the responsible archivist that bears in the mind digital curation principles – it also aims to function as a knowledge transfer opportunity to help grow LAC’s digital capacity overall.

ht4

Next steps for Pre-Ingest?

While the Pre-Ingest workflow has been refined over the past several years, there’s always room for improvement and the next step would be to develop ways to better automate the work overall.  Digital Archivists will always be needed to aid in troubleshooting unusual or unidentified formats but the less time spent on identification for those formats that are okay, the better.

Additionally, the more information LAC can obtain prior to transfers, the better positioned we will be to handle and efficiently process content once it arrives.  Not surprisingly, we’ve found that upstream conversations with donors and departments are worthwhile and often yield information about how the content was created and used, which can aid us in preserving, and providing access to the content.  In one such case, a government institution and LAC staff, including a Digital Archivist, engaged in pre-transfer conversations over a period of two months.  During this time, the Digital Archivist guided the institution in learning how to identify the file formats to be transferred and how to package the content for transfer.  Because of this upstream work, Pre-Ingest only took 30 minutes and the responsible archivist was able to complete their selection and description work in one day.  Equally significant, this department has gained subject matter expertise in digital archives and in relevant LAC policies governing transfer, including preferred/acceptable file formats. Furthermore, this type of pre-transfer work can also aid government institutions and private donors that LAC may work with on an ongoing basis in growing their own digital skills in information management and personal record keeping - with the goal of facilitating a more sustainable future for digital archives.

Follow-up with us!

Does your institution do some sort of Pre-Ingest review of content? If so, does it cover some of the same functionalities we describe? We’d welcome learning more about your work to prepare content for preservation – feel free to contact us at bac.capacitenumerique-digitalcapacity.lac@canada.ca or me directly at heather.tompkins@canada.ca .

Also, many thanks to my colleagues Angela Beking, Tom Smyth and Kevin Palendat for their feedback and contributions to this post.

 


Scroll to top