Keith Pendergrass

Keith Pendergrass

Last updated on 10 November 2021

Keith Pendergrass is Digital Archivist at Baker Library Special Collections, Harvard Business School.


This is a companion post to the Environmentally sustainable digital preservation - moving from theory to practice webinar.

Since the 2019 article and workshop protocol on environmentally sustainable digital preservation that I wrote with Walker Sampson, Tessa Walsh, and Laura Alagna, I have been using our framework to improve the sustainability of Baker Library’s digital archives program. I have written previously about our efforts to integrate sustainability into policies and workflows. For this post, I am going to look at a recent software development project as an example of how we can embed sustainability into our design and use of digital preservation systems and tools.

adaptFirst, some background. We have been collecting born-digital materials at Baker Library Special Collections since the mid-2000s and began formalizing our approach around 2017. At that point, we acknowledged that the mostly manual digital object management processes we used for our appraisal and archival processing workflows could not handle the scale of our born-digital acquisitions. After several years of advocacy, we received support to build a workflow implementation and digital object management tool with our IT department (HBS IT). The result is ADAPT: The Adaptive Digital Appraisal and Processing Tool, which we deployed in January 2021. It consists of a web service and web application that integrate with pre-existing HBS storage and database infrastructure. For workflow implementation, users create collections, add deposits, and design custom (or select pre-set) workflows made up of repeatable tasks. Some of those tasks—such as file packaging, copying, and validation; file fixity tracking; and virus scanning—are automated. For digital object management, ADAPT creates a storage volume per collection with a unique directory per deposit. Users then add files to the deposit directory and execute the workflow, with the automated tasks taking care of inventory control, fixity tracking, and file packaging. ADAPT is specifically designed for appraisal and archival processing workflows; it is not intended as a preservation repository and is complementary to Harvard’s existing repository, the Digital Repository Service (DRS).

Now, on to sustainability. With the privilege of being able to design a system from the ground up, we wanted to ensure that it helped us meet our sustainability goals, which include reducing GHG emissions and raw material use, improving employee well-being, and using our financial resources most efficiently and effectively. We met these goals by enabling selective appraisal practices, reducing the number of copies and fixity check frequency, and making ongoing maintenance easier for our HBS IT colleagues.

Appraisal

In the DPC webinar Enacting Environmentally Sustainable Preservation, I mentioned our efforts to dedicate more staff time to appraisal at multiple points in our workflows. This is ADAPT’s most significant sustainability benefit: By automating digital object management, ADAPT frees up staff time so that we can focus on those areas of the workflow that can most benefit from our expertise, such as appraisal and archival processing. This makes it more feasible to do the in-depth work required for selective appraisal while maintaining our current staffing levels and workloads. By being more selective in our appraisal, we focus our resources on those materials of most value for our community and reduce the extent of materials that we steward long term, with a correlated reduction in environmental impacts and financial costs. Another benefit is that by eliminating our appraisal and processing backlogs (and addressing new acquisitions as they come in so that we do not create new backlogs), we relieve future archivists of these burdens.

Number of copies

To meet our business requirements of disaster recovery and data restoration after loss or alteration, our initial storage design called for three full-bit copies. However, after discussions with our HBS IT colleagues in which we learned the capabilities of existing storage infrastructure, we settled on two full-bit copies plus file system backups. ADAPT’s primary storage is an on-premises NetApp storage cluster with nodes in high-availability pairs and node storage consisting of RAIDs. It also has NetApp’s Snapshot functionality, which creates point-in-time file system backups. These snapshots act as restore points while using only a fraction of the storage that would be required for a full-bit backup copy. As long as the NetApp storage cluster remains online, we can use the file system backups to restore files, fulfilling our need for data restoration from loss or alteration. For disaster recovery, we synchronously mirror ADAPT storage volumes to AWS S3.

While we would have liked to select another AWS availability zone or another cloud service provider to increase the percentage of clean energy used for our second copy, HBS IT had already vetted and approved AWS for storage of high risk confidential information, which was a requirement for ADAPT, and we were constrained to a particular availability zone due to existing Harvard and HBS IT processes. We look forward to benefitting from AWS’s continued sustainability improvements and, as Matthew Addis notes in his 2020 DPC blog post, the greater efficiencies and hardware utilization percentages that cloud service providers achieve go a long way toward reducing the impact of our work. With an on-premises copy on nodes in high-availability pairs using RAIDs, mirrored off-premises copy with AWS’s intra-zone replication, and the file system backups, we were confident that our “two plus” replication approach could meet our business requirements while improving ADAPT’s financial and environmental sustainability.

Fixity check frequency

For file fixity checking, we implemented a configurable fixity check frequency policy that controls the maximum time between fixity checks for each deposit in ADAPT. Once a deposit exceeds the policy period, ADAPT displays a notification for users to edit the workflow and run another check. We designed this as a notification instead of an automatic action so that users have the flexibility to run the fixity check during off-peak times on the electricity grid when the emissions intensity is lower due to higher percentages of non-emitting generators. Our initial policy was set to 90 days, but we are now extending it to 120 days based on system performance and fixity check results thus far, and we may increase it to 180 days in the future. While our fixity check frequency policy sets the maximum time between checks, most deposits will receive checks more frequently because we run fixity checks after tasks that could alter the files.

To reduce the impact of fixity checks and other automated tasks in ADAPT, we explored adding the ability to schedule computationally-intensive processes, but decided that the benefits did not merit the required development time. We may revisit our decision once offshore wind generation increases on the New England electricity grid, making daily and seasonal low-emissions periods more regular and predictable.

Maintenance

To improve the ease of maintaining ADAPT, we used a micro-services model to implement automated tasks. We developed discrete Linux or Python scripts that then call open-source tools stored on the web server. This approach allows our HBS IT colleagues to quickly identify and focus on a discrete amount of code when troubleshooting or enhancing functionality. It also reduces our QA burden as we need to conduct in-depth testing only of a particular script or tool after changes. Additionally, it allows us to update or swap out tools over time with limited impact on the rest of ADAPT. Using standard HBS storage and database infrastructure was another way to reduce the maintenance burden, allowing us to rely on existing expertise and processes to support significant portions of ADAPT.

These cases, while specific to ADAPT, provide examples of how we can embed sustainability into our tools and systems. We should consider how our design and implementation decisions not only affect our electricity and resource use, but also how they can streamline workflows, processes, and maintenance, leading to improved outcomes for employee well-being and the environment while enhancing our ability to achieve our organizational missions.

Comments   

#1 Gabby Samuel 2022-03-15 06:52
This work is really interesting. I'm exploring the environmental impacts associated with biobanking and health data repositories in the UK. Similar to what you wrote in your article, sustainability in biobanking is ordinarily conceptualised in terms of financial/econo mic sustainability We have recently argued that this needs to be expanded to environmental sustainability (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8881066/) . I'm now interviewing health researchers using data in the UK and am finding synergies with some of what you have written about. The issue of Jevons' paradox is something a group of us have been thinking about a lot (we're holding a workshop next month as part of a UK digital sustainability grant). Please drop me an email if you would like to chat about overlaps in interests. I'm interested in any frameworks that might be able to be modified for other fields (such as biobanking/heal th data research that wants to just collect more and more data)
Quote

Scroll to top