Matthew Addis

Matthew Addis

Last updated on 31 August 2022

Matthew Addis is Chief Technology Officer at Arkivum.


Introduction

Have you ever wondered what the actual carbon footprint is of doing digital preservation?  For example, what are the CO2 equivalent carbon emissions associated with archiving and preserving a TB of data for 10 years? 

There’s been some brilliant discussion and ideas in the digital preservation community around environmental sustainability of long-term digital preservation.  The DPC have a page that rounds up many of the resources, reports and blog posts that are available, including my previous blog post on the topic.  But there is a real paucity of quantitative data to support this discussion and debate. 

Arkivum has measured the actual carbon footprint of doing real digital preservation in the cloud.  I don’t think anyone has put numbers out there for the kgCO­2eq emissions per TB when ingesting and processing different types of data in a real-world digital preservation system (please tell me if I’m wrong!).   This blog post is our attempt to change that.

Much of the work was done part of the ARCHIVER project and kudos goes to the ARCHIVER project team for ensuring environmental sustainability was high on the project agenda and supporting work in this area.

More details of our approach, including the methodology we used, and some of its limitations, are available in a report that we’ve released here: https://doi.org/10.6084/m9.figshare.20653101

Scope

The full carbon footprint of LTDP needs to consider all emission scopes, including in upstream and downstream supply chains.   The concepts of direct and indirect emissions and whether they are upstream or downstream are covered by the Greenhouse Gas Protocol, which includes a very helpful diagram of how all these things fit together!

The work we have done on carbon footprint is limited to the Scope 3 gross emissions from the energy used to run a digital preservation system in the cloud using GCP.  Google is notable by being carbon net zero for energy emissions when clients use their GCP services (compute, storage etc.) and they have further commitments to use 100% carbon free energy by 2030.   Google also publish which of their datacentres use a cleaner energy mix than others so an informed choice can be made on which Google region(s) to use.  This is based on the local energy mix, which varies day to day, and whether Google themselves have their own local renewable energy sources or invest in local renewable energy production and storage.

For a more complete estimate of the lifetime carbon footprint of LTDP, we would need to go beyond the gross emissions from energy use in GCP.   For example, in Arkivum’s case, because we do LTDP in the cloud and we use GCP facilities, we would need to also consider a proportion of the carbon footprint of datacentre construction, the footprint of manufacturing the ICT equipment in GCP and so on – this known as the ‘embodied footprint’. 

Arkivum does attempt to minimise our use of ICT within cloud providers such as Google, which reduces both energy consumption and the embodied footprint.   For example, we spin up and consume resources in the cloud, mostly ‘spot instances’, and do so only when needed to execute digital preservation processes; there are no idle servers are left running when there’s nothing to be done.  More on that can be found in our iPRES paper on scalable and sustainable preservation in the cloud.

There are the direct emissions that come from Arkivum’s own staff and offices to consider too, for example company facilities (Scope 1 direct emissions), and the energy used by these facilities (Scope 2 indirect emissions).   We do have measures in place to keep this low, such as using shared office spaces with good public transport links as well as allowing working from home. 

There are downstream emissions that should also be consider in LTDP.  If data is retrieved from our service, then it will be moved over networks, downloaded to devices, and then used is different ways.  For example, when content is downloaded and viewed on a mobile device by an end-user then that will generate emissions (Scope 3 indirect emissions). 

And there are emissions associated with all the steps that precede the use of our solution, including prior activities undertaken by our clients such as the acquisition, selection and appraisal of digital content and then upload of this content to our service. 

The carbon footprint of LTDP in its totality is really, really hard to measure!  Therefore, the scope of our analysis is limited to what we could measure: the Scope 3 gross emissions resulting from Google’s use of energy when our system is run in GCP.  It doesn’t cover everything, but it’s a start! 

Show me some numbers!

The numbers below are a small subset of results from processing and storing several PBs of real-world data over a 6-month period using Arkivum’s software system deployed on Google’s Cloud Platform (GCP).   Thankfully, the corresponding net emissions were zero because Google is already carbon neutral.

First, the gross emissions resulting from energy use for LTDP of Astronomy research datasets in GCP as part of the ARCHIVER project:

 

GCP Frankfurt (measured)

GCP Finland (estimated)

1 PB data stored for 1 year

7800 kgCO2 eq

3500 kgCO2 eq

1 PB ingest of large image files

1600 kgCO2 eq

730 kgCO2 eq

The dataset used in the scenario above consisted of Astronomy images supplied by one of the ARCHIVER end-users: PIC.  The data was provided in the form of image and metadata files inside of bagit bags that in turn were within large tar files. 

When ingested into the Arkivum solution, the ingest process included: extracting the image and metadata files from the tar containers; validating the bagit bags; generating further checksums for each file, including SHA256 and SHA512; performing file format identification; processing and indexing the metadata; replicating the image and metadata files to create two copies; checking that all of the files had been stored correctly using fixity checks; and finally, recording all steps in an audit trail and system database.   As we all know, LTDP is a lot more than just storing data and this is just a basic set of actions!

The carbon footprint was measured when the system was running in GCP Frankfurt, which was the GCP region used for the ARCHIVER testing.  As the table also shows, the footprint can also be estimated for other GCP regions, such as GCP Finland.  We do this by looking at the relative grid carbon intensity of different GCP datacentre locations.   A comparison of GCP Frankfurt with GCP Finland immediately shows how gross carbon footprint depends significantly on exactly where LTDP takes place and the local energy mix.   We are doing LTDP in the cloud, but as the saying goes, ‘the cloud is just someone else’s computers’, and these computers do physically live somewhere and that does matter.

Notice that the carbon footprint for 1 year of storage (GCP standard buckets) is higher than the footprint from the initial ingest of the data.  In this case, the data volume is large (1 PB).  The storage used was low-latency and high-speed cloud object storage that allowed immediate retrieval of data at all times.  The carbon footprint of deep-archive storage would likely be lower, but we didn’t get chance to measure that this would be the case.  The ingest was not computationally intensive and relatively little processing is required for each file when ingested (other than a few steps such as generating some checksums).

The balance of the carbon-cost between ingest and storage can be very different for other types of data and other types of preservation!

Below are numbers for ingesting and processing 1 million Office files into our system in GCP. 

 

GCP Frankfurt (measured)

GCP Finland (estimated)

1M office files stored for 1 year.

5.5 kgCO2 eq

2.2 kgCO2 eq

Ingest of 1M office files.

140 kgCO2 eq

63 kgCO2 eq

The files were a mix of file types and included MS Word and WordPerfect documents, PowerPoint presentations, scanned images, and other file formats commonly found in corporate environments or special collections and archives.  The total data volume was only 700GB.  The average file size is small at 700kB, which is typical for office files, but there were a lot of them!

The processing applied included a full digital preservation workflow that applied file format normalisation to create additional versions of the files for both preservation and access.  This increases file counts, workload and storage volumes.  We used Archivematica for normalisation with up to 40 instances running in parallel inside Kubernetes pods on GCP.  These instances are transient and auto-scaled, i.e. they are created on-demand.  None existed before the ingest was started and none were left running after they finished processing the files.  We also extended Archivematica to do extra file format conversions, for example to convert Office file formats into PDF/A.  Archivematica runs several extra services such as characterisation and validation that aren’t run in the simpler workflow used for the Astronomy data use case (which did not use Archivematica). 

All this extra processing amounts to a bigger carbon footprint when ingesting the data compared to storing the data.  This is in stark contrast to the Astronomy data use case which is the other way around.  

This just goes to show that carbon footprint depends on what sort of data you want to preserve, what preservation actions are taken, and what tools and systems are used to do it. 

Putting LTDP emissions into context

It’s worth putting these numbers in context of something more tangible such as driving a car or flying on a plane, which seems to be the usual comparisons that people do. 

The average car on UK roads emits 14.3 kg of CO2 per gallon of fuel consumed, which at a fuel efficiency of 55 mpg means that a car, on average, emits 280g CO2 per mile driven.  

Ingesting and storing 1M office documents for 1 year has a gross emission equivalent to driving a car for just over 500 miles.   After the initial ingest, assuming files are not accessed very often or reprocessed, then ongoing storage carbon cost is equivalent to driving another 20 miles per year.   In the grand scheme of things that doesn’t seem too bad.   Especially considering that Google already offset their gross emissions, so the net carbon footprint is already zero.  Happy days! 

PB scale preservation is a different matter – the emissions in this case, even when minimal processing is done, is more comparable to flying a jumbo across the Atlantic and back several times. 

Gross emissions are measured in tonnes not kgs and the choice of Google region has a significant impact.   Even though the net emissions from energy consumption are zero, there’s still significant amounts of physical ICT kit involved to process and store the data.  That has additional footprint that comes from its manufacture, transport and eventual disposal.  

Summary

This blog post has provided some examples of quantified gross carbon emissions resulting from the energy use of our LTDP system when deployed and run in GCP. 

I want to stress again that the net emissions are zero thanks to Google’s use of mostly renewable energy and offsetting what’s left.  As I’ve discussed before, cloud providers such as Google are often greener and more transparent than perhaps some people give them credit for. 

But pick a different preservation system, a different cloud provider, a different type of data, a different preservation process, a different deployment location and the numbers could vary substantially! 

And that’s the point.

Carbon emissions from LTDP vary hugely depending on what you are trying to do and how and where it gets done – which is why it’s so important to be able to measure, quantify and compare actual emissions.

By sharing our approach and by providing some of the numbers and metrics that we’ve generated, I’m hoping that this post will advance, even if just a little, some of the really important work that is going on in the community around environmental sustainability and impact of LTDP.

For more information, please see the report here: https://doi.org/10.6084/m9.figshare.20653101