Jamie Shiers is Data Preservation in High Energy Physics Project Leader at CERN
From the early days of planning for the Large Hadron Collider (LHC) it was known that it would generate an unprecedented amount of data. As we come to the end of the 2nd multi-year run (Run2) of the LHC, the CERN data archive has broken through the 300PB barrier. The LHC, including its planned upgrades, such as the High Luminosity LHC, will continue to take data for between one and two more decades when it restarts for Run3 in 2021 (so around 3 decades from start to finish).
All of this data – past, present and future – will need to be preserved for at least the data-taking period of the LHC, if not for an extended period thereafter.
For comparison, the data from the former Large Electron Positron (LEP) collider that took data from 1989 to 2000, is still preserved and re-used, close to two decades after the end of data taking and three from the LEP startup. (Indeed, publications are still being made and there are strong scientific arguments to (continue to) be able to compare results from the four different experiments).
Whereas, at the beginning of LEP – housed in the same tunnel that now hosts the LHC, the responsibility for managing tape storage lay with the experiments themselves (even if the volumes were stored centrally), by the time of the LHC we had moved to centrally managed storage in high-capacity tape robots. “Bit preservation” – minimizing but not totally eliminating even tiny occurrences of data loss or corruption, has been offered to the LHC and all other current experiments since around the beginning of this millennium.
Preserving the bits might be necessary but it is far from sufficient to ensure meaningful re-use of data, even after short periods of time. Building on the pioneering work of the Study Group for Long-Term Analysis in High Energy Physics, more commonly known as DPHEP, several strategies were proposed to the 2012/13 update of the European Strategy for Particle Physics (which itself is now due for a revision). DPHEP itself was initiated also around a decade ago, initially at the Deutsches Elektronen-Synchrotron laboratory (DESY) in Hamburg, and which rapidly grew to cover all of the main HEP labs worldwide.
These strategies included not only “bit preservation” but also well-established services for storing and preserving documentation (also known as “digital libraries”), as well a revolutionary approach for preserving not only the software needed to process and (re-)use the data, but also the environment in which that software had run and for which it had been validated. It is now widely agreed across data preservation activities in HEP that these are the 3 pillars on which our data preservation services are built.
Such services have now been offered in production for several years and are considered mature and stable.
However, the story does not stop there, with extensive efforts that are on-going to capture all of the data and “knowledge” necessary to allow analyses to be repeated in the future. This is complemented by on-going “Open Data” releases of subsets of the data from the LHC experiments – and others – together with the necessary software, environment and documentation to re-use the data.
Whilst few people would have been bold enough to predict that LEP data would still be both available and usable 3 decades after first data taking, this is exactly the expectation – and even requirement – for LHC data today, even though the LHC data set is already close to 3 orders of magnitude greater than that of LEP (which was roughly 100TB for each of the 4 experiments, including the raw data) and set to grow by possibly a factor of hundred – up to tens of EB at the end of the 2030s!
Comments
We have much more information that we are happy to share with anyone interested.
Thanks again, Jamie
Thanks!