Eve Wright is a Digital Archivist at National Records of Scotland
At National Records of Scotland, our current approach to digital preservation is to look outwards. We take small incremental steps to improve our preservation activities gradually, so we are continually dependent upon (and grateful to!) to our colleagues, who make open source digital preservation options available.
One challenge we have recently faced was how we could implement a regular process of integrity checking without any system automation. Our Digital Repository is file system based so there would need to be some kind of tool to look through digital objects and compare checksums against what we produced during ingest.
Additionally, the process for integrity checking would need to be relatively scalable. We currently have over 1.7 million digital objects in the Digital Repository, with a significant increase expected over the next couple of years.
While we were realistic that this would be a manual process to an extent, we needed some kind of tool to help us manage this.
Research
My first port of call when looking for an open source tool to solve a problem is my digital preservation comfort blanket: COPTR.
This led me to this blog post from David Underdown on DROID report as basis for collection integrity checks. Using CSV Validator alongside DROID reports for integrity checking – as described in David’s blog - definitely ticked a lot of our boxes.
We had already produced DROID reports for each of our accessions during ingest, so the data was pretty much ready to go. CSV Validator also had a simple and easy-to-use GUI, which made it usable for us as we are currently unable to use command line operations in the Digital Repository.
Also, The National Archives (TNA), who developed this tool, have produced really extensive and helpful documentation around this. There was even a ready-prepared schema file for checking DROID reports available on their Github. Trialling the use of this tool was a no brainer.
Implement
CSV Validator can do lots of things but the only check we needed was the “checksum” check. This compares the checksum (we use MD5) listed in the csv file against the checksum of the file in the Digital Repository.
To prepare for our first integrity check, we batched our full DROID listing of the Digital Repository into csv files of 100,000 rows. This had a couple of benefits. Firstly, it meant if an error was detected in one file, we wouldn’t need to run the entire check again. It also meant we could open multiple instances of CSV Validator and run these at the same time. This reduced the time taken for the full check and meant we could complete this in one day.
We had a few hiccups when we first started, mainly around some white space and funny characters in some file names. With a quick issue logged on Github, TNA advised that the URI field is more effective to use than FILE_PATH, and it ran very smoothly from then on. In keeping with the theme of World Digital Preservation Day 2024, this very much resembled a “concerted effort” towards success! Thanks to TNA colleagues for their support.
BAU
We have been using CSV Validator for our collections integrity checks since December 2021, and it has really helped us to have that assurance that everything is unchanged since we have received it. We run this check every six months: a frequency that was decided based on how much we could realistically deliver with our small team.
CSV Validator is not the first open source tool we have included in our workflows and as we scale up, it will definitely not be the last.
Doing digital preservation in a bespoke way outwith an all-singing, all-dancing solution can seem daunting at times. We are continually grateful to the developers of open source software for making simple, yet incredibly effective tools so easily available.