Adam Harwood is a Research Data & Digital Preservation Technologist at the University of Sussex.
The University of Sussex Library Special Collections is taking a whole new approach to digital preservation this year and will embrace the do it yourself for no money philosophy. The goal is to use a suite of open source tools and existing infrastructure at the University to create a digital archive.
I’ve been testing out a few different open source tools recently and putting together a workflow to process Special Collections' digital archives. I’ve been meaning to test out Bagger to create AIPs for a while now and last week I spent a morning giving it a test run. I know I find it very helpful reading about how other practitioners have been using digital preservation tools and so I wanted to share my experience of testing Bagger. It was a relatively painless experience…
What is bagger?
Bagger is a graphical user interface to package a group of data files together according to the BagIt specification. The BagIt specification, developed by the Library of Congress is a set of hierarchical file layout conventions to store and transfer digital content in digital containers called ‘bags’. These bags contain manifests of the packaged files and checksums to verify data during transfer and while in storage. It is used in the Archivematica open source digital preservation system to store AIPs.
Installation
I downloaded the bagger application from the github repo linked above. No installation was necessary - I simply unzipped the file archive to my program files folder. It doesn’t have to be put there, it can run from anywhere, but I’d recommend you put it in a place where you’ll remember. To launch the application you need to find the bagger.bat file in the bin folder.
Create a new bag
Bagger creates bags. Bags are folders that contain your data and several .txt files known as ‘tag’ files that contain metadata. Following the user guide included in the download I created my first bag. Creating a new bag makes a copy of the files and puts them in a bag in a selected place leaving the copied files as they were. Bagger can also create a ‘bag in place’ that converts a collection of files into a bag in their current place in a file system.
Create your own bag profile
After you create a new bag you need to select a profile that determines some of the metadata fields included in your tag files. You can select from a few default profiles, but you can also create your own. This allows you to enter your own metadata fields into the tag file bag-info.txt to identify each individual bag in an automatic and consistent way. Standard fields that can be created are fields like ‘source organisation’, ‘bagging date’, ‘organisation address’, ‘external identifier’ etc… Creating a profile requires a bit of skill editing JSON files. Thankfully a few editable JSON files are included with the latest version of bagger, so there is no need to create a JSON file from scratch.
Building your bag
After selecting your profile, you need to add your files (or ‘payload’ in bagger terminology) to your bag. Rather than testing with actual archives I used the content in my downloads folder. This folder is full of different file types and has multiple file hierarchies. According to system properties it has a total of 515 files, 122 folders and is 1.1GB in size. (I know, I should have a bit of a spring clean!) I know there is nothing in there that I need that isn’t stored elsewhere so it is safe to lose. If you don’t have any files to use as a test, you could visit the Digi Pres commons for a test corpora or the Open Preservation Foundation. It takes a couple of seconds for Bagger to add all the files to a new folder in the bag called ‘data’.
You can then name and save your new bag to your chosen location. It took about two and a half minutes to create the bag. Along with the data folder, bagger creates some ‘tag’ files. These tag files are .txt files that contain metadata about your data including a list of the data files and their checksums. I chose to create MD5 checksums but you can choose other hashes as well.
The profile I selected at the beginning determined the fields added to the bag ‘info.txt’ file. These fields were: Bag-size, Bagging date, and Payload Oxum. Payload Oxum is a machine readable calculation of the bag size in 8-bit bytes.
Verify bag
After saving the bag, you can verify each checksum against the corresponding file name. Verification took about 15 seconds for my whole downloads folder test. On my first attempt to verify the bag I received an error message saying ‘Profile Compliant failed as the Awardee-Phase’. Awardee-Phase was a field created by the profile template loaded at the beginning of the process.
I wasn’t sure what this field meant, but I entered 1234 and tried again. This time, validation failed with the following error 'tagmanifest-md5.txt contained invalid files [bag-info.txt]’. I thought this might be because I had changed the bag by putting in the 1234 value into the Awardee-phase field and it was validating this new version of the bag against an old saved version. So i saved the bag-overwriting the existing one and tried validation again. It worked! I then read the guide that said you should always save the bag after editing it, so always read the instructions first!
Holey Bags
This isn’t an option that I need to use at present, but it may be of more use in the future. A holey bag doesn’t contain any data, but contains URL locations to where the data does exist. These URLs are kept in a separate file called fetch.txt. The welcome Library are using this feature in an ingenious way to create versioning in AIPs. I thoroughly recommend you read this post for a great explanation of holey bags.
Using the simple features of Bagger means that we can create our very own standardised AIPs that provide us with the building blocks to establish a checksum checking schedule in the future. We plan to use Bagger in conjunction with an application called Teracopy that protects modified and creation dates during file transfer. We also plan to use DROID and Fixity to provide us with file profiling and fixity checking applications to round out our digital archive tools. These tools and Bagger are suggested applications in the DPC/TNA resource Novice to Know How online training. This launched at the beginning of May and trains digital preservation practitioners in implementing a “simple and proactive digital preservation workflow within their organisation”. Definitely worth checking out.
Something that strikes me about all these tools is that most of them create and verify their own checksum files but don’t check existing checksum files. (although I’ve not looked at Fixity much yet). This raises the question of why we save checksum files if no other system will pay them any attention? Also, each application names the checksum files differently e.g. checksum.md5 (teracopy) and manifest-md5.txt (bagger). Should we be keeping all these files? I’m sure there is a simple logical technical explanation to all of this. I’d be interested to know what readers thoughts are in the comments.