John Beaman is the Preservation Repository Manager at the British Library
Many institutions have significant digital content stored outside of a fully-fledged preservation repository system. The content may be waiting for a backlog of other material to be cleared, content-specific pre-ingestion processes to be developed, or even for a preservation repository system to be implemented at the institution. Often the content is stored on standard network storage made available by the institute’s IT department, or on offline storage such as external USB hard drives. This content is inherently at risk since it is not protected by the full range of digital preservation processes within a preservation repository system such as file fixity checking. The longer content remains in this state, the greater the risk of it coming to harm. In response to this, the British Library’s Digital Preservation Team has developed the Minimum Preservation Tool (MPT), a collection of utilities written in Python that can be used to create an interim preservation storage solution, providing a basic minimum level of file preservation in order to reduce risks to content currently stored outside of a preservation repository system.
MPT, artwork by Valene Jouvet
MPT overview
The MPT has no special requirements and has been designed to make use of existing network storage and compute resources available at most institutions. It is designed to provide greater protection to digital content than a standard network storage offering. The key differences between an MPT solution and standard network storage are replication of content across two or more storage “nodes” and regular fixity checking (checksum validation) of all files on all nodes to help ensure content remains authentic and unchanged.
How it works
An implementation of the MPT comprises two or more MPT storage “nodes”. Each storage node should reside on physically separate hardware, and risk mitigation can be enhanced, if possible, by ensuring the nodes reside at separate geographic locations. The storage nodes are accompanied by one or more MPT servers, which can be provisioned as Virtual Machines (VMs), negating the need to allocate dedicated server hardware for the MPT. Similar to the storage nodes, each server should reside on physically separate hardware. Ideally, there should be one MPT server for each storage node. This is especially important if the storage nodes reside at separate geographic locations. The storage nodes are presented to the MPT servers so that the MPT scripts running on the servers can access the MPT storage content.
Digital content can be divided into “collections”, with each collection residing in its own storage area within each MPT storage node. New content is copied into one storage node and immediately replicated to the other nodes. File checksums are generated during the copying and replication processes. Checksums are regularly validated at each node to ensure files have not changed, and checksums are regularly compared between nodes to ensure all copies of each file match.
Each MPT process produces log files in CSV format. Selected information can also be emailed to designated system administrators. The log files can also be used to produce MPT activity reports (using PowerBI, for example) which can be made available to content stakeholders and collection owners. These reports provide a summary of MPT activities in the preceding three months, with details including the number of objects in each collection, number of checksum scans and any associated errors.
Ingestion of content into the MPT
Initial ingestion of content into the MPT normally requires a one-off bulk copying process. For non-static digital collections (i.e. collections for which new content is continually being acquired), a “staging” folder can be created at an agreed location on the storage network. Any new content copied into the staging folder by the user is transferred automatically into the MPT by a staging process that runs at regular intervals.
The destination of new content deposited into the MPT via staging can be specified during the initial configuration of the MPT. In this way, new content can either be added to existing content from the initial ingestion, or saved to a different folder to keep it separate. Files already in the MPT cannot be overwritten by the staging process, so new files deposited in the staging folder will not be transferred into the MPT if a file of the same name already exists in the MPT at the same location.
MPT sizing options
The MPT was not designed as a large scale solution, more a short term, interim solution for smaller discrete collections. Therefore, the size of a digital collection influences the most effective way to implement the MPT for it. For example, small collections (typically < 1 TB) may be best stored within a designated “small collection” area of the MPT. Larger collections may require their own dedicated storage at each of the MPT storage nodes. The largest collections may also require additional MPT servers (VMs) to carry out the content management and preservation processes (replication, fixity checking etc.). Actual thresholds are dependent on a range of factors, so it is recommended that tests are carried out to confirm the best option in each case.
Accessing collection content inside the MPT
Where access to digital content is required, the MPT storage nodes should be configured for limited read-only access by relevant users so that content integrity is preserved and there is no possibility of data modification by users. Normally it is only necessary to provide read-only access to one of the MPT nodes since all content is replicated at each node.
Expectation management
The limitations of the MPT are predominantly dictated by the amount of storage and compute resources available at the institution. This inevitably means some digital collections may not be suitable for the MPT. There are many factors involved in assessing the suitability of a collection for the MPT, but some key things taken into consideration are:
- total size (data volume) of the collection
- total number of files in the collection
- volatility of the content (i.e. is it constantly changing?)
- the number of different locations where the content currently resides
The MPT is NOT a digital preservation repository system. It does not manage or create metadata, nor is it intended to provide general access to stored content. It is therefore important that users understand the MPT is not a substitute for an institutional preservation repository system. Instead the MPT provides a minimum level of file preservation until such time that the digital content can be stored more safely in a fully-equipped preservation system.
More information
The MPT was originally designed as an internal interim storage solution at the British Library. However, we appreciate that it may also be of interest to other institutions so intend to make the code available more widely. Work is still being carried out to improve the MPT software package for use by other institutions, especially the implementation and configuration documentation. In the meantime, an early release of the code can be found at https://github.com/britishlibrary/mpt. For more information about the MPT, please contact John Beaman (john.beaman@bl.uk) at the British Library.