Heikki Helin is Senior Technology Coordinator for Digital Preservation Services at CSC - IT Center for Science Ltd in Espoo, Finland
The Finnish national digital preservation service, based on the OAIS reference model, has been in production since 2015. Providing services for preserving the cultural heritage and research data sectors, it is a service funded by the Ministry of Education and Culture of Finland. Currently, we have more than 1.3 million Archival Information Packages (AIPs) in preservation amounting to more than 450 terabytes. We have defined common national preservation specifications, which describe in detail how digital assets must be prepared before ingesting them to the preservation service. This includes detailed requirements for metadata and file formats.
Detailed requirements on both file formats and metadata are necessary for a fully automated ingest process. However, preparing and ingesting digital assets in an appropriate format according to the requirements can be a demanding task, especially in cases in which the producer is not familiar with the various preservation standards and metadata formats. This process requires both know-how and can be very time-consuming. It is therefore a very costly process, highlighted in those organizations with insufficiently competent IT staff. The growing demand for making this process easier for our partner organizations is the reason we have developed tools to decrease the burden of creating valid submission information packages from scratch. We introduce two main tools in this blog post.
Firstly, the Pre-Ingest Tool is a tool to make it easier to create Submission Information Packages (SIPs) programmatically. The tool is a set of modular software components that produces a METS document containing all the necessary metadata conforming to our national preservation specifications. The tool can create descriptive and administrative sections of a METS document, a structural map and a file section. It can automatically extract technical metadata from files into the PREMIS metadata format and digitally sign the SIP. Finally, it can compress the newly created SIP to a TAR or ZIP package. The pre-ingest tool is in production in partner organizations, helping them integrate their back-end systems with the national digital preservation service. We did receive good feedback when testing it with a representative sample of partner organizations. The pre-ingest tool is now used more widely and as new partner organizations deploy our digital preservation service they almost invariably integrate the tool to their own systems, as opposed to creating their own solutions.
Secondly, File Scraper is a tool for identifying files. It can identify files, collect metadata from them and check their well-formedness. The tool uses third party software to validate and extract metadata from files and normalizes the results in a uniform structure. It outputs a Python object containing a python dictionary of technical metadata, a python dictionary about the used software including their outputs, and the validation result as a boolean value. The tool can also be used without (the sometimes time consuming) validation in order to just identify the file and collect the technical metadata. The File Scraper is used by the Pre-Ingest Tool described above and in the ingest validation in the Finnish national digital preservation service. The tool uses for example FFMpeg, FIDO, file, GhostScript, ImageMagick, JHOVE, LibreOffice, MediaInfo, Pillow, pngcheck, v.Nu, veraPDF, and warc-tools to validate files and extract metadata from them. The majority of the file formats listed in our specifications are currently supported and more will be supported in the future.
Both tools are available under LGPLv3 license at GitHub among other tools and libraries. Although these tools are developed specifically for our digital preservation service, we firmly believe that these tools can beneficial for digital preservation community in general.