Yvonne Tunnat is Preservation Manager at the ZBW Leibniz Information Centre for Economics
I was fresh from university when I started my job as a preservation manager in October 2011 at the ZBW. Having taken a module named “Digital Preservation” during my studies of library and information science and after a 9-week-internship at the Digital Preservation Department of the university of Utah, I obviously was the best they could find for the job, although I knew next to nothing and they knew it.
Only, I did not know it. I felt self-confident and well-prepared. I had seen the OAIS slides several times, I knew our ingest was more or less solved and I did not need to think about access as we run a dark archive, so preservation planning was the one big task left on my desk.
There was this software, JHOVE, which miraculously was able to decide if a PDF was ok, flagging the bad ones for later preservation actions. As I knew nothing (like Jon Snow), I took all JHOVE findings as granted.
My preservation plan was as following:
- Gather all bad PDF
- Migrate them to good PDF
- Check if they still look alike
Thanks to JHOVE, the first step was easy. I left the second step to our IT guy, who quickly built a small java program, which transformed all the bad PDF into good ones. At least, after the migration JHOVE could not find anything wrong with them anymore.
But I had to rack my brain about the third step. Somehow I needed to compare the new PDF version with the original to see if there were any changes that would make the data producer angry (like layout changes, missing content etc).
I could not find suitable software which compared two PDF versions with each other. There were some tools out there, but they did not offer what I was looking for.
I heard about a solution the NLNZ had chosen: Transfer each PDF page to a JPEG and then compare the JPEGs, as there were several usable tools for image comparison out there. That sounded cumbersome but feasible and, most importantly, automatable and therefore scalable. Nevertheless, most of the tools that compare pictures were built to detect double digitization pages and give a warning if two pages are too similar. If just a few characters are different or the font is not quite the same, it won’t alert anybody.
In 2014, when I finally decided that my life would be easier if I learned at least some java myself, I managed a little program which compared all the words in two different PDF files, listing all the differences. The performance was poor, most likely due to my poor coding skills. I decided to skip the third step for now, writing it on the “plans”-agenda.
In the meantime, the first two steps had proven to be more difficult. As I went to conferences, talked to people and did some tests myself, I learned that JHOVE can be wrong. In fact, a great deal of my working life has been spent on JHOVE benchmarking testing, JHOVE presentations, papers and blogposts and JHOVE working groups. By now, I must be one of the best-known JHOVE experts in Germany, at least people think I am.
I married and changed my name. Still, everybody knew it was me when “some Yvonne sent a mass mail about JHOVE” to the German-speaking digital preservation community.
The bad-to-good migration tool built by my co-worker turned out to be too basic, just putting the bad PDF pages into a new PDF, which then would be considered ok by JHOVE, which only checks the overall structure. So we started to look for a suitable PDF/A migration tool, for that would also solve our problems with non-embedded fonts, which is a big issue for us.
As of today, we still don’t do preservation planning in our productive system. More interestingly, nobody seems to be surprised or shocked about it. People I talk to usually do not even consider doing preservation planning in the near future.
I have left out all the other things I had to struggle with while trying to solve the preservation planning problem: writing several preservation policies for different purposes and get certified twice.
During the years of course the plethora of file formats in our archive has grown. So while I am at it, I am making plans not only for PDF files, but for TIFF, JPEG and gif as well. Plans are my reality.