This blog post is an expansion and update of a reply I made to the digipres listerv in response to a query from Bernadette Houghton. It's also part of a response I began making at the Apres iPRES unconference to colleagues who were concerned I wasn't being disruptive enough at this year's iPRES (bless you all). The spirit of this blog post is for them.
This is what Bernadette asked:
"I’m currently doing some research into file format validators (e.g. http://description.fcla.edu/) and during my testing noted that anomalies are very common; at least, for those random files I’ve been checking.
In the real digital preservation world, what’s the best practice in regard to files with anomalies detected? Are the files preserved as they are, or are they resaved (either into the same or a different format) to get rid of the anomalies?
Of course, it may be that the anomalies are a result of a bug in the validator rather than the files themselves. But that isn’t always going to be apparent."
This is an issue many will be familiar with. Applying validation and attempting to make it work for us is not straightforward. In fact, it seems so far from working for us that I'd like to step back a little and consider whether validation is meeting our digital preservation needs, and if there's another direction we could take?
What is validation doing for us?
Many validators are far from perfect and will sometimes provide a variety of incorrect or misleading reports. Conversely they may also provide a series of very precise and accurate reports that are almost impossible to understand unless you are deeply engaged with the relevant format’s construction and specification. Furthermore, interpreting the relevance of those reports to any sensible digital preservation intervention is usually very difficult. And finally, a validator may provide a variety of superficially impressive reports, whilst completely failing to check on issues that might be of great interest to a digital preservationist.
There is a significant disconnect between validation and the digital preservation answers we need. Validators operate at the microscopic level of sub-file inspection. Our policy-based preservation planning operates (at best) at the file level, but ideally at a higher level of abstraction. We've struggled to bridge that gap.
Considering the example of PDF, I suspect I could count on one hand the number of candidates who might have the knowledge and experience necessary to interpret specific PDF validation results into definitive preservation actions. I think there's a gulf between our validation tools and our practitioners attempting to use them. And it's not the fault of the practitioners! Surely if this so-called solution to digital preservation problems is so bad, we should consider trying something different?
Validate because it's part of digipres lore...?
Why validate? We seem to have fallen into the trap of applying validation and file format identification services as a means to an end.
In Marco Klindt's thoughtful analysis of the preservation worthiness of the PDF format at iPRES2017, he opens a discussion on validation by stating "Digital preservation workflows require some sort of checking whether files adhere to the specification of the file format they claim to be." But does not justify why. After digging into the options for validating PDF files he concludes by saying (emphasis is mine):
"This helps a lot but does not address the question whether the content of a PDF file is truly (human and/or machine) accessible and usable with regard to the aspects mentioned above. Being able to validate a file is a necessary condition but it gives no comprehensive answer about potential risks concerning future usability."
Wow! That's a powerful and damning statement about validation that leaves me wondering what purpose it is supposed to fulfil? Not to mention, why it is "necessary".
At this year's iPRES Juha Lehtonen presented an in depth experiment into validation and subsequent fixing of validation errors in PDF files, which builds on the validation work of Johan van der Knijff as well as more recent JHOVE work by Micky Lindlar and OPF colleagues. There's lots of detail worth reading that expands on these brief summarising quotes "...the most common error messages disappear totally after reconstruction and conversion. Some errors remain but the number of problematic files has reduced significantly" and "Our pilot study shows that reconstructing PDFs can be an effective method to reduce the amount of problematic files, but also that other “new” errors may appear this way". But for the purposes of this blog post I'm more interested in the introduction. The authors state (again, emphasis is my own):
"Our preservation service validates all digital objects in a Submission Information Package (SIP) during the ingest process and rejects the SIP if the validation of any digital object fails. We have found out that invalid PDF files are common reasons for digital object validation failures. Thus, it would be better to validate files before ingesting them and fix potential problems beforehand. Partner organizations may believe, however, that if PDF files can be opened using tools like Acrobat Reader, they are valid. That is why many of them consider the “pre-validation” unnecessary. We have a need to support our partner organizations by defining an effective process to minimize the amount of invalid PDF files sent our service for ingest."
So validation has not only become inexplicably baked into our preservation thinking, but also inexplicably baked into our workflows. This sets off a number of alarm bells for me. Firstly, it appears that the previously established preservation process appears to be influencing or possibly even driving preservation policy. A potentially dangerous prospect. Furthermore, the solution this paper explores is to implement preservation actions that may damage content, prior to ingest. If complex and possibly damaging actions are to be performed, surely they should occur after ingest and within our carefully controlled preservation environments where they can be monitored, documented and, if necessary, reversed (Minimum Effort Ingest where are you hiding, please come back and forgive us)?
Just to be perfectly clear, this is not intended to be a criticism of Marco and Juha or their organisations. On the contrary, what they are doing appears to be the norm. It's standard practice. But is it best practice?
Unfortunately all of this doesn't surprise me that much. I can speak from personal experience of an organisation whose preservation bods requested validation of PDF files on ingest and retention of the resulting metadata. The implementors found that this process generated "errors" and decided to "fix" them without consulting on whether this would be a good idea. When said preservation bods discovered this quite complex process had been devised, despite it being based on no written requirements, the response was that it would be too expensive to remove, but that at least the original files would be retained. The modified files became the access copy. Some years later when users repeatedly came across content broken in the "fixing" process, the workflow was finally amended to remove the re-baking of PDFs and point users back to the original files. Yes: arg!
There seems to be a cultural IT issue at least partly at play here, and heck, I'm a computer scientist, I'll take this on the chin... When a software tool reports errors, we just can't stop ourselves from trying to fix them! But a validation tool does not have a direct line of communication with absolute truth, and our response to what it says needs to factor in a bit more analysis and a bit more subtlety.
What do we actually want to achieve?
I'll now attempt to put aside this digipres lore, that we must validate the heck out of files until they are well formed and/or valid, and have a think about what we actually want to achieve in a risk assessment of digital content - and I should say, I would love to see a proper discussion about this. Here's my shot at it:
Does this digital object render without error, accurately and usefully for the user, and is it likely to render in the future? If not, should we do something about that?
Directly answering those questions is difficult, and it’s even more difficult to answer with an automated and time efficient solution. I think that we've used file format ID and validation to attempt to provide us with some information that might help. But they have become automated proxies for actually opening or rendering each of our files in a viewer, because opening and visually inspecting them is a terribly inefficient manual process. These proxies probably won’t directly answer the questions above, but they might get us a bit closer if we tailor them carefully. They might also help to highlight the (hopefully) small number of our files that warrant further investigation via manual assessment. This is what, in my opinion, digipres validation should be about.
The things that I think are important to get out of a risk assessment process, that might include validation steps, are:
- Is the file encrypted (and do we have the key) – this would definitely stop us rendering the file
- Is the file dependent on data external to the file itself, such as a non-embedded font. This might be data that we have failed to capture – external dependencies are getting more common and could prove to be significant obstacles to successful or accurate rendering now or in the future
- Is the file significantly damaged or incomplete, perhaps such that it returns a large number of validation errors, or specific errors that indicate incompleteness (eg. EOF character not found), or perhaps dramatic bit damage – if we’ve not got all the bits (in the right order), we’ve got a real problem
- Is the file really of the same format that our file format identification tool thinks it might be - file format identification might only consider the first few bytes of the file in order to spot the presence of a short identifying string, but sometimes provides uncertain or ambiguous results
To summarise: ensure it's not encrypted, verify it's complete, check it's not badly broken, and see if it's what we think it is. This isn't the same as my questions in bold above (i.e. does it render?) but it covers the most likely obstacles to a file successfully rendering.
What we probably don’t need to worry about is the kind of very typical minor file format infractions that are so common that typical rendering software expects and copes with fine. Whilst we could consider “fixing” these kinds of issues, we are then introducing significant risk by changing the digital object and possibly damaging it (as was found in the PDF work referenced above). Worst case, we damage it without realising it. Intervention here is also highly speculative. In a world where we never have sufficient funds to do digital preservation as comprehensively as we would like, doing potentially unnecessary work in one area could starve us of essential resources needed to mitigate other risks. This is essentially the hardest question in digital preservation – to which risks do I prioritise my limited resources? To my mind, consideration of this conundrum directs us to only take action to “fix” digital objects if it is deemed absolutely necessary.
To put it another way, the developers of rendering tools have done us a massive favour by implementing a huge amount of digital preservation work for us. What? Well they implemented loads of optimisations to their tools so that they graciously cope with all sorts of file format infractions in the files we want to preserve. But despite that, our reaction has been to try and repeat this work at a different stage of the lifecycle. A task that is both incredibly difficult and most likely way beyond our meagre resources. Yes. To quote renowned philosopher Dizzy Rascal: its bonkers!
Yeah, well, that's just like, uh, your opinion, man
Some would argue that to ensure we can make sense of a file many years into the future it should obey the exact letter of the law with regards to a file format specification. Presumably this is so we can re-create rendering tool (which we allowed to become obsolete despite having the source code) from the file format specification (which we did preserve). This does however assume that the specification is sufficiently detailed and accurate to support that, and in most cases they aren't. It also assumes rendering software aligns closely with file format specifications rather than what is encountered in typical files of the file format(s) they support, and in many cases they don't. As Sheila Morrissey described in "The Network is the Format": notes in the PDF specification on the tolerances of Acrobat beg the question as to "what we are to consider authoritative with respect to PDF format instances: the specification, or the behaviour of the Acrobat reader application".
If this is something we're really worried about, why not apply our software preservation capabilities to preserve the source code of our key rendering tools?
If validation isn't the answer, what is?
I speculate that it might be more useful for this community to almost completely drop its interest in validation and invest in adapting rendering tools so we could use them to automatically parse files and discover problems in that parsing (and actually report these exceptions rather than suppressing them as most rendering tools do) without actually rendering the file on screen. Configure to "preservation mode, report errors to command line, don't display rendering". This would be much closer to what we actually want to find out – can we parse this file, can we render it without error?
It's unclear to me whether this would be a realistic ask - I think we'd need to try it out, and explore whether rendering blips could really be trapped and usefully reported. And of course every renderer is different. On the plus side, we might be able to ride on the back of work to develop and support existing rendering tools. Our community has not performed well at sustaining it's own tool output. This approach might just move us closer to the most important preservation tools in existance - the ones that render and make sense of the 0s and 1s in our archives.
Ok, stop messing around and give us a more practical answer
At the very least I think we should be completely re-evaluating how we implement validation. Our focus should be on cherry picking a very small number of validation results that meet very specific risk oriented questions such as: is it encrypted and is it complete (1 and 2, above). Our focus should also be on otherwise ignoring the minutae of individual validation results and finding ways to perform crude meta analyses of these (imperfect) results in their entirety in order to answer more general questions such as: is it catastrophically broken and is it what we think it is (3 and 4, above).
This suggests a very different approach to new research and development work is required. Let's not even try and fix up our validators so that they work more effectively. Let's assume they're always going to be a bit imperfect, and instead devote our efforts to making them answer real digital preservation questions.
Come on, fess up! Validation *can* help us, but indirectly...
I just wanted to add a final note relating to thinking on veraPDF. My experiences in working on the project certainly helped to shape some of my thoughts in this post. I should say however that despite my attempt to be provocative with some of my words in this post, I *do* believe there is a place for validation in our digital preservation armoury and veraPDF illustrates this well. By clarifying the "target" of a well constructed PDF/A file, veraPDF will hopefully improve the quality of PDF generators and consequently the quality of PDFs in circulation - clearly a big issue with this format. Ultimately this should make preservation of those PDFs a little easier.
Where next?
So what do you think? I would love to see some deeper debate on this topic. What have I missed from my requirements for file assessment? Should we scrap validation? Could we realistically put rendering tools into action as a substitute for validators?
Acknowledgements
Thanks and acknowledgements go to a whole host of people who I've talked to about these issues over the years, particularly Andy Jackson, Sheila Morrissey, Yvonne Tunnat, Johan van der Knijff, Carl Wilson and many more. I'd also like to thank Libor Coufal who replied to my original listserv post and reminded me of an important omission: the need to follow up on file format identification that is uncertain - I've added this to the blog post above. Thanks to William Kilbride for coming up with the title for this blog. And thanks to Bernadette Houghton for her question on digipres that sparked off my rant.
Comments
>What Paul really asks for is authoritative validation, the results of which are presented
>in a manner that's well understood and useful to the digital preservation community.
What my blog post looked to do was ask the question as to what we are trying to achieve when we apply a validation tool in a typical digital preservation ingest workflow. My suggestion is that we have completely lost sight of what we set out to achieve in the first place. Therefore after many years of trying to get validation to work well and failing (in my estimation) we should stop and consider if another approach would be better.
To re-iterate: I think that we don't actually want to know whether a file matches the rules in a file format specification (i.e Format Validation). What we do want to know is: does the file render now and is it likely to in future. This is a different question to the one that validation answers.
>Authoritative validation is best provided by a reference implementation of the "parse
>and report" renderer, or indeed - a validator.
>Once you're happy you have an authoritative validator
Can we dip into what you mean by this term? Authorative in what way? Gary McGath, original author of JHOVE, has observed that the notions of well formed and/or valid were merely broadly interpreted from their XML definitions. As Gary described in a blog post respose to mine "With most formats, I never knew exactly what these terms meant..." And he goes on "With most other formats, there’s no definition of these terms. JHOVE applies them anyway. (I wrote the code, but I didn’t design JHOVE’s architecture. Not my fault.) I approached them by treating “well-formed” as meaning syntactically correct, and “valid” as meaning semantically correct. Drawing the line wasn’t always easy. If a required date field is missing, is the file not well-formed or just not valid? What if the date is supposed to be in ISO 8601 format but isn’t? How much does it matter?"
https://madfileformatscience.garymcgath.com/2018/10/15/format-validation-problem/
But that's just about our interpretation of validity. What are we to consider authoratitive? The format specification? The renderer/viewer reference implementation? Do either of these things exist (many don't). Is the specification complete (are any?)?
So we're back to the question of "what is a format?", "what is a valid format?" and of course "is this file valid?". If these questions are virtually impossible to answer, then my response is "do we care what the answer is anyway?" and "maybe we're asking the wrong question?" and eventually "my head hurts, is it time to go to the pub yet?", followed by the inevitable "why are we in the pub and still talking about file formats?".
> it's a question of interpreting the
>results. This is not a trivial problem and is both technical and dependant on preservation
>needs and policies. Collaboration between format experts and preservationist s is the best
>way of ensuring that the necessary expertise is applied.
As you say "not a trivial problem". I don't see any evidence that this problem has been solved. Do you think it has? Are we close? Do we have the resource to solve this problem for the formats we care about?
>It would be a ‘brave’ DP professional who relied on a proprietary, closed source, GUI based
>renderer for a long term sustainable solution.
I broadly agree with this statement, but I'm not sure why it's relevant here. If we were to look at applying/modifying renderers to see if they render our files without error (as I suggest in my post), I'd start with open source ones, of which there are plenty.
Authoritative validation is best provided by a reference implementation of the "parse and report" renderer, or indeed - a validator.
Once you're happy you have an authoritative validator it's a question of interpreting the results. This is not a trivial problem and is both technical and dependant on preservation needs and policies. Collaboration between format experts and preservationists is the best way of ensuring that the necessary expertise is applied.
It would be a ‘brave’ DP professional who relied on a proprietary, closed source, GUI based renderer for a long term sustainable solution.