File Formats

File formats define how information is encoded a digital file. File formats can be standardised, open, well documented and possibly associated with a reference implementation for how software should interact with files of that format. But file formats are not always as clearly defined, and format specifications are not always closely followed by the software that implements them. Understanding file formats and how we interact with them in practice can be therefore be critical to ensuring effective digital preservation. This page provides some guidance on the best sources of information for further information on file formats. For a broad introduction to file formats and digital preservation, see the DPC Handbook:

See also, the DPC Technology Watch Reports:

Understanding the broader challenges associated with file formats

A number of pieces of work have sought to develop methods of assessing the appropriateness of particular file formats for preservation, typically based on high level criteria. This includes the now somewhat dated DPC Tech Watch report. More recent thinking has begun to move away from this approach, due to the need to base decisions on practical experiences with working with file formats and software:

Precision and completeness are not qualties that can always be associated with file format specifications, and this lies problem lies at the root of many preservation challenges:

Examples from the Information Security community, while not typical of the preservation challenges we are likely to experience, illustrate the flexibility in many file format specifications:

File format identification

Applying a specialist software tool to identify the formats of files to be preserved is typically one of the first steps in a digital preservation work flow. Read more about File format identification here...

Seeking reference information and guidance on specific formats

There are a number of excellent sources of information to assist digital preservationists. Wikipedia remains a good place to start for high level information about a particular file format. The associated Wikidata is the also the focus of the latest effort to build a collaborative registry of file format information.

A small number of libraries and archives have been developing their own preservation focused assessments of particular file formats. These provide useful guidance on the risks associated with common file formats, and approaches for addressing them. They are located in different places on the web, but are linked from the home of a loose collaboration between these organisations on the DPC Wiki:

The Just Solve wiki provides a community driven site for gathering information about different file formats and is particularly good for discovering information on more obscure file formats:

Child Tags

PDFPDF/AJPEG2000

Parent Tags

Issues

Documents

pdf Directory of Digital Repositories and Services in the UK June 2005

Directory of Digital Repositories and Services in the UK June 2005

pdf Institutional Repositories

Institutional Repositories

Articles

PDF/Eh? redux: putting veraPDF into practice. Or how I rediscovered my inner geek

{jcomments on} Ancient history: how we got here Way back in 2013 the DPC collaborated with the OPF on a project called SPRUCE. Following on from the success of another little project called AQUA, and with some very handy funding from the Jisc, we ran a bunch of mashup events and got hands on with all sorts of digital preservation challenges. The management of PDF files, and particularly risk assessment, was a recurring theme. In response, the SPRUCE project held a hackathon in Leeds where...

Read More


VeraPDF

The veraPDF consortium will deliver a definitive validator for PDF/A: an authoritative corpus of test files establishing the objective frame of reference for validation of all parts and conformance levels of PDF/A, an open¬source, and a purpose¬built validator and policy checker to implement the collecting policies of memory institutions. We expect a vibrant community will develop to sustain these efforts. The veraPDF consortium brings together a unique network of stakeholders with...

Read More


Re:Format - What is file format obsolescence and does it really exist?

Digital preservation literature identifies file format obsolescence as one of the main threats, if not *the* threat, to the longevity of our digital data. Files must be migrated or emulated as they become obsolete, to ensure that they can still be rendered and used in the future. As Jeff Rothenberg famously put it at the end of the 1990s: "digital information lasts forever—or five years, whichever comes first". More recently however, the community has grown more sceptical. Luminaries such as...

Read More


Preserving Documents Forever: When is a PDF not a PDF?

Presentations An introduction to PDF, Sarah Higgins, Aberystwyth University Understanding PDF risks in preservation, Johan van der Knijff, National Library of the Netherlands PDF: Myths vs facts, Ange Albertini, Corkami Preserving PDF at the coalface, Tim Evans, Archaeology Data Service Introducing veraPDF, Carl Wilson, Open Preservation Foundation The Digital Preservation Coalition and the Open Preservation Foundation, with support from the European Commission and the...

Read More


Current Trends and Future Directions for Digital Imaging in Libraries and Archives

Introduction Issues of validation, compression and preservation become more important in image management as collections grow in size and complexity. On one hand compression is seen as a necessary requirement to deal with the scale of the collection on order to make preservation a practical reality, but preservation advice generally discourages compression which is seen as a preservation risk. Validation is essential for quality assurance in the development of large collections and is a...

Read More


Digital Preservation with Portable Documents: a workshop to introduce and discuss the PDF/A version

Introduction The portable document format (PDF) is ubiquitous, easily-produced and is widely used in a diverse range of environments.  A variant of the standard – PDF/A in which ‘A’ stands for archive – was published in 2005.  This version of the standard, also published as ISO 19005, minimises the dependencies between the contents of a file and the system on which it is rendered.  This self-contained characteristic of PDF/A makes it particularly attractive for those...

Read More


JPEG 2000 for the Practitioner

Introduction A free seminar to explore and examine the use of JPEG 2000 in the cultural heritage industry was held at the Wellcome Trust. The seminar included specific case studies of JPEG 2000 use. It examined technical issues that have an impact on practical implementation of the format, and explored the context of how and why organisations have chosen to use JPEG 2000. Although the seminar had an emphasis on digitisation and digital libraries, the papers are relevent to a range...

Read More


DPC/BL Joint JPEG 2000 Workshop

Introduction The JPEG2000 image compression technique has been cited by experts as a new archiving format for digital images. It is both a preservation and delivery format, and has been seen as a possible alternative to the TIFF format which most institutions use as a long-term archiving standard. Produced by both imaging experts and the Joint Photographic Experts Group, it is now a recognised ISO standard. The standard JPEG file format which is so widely in use is not yet an ISO...

Read More

Scroll to top