
Unless otherwise stated, content is shared under CC-BY-NC Licence

Analysing PDFs with the PyMuPDF library

Edith Halvarsson

Edith Halvarsson

Last updated on 7 June 2022

This blog post is by Sebastian Lange, Software Engineer with the Bodleian Digital Library Systems and Services (BDLSS) department and Edith Halvarsson, Digital Preservation Officer with Bodleian Libraries’ department of Open Scholarship Support.  

Analysing PDFs with the PyMuPDF library 

Like many heritage institutions Bodleian Libraries holds a vast collection of PDFs, created in various flavours and software over the past 20 years. These documents have come to the libraries from diverse sources – such as digitization suppliers, academic depositors, and born-digital personal archives. 

We wanted a quick and dirty way of scanning our PDF collections for particular features, tailoring these to the needs of the Libraries’ vast and diverse collections. Using the PyMuPDF library we created a small tool which helps us gather more information about the current state of our PDFs, especially but not exclusively, regarding their accessibility. While our PDF analysis tool is less detailed than validation tools (like veraPDF), using the PyMuPDF library can be a good first step for analysing PDFs and flagging potential high-level digital preservation risks.

Read More

Digitisation of The Scotsman Collection: Digital Access and Preservation at Historic Environment Scotland

Christopher Viney

Christopher Viney

Last updated on 17 May 2022

Christopher Viney is Archive Digitisation Officer at Historic Environment Scotland

Over the past decade issues in access to archival collections have been thrown sharply into focus demonstrating the value of digitising analogue archival material. This provides a greater level of access and mitigates potential barriers that can often shut people out of our institutions. Indeed, digitisation and digital preservation provides a key tool in the work of Historic Environment Scotland to deliver on its vision ‘to make sure Scotland’s heritage is cherished, understood, shared and enjoyed with pride by everyone’. After the fantastic work of the Archive Digital Project, showcased in the online exhibition “Beyond the Physical”, Historic Environment Scotland has continued to improve access to and preservation of our collections. One such recent project has been the digitisation of The Scotsman Collection.


Read More

No time to waste: what’s ticking at CLOCKSS

Alicia Wise

Alicia Wise

Last updated on 11 May 2022

Alicia Wise is Executive Director of CLOCKSS

Time is a thief of memory, even for formal publications, unless long-term digital preservation arrangements are in place. It takes a community to safeguard the scholarly record. It is too big a job for any single organisation, and too horrific for our species if done badly.

Read More

Reducing the pain of procurement

Michael Popham

Michael Popham

Last updated on 6 May 2022

Michael Popham is Digital Preservation Analyst at the DPC and Jenny Mitcham is Head of Good Practice and Standards at the DPC.

Perhaps you’ve been given the go-ahead to procure a “digital preservation system”, or you’re trying to work out what differentiates such a system from the applications and infrastructure that you already have in place? How do you decide what you really need, especially in light of the rapidly evolving marketplace of commercial and open source preservation solutions? The DPC has recently launched a set of resources designed to help.

Read More

First steps to a guide for computational access to digital repositories

Leontien Talboom

Leontien Talboom

Last updated on 4 May 2022

Leontien is a collaborative PhD student at The National Archives, UK and University College London, her research is about access to born-digital material. 

Within the digital preservation community, the term computational access is popping up more and more frequently. It is often linked to other terms such as artificial intelligence, data mining and deep neural networks. However, there is often little understanding of what these terms actually mean and how they relate to each other. 

Read More

A #DPClinic chat about persistent identifiers

Jenny Mitcham

Jenny Mitcham

Last updated on 3 May 2022

On Friday last week, our latest #DPClinic chat delved into the topic of persistent identifiers (PIDs). As I remarked at the start of the session, persistent identifiers are something that pop up as an example of accepted good practice in DPC RAM, our Rapid Assessment Model. They are mentioned at the managed level of the metadata section with the example “Persistent unique identifiers are assigned and maintained for digital content.”

Read More

Capturing and preserving practice based research

Holly Ranger

Holly Ranger

Last updated on 27 April 2022

Holly Ranger is Research Data Management Officer in the Research & Knowledge Exchange Office at the University of Westminster

Practice Research Voices (PR Voices) is an Arts and Humanities Research Council funded project led by the University of Westminster. The project is scoping the development of an Open Library of Practice Research for the dissemination and preservation of practice research, building on existing software and standards and guided by open research principles.

‘Practice research’ is ‘an umbrella term that describes all manners of research where practice is the significant method of research conveyed in a research output’ (Bulley and Sahin, 2021). Practice research outputs are typically multi-component portfolios or collections of non-text file formats which are disseminated and hosted in separate places such as personal websites, institutional repositories, archives, and commercial video-sharing platforms. These factors pose a significant challenge to the preservation and reuse of practice research and practice research data.

Read More

Digital Preservation at BT Archives

BT Archives

Melanie Peart

Last updated on 5 April 2022

Melanie Peart is Archives Specialist at BT Heritage & Archives.

BT Archives was set up over 35 years ago to preserve the history of the then recently privatised company, British Telecom.  Now a global telecommunications company, BT has its origins in mid-19th century telephone and telegraph businesses.  Its history includes, amongst many other things, the iconic red telephone box (or more accurately telephone kiosk), undersea cable-laying and memorable advertising campaigns.  So there is a wide and important history which we in the archives team are very keen to preserve and make available.  This includes digital archives as well as physical records and we have been collecting digital material since the 1990s and digitising physical archives for almost as long.

Read More

What’s up with using WhatsApp?

Jenny Mitcham

Jenny Mitcham

Last updated on 31 March 2022

Last week one of our monthly #DPClinic sessions focussed on the topic of preserving WhatsApp, an interesting subject that drew in a good crowd of people from the digital preservation community.


The session was triggered by a question from a DPC Member who was interested to find out how other organizations were tackling the challenge of WhatsApp preservation, in particular where it has been used for business functions and needs to be captured as a record. It was clear that this issue is a shared one and an emerging challenge for those working in digital preservation. There had previously been some discussion on Twitter on this topic with some really helpful replies to the question I posed.

Read More

Using data to support digital preservation practices: An NFSA case study

Lauren Curless

Lauren Curless

Last updated on 15 March 2022

Lauren Curless is the Data Integrity, Analytics & Information Management Manager at National Film and Sound Archive of Australia.

As an audio-visual archive, the NFSA has always been interested in storytelling. Australian culture is showcased in every item held in our collection, across a huge number of formats, in the stories of Australians from all walks of life. We’re in a unique position amongst cultural institutions, our collection is primed for digital preservation due to the nature of our content and existing curatorial, digitisation and access programmes. 

Read More

Scroll to top