A Note from the Editor, Sara Day Thomson

As the coordinator of the DPC’s Web Archiving & Preservation Working Group, it has been my absolute pleasure to work with some of the most enthusiastic, creative, and persevering professionals in the field. The community of archivists, curators, librarians, researchers, and enthusiasts who do the work of capturing and preserving web resources has always displayed a collaborative spirit and a willingness to try new approaches and learn from each other.

The coronavirus pandemic has truly and profoundly put that spirit to the test, and the web archiving community has not disappointed.

‘The speed, scale, and level of interest in participating in this collective effort have been remarkable and have no comparison to previous collaborative endeavours,’ Jefferson Bailey from Internet Archive attests. ‘It is a great testament to the community's ability to work together.’

Over the last couple of weeks I’ve been in touch with a handful of the professionals at the frontlines of this effort to archive the global experience of coronavirus. In a series of blog posts, I’ll share their insights into this urgent undertaking to capture the world’s response to coronavirus (Covid-19) online.

 

Capturing the UK Government Response to the Coronavirus (COVID-19) Pandemic at The National Archives UK

By the UK Government Web Archiving Team at TNA

With more than 25 combined years of web archiving experience between us, the UK Government Web Archiving team has previously dealt with plenty of major events, ranging from the London 2012 Olympic and Paralympic Games to Brexit, with several General Elections between. However, the Coronavirus (COVID-19) pandemic is on a different scale. Although we started undertaking targeted captures of content related to the UK Government response in February 2020, the situation in the UK accelerated much more quickly than anticipated.

 

WA Corona TNA UKGWA Homepage

 

On the evening Monday 16 March 2020, the Prime Minister announced that everyone should work from home if possible. The following morning The National Archives announced that the building would close to the public that evening and to staff by the end of the week. By Friday our team was all working remotely for the first time, as were our contractors MirrorWeb. Our systems are all web/cloud-based, so can be accessed remotely. It has taken some time and effort, however, to find and install tools which enable us to do things on different machines than usual. We are also holding morning virtual ‘stand up’ meetings to make sure we stay in touch.

 

WA Corona TNA GOVUK Advice

 

The major events we have dealt with previously have almost all had a predictable, limited duration or have impacted on a relatively small part of government. As the COVID-19 pandemic impacts all UK Government work and its duration is unclear, we have had to adapt accordingly.  Additionally, government is creating many new web resources at high speed on a number of different platforms. We are keeping a close watch on the news and our main central government website GOV.UK to identify new services.

 

We have adopted a 3-pronged approach to gain maximum coverage and quality, at the required speed of capture.

1. Bulk capture of sites

The team has deployed a script that checks for specific words (including “covid”, “coronavirus”, among over terms) on the homepages of all the (2000+) government websites in scope for UKGWA. Those that contain any such words are crawled to a limited depth in order to capture content relating to the pandemic.

Over 500 sites were identified for the first production crawl, which was launched on 20 March, and has now finished. We intend to re-run the script and crawl once per week for the duration of the pandemic.

2. In-depth capture of sites with specific relevance

We have identified a small but growing number of sites/sub-sites that contain detailed information about the response to the pandemic. Using Heritrix, our default crawler, we are instructing MirrorWeb to crawl these sites, or the relevant sub-sections of those sites, regularly.

Frequently updated sites/sub-sites will be crawled on all weekdays and augmented with Webrecorder crawls of pages if significant updates are made over weekends.

Less frequently updated sites/sub-sites are being monitored using Visualping (a web page monitoring resource) and crawls will be ordered when changes are identified. This runs each morning.

3. Capturing interactive web content using Webrecorder

Webrecorder enables us to capture content more complex than we can capture using our usual Heritrix-based approach. This mainly, but not exclusively, consists of interactive content (maps, forms, animations, etc.) and embedded video.

TNA staff capture the content using Webrecorder and send the WARC files to MirrorWeb who integrate it with the main UKGWA collection.

We have identified a number of resources relating to COVID-19 which we can only capture using Webrecorder. The most frequently updated resources are captured daily, including the arcgis.com cases dashboard.

However, we have also identified a number of very significant resources relating to COVID-19 which it is not possible for us to capture regularly using Webrecorder either because the process is manual and too time consuming or because it would require us to input data which we do not have to hand. We will liaise with colleagues in other teams to help ensure information about these resources is captured in other ways – e.g. through the digital transfer process.

 WA Corona TNA Webrecorder GOVUK

 

Business as Usual

In addition to the newly instated 3-prong approach to capture content related to coronavirus, we also continue to capture UK Government Twitter, YouTube and Flickr channels as usual and to operate our regular ‘business as usual’ monthly capture schedule.

We anticipate that our approach will adapt further as the pandemic continues and more web-based resources are released by government.

When the pandemic has ended, we are sure that our collection, along with those developed by colleagues around the world, will contribute greatly to the study of this unprecedented event.

 

In the next instalment, I’ll share the strategies web archivists in different countries, working for different types of organisations, have employed to capture the particular way coronavirus affects their own communities. I’ll look at how these resources might be used and discuss why web archivists hope their efforts can help reduce the spread of false information in the future.  


Scroll to top