Colin Armstrong is a Disc Imaging Technician at the British Library and attended iPRES 2018 with support from the DPC's Leadership Programme which is generously funded by DPC Supporters
It is important to state fae the off that web preservation and internet archiving is not something I am particularly familiar with; so I was pleased to attend the morning paper presentations on web preservation and get tae grips with the terminology, learn about some practical theories in identifying and measuring web page similarity, understand how different web archive collections are created, and learn about some of the tools used to detect when collections go off-topic.
The first paper, titled ‘Measuring News Similarity Across Ten U.S. News Sites’, was presented by Grant Atkins and aimed to measure the similarity of daily news across multiple websites. The idea behind this work is that certain news stories will be included and emphasized over others due to editorial input, and this editorial input will differ between news websites and over the course of the day. Given that a significant amount of online news content can be used to record historical events, it is important to consider the barriers, tools and methods when identifying, recording and preserving news content.
Some initial challenges in mining archived news were that JavaScript would prohibit playback on certain news websites, Paywalls (Wall Street Journal) would only give partial stories, and broken stylesheets (Washington Post) would lead to a lack of structure; and so memento selection and timing of collection was already important. Over a three month period, mementos from ten U.S. news websites were collected around the same time to ensure synchronicity, with the prominent lead or ‘hero stories’ then being empirically identified by a custom parser through identifiers like layout and central placement, larger fonts, and image size. The take-away results were that where there is a high variability and number of hero stories, it can be somewhat difficult to identify significant events such as with National holidays; however the highest similarity between news sites actually peaked days after a significant event (such as with an ‘Election Day’ or ‘Travel Ban’), and so the methods and tools outlined may be used to identify synchronous stories outside of national events.
The next speaker was introduced as a ‘twofer’ since Shawn Jones was presenting two papers: the ‘Off-Topic Memento Toolkit’ and ‘The Many Shapes of Archive-It’. The first of these papers tackled the issue of web archive collections whose original seeds can go off-topic over time by an automated archival system, resulting in mementos that no longer reflect the curator’s original intentions for the collection. Transferring domain ownership, website redesign, a change in language, hacking or technical (maintenance) issues are some reasons this may happen; and so the Off-Topic Memento Toolkit (OTMT) was created to identify such off-topic mementos within web archive collections and allow for their removal or exclusion at a later date. Using a gold standard dataset and building on previous work from AlNoamany et al, the toolkit was measured against eight different similarity measures to detect off-topic mementos, with the most successful results being on word count and cosine of TF-IDF vectors. More on the OTMT and an initial release can be found here.
The last presentation paper (the second by Shawn Jones) focused on collections within Archive-It, using structural metadata and domain diversity to understand crawling behaviour and curatorial decisions. By analysing growth curves of collections (beyond metadata and content), an insight into seed curation and collection behaviour was possible. For example, by highlighting collections which have many original sources to begin with but waning interest later on, collections with a sporadic renewed interest, or collections that have a constant curatorial control throughout their lifetime. Domain diversity was another metric used in the analysis, and successfully indicated whether a curator pulled from many sources or one, or from a ‘top-level’ page or in-depth URIs. Four ‘semantic categories’ were recognised and then mapped to these structural behaviours:
- Self-Archiving, accounting for 54.1% of surveyed collections which mostly revolved around an archiving institution or organisation archiving its own web presence.
- Subject-based Archiving, where the seeds of a collection consist of a particular source topic.
- Time bounded-Expected are collections that focus on planned, expected events or a specific time period (e.g. 2008 Olympics).
- Time bounded-Spontaneous are collections that start after an unplanned event (e.g. a major natural disaster) or the beginning of a movement (Black Lives Matter).
In essence there are a number of ways to understand and categorise web archive collections within Archive-It, beyond the metadata and collection content themselves!
The entire session was enlightening and engaging, with some excellent research being done and scope for future work; and I’d like tae thank the DPC Leadership Programme for the scholarship which allowed me to attend iPRES 2018.