Jefferson Bailey is Director of Web Archiving Programs for The Internet Archive in the USA


Archival collections have always been incomplete. Being homogenous, selective groups of records preserved through time, they support attestation and evidentiary consideration only through their longitudinal availability. Multiple appraisal, selection, and processing strategies have developed over the history of the archival endeavor to address the ways in which the archival collection is, by nature, a partial or symbolic representation. From documentation strategy to the study of “archival silences,” both archivists and users alike have grappled with the challenges of incompleteness inherent in the archive. The emergence of born-digital records, and the ease of their creation, alteration, and publication, has compounded these challenges by introducing a documentary environment that is at once more rich and more easily  preserved, but also more dynamic, more ephemeral, and more partial. Furthermore, digital records have introduced their own characteristics of incompleteness: bit corruption, format migration, rendering and technological dependence, and other vulnerabilities that can impede recreation or interpretation.

These complexities have accelerated even as our ability to archival records for permanent preservation becomes vastly easier, technically and in resource allocation. A basic web crawl with minimal administration can capture tens or hundreds of millions of born-digital documents. A single accessioned hard drive can contain a lifetime’s worth of personal papers or an era’s worth of official records. A single cultural heritage organization can preserve millions, even billions, of born-digital documents with relative ease and realistic costs. Digital preservation exists in a historical moment in which we can archive vastly more, yet that increase represents a diminishing portion of the available whole. We are blessed by a growing surfeit, and cursed by a diminishing fragment. Lossy accelerant.

While surfeit introduces challenges around storage costs, management-through-time, and discovery/access, these are relatively known challenges or are ones of scale, less of implementation. Less conceptualized, however, is the role of elision and fragment in purely born-digital archives, especially web-harvested archives. Archived web content can be intellectually addressed from a variety of perspectives. Web archives are both individual digital objects (PDF document, CSS file) and a technical recreation (a web “page”) based on both static and dynamic elements served from different locations yet rendered in a single user experience. As well, web archives can be experienced both as content (page text, metatags) and as environment (rendering in a browser). The personalization and interactivity of much web content also complicates the notion of a unique instantiation. Crawling, replay technologies, robots.txt exclusions, permission policies -- all can introduce additional entanglements to the task of defining extent, authenticity, and comprehensiveness.

The ease of acquiring material from the web has introduced an inherent paradox: web-scale harvesting can create archives exponentially larger and broader than similarly-oriented analog collections, thus greatly expanding the idea of completeness on a topical, thematic, or evidentiary basis. Yet at the same time, the sheer scale of many of these web archives, as well as the fragmentary nature of their hosting and composition, makes quality assurance and managing the completeness of individual pages or their recreated replay challenging. And the vast scale of the web itself, as a publication and communication platform, means that even massive archives will be, at best, a small fraction of the whole.

We as the heritage and preservation community -- the trusted custodians providing ongoing access to the historical record regardless of format or origin -- will need to ensure that our theories, practices, and, especially, public-facing explication and contextualization, take into account the increasing divergence between the surfeit and fragment that characterize the digital collections we work so hard to preserve.


Scroll to top