Handbook

Decision tree

Clearly defined selection policies will enable cost savings in terms of time taken to establish whether or not to select and also potential costs further down the track of needing to re-assess digital resources which are either in danger of becoming or are no longer accessible.

This Decision Tree may be used as a tool to construct or test such a policy for your organisation. The decision process represented in the tree should be addressed by your policy for selection of digital materials for the long-term.

Assuming a digital resource is being considered for selection, the questions and choices reflected here will assist the ultimate decision to accept or reject long-term preservation responsibility. The flow of the questions represents a logical order of evaluation. If the response to early questions is not favourable there is little point in accepting preservation responsibility for the resource or continuing its evaluation, for example if the content does not meet your collection policy then the response to questions on the technical format will be irrelevant. The structure of the tree aims to reflect this process.

Remember...

When a policy is in place, to be effective it must also be:

  • Endorsed by senior management
  • Actively promulgated throughout the organisation
  • Reviewed at regular intervals
  • Allocated appropriate resource commitment

Go to Decision Tree Interactive Assessment

The interactive version requires javascript to be enabled on your computer.

Read More

Contents

This contents page provides an "at a glance" view of the major sections and all their component topics.

You can navigate the Handbook by clicking and expanding the "Explore the Handbook" navigation bar or by clicking links in this contents page.

The contents are listed hierarchically and indented to show major sections and sub-sections. Landing pages provide overviews and information for major sections with many sub-sections.

Maintenance and additions to the new Handbook will be ongoing. Any new sections agreed for the next DPC publications plan will be shown as "coming soon".

Status  Digital Preservation Handbook [landing page]
tick4 Complete tick4 Introduction
Coming soon tick4 How to use the Handbook
  tick4 Development and acknowledgements
  tick4 Digital preservation briefing [landing page]   (PDF of this section)
  tick4 Why digital preservation matters
  tick4 Preservation issues
  tick4 Getting started    (PDF of this section)
  tick4 Institutional strategies [landing page]   (PDF of this section)
  tick4 Institutional policies and strategies
  tick4 Collaboration
  tick4 Advocacy
  tick4 Procurement and third party services
  tick4 Audit and certification
  tick4 Legal compliance
  tick4 Risk and change management
  tick4 Staff training and development
  tick4 Standards and best practice
  tick4 Business cases, benefits, costs, and impact
  tick4 Organisational activities [landing page]   (PDF of this section)
  tick4 Creating digital materials
  tick4 Acquisition and appraisal
  tick4 Decision tree
  tick4 Retention and review
  tick4 Storage
  tick4 Legacy media
  tick4 Preservation planning
  tick4 Preservation action
  tick4 Access
  tick4 Metadata and documentation
  tick4 Technical solutions and tools [landing page]   (PDF of this section)
  tick4 Tools
  tick4 Fixity and checksums
  tick4 File formats and standards
  tick4 Information security
  tick4 Cloud services
  tick4 Digital forensics
  tick4 Persistent identifiers
  tick4 Content-specific preservation [landing page]   (PDF of this section)
  tick4 e-Journals
  tick4 Moving pictures and sound
  tick4 Web-archiving
  tick4 Glossary

 

Save

Save

Save

Save

Save

Read More

Geospatial data

 

Under construction icon-orange This page is under construction

 

gis

Illustration by Jørgen Stamp digitalbevaring.dk CC BY 2.5 Denmark

 

Read More

Email

 

 

Under construction icon-orange This page is under construction

 

email

Illustration by Jørgen Stamp digitalbevaring.dk CC BY 2.5 Denmark

 

Resources

 

Case studies

bv_icon_casestudy

Society of American Archivists campus case studies

Partnering with IT to Identify a Commercial Tool for Capturing Archival E-mail of University Executives at the University of Michigan

http://files.archivists.org/pubs/CampusCaseStudies/CASE-14-FINAL.pdf

Aprille Cooke McKay, Bentley Historical Library, University of Michigan, examines the challenges and opportunities of partnering with IT to issue a Request for Proposal (RFP) for commercial e-mail archiving software. 2013. 53 pages

Will They Populate the Boxes? Piloting a Low-Tech Method for Capturing Executive E-mail and a Workflow for Preserving It at the University of Michigan

http://files.archivists.org/pubs/CampusCaseStudies/CASE-15-FINAL.pdf

Aprille Cooke McKay, Bentley Historical Library, University of Michigan. The first part of the paper describes a pilot study testing whether university executives and leaders would flag e-mail messages of long-term value to transfer to the archives. The second part describes the steps taken to move from an ad hoc approach to digital records transfer and processing to one much more routinized. 2013. 91 pages.

Read More

eBooks

 

Under construction icon-orange This page is under construction

 

ebooks

Illustration by Jørgen Stamp digitalbevaring.dk CC BY 2.5 Denmark

 

Read More

Documents and PDF/A

 

Under construction icon-orange This page is under construction

 

pdfa

Illustration by Jørgen Stamp digitalbevaring.dk CC BY 2.5 Denmark

 

Resources

 

Case studies

bv_icon_casestudy

Newspaper e-prints

http://www.digitalpreservation.gov/ndsa/working_groups/documents/NDSA_CaseStudy_NewspaperEPrints.pdf

The US National Digital Stewardship Alliance (NDSA) examines the value, opportunities and obstacles for selective preservation of the PDF printmasters for newspaper e-prints. February 2013, 3 pages

 

Read More

Computer-aided design

 

Under construction icon-orange This page is under construction

 

cad

Illustration by Jørgen Stamp digitalbevaring.dk CC BY 2.5 Denmark

 

Read More

Complex objects and software

 

Under construction icon-orange This page is under construction

 

complexobjects

Illustration by Jørgen Stamp digitalbevaring.dk CC BY 2.5 Denmark

 

Read More

Glossary

Illustration by Jørgen Stamp digitalbevaring.dk CC BY 2.5 Denmark

Introduction

Acronyms and Initials are a feature of any specialised discipline. In an emerging discipline, such as digital preservation, another major difficulty is the lack of a precise and definitive taxonomy of terms. Different communities use the same terms in different ways which can make effective communication problematic. The following working set of definitions and acronyms are those used throughout the Handbook and the DPC Technology Watch Reports and Website. They are intended to assist in its use as a practical tool.

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z


 

 A

Access As defined in the Handbook, access is assumed to mean continued, ongoing usability of a digital resource, retaining all qualities of authenticity, accuracy and functionality deemed to be essential for the purposes the digital material was created and/or acquired for.

ADS Archaeology Data Service. A UK based service active in digital preservation. http://ads.ahds.ac.uk

AIP Archival Information Package. An Information Package, consisting of the Content Information and the associated Preservation Description Information (PDI), which is preserved within an OAIS (OAIS term).

AMIA Association of Moving Image Archives, an organisation active in the field of moving image archiving. http://www.amianet.org

ARC Container format for websites devised by the Internet Archive, superseded by WARC.

ASCII American Standard Code for Information Interchange, standard for electronic text. https://en.wikipedia.org/wiki/ASCII

Authentication A mechanism which attempts to establish the authenticity of digital materials at a particular point in time. For example, digital signatures.

Authenticity The digital material is what it purports to be. In the case of electronic records, it refers to the trustworthiness of the electronic record as a record. In the case of "born digital" and digitised materials, it refers to the fact that whatever is being cited is the same as it was when it was first created unless the accompanying metadata indicates any changes. Confidence in the authenticity of digital materials over time is particularly crucial owing to the ease with which alterations can be made.

 B

Bit A bit is the basic unit of information in computing. It can have only one of two values commonly represented as either a 0 or 1.The two values can be interpreted as any two-valued attribute (yes/no, on/off, etc).

Bit Preservation A term used to denote a very basic level of preservation of digital resource as it was submitted( literally preservation of the bits forming a digital resource). It may include maintaining onsite and offsite backup copies, virus checking, fixity-checking, and periodic refreshment to new storage media. Bit preservation is not digital preservation but it does provide one building block for the more complete set of digital preservation practices and processes that ensure the survival of digital content and also its usability, display, context and interpretation over time.

Born-Digital Digital materials which are not intended to have an analogue equivalent, either as the originating source or as a result of conversion to analogue form. This term has been used in the Handbook to differentiate them from 1) digital materials which have been created as a result of converting analogue originals; and 2) digital materials, which may have originated from a digital source but have been printed to paper, e.g. some electronic records.

BWF Broadcast WAV format, the European Broadcasting Union standard for a WAV file, with extra metadata. http://www.digitalpreservation.gov/formats/fdd/fdd000003.shtml

Byte (B) A unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable unit of memory in many computer architectures.

 C

CCSDS Consultative Committee for Space Data Systems, the body responsible for the OAIS Reference Model. http://public.ccsds.org/default.aspx

Chain of Custody A key concept in forensics whereby the custody and provenance of digital hardware, media and files are safeguarded through, for example, the appointment of evidence custodians. The purpose of the Digital Evidence Bag (DEB) is to hold digitally, along with the evidential digital objects, provenance metadata that can be updated as required: a concept that is familiar to digital preservation practitioners.

Checksum A unique numerical signature dreived from a file. Used to compare copies.

CLIR Council on Library and Information Resources. US based organisation active in digital preservation. http://www.clir.org

CNI Coalition for Networked Information. US based organisation active in digital preservation. http://www.cni.org

Continuing Access refers to the right of a subscriber to an electronic publication and their users to have on-going permanent access to electronic materials which have already been leased and paid for by the subscriber from a publisher. It is a term used, along with its synonyms perpetual access and post-cancellation access, in the information industry to describe the ability to retain access to electronic materials by the subscriber/licensee after the contractual licensing agreement with the publisher/licensor for those materials has ended, whatever the reason for the cessation. It may also cover as appropriate arrangements for digital preservation needed to guarantee some elements of continuing access.

COPTR Community Owned digital Preservation Tool Registry hosted by The Open Preservation Foundation. http://coptr.digipres.org

Crawl The act of browsing the web automatically and methodically to index or download content and other data from the web. The software to do this is often called a web crawler.

 

 D

Dark Archive is an archive that cannot be accessed by any current users but may be accessible at future dates subject to the occurrence of specific pre-defined events ('trigger event'). Access to the data is either limited to a few set individuals or completely restricted to all.

DCC Digital Curation Centre. A UK based organisation active in digital preservation. http://www.dcc.ac.uk

DDI Data Documentation Initiative. A de facto international metadata standard for describing data from the social, behavioral, and economic sciences. http://www.icpsr.umich.edu/DDI

Designated Community an identified group of potential consumers who should be able to understand a particular set of information from an archive. These consumers may consist of multiple communities, are designated by the archive, and may change over time (OAIS term).

Digital Archiving This term is used very differently within sectors. The library and archiving communities often use it interchangeably with digital preservation. Computing professionals tend to use digital archiving to mean the process of backup and ongoing maintenance as opposed to strategies for long-term digital preservation. It is this latter richer definition, as defined under digital preservation which has been used throughout this Handbook.

Digital Forensics The application of scientific technical methods and tools toward the preservation, collection, validation, identification, analysis, interpretation, documentation and presentation of digital information derived after-the-fact from digital sources.

Dim Archive provides bit preservation for the content plus digital preservation planning and actions for long-term perpetual access, and also limited current access (perhaps limited to on-site users or previous subscribers post-cancellation, etc.).

DigCurV Digital Curator Vocational Education Europe. A project funded by the European Commission to establish a curriculum framework for vocational training in digital curation. http://www.digcurv.gla.ac.uk/

Digital Materials A broad term encompassing digital surrogates created as a result of converting analogue materials to digital form (digitisation), and "born digital" for which there has never been and is never intended to be an analogue equivalent, and digital records.

Digital Preservation Refers to the series of managed activities necessary to ensure continued access to digital materials for as long as necessary. Digital preservation is defined very broadly for the purposes of this study and refers to all of the actions required to maintain access to digital materials beyond the limits of media failure or technological and organisational change. Those materials may be records created during the day-to-day business of an organisation; "born-digital" materials created for a specific purpose (e.g. teaching resources); or the products of digitisation projects. This Handbook specifically excludes the potential use of digital technology to preserve the original artefacts through digitisation. See also Digitisation definition below.

  • Short-term preservation - Access to digital materials either for a defined period of time while use is predicted but which does not extend beyond the foreseeable future and/or until it becomes inaccessible because of changes in technology.
  • Medium-term preservation - Continued access to digital materials beyond changes in technology for a defined period of time but not indefinitely.
  • Long-term preservation - Continued access to digital materials, or at least to the information contained in them, indefinitely.

Digital Preservation Management Workshop and Tutorial An intensive training workshop and online tutorial developed and maintained by Cornell University Library, 2003-2006; extended and maintained by ICPSR, 2007-2012; and now extended and maintained by MIT Libraries, 2012-on. http://dpworkshop.org/

Digital Publications "Born digital" objects which have been released for public access and either made available or distributed free of charge or for a fee. They may consist of networked publications, available over a communications network or physical format publications which are distributed on formats such as floppy or optical disks. They may also be either static or dynamic.

Digital Records See Electronic Records

Digital Resources See Digital Materials

Digitisation The process of creating digital files by scanning or otherwise converting analogue materials. The resulting digital copy, or digital surrogate, would then be classed as digital material and then subject to the same broad challenges involved in preserving access to it, as "born digital" materials.

DIP Dissemination Information Package. An Information Package, derived from one or more Archival Information Packages (AIPs), and sent by Archives to the Consumer in response to a request to the OAIS (OAIS term).

DLF Digital Library Federation. A US based organisation active in digital preservation. http://www.diglib.org

Documentation The information provided by a creator and the repository which provides enough information to establish provenance, history and context and to enable its use by others. See also Metadata.

DOI Digital Object Identifier. A technical and organisational infrastructure for the registration and use of persistent identifiers widely used in digital publications and for research data. The DOI system was created by the International DOI Foundation and was adopted as International Standard ISO 26324 in 2012. http://www.doi.org

DPC Digital Preservation Coalition. A UK and Ireland based organisation active in digital preservation and responsible for the Digital Preservation Handbook. http://www.dpconline.org

DPTP Digital Preservation Training Programme, an intensive training course run by the University of London Computer Centre. https://dptp.london.ac.uk/

DRAMBORA Digital Repository Audit Methodology Based on Risk Assessment. A set of risk assessment tools developed by the Digital Curation Centre. http://www.dcc.ac.uk/resources/repository-audit-and-assessment/drambora

DROID A file profiling tool developed and distributed by TNA to identify file formats. Based on PRONOM. http://www.nationalarchives.gov.uk/information-management/manage-information/policy-process/digital-continuity/file-profiling-tool-droid/

 E

Electronic Records Records created digitally in the day-to-day business of the organisation and assigned formal status by the organisation. They may include for example, word processing documents, emails, databases, or intranet web pages.

Emulation A means of overcoming technological obsolescence of hardware and software by developing techniques for imitating obsolete systems on future generations of computers.

Escrow A widespread legal practice of the deposit of content or software source code with a third party. Escrow takes place in a contractual relationship, formalized in an escrow agreement, between at least three parties: the provider, the customer, and the third party providing the escrow service.

 F

FIAF International Federation of Film Archives, an association of the world's leading film archives. http://www.fiafnet.org

FIAT International Federation of Television Archives, a professional association for those engaged in the preservation and exploitation of broadcast archives. http://fiatifta.org

File Format A file format is a standard way that information is encoded for storage in a computer file. It tells the computer how to display, print, and process, and save the information. It is dictated by the application program which created the file, and the operating system under which it was created and stored. Some file formats are designed for very particular types of data, others can act as a container for different types. A particular file format is often indicated by a file name extension containing three or four letters that identify the format. http://en.wikipedia.org/wiki/File_format

Fixity Check a method for ensuring the integrity of a file and verifying it has not been altered or corrupted. During transfer, an archive may run a fixity check to ensure a transmitted file has not been altered en route. Within the archive, fixity checking is used to ensure that digital files have not been altered or corrupted. It is most often accomplished by computing checksums such as MD5, SHA1 or SHA256 for a file and comparing them to a stored value. http://en.wikipedia.org/wiki/File_Fixity

 G

GIF Graphic Interchange Format, an image which typically uses lossy compression. http://en.wikipedia.org/wiki/GIF

Gigabyte (GB) A unit of digital information often used to describe data or data storage size, equates to approximately 1,000 Megabytes (MB).

GIS Geographical Information System, a system that processes mapping and data together.

 H

HTML Hypertext Markup Language, a format used to present text and other information on the World Wide Web. Since 1996, versions of the HTML specification have been maintained by the World Wide Web Consortium (W3C). http://en.wikipedia.org/wiki/HTML

 I

IASA International Association of Sound and Audiovisual Archives, an association for archives that preserve recorded sound and audiovisual documents. http://www.iasa-web.org

IIPC The International Internet Preservation Consortium. http://www.netpreserve.org

Information Assurance An aspect of digital security, specifically directed at ensuring that the quality of the information is demonstrably safeguarded, that it has not been tampered with or accessed inappropriately.

Ingest the process of turning a Submission Information Package (SIP) into an Archival Information Package (AIP), i.e. putting data into a digital archive (OAIS term).

InterPARES project International Research on Permanent Authentic Records in Electronic Systems. http://www.interpares.org

ISO International Organization for Standardization. http://www.iso.org/iso/home.html

 J

JHove2 A characterization tool for digital objects. Characterisation is comprised of four elements: identifying the object's format; validating that the object conforms to its format's technical norms;, extracting technical metadata from the object; and assessing whether the object should be accepted into a repository, based on policies set by the curator. https://bitbucket.org/jhove2/main/wiki/Home

JPEG Joint Photographic Experts Group, a committee that oversees international standards for compression and processing of digital photographs . The majority of JPEG formats are lossy. http://www.jpeg.org/

JPEG 2000 a revision of the JPEG format which can use lossless compression.

 K

Kilobyte (KB) A unit of digital information often used to describe data or data storage size, equates to approximately 1,000 Bytes

 L

Life-cycle Management Records management practices have established life-cycle management for many years, for both paper and electronic records. The major implications for life-cycle management of digital resources, whatever their form or function, is the need actively to manage the resource at each stage of its life-cycle and to recognise the inter-dependencies between each stage and commence preservation activities as early as practicable. This represents a major difference with most traditional preservation, where management is largely passive until detailed conservation work is required, typically, many years after creation and rarely, if ever, involving the creator. There is an active and inter-linked life-cycle to digital resources which has prompted many to promote the term "continuum" to distinguish it from the more traditional and linear flow of the life-cycle for traditional analogue materials. We have used the term life-cycle to apply to this pro-active concept of preservation management for digital materials.

Lossless Compression A mechanism for reducing file sizes that retains all original data.

Lossy Compression A mechanism for reducing file sizes that typically discards data.

LOTAR (LOng Term Archiving and Retrieval) a digital preservation standard for 3D CAD models and product data management information developed by LOTAR International, an industrial consortium of aerospace and defence companies from the US and Europe. http://www.lotar-international.org

 M

Megabyte (MB) A unit of digital information often used to describe data or data storage size, equates to approximately 1,000 Kilobytes (KB).

Metadata Information which describes significant aspects of a resource. Most discussion to date has tended to emphasise metadata for the purposes of resource discovery. The emphasis in this Handbook is on what metadata are required successfully to manage and preserve digital materials over time and which will assist in ensuring essential contextual, historical, and technical information are preserved along with the digital object. The PREMIS Data Dictionary for Preservation Metadata has become a key de facto standard in digital preservation.

METS Metadata Encoding and Transmission Standard, a standard for presenting metadata using XML. http://www.loc.gov/standards/mets/

Migration A means of overcoming technological obsolescence by transferring digital resources from one hardware/software generation to the next. The purpose of migration is to preserve the intellectual content of digital objects and to retain the ability for clients to retrieve, display, and otherwise use them in the face of constantly changing technology. Migration differs from the refreshing of storage media in that it is not always possible to make an exact digital copy or replicate original features and appearance and still maintain the compatibility of the resource with the new generation of technology.

MIME Multipurpose Internet Mail Extensions. A protocol for including non-ASCII information in email messages. Software typically include interpreters that convert MIME content to and from its native format, as necessary. http://en.wikipedia.org/wiki/MIME

MPEG Moving Picture Experts Group. A committee responsible for the development of international standards for compression, decompression, processing, and coded representation of moving pictures, audio and their combination. https://mpeg.chiariglione.org/

 N

NCDD The Netherlands Coalition for Digital Preservation. http://www.ncdd.nl/en/

NDSA National Digital Stewardship Alliance a US based organisation active in digital preservation. http://www.digitalpreservation.gov/ndsa/

NESTOR The German competence network for digital preservation. http://www.langzeitarchivierung.de/Subsites/nestor/EN/Home/home_node.html/

 O

Open Archival Information System (OAIS) An Archive, consisting of an organization, which may be part of a larger organization, of people and systems, that has accepted the responsibility to preserve information and make it available for a Designated Community. It meets a set of responsibilities, as defined in section 4 of the OAIS standard that allows an OAIS Archive to be distinguished from other uses of the term ‘Archive’. The term ‘Open’ in OAIS is used to imply that the OAIS standards are developed in open forums, and it does not imply that access to the Archive is unrestricted. The OAIS abbreviation is also used commonly to refer to the Open Archival Information System reference model standard which defined the term. The standard is a conceptual framework describing the environment, functional components, and information objects associated with a system responsible for the long-term preservation. As a reference model, its primary purpose is to provide a common set of concepts and definitions that can assist discussion across sectors and professional groups and facilitate the specification of archives and digital preservation systems. It has a very basic set of conformance requirements that should be seen as minimalist. OAIS was first approved as ISO Standard 14721 in 2002 and a 2nd edition was published in 2012. Although produced under the leadership of the Consultative Committee for Space Data Systems (CCSDS), it had major input from libraries and archives.

OPF Open Preservation Foundation, formerly the Open Planets Foundation. http://openpreservation.org

 P

PAIMAS Space Data and Information Transfer Systems - Producer-Archive Interface - Methodology Abstract Standard. This ISO 20652:2006 standard covers the first stages of the ingest process defined by OAIS reference model. http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=39577

PDF Portable Document Format, a set of formats and open standards maintained by the International Organization for Standardization for producing and sharing electronic documents originally developed by Adobe Systems. The original page description format has been elaborated over successive versions to enable the embedding of such complex objects as image, audio, and moving image files, hyperlinks, embedded XML metadata, and updatable forms. Specification for various versions and profiles of the format are now maintained by the International Standards Organization. http://www.adobe.com/uk/products/acrobat/adobepdf.html

PDF/A Versions of the PDF standard intended for archival use. http://www.aiim.org/Research-and-Publications/Standards/Committees/PDFA

PDI Preservation Description Information. The information which is necessary for adequate preservation of the Content Information and which can be categorized as Provenance, Reference, Fixity, Context, and Access Rights Information (OAIS term).

Perpetual Access see Continuing Access.

Petabyte (PB) A unit of digital information often used to describe data or data storage size, equates to approximately 1,000 Terabytes (TB).

PIN Pérennisation des Informations Numériques, the French national interest group for digital preservation. http://pin.association-aristote.fr/doku.php

Post-cancellation Access see Continuing Access.

PREMIS Preservation Metadata: Implementation Strategies. A de facto standard for digital preservation metadata. http://www.loc.gov/standards/premis/

PRONOM A database of file formats, software products and other technical components required to support long-term access to electronic records and other digital objects of cultural, historical or business value. Used with DROID. http://apps.nationalarchives.gov.uk/PRONOM/Default.aspx

PST Personal Storage Table is a file extension for local 'personal stores' written by the program Microsoft Outlook. PST files contain email messages and calendar entries using a proprietary but open format, and they may be found on local or networked drives of email end users. Several tools can read and migrate PST files to other formats. http://en.wikipedia.org/wiki/Personal_Storage_Table

 Q

 R

Reformatting Copying information content from one storage medium to a different storage medium (media reformatting) or converting from one file format to a different file format (file re-formatting).

Refreshing Copying information content from one storage media to the same storage media.

 S

Sandbox Containment A secure computing environment for running novel, unattested or experimental code or changes in code, including potentially malicious code. The environment is self-contained with tightly controlled resources and is characteristically virtual.

SGML Standard Generalized Markup Language an ISO standard for how to specify a document markup language or tag set. http://en.wikipedia.org/wiki/Standard_Generalized_Markup_Language

Significant properties Characteristics of digital and intellectual objects that must be preserved over time in order to ensure the continued accessibility, usability and meaning of the objects and their capacity to be accepted as (evidence of) what they purport to be. https://www.archives.gov/files/era/acera/pdf/significant-properties.pdf

SIP Submission Information Package. An Information Package that is delivered by the Producer to the OAIS for use in the construction or update of one or more Archival Information Packages (AIPs) and/or the associated Descriptive Information (OAIS term).

SMPTE Society of Motion Picture and Television Engineers, a professional organisation and technical standards body for television and motion picture. https://www.smpte.org

 T

TDR Trusted Digital Repository. A trusted digital repository has been defined as having “a mission to provide reliable, long-term access to managed digital resources to its designated community, now and into the future”. The TDR must include the following seven attributes: compliance with the reference model for an Open Archival Information System (OAIS), administrative responsibility, organizational viability, financial sustainability, technological and procedural suitability, system security, and procedural accountability. The concept has been an important one particularly in relation to certification of digital repositories.

Terabyte (TB) A unit of digital information often used to describe data or data storage size, equates to approximately 1,000 Gigabytes (GB).

Three-Legged Stool A conceptual approach to digital preservation that suggests a fully implemented and viable preservation programme addresses organisational issues, technological concerns, and funding questions, balancing them like a three-legged stool. Developed as part of the Digital Preservation Management Workshop and Tutorial.

TIFF Tagged Image File Format, a common format for images typically lossless. http://en.wikipedia.org/wiki/Tagged_Image_File_Format

TRAC Trusted Repository Audit and Certification, toolkit for auditing a digital repository. http://www.crl.edu/sites/default/files/d6/attachments/pages/trac_0.pdf

Trigger Event This terminology is used when specific conditions relating to an electronic publication and its continued delivery to users are met. If the publication is no longer available to users from the publisher or any other source for a variety of reasons then a trigger event is said to have occurred. They can set in motion access for users via an archive where the electronic publication may be digitally preserved.

 

 U

UKWA UK Web Archive. http://www.webarchive.org.uk/ukwa/

 V

 W

WARC The WARC (Web ARChive) format is a container format for archived websites, also known as ISO 28500:2009. It is a revision of the Internet Archive's ARC File Format used to store web crawls harvested from the World Wide Web. http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=44717

WAV the standard file wrapper for audio; see BWF (Broadcast WAV Format) for the professional variant. http://en.wikipedia.org/wiki/WAV

Writeblockers Tools that prevent an examination computer system from writing or altering a collection or subject hard drive or other digital media object. Hardware writeblockers are generally regarded as more reliable than software writeblockers.

 X

XML Extensible Markup Language, a widely used standard (derived from SGML), for representing structured information, including documents, data, configuration, books, and transactions. It is maintained by the World Wide Web Consortium (W3C). http://www.w3.org/XML/

 Y

 Z



Read More

Web-archiving

 

Illustration by Jørgen Stamp digitalbevaring.dk CC BY 2.5 Denmark

Overview

 

This case study provides a brief novice to intermediate level overview summarised from the DPC Technology Watch Report on Web-Archiving. Three "mini case studies" are included illustrate the different operational contexts, drivers, and solutions that can be implemented. The report itself provides a "deep dive" discussing a wider range of issues and practice in greater depth with extensive further reading and advice (Pennock, 2013). It is recommended to readers who need a more advanced level briefing on the topic and practice.

 

Introduction

 

The World Wide Web is a unique information resource of massive scale, used globally. Much of its content will likely have value not just to the current generation but also to future generations. Yet the lasting legacy of the web is at risk, threatened in part by the very speed at which it has become a success. Content is lost at an alarming rate, risking not just our digital cultural memory but also organizational accountability. In recognition of this, a number of cultural heritage and academic institutions, non-profit organizations and private businesses have explored the issues involved and lead or contribute to development of technical solutions for web archiving.

 

Services and Solutions

 

Business needs and available resources are fundamental considerations when selecting appropriate web archiving tools and/or services. Other related issues must also be considered: organizations considering web archiving to meet regulatory requirements must, for example, consider associated issues such as authenticity and integrity, recordkeeping and quality assurance. All organizations will need to consider the issue of selection (i.e. which websites to archive), a seemingly straightforward task which is complicated by the complex inter-relationships shared by most websites that make it difficult to set boundaries. Other issues include managing malware, minimizing duplication of resources, temporal coherence of sites and long-term preservation or sustainability of resources. International collaboration is proving to be a game-changer in developing scalable solutions to support long-term preservation and ensure collections remain reliably accessible for future generations.

The web archiving process is not a one-off action. A suite of applications is typically deployed to support different stages of the process, though they may be integrated into a single end-to-end workflow. Much of the software is available as open source, allowing institutions free access to the source code for use and/or modification at no cost.

 

Integrated Systems for Web-archiving

 

A small number of integrated systems are available for those with sufficient technical staff to install, maintain and administer a system in-house. These typically offer integrated web archiving functionality across most of the life cycle, from selection and permissions management to crawling, quality assurance, and access. Three are featured here.

 

PANDAS

PANDAS (PANDORA Digital Archiving System) was one of the first available integrated web archiving systems. First implemented by the National Library of Australia (NLA) in 2001, PANDAS is a web application written in Java and Perl that provides a user-friendly interface to manage the web archiving workflow. It supports selection, permissions, scheduling, harvests, quality assurance, archiving, and access. PANDAS is not open source software, though it has been used by other institutions (most notably the UK Web Archiving Consortium from 2004 to 2008). It is used by the NLA for selective web archiving, whilst the Internet Archive supports their annual snapshots of the Australian domain.

Web Curator Tool (WCT)

The Web Curator Tool is an open source workflow tool for managing the selective web archiving process, developed collaboratively by the National Library of New Zealand and the British Library with Oakleigh Consulting. It supports selection, permissions, description, harvests, and quality assurance, with a separate access interface. WCT is written in Java within a flexible architecture and is publicly available for download from SourceForge under an Apache public licence. The WCT website is the hub for the developer community and there are active mailing lists for both users and developers. The highly modular nature of the system minimizes system dependencies.

NetarchiveSuite

NetarchiveSuite is a web archiving application written in Java for managing selective and broad domain web archiving, originally developed in 2004 by the two legal deposit libraries in Denmark (Det Kongelige Bibliotek and Statsbiblioteket). It became open source in 2007 and has received additional development input from the Bibliothèque nationale de France and the Österreichische Nationalbibliothek since 2008. It is freely available under the GNU Lesser General Public License (LGPL). The highly modular nature of the system enables flexible implementation solutions.

 

Third party and commercial services

 

Third party commercial web archiving services are increasingly used by organizations that prefer not to establish and maintain their own web archiving technical infrastructure. The reasons behind this can vary widely. Often it is not simply about the scale of the operation or the perceived complexity, but the business need and focus. Many organizations do not wish to invest in any skills or capital that is not core to their business. Others may use such a service to avoid capital investment. Moreover, organizations are increasingly moving their computing and IT operations into the cloud, or using a SAAS (Software as a Service) provider. Web archiving is no exception. From a legal and compliance perspective, third party services are sometimes preferred as they can provide not just the technology but also the skills and support required to meet business needs. This section introduces some of the third party services currently available but is of course a non-exhaustive list, and inclusion here should not be taken as recommendation.

 

Archive-It

Archive-It is a subscription web archiving service provided by the Internet Archive. Customers use the service to establish specific collections, for example about the London 2012 Olympics, government websites, human rights, and course reading lists. A dedicated user interface is provided for customers to select and manage seeds, set the scope of a crawl and crawl frequency, monitor crawl progress and perform quality assurance, add metadata and create landing pages for their collections. Collections are made public by default via the Archive-It website, with private collections requiring special arrangement. The access interface supports both URL and full text searching. Over 200 partners use the service, mostly from the academic or cultural heritage sectors. The cost of the service depends on the requirements of the collecting institution

Archivethe.Net

Archivethe.Net is a web-based web archiving service provided by the Internet Memory Foundation (IMF). It enables customers to manage the entire workflow via a web interface to three main modules: Administration (managing users), Collection (seed and crawl management), and Report (reports and metrics at different levels). The platform is available in both English and French. Alongside full text searching and collection of multimedia content, it also supports an automated redirection service for live sites. Automated QA tools are being developed though IMF can also provide manual quality assurance services, as well as direct collection management for institutions not wishing to use the online tool. Costs are dependent upon the requirements of the collecting institution. Collections can be made private or remain openly accessible, in which case they may be branded as required by the collecting institutions and appear in the IMF collection. The hosting fee in such cases is absorbed by IMF.

The University of California's Curation Centre (UC3)

As part of the California Digital Library, provides a fully hosted Web Archiving Service for selective web archive collections. University of California departments and organizations are charged only for storage. Fees are levied for other groups and consortia, comprising an annual service fee plus storage costs. Collections may be made publicly available or kept private. Around 20 partner organizations have made collections available to date. Full text search is provided and presentation of the collections can be branded as required by collecting institutions.

Private companies

Private companies offer web archiving services particularly tailored to business needs. Hanzo Archives, for example, provide a commercial website archiving service to meet commercial business needs around regulatory compliance, e-discovery and records management. Hanzo Archives emphasize their ability to collect rich media sites and content that may be difficult for a standard crawler to pick up, including dynamic content from Sharepoint, and wikis from private internets, alongside public and private social media channels. (More details about the possibilities afforded by the Hanzo Archives service can be found in the Coca-Cola case study) Similarly, Reed Archives provide a commercial web archiving service for organizational regulatory compliance, litigation protection, eDiscovery and records management. This includes an 'archive-on-demand' toolset for use when browsing the web. In each case, the cost of the service is tailored to the precise requirements of the customer. Other companies and services are also available and readers are encouraged to search online for further options should such a service be of interest.

 

 Case study 1: The UK Web Archive

 

The UK Web Archive (UKWA) was established in 2004 by the UK Web Archiving Consortium. It was originally a six-way partnership, led by the British Library in conjunction with the Wellcome Library, Jisc, the National Library of Wales, the National Library of Scotland and The National Archives (UK).

UKWA partners select and nominate websites using the features of the web archiving system hosted on the UK Web Archive infrastructure maintained by the British Library. The British Library works closely with a number of other institutions and individuals to select and nominate websites of interest. Selectively archived websites are revisited at regular intervals so that changes over time are captured.

The technical infrastructure underpinning the UK Web Archive is managed by the British Library. The Archive was originally established with the PANDAS software provided by the National Library of Australia, hosted by an external agency, but in 2008 the archive was moved in-house and migrated into the Web Curator Tool (WCT) system.

A customized version of the Wayback interface developed by the Internet Archive is used as the WCT front end and provides searchable access to all publicly available archived websites. Full text searching is enabled in addition to standard title and URL searches and a subject classification schema. The web archiving team at the library have recently released a number of visualization tools to aid researchers in understanding and finding content in the collection.

Special collections have been established on a broad range of topics. Many are subject based, for example the mental health and the Free Church collections. Others document the online response to a notable event in recent history, such as the UK General Elections, Queen Elizabeth II's Diamond Jubilee and the London 2012 Olympics.

Many more single sites, not associated with a given special collection, have been archived on the recommendation of subject specialists or members of the public. These are often no longer available on the live web, for example the website of UK Member of Parliament Robin Cook or Antony Gormley's One & Other public art project , acquired from Sky Arts.

 

 Case study 2: The Internet Memory Foundation

 

The Internet Memory Foundation (IMF) was established in 2004 as a non-profit organization to support web archiving initiatives and develop support for web preservation in Europe. Originally known as the European Archive Foundation, it changed its name in 2010. IMF provides customers with an outsourced fully fledged web archiving solution to manage the web archiving workflow without them having to deal with operational workflow issues.

IMF collaborates closely with Internet Memory Research (IMR) to operate a part of its technical workflows for web archiving. IMR was established in 2011 as a spin off from the IMF. Both IMF and IMR are involved in research projects that support the growth and use of web archives.

IMR provides a customizable web archiving service, Archivethe.Net (AtN). AtN is a shared web-archiving platform with a web-based interface that helps institutions to easily and quickly start collecting websites including dynamic content and rich media. It can be tailored to the needs of clients, and institutions retain full control of their collection policy (ability to select sites, specify depth, gathering frequency, etc.). Quality control services can be provided on request. Most is done manually in order to meet high levels of institutional quality requirements, and IM has a dedicated QA team composed of QA assessors. IM has developed a methodology for visual comparison based on tools used for crawling and accessing data, though they are also working on improving tools and methods to deliver a higher initial crawl quality.

Partner institutions, with openly accessible collections for which the IM provides a web archiving service, include the UK National Archives and the UK Parliament.

Access to publicly available collections is provided via the IM website. IM provides a full text search facility for most of its online collections, in addition to URL-based search. Full text search results can be integrated on a third party website and collections can be branded by owners as necessary.

Following the architecture of the Web Continuity Service by The National Archives (The National Archives, 2010), IM implemented an 'automatic redirection service' to integrate web archives with the live web user experience. When navigating on the web, users are automatically redirected to the web archive if the resource requested is no longer available online. Within the web archive, the user is pointed to the most recent crawled instance of the requested resource. Once the resource is accessed, any link on the page will send the user back to the live version of the site. This service is considered to increase the life of a link, to improve users' experience, online visibility and ranking, and to reduce bounce rates.

Web archiving collections are available for public browsing from the IM website, a combination of both domain and selective collections from its own and from partner institutions.

 

 Case study 3: The Coca-Cola web archive

 

The Coca-Cola Web Archive was established to capture and preserve corporate Coca-Cola websites and social media. It is part of the Coca-Cola Archive, which contains millions of both physical and digital artefacts, from papers and photographs to adverts, bottles, and promotional goods. Coca-Cola's online presence is vast, including not only several national Coca-Cola websites but also for example, the Coca-Cola Facebook page and Twitter stream, and other Coca-Cola owned brands (500 in all).The first Coca-Cola website was published in 1995.

Since 2009, Coca-Cola has collaborated with Hanzo Archives and now utilizes their commercial web archiving service. Alongside the heritage benefits of the web archive, the service also provides litigation support where part or all of the website may be called upon as evidence in court and regulatory compliance for records management applications.

The Coca-Cola web archive is a special themed web archive that contains all corporate Coca-Cola sites and other specially selected sites associated with Coca-Cola. It is intended to be as comprehensive as possible, with integrity/functionality of captured sites of prime importance. This includes social media and video, whether live-streamed or embedded (including Flash). Artefacts are preserved in their original form wherever possible, a fundamental principle for all objects in the Coca-Cola Archive.

Hanzo Archives' crawls take place quarterly and are supplemented by occasional event-based collection crawls, such as the 125th anniversary of Coca-Cola, celebrated in 2011. Hanzo's web archiving solution is a custom-built application. Web content is collected in its native format by the Hanzo Archives web crawler, which is deployed to the scale necessary for the task in hand.

Quality assurance is carried out with a two-hop systematic sample check of crawl contents that forces use of the upper-level navigation options and focuses on the technical shape of the site.

The Archive is currently accessible only to Coca-Cola employees, on a limited number of machines. Remote access is provided by Hanzo using their own access interface. Proxy-based access ensures that all content is served directly from the archive and that no 'live-site leakage' is encountered. The archive may be made publicly accessible in the future inside The World of Coca-Cola, in Altanta, Georgia, USA.

The Coca-Cola web archive collection contains over six million webpages and over 2TB of data. Prior to their collaboration with Hanzo, early attempts at archiving resulted in incomplete captures so early sites are not as complete as the company would like. The collection also contains information about many national and international events for which Cola-Cola was a sponsor, including the London 2012 Olympics and Queen Elizabeth II's Diamond Jubilee.

 

Conclusions

 

Web archiving technology has significantly matured over the past decade, as has our understanding of the issues involved. Consequently we have a broad set of tools and services which enable us to archive and preserve aspects of our online cultural memory and comply with regulatory requirements for capturing and preserving online records. The work is ongoing, for as long as the Internet continues to evolve, web archiving technology must evolve to keep pace.

Alongside technical developments, the knowledge and experience gained through practical deployment and use of web archiving tools has led to a much better understanding of best practices in web archiving, operational strategies for embedding web archiving in an organizational context, business needs and benefits, use cases, and resourcing options. Organizations wishing to embark on a web archiving initiative must be very clear about their business needs before doing so. Business needs should be the fundamental driver behind any web archiving initiative and will significantly influence the detail of a resulting web archiving strategy and selection policy. The fact that commercial services and technologies have emerged is a sign of the maturity of web archiving as a business need, as well as a discipline.

 

Resources

Pennock, M., 2013. Web-Archiving, DPC Technology Watch Report 13-01 March 2013

http://dx.doi.org/10.7207/twr13-01

This report is intended for those with an interest in, or responsibility for, setting up a web archive. It introduces and discusses the key issues faced by organizations engaged in web archiving initiatives, whether they are contracting out to a third party service provider or managing the process in-house and provides a detailed overview of the main software applications and tools currently available.

ISO, 2012, ISO 28500:2009 Information and Documentation – the WARC file format

http://www.iso.org/iso/catalogue_detail.htm?csnumber=44717

The WARC (Web ARChive) format is a container format for archived websites, also known as ISO 28500:2009. It is a revision of the Internet Archive's ARC File Format used to store web crawls harvested from the World Wide Web.

ISO, 2013 ISO/TR 14873:2013 Information and Documentation – Statistics and quality issues for web archiving

http://www.iso.org/iso/catalogue_detail.htm?csnumber=55211

This technical report defines statistics, terms and quality criteria for Web archiving. It considers the needs and practices across a wide range of organisations such as libraries, archives, museums, research centres and heritage foundations.

Meyer E 2010 (a), Researcher Engagement with Web Archives: State of the Art Report, JISC

http://ie-repository.jisc.ac.uk/544/

This report summarizes the state of the art of web archiving in relationship to researchers and research needs focussing primarily on individual researchers and institutions.

Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations

Zittrain, Jonathan and Albert, Kendra and Lessig, Lawrence, Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations (October 1, 2013). Harvard Public Law Working Paper No. 13-42. Available at SSRN: http://ssrn.com/abstract=2329161 or http://dx.doi.org/10.2139/ssrn.2329161

http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2329161 or http://dx.doi.org/10.2139/ssrn.2329161

This article from the Perma project team documents a serious problem of reference rot: more than 70% of the URLs within the Harvard Law Review and other journals, and 50% of the URLs found within United States Supreme Court opinions, do not link to the originally cited information. It proposes a solution for authors and editors of new scholarship that involves libraries undertaking the distributed, long-term preservation of link contents.

Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot

http://dx.doi.org/10.1371/journal.pone.0115253

This large-scale study looked into approximately 600K links extracted from over 3M scholarly papers published between 1997 and 2012. Those were links to so-called web-at-large resources, i.e. not links to other scholarly papers. It found one out of five STM articles suffering from reference rot, meaning it is impossible to revisit the web context that surrounds them some time after their publication. When only considering STM articles that contain references to web resources, this fraction increases to seven out of ten.

The National Archives, 2010.Government Web Archive: Redirection Technical Guidance for Government Departments, version 4.2, The National Archives (UK)

http://www.nationalarchives.gov.uk/documents/information-management/redirection-technical-guidance-for-departments-v4.2-web-version.pdf

This guidance describes an innovative service that provides URL rewriting and redirection functionality for UK Government web pages by setting up redirection to the UK Government web archive where a requested URL does no longer exists on a departmental web site.

MEMENTO and the Time Travel Service

http://www.mementoweb.org/

Memento is a tool which allows users to see a version of a web resource as it existed at a certain point in the past. It is now used in several web archives. The Time Travel service based on Memento checks a range of servers including many web archives and tries to find a web page as it existed around the time of your choice.

Archive-It

http://www.archive-it.org/

Hanzo Archives

http://www.hanzoarchives.com/

Wayback

http://www.sourceforge.net/projects/archive-access/files/wayback/

Netarchive Suite

https://sbforge.org/display/NAS/NetarchiveSuite

PANDAS

http://pandora.nla.gov.au/pandas.html

UC3 Web Archiving Service

https://cdlib.org/services/uc3/about/

Web Curator Tool

http://webcurator.sourceforge.net/

International Internet Preservation Consortium

http://www.netpreserve.org

The IIPC is a membership organization dedicated to improving the tools, standards and best practices of web archiving while promoting international collaboration and the broad access and use of web archives for research and cultural heritage. There are many valuable resources on the website including excellent short videos such as the example below.

Why Archive the Web?

https://www.youtube.com/watch?v=pU32rjTaMFE

 

A short video published on 18 Oct 2012 introducing the challenges of web-archiving and the IIPC. (2 mins 53 secs).

What is a Web Archive?

https://youtu.be/ubDHY-ynWi0

This short video explains 'Web Archiving' and why is it important that the UK Legal Deposit libraries support it. It was produced as part of the Arts and Humanities Research Council funded 'Big UK Domain Data for the Arts and Humanities' project.(2 mins 31 secs)

What do the UK Web Archive collect?

https://youtu.be/1QLMPIRwJEo

This video for users explains what they can expect to find and where they might go to access the three collections that the UK Web Archive hold. It was produced as part of the Arts and Humanities Research Council funded 'Big UK Domain Data for the Arts and Humanities' project. (2 mins 55 secs)

 

Further case studies

NDSA Website content case studies

The US National Digital Stewardship Alliance (NDSA) examines the value, opportunities and obstacles for selective preservation of the following specific web content types:

Science, Medicine, Mathematics, and Technology forums

http://www.digitalpreservation.gov/ndsa/working_groups/documents/ScienceForums_CaseStudy_public_v2.pdf

December 2013 (3 pages).

Science, Medicine, Mathematics, and Technology blogs

http://www.digitalpreservation.gov/ndsa/working_groups/documents/ScienceBlogs_CaseStudy_public_v2.pdf

December 2013 (3 pages).

Born‐Digital Community and Hyperlocal News

http://www.digitalpreservation.gov/ndsa/working_groups/documents/NDSA_CaseStudy_CommunityNews.pdf

February 2013 (3 pages).

Citizen Journalism

http://www.digitalpreservation.gov/ndsa/working_groups/documents/NDSA_CaseStudy_CitizenJournalism.pdf

February 2013 (3 pages).

 

On the Development of the University of Michigan Web Archives: Archival Principles and Strategies

http://files.archivists.org/pubs/CampusCaseStudies/Case13Final.pdf

Michael Shallcross, Bentley Historical Library, University of Michigan details the strategies and procedures the University Archives and Records Program (UARP) followed to develop its collection of archived websites, and how it initiated a large-scale website preservation project as part of a broader effort to proactively capture and maintain select electronic records of the University. 2011 (29 pages).

 

References

 

Pennock, M., 2013. Web-Archiving, DPC Technology Watch Report 13-01 March 2013. Available: http://dx.doi.org/10.7207/twr13-01

The National Archives, 2010. Government Web Archive: Redirection Technical Guidance for Government Departments, version 4.2, The National Archives (UK). Available: http://www.nationalarchives.gov.uk/documents/information-management/redirection-technical-guidance-for-departments-v4.2-web-version.pdf

 

Read More

Scroll to top