Bulk dataset/Downloads

This type of computational access is where an organization makes its material available through a data download. The organization creates a dataset from its material (this may be all of its collections, or a sub section of them), processes it and then makes it possible for users to download it through an online interface or portal. These datasets are normally available in CSV (Comma Separated Values) and JSON formats, as they are ubiquitous and easy to read by both humans and computers. Read more about the CSV and JSON formats here: Our Friends CSV and JSON.

This type of access gives an organization a lot of control over the material, as it sets the parameters of what is being made available. However, this approach also requires a lot of maintenance, as the dataset will need to be updated and uploaded manually. Versioning is also something to take into consideration. For single files or downloads this may not be as problematic, but when working with large amounts of data this can be important, as results may differ from version to version, depending on what has changed and why. Even if providers cannot retain all versions of the data, users should be encouraged to correctly cite the version they have used to aid in potential reproducibility.

Once the users download the file, they will have to set up their own environment and decide what they want to do with this material. Archivists may see this as the easiest way to make data computationally accessible as it fits well with existing concepts of access and use. They are used to packaging and storing information for users to request and access and the bulk datasets approach could be seen as a very similar process.

The diagram below illustrates the simplicity of the bulk datasets approach. An organization makes a dataset available through an interface or portal. A user can then download this dataset to work with.

bulk datasets

There are different ways of providing access through bulk datasets. The type of material made available as datasets may differ; some organizations will only make their metadata available in bulk, whereas others include both the data and metadata. The hosting of the datasets does not necessarily have to be done by the organization itself. It may decide to upload this material to a third-party provider. For example, large datasets from the Museum of Modern Art (MoMA) in New York are hosted through GitHub; these files are automatically updated monthly and include a time-stamp for each dataset.

A similar approach is taken by Pittsburgh’s Carnegie Museum of Art (CMoA) which also has a GitHub repository; however, this one is updated less regularly.

OPenn is another repository for datasets, specifically archival images. It is managed by a cultural heritage institution which provides access to its own material as well as material from contributing institutions.

A slightly different approach was taken by the International Institute of Social History (IISH) in the Netherlands, which uses open-source software to make its datasets accessible. A slightly modified version of Dataverse is hosted on their website.

The table below showcases a variety of ways in which the bulk dataset model has been applied at different organizations including the following:

  • Are data or metadata (or both) are made available?

  • Is data updated and versioned?

  • Which file formats are available?

  • Where is it hosted?

Organization/Project

Data/Metadata

Versioning

Updated

Terms of use

Downloadable format

Type of Data

Hosted on

HathiTrust – extracted feature datasets

Both

Yes

Yes

Yes

Available through rsync

Unstructured book files

Own website

MoMA Collection

Metadata

Yes

Yes

Yes

Several formats through GitHub

Metadata of collection

GitHub

IISH Data Collection

Both

Yes

Yes

Yes

Depends on the dataset

Structured research datasets

Own website with use of Dataverse

Carnegie Museum of Art

Both

No

No

Yes

CSV and JSON

Object from museum

GitHub

OPenn

Both

No

Yes

Yes

CSV, TIFF and TEI

High Resolution archival images

Own website


Scroll to top