The Pacific and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC) holds records from 1,370 small languages, mainly audio recordings made by linguists and musicologists since the 1950s, and mainly from the Pacific. It has 16,000 hours of audio, 3,000 hours of video, in 230 terabytes of material in 428,000 files. Many of these records represent the only online presence of records in these languages. The collection serves both a research and a community purpose. A major aim of this work is to curate the digital files for return to and access by the speakers of these languages. The initiative undertaken this year is to move all items to Research-Object Crate (RO-Crate) format, and to store the collection in Amazon S3. RO-Crate is a standard that uses JSON-LD, and allows the whole collection to contain self-describing items so makes it more durable over time, and less reliant on a catalog that is at risk of failure, with consequent metadata loss. Every time an item’s catalog entry is saved, it writes a new version of the RO-Crate, to ensure that it is current. A version of this was trialed earlier, writing and XML file to the same directory as the item, and then using an app developed to write a catalog of any arbitrary set of items in order to deliver them to a specific location. A test version of this is fully functional [2] and illustrates a viewer built by harvesting the data from the items themselves, not from a catalog. The same viewer can be re-used for any set of RO-Crate files, so allowing tooling to be repurposed for a range of source files.

DPA2024 Finalist RI PARADISEC 1

An example of the viewer built from elasticsearch over RO-Crate pilot data, showing a result finding text in an audio transcript and playing just that chunk within a 45 minute file.

There are many locations in the world for which internet access is difficult, and often expensive, and so access to heritage materials in or near their source communities remains problematic. A solution tested in a few locations (a village in Vanuatu, an Australian Western Desert community, workshops in Honiara and in Papeete) is to load relevant items from the PARADISEC collection onto a small low-powered computer, known as a Raspberry Pi, which includes a wifi transmitter. A usb drive is plugged in, with a catalog of just those files collected from the RO-Crates, and written in a simple html form that can be retrieved on a mobile phone, from a local signal transmitted by the Raspberry Pi, independent of any internet access. The catalog is created by a service written by the project, called Dataloader [1], that harvests the RO-Crate file stored in each item and then creates the catalog of just that small collection. This catalog includes services that allow video files to be viewed, audio files to be heard, and pdf files to be scrolled through. All files can then be downloaded to the mobile phone without being impeded by bandwidth considerations.

DPA2024 Finalist RI PARADISEC 2

Indicators of location of collections in the PARADISEC catalog

Recall that these are recordings made in the past few generations. They were analog tapes of people from villages in many locations, stored inaccessibly, and so they have not been available for the people recorded or their families.

The innovation given by the use of RO-Crate also allows any user of PARADISEC to search the text of items in the collection, including transcripts of media in the standard format used by linguists (ELAN XML, or .eaf) that are played along with the media. A citation to a point in the media can be resolved directly to the archival file [3].

DPA2024 Finalist RI PARADISEC 3

Using a Raspberry Pi at Erakor village, Vanuatu

Once these recordings are back in the relevant communities they can inspire revitalisation of oral traditions and additional contextual information that can enrich the catalog.

PARADISEC works with cultural centres and museums in the Pacific to digitize audio tapes in their collections, and then to return items together with metadata. This will be done with RO-Crate and with suitable readers (like Describo [4] or Crate-O [5]) to read and edit metadata in this format.


[1] https://language-archives.services/about/data-loader/ [ a new version is being written now to take advantage of the standard RO-Crate JSON format]

[2] https://mod.paradisec.org.au/view/NT1/98007?ocfl_version=v1&type=audio#NT1-98007-98007A, https://mod.paradisec.org.au/transcription-search?page=1

[3] in a similar example the link here can be resolved if you are logged in to the PARADISEC catalog https://catalog.paradisec.org.au/viewer/#/NT1/98007/media/NT1-98007-98007A?transcription=NT1-98007-98007A.eaf&segment=3.992

[4] Describo – https://describo.github.io/

[5] Crate-O – https://github.com/Language-Research-Technology/crate-o

Latest Comments

Scroll to top