Last month I emailed William Kilbride at DPC with a query about file formats for quantitative data for long term preservation and, as a result of that email and the ensuing conversation, I appear to have agreed to write a blog post about the topic. Here is that blog post.

A little bit of background: My name is Jenny O’Neill and I am Data Manager at UCD Library in Dublin. Part of my job involves providing local services to UCD researchers around the area of Research Data Management and part of my job is running the Irish Social Science Data Archive (ISSDA). ISSDA is in the process of becoming a member of the Consortium of European Social Science Data Archives (CESSDA). As part of that process we are applying for the Data Seal of Approval (DSA). ISSDA accepts data in a range of formats including SPSS, SAS and Stata. An initial review by CESSDA of our draft DSA application noted that we are lacking a technology watch for these formats of quantitative data. Hence my email to William.

Really all I was looking for was a link to work being done in this area that we could cite – ‘these guys are keeping an eye on this and we’re keeping an eye on them’ kind of thing. But life is rarely that simple. William kindly provided me with a list of people who had an interest in this area and also asked for comment on the DPC mailing list. I also used the CESSDA internal Basecamp to ask how other members of CESSDA who deal with similar data types tackle the issue of preservation formats.

This is what I have learned. There are three issues here: formats for ingest, dissemination and preservation.

Formats for ingest and dissemination are relatively straightforward and are based on the needs of our Designated Community, both Data Producers and Data Consumers. ISSDA’s own file format policy is based on our knowledge of what formats our Data Producers want to give us and those that our Data Consumers want to receive. We have also looked at the file format recommendations of some of the larger social science data archives including DANS, GESIS, ICPSR and the UK Data Archive. Because we will be using NESSTAR to provide online access to data we recommend that data are provided in SPSS together with other formats including Stata and SAS. We additionally recommend that data is provided as a Tab-delimited file (.tab) with setup filed for SPSS, Stata and SAS. But realistically, what we receive is SPSS, SAS and Stata.

I found when formulating our guidelines is that there is no international consensus on which are considered ‘preferred’ and which are considered ‘acceptable’ formats. For example, DANS ‘prefer’ SPSS Portable (.por), SPSS (.sav), STATA (.dta), DDI (.xml), data (.csv) together with setup files and (.txt), while SAS (.7bdat; .sd2; .tpt) is considered ‘acceptable’, with R being acceptable and under consideration. Similarly GESIS ‘prefer’ SPSS Portable (.por), SPSS (.sav), STATA (.dta), tab-, comma-, delimited text files (.csv) with setup file and DDI-XML. However GESIS include SAS Transport (.sas) and SAS (.sas7bdat) as also ‘preferred’ rather than ‘acceptable’. ICPSR’s Collection Development Policy doesn’t recommend any specific formats, instead discussing ‘readily useable formats’ that ‘promote easy access and use without compromising research value’ among other features. For tabular data with extensive metadata the UK Data Archive ‘prefer’ SPSS portable format (.por) or delimited text and with setup file for SPSS, Stata, SAS, etc. SPSS (.sav), Stata (.dta), MS Access (.mdb/.accdb) are considered ‘acceptable’ and SAS isn’t mentioned at all.

I also received a reply from David Clipsham at the National Archives about PRONOM. He let me know that PRONOM’s most recent signature release (DROID signature file version 90) included entries for SPSS Portable (.por) and most versions of Stata's .dta format, plus greatly expanded the number of SAS data format versions it holds. PRONOM previously held entries for SPSS .sav format, and version 9.1 of SAS only. This at least means that DROID and other applications that use PRONOM's data will get positive ID now for Archives using this tool to verify file formats on ingest.

File formats for preservation are a bit more complex. What you might notice about all of these formats (SPSS, Stata and SAS) is that they are all proprietary and require expensive software to open and use them. Digital Preservation 101 tells us that proprietary formats are not always suitable for long term preservation as they are “susceptible to upgrade issues and obsolescence if the owner goes out of business or develops a new alternative” (DPC Handbook). A secondary concern for us at ISSDA is that while most researchers within the Higher Education system will likely have access to these software packages through their institutions, government researchers, policy makers and independent researchers may not have this access or only access to one of the three options because of the cost involved.

Some of the Archives that replied to my query, including RODA, the UK Data Archive and ICPSR (and most likely others), use a non-proprietary format for archival storage. From this standards agnostic format the data can be converted into current dissemination formats and also any future formats required by the Designated Community. For example Hervé L’Hours at the UK Data Archive responded to my query via the DPC mailing list to say that “for qualitative and quantitative data our ultimate preservation solution is the non-proprietary tab-delimited format”. They prefer TAB over CSV as CSV “does not allow for the occurrence of commas in text, either in string variables in quantitative files or in qualitative outputs. In turn, forward migration is easier with tab-delimited files than fixed width ASCII, as it allows for smoother load into a variety of packages, although we do generate a fixed-width ASCII version of each file for archival purity”. Similarly ICPSR normalise SPSS, Stata, and SAS files into raw ASCII data and syntax files, which form both part of the AIP and the DIP. This is outlined in ICPSR Meets OAIS: Applying the OAIS Reference Model to the Social Science Archive Context. They also create DDI XML files, which are part of the AIP only. Jared Lyle from ICPSR mentioned a few tools available to help convert statistical formats into syntax, including StatTransfer and Sledgehammer, but didn’t endorse any particular tool.

My understanding is that to transform data from SPSS, SAS or Stata into such an archival format would require a lot of specialised work that would need to be carried out with caution. Ed Pinsent used the phrase “interventions by skilled ‘data specialists’” when describing work carried out at the National Digital Archive of Datasets (NDAD).

At ISSDA we have neither the resources nor the technical expertise to transform data at this time so I was heartened to read this recent report from the Open Science and Research Initiative (ATT) which identified SPSS as one of the most popular statistical analysis packages whose file formats SAV (.sav) and SPSS Portable (.por) have become de facto standards in the field of social sciences research. ATT found that SPSS Portable (.por) is downwards and upwards compatible and can therefore be recommended for preservation, so for the moment this is the guidance we are using internally and also providing to Depositors.

As others noted in their replies to me, in terms of ‘technology watch’, ISSDA will be guided by our Designated Community. When they begin to offer or request data in different or emerging formats we will need to weigh up how we respond to these formats. For example we are aware that the use of R for statistical analysis is becoming increasingly popular but as our preferred formats can be read into R they largely meet the current needs of our Designated Community using this language.

I still have a lot left to learn so I if anyone has further suggestions, ideas, research, resources to point to please get in touch.

