Brian Lavoie is Senior Research Scientist for OCLC Research


The scholarly record is evolving to incorporate a broader range of research outputs, moving beyond traditional publications like journal articles and monographs. Research data is a salient and well-documented example of this shift, and many universities are now investing considerable resources in developing RDM services for their campus, as we document in our recent Realities of Research Data Management report series. These services sit alongside much of the research life cycle, from support in developing data management plans prior to commencing research (think of DMPOnline or DMPTool), to computing and storage resources for storing, working with, and sharing data during the research process (often called active data management; for example, the DataStore service at the University of Edinburgh), to data repository services for storage, discovery, and access to final data sets (like the University of Illinois Data Bank).

But one part of the research life cycle has been relatively slow to this point in attracting RDM service support, best practices, and policies: long-term preservation. Here I mean what the UK Digital Curation Centre describes in its research life cycle model as preservation action: “actions to ensure long-term preservation and retention of the authoritative nature of data”.

I do not wish to suggest that RDM practitioners have not thought about long-term preservation – indeed they have. Nevertheless, this topic has yet to find much traction in the RDM service space. One explanation for this is that resourcing for RDM, especially staffing, is limited and therefore prioritization is essential. And one of the most immediate priorities is steering a culture change among researchers that promotes an understanding of why good data management practices are a worthy investment of researcher attention. Moreover, there is an understandable emphasis on developing RDM services that meet researchers immediate RDM needs – like active data management, or compliance with government or funder data mandates. RDM services of this kind often operate within a fairly limited time frame in regard to data retention commitments.

But beyond issues of prioritization, bringing the tail end of the research cycle firmly within the scope of the RDM service space also requires sorting out several noteworthy challenges to forming a robust data preservation policy. For example, which data sets should be stewarded with a view toward long-term preservation? While the aspirational answer might be “all of them”, or at least, “as many as possible”, practical considerations will likely intervene to temper actual practice, and some criteria for identifying high value data sets for long-term retention will need to be specified.

An interesting example of this can be found at the University of Illinois, where they have developed a preservation review policy in support of the Illinois Data Bank data repository. The policy operates on the principle that not all data has equal long-term value; in light of this, the Data Bank commits to retain all data sets for a minimum of five years, after which a data set may be subject to review. The review focuses both on data sets that have received little use or interest, or have no projected future value (and may therefore be good candidates for deaccession), as well as data sets that are accessed frequently, and have been cited or otherwise made use of in many papers (here the review would determine if additional measures should be taken to ensure continued usability). The RDM staff worked closely with the University archives to develop the preservation review policy, leveraging the skills and experience of the archiving profession. As Illinois’ experience suggests, RDM practitioners would benefit from collaboration and consultation with colleagues in the archives community, whose knowledge and perspective will be invaluable in developing long-term preservation policies for research data.

Differing data management practices and protocols across disciplines poses another challenge to the development of data preservation strategies. Science Europe, an association of European research organizations and funders, recently published a guidance document on discipline-specific RDM, noting that research data supporting scientific publications should be retained for a period suitable to the needs of the relevant scientific community. The document further notes that “[i]n many cases, this period is defined as at least ten years. However, it has to be decided what period is scientifically and/or socially appropriate. There are cases where it has to be indefinite.” To complicate matters further, some data, such as medical data, might have to be destroyed after a certain period, due to privacy concerns. The potential for significant variation in RDM protocols across disciplines may, in some circumstances, make the development of a general data preservation strategy difficult, if not impracticable, especially for repositories that host data sets from multiple disciplines.

Add into the mixture the increasingly complex web of government and funder policies and guidelines on data retention and preservation, and the data preservation landscape becomes still more challenging. Development of infrastructure and services to support long-term preservation of research data would be helped by convergence on disciplinary or even international standards on data retention protocols. Moreover, moving from an environment of recommendations and guidelines for data retention to one of concrete policies that are consistently and effectively monitored and enforced might also remove some of the uncertainty researchers and institutions face in making decisions about the long-term stewardship of research data.

With grateful acknowledgment to my colleague Rebecca Bryant, our case study partners in the Realities of RDM project, and participants in the OCLC Research Library Partnership RDM Interest Group, all of whom helped inspire and shape my thinking on this topic.


Scroll to top