This blog post is derived from a series of emails between Jez Cope, Research Data Manager at the University of Sheffield, and Martin Donnelly of the Digital Curation Centre (DCC), University of Edinburgh, in early January 2017.
MD – From Open Access (OA) publications to research data management (RDM), over the past decade or so scholars and researchers – as well as the people who support them, such as librarians and IT professionals – have had to get used to constantly increasing responsibilities and expectations to prepare their outputs for long-term preservation. The next thing on the horizon is software, with open workflows and methodologies close behind. How does software preservation differ from its predecessors in this chain?
JC – I think there are some very high expectations to live up to, based on the success of OA and RDM. OA is seen by a lot of people as an overnight success, but like most overnight successes it’s been over a decade in the making. We’re already dealing with expectations that RDM can be solved quickly now that OA has been ‘solved’ (which in reality it hasn’t), and I think that problem will only be greater with software.
Although all researchers produce publications, and most make use of data in some form, a much smaller proportion are involved in making research software. That’s growing rapidly now, but a lot of people are very proprietorial about the software they create, and preparing software for wider consumption is often seen as prohibitively time-consuming.
Software has an interesting and unique set of challenges around licensing because of the special relationship between executable code, source code and the compiler/ interpreter. The licenses we use for data and publications don’t work for software, but thankfully organisations like the FSF [Free Software Foundation] have done some great work in this area and there are some really solid, tried and tested software licenses available.
Software is also much more context-dependent: it’s not sufficient just to preserve the software itself, but also the support libraries, compilers, operating systems and sometimes even the specific hardware used to run it.
Do you see any parallels with existing established areas of digital preservation that suggest ways in to some of these problems?
MD – There’s certainly common ground. You mentioned the FSF earlier. If I remember correctly, one’s of Eric S. Raymond’s maxims is that “many eyes make light bugs”, or something like that. [“Given enough eyeballs, all bugs are shallow”, dubbed ‘Linus’ Law’ by ESR.] The parallel with data is that one of the frequently voiced concerns about making it more readily accessible is that it’s messy, and won’t necessarily have been created in a structured, well-described way with an eye on posterity. Researchers (and others) are nervous about letting people behind the curtain. From the modest amount of coding I’ve done personally, and the slightly greater amount that I’ve project managed, I know that code (like science itself!) doesn’t always go according to plan. It’s messy (with lots of failed methods commented out) and people understandably don’t want to air their dirty washing in public, so if they’re obliged to share it they will want to spend time and effort tidying it up. And that comes with a cost.
Having said all that, if people know that their workings (in addition to the finished “product”) are likely to be scrutinised by others, they’ll probably be more diligent, or rigorous, in its production. Perhaps rigorous isn’t the word I’m chasing here… conscientious? Vigilant? Apprehensive?!
One tactic in RDM and OA that might usefully be ported to software would be the ‘champions’ approach. Organisations like SPARC-Europe have found success in encouraging the message to come from (or at least to appear to come from) within the community itself, to avoid the perception that this new expectation is a diktat from a distant and uncaring Olympus. I would anticipate that code would have a real advantage in this regard, as there are already huge networks of people working on Open Source projects, and their enthusiasm is plain to see…
JC – Yes, I think the most persuasive arguments are those that come from peers and describe benefits to the individual as well as the community. “Because you really ought to” convinces no-one! I think in that regard software preservation and sharing has an advantage, in that the people joining us on this journey are those who want to be here, and we can work with those people to demonstrate the benefits of spending time and effort on looking after software. Ultimately that’s what will help convince those who are (rightly) wary of these changes.
It’s interesting that you use the word “apprehensive”: people experience a lot of stress around this because it’s new and I think that puts them off trying things that could really help them and their research. The concerns people have are very real and can’t simply be dismissed with a wave of the hand or the right infrastructure or a clever argument. I think it’s important to recognise that it takes patience and compromise to make a permanent, positive change in culture.
I’m increasingly seeing a lot of these areas through a lens of “scholarly communication”, and in that context code is an incredibly precise way of communicating certain types of idea. Academic success requires other people to understand, refine and build upon your ideas, and by communicating in a variety of ways (prose, data, code) you create more opportunities for them to do that and hence for the value of your own work to be recognised. Ultimately I’d like to reach a point where all forms of scholarly communication were equally valued, and the choice between writing a paper, publishing a dataset or releasing some code (or a combination) was based entirely on what was most practical or effective for the research in question.
One way of getting there is to establish the practice of citing software (and data) in conventional bibliographies. Do you think this will solve that problem or are there other things we need to do too?
MD – You mentioned licensing earlier, and I wonder whether that might be the (or at least ‘an’) elephant in the room here. Citing data in a bibliography is one thing, in that the creator info is usually clear-cut, but citing software is another, especially when it’s a group effort (as with many Open Source programs)… or a commercial package. I’ve spoken with researchers recently who literally don’t know where to begin with this. Their funders have policies which require (or strongly encourage) them to make available all of the tools and information necessary to reproduce findings, but that can be extremely difficult in practice, especially when proprietary / commercial software has been used. At the DCC we tend to encourage people to use open formats and software where possible, but the reality is that some academic disciplines rely to a huge degree on de facto proprietary standards, and there’s no easy way around that.
But I guess the temptation we always face is to seek to solve this all at once, but in reality it’s always going to be a process of incremental steps, and as much about awareness raising as problem solving. The Software Sustainability Institute has done a lot of good work in this area. And how do you eat an elephant (in a room)? One bite at a time!
Links and references
- Digital Curation Centre, http://www.dcc.ac.uk/
- Free Software Foundation, http://www.fsf.org/
- Research Data Management at the University of Sheffield, https://www.sheffield.ac.uk/library/rdm
- Software Sustainability Institute, https://www.software.ac.uk/