Rhiannon Lewis is a PhD researcher at the School of Advanced Studies, University of London


As someone embarking on a PhD, one which will use digital images and accompanying data from social media as its primary research data set, attending a day that investigates different approaches to doing just that was an excellent point of reflection at an early stage! My research will investigate (re)use of digital images of collection objects from Science Museum Group, how different contexts on social media provide new understandings of the objects. Both digital images, as well as the data that accompanies them, will form the primary evidence for my research. I was therefore keen to find out about methods used by national memory institutions for archiving social media platforms. What were the main considerations when archiving different platforms? What were the best practice standards? How can I (and others) apply these to current research?

There were some recurring themes from speakers, whose archiving practices shaped their approaches to researching and archiving social media. Firstly, and perhaps obviously, different platforms need different methods of data collection. What is more, various approaches can be taken to the same platform; these will produce very different data. Underpinning all these approaches are the ethical considerations when archiving and collecting from social media platforms.

The first conclusion to take away from the day might be obvious, but not to be underestimated. Different social media platforms call for different data collection methods. As Sara Day Thomson, from the Digital Preservation Coalition, highlighted, these can include web crawlers, platform self-archiving services, API-based tools, third-party services, data resellers. However, I was comforted to learn that institutions like the British Library were encountering the same complications as me with archiving platforms such as Facebook, with data collection only possible on a qualitative scale. Laura Wrubel and Dan Kerchner’s (researchers from George Washington University) introduction to web scraping, primarily on Twitter through APIs, highlighted that this method requires a specific set of skills and therefore can be hard to do. Collecting in this way means getting data in a format very different from how originally encountered, sometimes returning “incomplete” data in JSON.  As someone primarily interested in images - aggregating in this way could not be a data collection end point. Yet, just as there is more than one platform, there are multiple ways to collect data from them.

Capturing social media data doesn’t necessarily mean capturing it the way that material was initially encountered. There is a need to understand, if not capture, the context. Sara Day Thomson’s presentation noted that content could only be accessed through the current format of the platform, regardless of wen it was created, therefore content might not appear the same as when originally posted. She also articulated some of the challenges of capturing the broader conversation, or context, as this is crucial for understanding social media posts. They are not singular but part of a broader networked conversation. This was again highlighted by independent researcher Anisa Hawes, that accessing historical content directly through a platform requires using the current technological interface. Hawes introduced Rhizome’s webrecorder, a high-fidelity tool able to capture and review image files, such as memes or “graphic events” through the numerous platforms documenting a browsable encounter. It therefore collects what other web crawling technologies struggle to harvest. In her study ‘Collecting & Curating Digital Posters Using Mixed Approaches’, Hawes used the example of a meme of a poster from a David Cameron campaign ad to better understand the networked environments on which a meme depends to be established and to further its meaning. Although this approach potentially introduces an element of subjectivity from the curator or archivist who is directing the navigation through and collection of online content.

The DPC briefing day made clear social media is an important research dataset but must be navigated with a duty to protect people's privacy. Sara Day Thomson addressed ethics around data collection on social media platforms, noting that terms and conditions of a social media platform may not be enough in themselves to fulfil personal and ethical research obligations. They need to be followed, but more stringent measures may need to be taken as well. Equally important to consider is the way in which Nicola Bingham, from the British Library introduced their archiving of social media pages and accessibility of that data. They consider people’s current understanding of the social media spaces in their archiving practices to decide what information people consider public and what is considered private. They are also selective about what they archive and only a very small data set is made publicly available, to make sure people’s information is protected. As even if posts are anonymized this does not necessarily hide a person’s identity, because content can be searched and traced back, so this needs to be taken into consideration. Although I am a long way from publishing findings of my own research, it is very important to be conscious of how it will be published throughout.

To conclude the key takeaway points, or points of consideration from the day, for me as a researcher were that:

  • Data collection methods should be influenced by functions, the set-up of the platforms, as well as research questions.
  • Just because you captured the data doesn’t mean that you captured the way it was initially encountered.
  • Social media is an important source of knowledge, but any data collection needs careful consideration and be undertaken ethically, with an emphasis on protecting people’s identities.

For more on Rhiannon’s research see: https://research.sas.ac.uk/search/student/1304/ms-rhiannon-lewis/


Scroll to top