It’s been just over a year since Elon Musk took over control of X (formerly Twitter), and the major changes that he has made to this popular social media platform have been hard to miss – especially for any organization hoping to archive and preserve posts. So for the last DP Clinic of 2023, a one-hour focussed discussion forum that is open to all, we thought it might be interesting to revisit the whole topic by exploring the question: “What are we going to do about Social Media?”
About 40 people joined the session, and we began with a few warm-up questions to establish experiences and areas of interest amongst the group. There was a strong sense that collecting social media has got harder/much harder over the past few years, and there were several possible causes for this. Many respondents felt that there are growing technical challenges to collecting social media (such as the changes made to Twitter/X’s API for research), but also growing concerns about legal aspects (from increasingly restrictive Ts&Cs, to questions of copyright and IPR), and how to cope with the increasing scale and scope of social media posts. Responses to the statement “When I think of Social Media, I think…” generated the word-cloud that readers can see at the top of this blog post, and it is apparent that some of the most frequent terms reflect these difficulties.
Many of the attendees reported that they had previously had some success with collecting social media posts, most notably via the Twitter/X API, but these had stalled following the changes introduced in the past year. Where appropriate permissions had been obtained, some reported a modicum of success collecting posts to particular Facebook accounts. But there was widespread recognition that it was legal issues (e.g. around data protection) and ethical concerns, which were as much of a barrier to collecting social media posts as any of the technical challenges. Several attendees reported that whilst their organization was interested in collecting social media, they hadn’t yet taken the plunge because of the significant difficulties involved, whilst others who had begun collecting felt that their efforts were effectively stalled for the foreseeable future.
Some attendees suggested that the recent surge in the development of Large Language Models for training AI tools, was perhaps one reason why many social media platforms seemed to be tightening up their Ts&Cs of use – which had had the indirect effect of making it harder (both technically and legally) for organizations with a mission to collect and archive social media posts. Another attendee reported that issues around copyright and IPR remained one of the biggest challenges to collecting social media, and that the rapid spread of social media content generated by AI tools was likely to further complicate matters.
Others noted that it was the sheer popularity (and variety) of social media platforms that is creating difficulties for some large organizations – many of which have a huge number of official, semi-official, and personal social media accounts which can be too resource-intensive (and expensive) to collect and preserve at scale. In addition, it was observed that collecting official social media at scale often demands extra effort in appraisal and deduplication, to minimize the impact of commonplace actions such as retweeting or reusing content. A few attendees had previously had some success gathering social media posts via web-archiving tools and techniques (e.g. MirrorWeb or even screen scraping), but these were now failing due to technological changes or new access controls; one response to this had been to encourage account holders to archive their own social media posts and then offer these for deposit but this had met with very limited success.
There was general agreement that it would be helpful to have some more case studies from people and organizations that have attempted to collect social media – notably WhatsApp (building on earlier work, such as Jingwen Yang’s blog post of 2018 WhatsApp Records Capture, or the guidance for government organizations provided by the Nationall Archief of the Netherlands). Some participants suggested that the DPC could usefully conduct a survey of its Members in 2024, to summarize activities within the digital preservation community.
One attendee drew everyone’s attention to the EU’s Digital Services Act (DSA), which is due to come fully into force on 17th February 2024, suggesting that this might provide an opportunity for libraries and archives to provide access for researchers to certain data held by major social media platforms. However, some people felt that the views of libraries and archives was not (yet) sufficiently reflected in the implementation of the DSA, and events such as next year’s conference at the Deutsche National Bibliothek would be crucial.
It was suggested that the current situation around social media is in such turmoil (especially technically and legally), that for the foreseeable future our efforts might be better spent focussing on advocacy rather than attempting to collect and preserve social media posts.
And with that, our hour of discussion was up. Some attendees said that they had found it reassuring that they weren’t “missing something”, and were relieved to hear that others were also finding it (increasingly) difficult to collect social media content. Hopefully the prospects for collecting social media content, especially at scale, will improve during 2024 – and we’ll have a clearer idea of how to answer the question “What are we going to do about Social Media?” by this time next year.