Euan Cochrane

Euan Cochrane

Last updated on 5 November 2019

Euan Cochrane is Digital Preservation Manager at Yale University Library


Online web services are used by billions of people every day. They impact our lives and society in a myriad of ways. The way they present data to us and the ways they manipulate and transform the data we store in them have the potential to change behaviour and our understanding of the world. And this is all being done at a scale unimaginable in previous history.  These services have changed greatly over time. Many of those changes are not publicly documented or even known to the general public. I’ve outlined a few of those we do know of below:

Facebook

Facebook’s interface has changed substantially since it was widely launched in 2006. The like button was added in 2007, the newsfeed started prioritizing popular posts in 2009, in 2014 view counts were added to videos and in 2016 “Fake news” sites were banned from advertising on the platform.

Gmail

Gmail was launched to the public in 2004 and it took another year, until 2005/6 until we saw the introduction of rich text formatting in emails. In 2008 themes were added to Gmail which were followed up in 2012 by custom themes. 2013 saw the introduction of tabs to the inbox, and in 2018 google introduced “smart compose” which automatically generates text of emails while the user composes them, or before, based on whatever message they are replying to.

Google Docs/drive

Google Docs was launched by Google in 2006 after it’s acquisition of Writely. The ability to suggest edits was added in 2014. The google docs viewer added support for 12 additional formats in 2011 and in 2019 added the ability to edit Microsoft Office files in the Google suite.

Office 365

Office online was launched in 2008 as the lightweight “office web apps” but only saw public release late in 2009. Co-authoring was added to all the office applications in 2012 and followed by real-time co-authoring in 2013. Office online currently (2019) differs in a number of ways from the offline product including being unable to run macros in Excel, lacking reference and mailing functions in word and lacking various views also[1].

Online Software as a Service (SaaS) tools and sites such as these have had, and continue to have, a huge reach and influence across the globe. By the end of 2013 Facebook had over 1.3 billion users and by the end of 2018 another billion. In 2016 Gmail had over 1 billion active users. Changes to the interfaces affect all of these users in different and sometimes profound ways. Being able to view user-data in the different versions or to see how the experiences changed over time from different perspectives (user, administrator, advertiser, etc) could be very beneficial. The impact of a subtle change in user experience or Application Programming Interfaces (APIs) of any of these tools has the potential to hugely impact the world. Such changes have certainly upended various businesses that have relied on stable APIs and/or interfaces provided by online services such as these.

Unfortunately it’s virtually impossible to track those changes over time, experience the different versions as they were presented at the time they were in active use, or undertake any research into the impacts of the changes in interface over-time based on archives of those interfaces as no public archives of them appear to exist.

So what could be done about this? Can we preserve versions of web services for future researchers?

My first recollection of debate about this problem in the digital preservation community was when I was on a panel with, amongst others, David Rosenthal at the iPres 2016 conference titled: “Software Sustainability and Preservation: Implications for the Long-Term Access to Digital Heritage”. I mention David specifically from the panel as he and I disagreed on what could be done to save versions of services like those discussed above. My recollection was that David felt it was a lost cause. I was more optimistic. However I do agree there are a lot of hugely challenging potential barriers, for example:

  1. Getting the data archived
    1. Getting access to the files that provide the web services would likely require engagement from the service providers themselves. Some might be able to be acquired in the event that the vendors go out of business but for privacy reasons
    2. Most of these services rely heavily on open source software and may have a lot of intellectual property protected through trade secrets that could be disclosed through archiving with a third party
  2. Replicating complex experiences involving multiple different services or “Bound, Blurry, and Boundless objects[2]
    1. As Espenschied and Rechert superbly describe it: Even a small to mid scale web service under the control of the organization that seeks to preserve it might turn out to be spread across several virtual machines or making use of external microservices, requir[ing] new strategies and concepts.
  3. Replicating experiences without user-data would be less useful but archiving a representative set of user-data would be challenging due to:
    1. Privacy issues with sharing personal information
    2. User data is often meant to be live and constantly changing so a static copy wouldn’t be enough to replicate this experience (as was seen during the preserving virtual worlds projects).
  4. Replicating services that are under constant live development
    1. Facebook has a reputation for extensive A/B testing meaning that at any point in time the experience of one subset of users may be very different from another. Something Espenschied and Rechert discuss at greater length in [2] referenced below.
  5. Replicating experiences without third-party services such as ad networks
    1. Also discussed by Espenschied and Rechert. Third party services such as ad-networks ad to the full experience of the object but may be even more difficult to preserve and even more fraught with legal and ethical issues.

These are really difficult problems, so why am I at all optimistic?

Personally it is because I’m making progress in my continual learning process to become more and more patient and persistent. I’ve been using emulators since the 1990s when I was running WinUAE on Windows 95 to emulate my old Amiga 500 so I could play the games from it. I’ve been an advocate for the use of emulation in digital preservation for over a decade and many folks (some of who we referenced in our recent iPres paper) have been working on this for decades, but it is only recently that it has started to become more accepted amongst the digital preservation community. We may not have all the solutions yet, but we could make progress over time. And we will never have them if we don’t start trying.

From a practical perspective, despite Espenschied and Rechert lamenting that:

“When it comes to web-based services such as YouTube, twit-ter, etc, technical complexity and size poses limits. These objects are to be regarded as boundless, there is no way to preserve them while ensuring the continuous availability of all provided interfaces and potentials. Even if the technical infrastructure to create a complete copy of YouTube would be available, the main pur-pose of preservation—reducing the actively maintained surface and maintenance frequency of an object by abstracting its complexity—would be economically unattainable. YouTube the service requiresYouTube the organization to provide its full performative potential.”

As we heard (in a paper that included Rechert) at the iPres 2019 conference recently, nascent technical progress towards such an end is being made. Capturing and securely preserving networks of servers and desktops connected to external data sources is now a realistic proposition. In the linked paper “Preservation strategies for an Internet-based artwork yesterday, today and tomorrow” by C. Roeck et al, the authors described how they emulated an obsolete google API in order to replicate a legacy web-site’s experience as authentically as possible: “A further problem, which can be solved by the presented approach, is the usage of ancient Google search Web APIs (utilizing SOAP) for the TraceNoizer system. Google has stopped supporting this API, but it can be emulated in a virtual network environment and, consequently, allows the TraceNoizer environment to remain unchanged.

So technical progress is being made. But there is a lot more work to be completed before we could even come close to replicating something like the experience of facebook from 2016 and it’s worth exploring what that program of work might include.

So what might it take to begin to address the barriers outlined above and start archiving versions of these much larger scale online services? There are a few things that could help:

  1. Engagement from the service providers
    1. If the service providers could engage with the digital preservation community. I know we would love to engage with them in the EaaSI program of work[3]
  2. Memory institutions being willing to show vision and leadership and start to tackle this challenge
    1. This is currently a Research and Development (R&D) problem. This means it is a risky proposition and people who try to solve it will likely fail multiple times before making progress. It will take great vision and perseverance to make progress with this challenge.
  3. Appropriate resourcing to address the problems
    1. As an R&D challenge, this needs funding. The funding will likely need to be long-term and substantial. However this is a small spark of promise in the possibility that perhaps some of the more well-funded online services providers could allocate some of their own resources to this problem.
  4. Greater understanding of the legal implications
    1. As discussed, there are many legal challenges here. To resolve these will take time and resources. Perhaps it may require archives being comfortable with running black-box services where they don’t have access to the underlying code, or for them to sign NDAs to limit access to the back-end of services they preserve and maintain access to in order to support the front-ends they supply to users. Perhaps all data presented in these services will have to be redacted or restricted for a century or more before research is allowed.
  5. New tooling approaches and concepts are required.
    1. Espenschied and Rechert did a wonderful job of outlining the challenges in just deciding what should be captured to support endeavours like this. Capturing and replaying a live experience of a version or multiple different versions of Facebook from a point in time would be a huge challenge. But there are some potential approaches that could be explored to achieve this.  Perhaps machine learning and AI tools could be used to simulate real users, or the live streams of donated real user data could be played back in real time while a user interacted with facebook to experience the service as it was at that time. We could see memento project-like sliders on service websites or in web archives that move the User Interface, databases and algorithms back in time.

So, is this an insanely huge challenge? Perhaps Yes. Is this an at risk digitally endangered species? Most definitely. Is it a challenge worth at least trying to address? I think so!

As a final optimistic thought: perhaps a large part of the web 2.0+ past has been lost, but perhaps the future could be better. Perhaps instead of just looking to the past we can also look to the future and help to advocate for the developers new services to think about future researchers as they develop them. What if online services had a time dimension built in from the start?


 [1] https://en.wikipedia.org/w/index.php?title=Office_Online&oldid=917283550

[2] See D. Espenschied, K Rechert 2018 “Fencing Apparently Infinite Objects” iPres Proceedings 2018  https://osf.io/pw5dq/

[3] But see point 3 as a caveat


Scroll to top