Daniel Gomes is Arquivo.pt Service Manager for the Foundation for Science and Technology in Portugal.
"Collect the web to preserve it?! I don't envy that job."
That is a direct quote from my first "real-world" meeting.
I was 23 years old, I had just graduated from the University and that was my first job. We were in the year 2000.
One year later, we had developed a running prototype to perform selective collection of online publications. It was the first effort to preserve the Portuguese web, resulting from a collaboration between the National Library of Portugal and the University of Lisbon.
Even in those early-days of the Web, it became clear that acquiring and storing information from the Web before it quickly vanished was a challenge. But a rather simple one, in comparison to ensuring the accessibility of the stored web data across time.
Arquivo.pt: a public service for digital preservation
Most web authors don't have the resources nor the awareness to preserve their digital works. The mission of our digital preservation service is to provide web authors with the right to be remembered.
The project began in 2007. This year, we celebrated 10 years since the project start. The main milestones were:
- 2008: First crawl of the Portuguese Web (1.6 TB).
- 2010: First prototype of the search and access service.
- 2010: Start of daily crawls for 215 online publications.
- 2011: Integration of the first donated collection.
- 2014: Publication of the thesis Information Search in Web Archives.
- 2015: Publication of digital preservation recommendations for web authors.
- 2016: Release of the Arquivo.pt public service.
Currently, Arquivo.pt preserves more than 4 billion files in several languages preserved from the web since 1996 and provides a public search service over this information.
The service provides user interfaces for textual, URL and advanced search. It also provides Application Programming Interfaces (API) to enable fast development of added-value applications over the preserved information by third-parties.
Open-access rules
FCT (Foundation for Science and Technology) is the public institution that manages funding and infrastructures for research and higher-education in Portugal. Besides managing Arquivo.pt, FCT leads the national open-access initiative.
Open-access is a default behaviour for the institution activities.
The software developed for Arquivo.pt is available as a free open-source project so that it can be reused and improved by other initiatives. The Arquivo.pt team has also published in open-access over 40 scientific and technical articles related to web archiving.
However, providing open-access to the preserved information must be done respecting authors rights and avoiding any harm to their activities. We follow 3 general guidelines for all our activities:
- Preserved sites should not compete with the online sites: Considering that 80% of the pages disappear or change within 1 year, we impose a minimum access embargo period of 1 year after data acquisition.
- Preserve the access level imposed by the authors: If an information was originally published by its authors to be openly accessible online, then its accessibility level must be also preserved and respected. Restricted-access content is not preserved unless the author explicitly requires it.
- Do not take ownership of the preserved data: Respect restrictions imposed by authors for data collection (e.g. through Robots Exclusion Protocol rules) and remove preserved content on-demand.
During 10 years of activity, Arquivo.pt preserved 11 million websites and was accessed by more than 283 thousand users. We received removal requests from 8 web authors.
Becoming an international research infrastructure
Preserving online scientific outputs
Despite being focused on the preservation of the Portuguese web, Arquivo.pt has the inherent mission of serving the scientific community.
Research and Development (R&D) projects use the Web to publish complementary scientific information such as data sets, documentation or software. However, this precious and sometimes unique information, quickly disappears after the projects funding ends.
Arquivo.pt has automatically identified online information related to national and European R&D projects and preserved it.
Arquivo.pt also accepts preserving scientific websites worldwide suggested by the community.
Training on web preservation and research
It is important to train researchers to explore the preserved data and to raise awareness to the need of preserving born-digital web data. Therefore, in 2017 we launched the Investiga XXI bursaries and the training program about web preservation and research.
Making the Past as accessible as the Present
It is a naive common belief that everything is online, leading to the misleading conclusion that, if it is not online, it does not exist. Or even more concerning, it has never happened. The tremendous fast pace at which the Internet has penetrated societies without proper digital preservation services may actually have created an online amnesia about recent events.
On the other hand, for the first time in the history of Mankind, the technology exists to make information published in the Past as accessible as the information about the Present days.
If we achieve this, will Mankind continue to repeat the same mistakes?