Mark Schroeder is a solution architect in Iron Mountain's Digital Business Unit
We are living in most interesting times…
(Joseph Chamberlain, 1898)
When the collapse of the Soviet Bloc precipitated the breakdown of the German Democratic republic in 1989, the East German Secret Service (Ministerium für Staatsicherheit - Stasi), found themselves holding extensive archives of records. In the forty years of its existence, 91,000 employees of the Stasi and up to 180,000 informants had amassed thousands of linear metres of archive material.
Torn Stasi documents, image from the German Federal Archive
The precipitous speed of the regime’s fall, the large volume of the records, and the nature of the carefully-curated and cross-indexed media made thorough destruction impossible. This did not stop the last participants of the regime trying the impossible. Techniques used included fine shredding, burning, pulping, and towards the end, coarse hand-shredding.
Following an act of German Parliament that passed into law in December 1991, the Stasi Record Agency (Stasi-Unterlagen-Behörde or BStU) was established in Berlin and twelve regional locations with the mission to preserve and make accessible all the Stasi records from 1950-1990, with some archive material even predating 1950.
There continue to be many imperatives for making these files accessible to citizens, all of which are familiar to those inspired by digital preservation: historical accuracy provides a reliable foundation to establish justice for the victims and accountability for the perpetrators. In turn, the process of reconciliation and healing can continue. The research and educational benefits from authentic and reliable records help prevent “historical amnesia” and remind us of the dangers of power without accountability.
The work continues - snippet sorting in 1994 Source: dpa: deutsche presse-agentur
By the early years of the twenty-first century, after nearly thirty years of purely human effort, only six hundred sacks of paper had been successfully reassembled. This constituted less than four percent of the total.
In December 2000, the Bundestag instructed the BStU to investigate alternative approaches to the fully-manual effort of reconstruction that would have taken many generations (estimates vary between 600 and 800 years at current progress rates). After a successful proof of concept under laboratory conditions, the Frauenhofer Institute was engaged to further develop and pilot at a larger scale.
In 2007, four hundred sacks of material were moved to the Frauenhofer Institute facility in Berlin. The Fraunhofer team under Bertram Nickolay used Machine Vision techniques born in the 1990s to develop the processing engine. To maintain the context of the documents, physical handling of the materials was extremely rigorous: segments were carefully sorted to maintain the relationship in which they existed in the sacks and staples and paperclips bindings are important relational clues, whose relationship to the attached media was preserved.
The proposed scanning system had to deliver high resolution and high-fidelity colour rendering, fragment edge detection and contour and paper tone identification as values for downstream interpretation. OCR was not used to recognise text, since many documents could exist in other languages. Instead the pattern of the text blocks was extracted and held as an image layer separate from the underlying media and edge-detection layers. As part of the scanning workflow, each fragment was to be laser etched with a barcode, invisible to the naked eye, that helped stitch the layers together. Finally, once each page had been reassembled, trained archivists were employed to build the interrelationship of assembled pages into their containing files.
A five minute overview of the Frauenhofer reassembly project: https://www.dw.com/de/die-schnipselmaschine-wie-zerrissene-stasi-unterlagen-rekonstruiert-werden/video-6693167
Investment in the design of the scanning hardware, development of the operating procedures and coding of the machine vision algorithms was extensive, but at the time of development was considered justified due to the level of international interest in the process. One key specification challenge was the requirement that each reassembled page was guaranteed to be 100% accurate.
Virtual reconstruction meant that the physical document was not recreated, unlike the manual process. ‘Simple’ reconstructions were more cost-effectively carried out manually in any case, with the advantage of the physical document for reference and integrity.
Between 2007 and 2014, only 23 of the target 400 pilot sacks had been reconstituted through the two pilot phases. Whether this can be considered a success or failure is still a subject of active debate. What is certain is that the appetite to complete the work remains. In 2022, 29,064 citizen access requests were made, a number almost unchanged from the year before.
So what has changed between 2014 and now? What sort of organisation would be required to make a success of a reboot of the project? We have seen that there is no such thing as a fully-automated technical solution, but also that an entirely manual approach could not meet any realistic completion timeframes.
Technically, innovation should present opportunities for enhancement:
-
Cloud-based computing and digital resources can be made available at unprecedented pace and scalability, and without the need for upfront investment. Pilots and proofs can be delivered, assessed and re-engineered faster.
-
Pure code can be run "serverlessly", without the need to consider provisioning infrastructure, allowing for focus on resources aimed at the outcome, not in supporting infrastructure.
-
The ability to build cloud-based processing pipelines that include multilingual OCR support for all machine- and hand-print adds the possibility for fragmented text block meaning and context to play a role in reassembly. Personally, for me this is the most exciting new dimension.
OCR supported by trained Large Language Model (LLM) and other new ML techniques could identify themes, subjects, keywords and even syntactic writing styles to assist the trained archivists in assembling pages into files, and files into topic threads.
Improvements in network bandwidth, hybrid architecture design and latency management means cloud-based graphical computing capabilities could be available at or just behind the scanner devices themselves. There is no such thing as an image that’s ‘too good’. Upscaling a poor quality image is hard; downscaling, easy.
Additionally, since we've already learned that the proposed Frauenhofer bespoke scanner was intended to invisibly barcode every snippet where possible, it is not unfeasible that a downstream mechanical sort could reconcile the physical pieces in the time taken to scan and process them. This would provide further proof of record authenticity.
Successful completion of projects like this will need alignment of a number of factors, not least a realistic budget. One thing is certain though: archivist expertise augmented by technology will always be two of many key ‘pieces of the puzzle’ when a complete picture of historic events is needed by society.
At Iron Mountain we welcome opportunities to support and collaborate with the deep archivist experience surfaced through organisations like the Digital Preservation Coalition, and stand completely behind the DPC mission. Personally, I welcome the opportunities for learning that DPC offer to everyone interested in the preservation of history, and look forward to our next opportunities to learn and participate.
References:
Expression of interest procedure for the new edition of the project “Virtual reconstruction of torn MfS documents” Reference number of the announcement: 05430#0001#0087 - 111 - 41 - https://ausschreibungen-deutschland.de/1073365_Interessenbekundungsverfahren_zur_Neuauflage_des_Projekts_Virtuelle_Rekonstruktion_2023_Berlin
German Bundesarchiv (Stasi Records Archive) - Reconstruction of torn Stasi documents https://www.stasi-unterlagen-archiv.de/archiv/rekonstruktion/#c41320
(Video, German Language) The Snippet Machine - how torn Stasi documents are reconstructed:
(Video, Auto-translation) How Stasi FIles are reconstructed - Fraunhofer IPK, 9th Oct 2013
Stasi Media Library. Extensive collection of online stories from the period:
https://www.stasi-mediathek.de/
Frauenhofer Institute Stasi file project page:
https://www.ipk.fraunhofer.de/de/zusammenarbeit/referenzen/stasi-puzzle.html