Chris Loftus

Chris Loftus

Last updated on 23 August 2019

This blog has been written by Peter Vickers; a postgraduate student in Speech and Language Processing hired by the University Library, as part of the University of Sheffield’s OnCampus programme, to look into file identification and archiving.


Forgotten Scripts

Below is an inscription written in Linear A, a Minoan script which has been found on thousands of historical objects across Greece. Because the language bears no close similarity to a language we understand, and we have no Rosetta Stone to decipher the language, linguists have had to use speculation and comparison to attempt to decode the script. Whilst over the past decades, Linear A has been related to the proto-Greek Linear B, the Hittitie Luwian script, Phonecian, and Indo-Iranian, none of these comparisons have either achieved widespread academic acceptance or allowed for the translation of much of the Linear A corpus. For now, at least, Linear A, and all of the Debts, Curses, Tax Returns it encodes are indecipherable. 

Given our cultural interest in lost languages and the knowledge they might encode, I wonder what researchers in 100 years will make of all the digital content we create. Linear A is 3,500 years old – old enough to be forgiven for having been forgotten. Meanwhile, last week I found myself unable to access the data on a five inch floppy disk, which were still in use twenty years ago. Of course, the loss is not the same – I could use the library’s archival system to read the disc. However, the data on the disc might itself be in an obsolete file format. Comparing it to the Linear A problem : recovering the data might be compared to the legibility of our script, whilst opening might be compared files it to our ability to translate it.


Linear A cup.png


Filetypes

We are used to files being accessible: the common set of pdf, doc, jpgs, in daily use are  readable on almost any computer. Even if our computer does not have the capability to open a file, the operating system will usually prompt for any requisite software that needs to be installed when we attempt to open it. This is a perfectly acceptable system for everyday usage on files which only need to be used in the short to medium term. We can assume that software producers will not change the structure of the files because they know breaking compatibility would cause widespread complaints, and users can be sure that other people will be able to read their files because everyone uses the same basic set of file types. Even so, most of us have had the experience of being sent a file we cannot identify by its extension, and have tried opening it in Word only to be greeted with a page of random characters. In this situation if we are unable to find the software which will open the file it will be unreadable and probably we’ll ask the creator to resend the document in another format. This assumes that the sender is still present and willing to re-encode their data – an assumption which in the case of library hosting will almost never be true. If the library was to store all submitted files and offer them up to users without explanation, then it would create justifiable frustration for researchers accessing the files only to find they couldn’t open then - and demands on the library’s time in identifying the correct format. 

 

In an ideal world we would include a ‘Rosetta Stone’ for each file, from which the user could be certain of being able to access any file. In fact, this is what Sheffield Library, and many other large research institutions are attempting to do with the appropriately named “Rosetta” software. As files are uploaded by researchers, they are analysed and identified before being put into a Submission Information Package (SIP) for archiving. When the data is later requested it comes with a metadata package which details the file type so that accessing user can determine which software to open it with. However, in order to provide this information the library needs a reliable, automated method for identifying files in the first place. To explain how file identification works, we’ll have to look at what a file is.

 

What is a File?

A file is a collection of data which may be read or executed by a computer. As the computers we use are binary, they stored as a list of 1s and 0s. When a program opens a file, it must interpret the list of ones and zeros. To take a common example, text files with the UTF-8 encoding have the following lookup table: (ASCII was made a US government standard by Lyndon Johnson in 1968, hence the retro look) 



Now that we have a standard, it is trivial for a computer to split the file of ‘bitstream’ into groups of seven and replace them with the characters from the lookup table. 

 

However, we can’t always assume we are looking at ASCII text files! Without any identifying information, it’s very hard to open a file. Imagine being a linguist and being told that the script you must decode was a set of 32,000 1s and 0s. This would be a gargantuan task, closer to a philosophical exercise than a linguistic one. In fact, Jose Luis Borges’ short story ‘The Library of Babel’ investigates this problem any random string is “An n number of possible languages using the same vocabulary”: without some ground assumption about what a ‘1’ or a ‘0’ meant, you could assume that the code meant anything. To take an extreme example, in some fictional file format, 0 might mean “substitute for this 0 the entire works of William Shakespeare” and 1 might mean “email any previous text to all known contacts”. In fact, there is no fixed standard on how files should encode information about their format. In place of them, three conventions have arisen. 

 

The first is the use of extension conventions. The characters after the (final) dot on in a filename are associated with a file standard and a program which can open them. This method relies on developers respecting a one-to-one correspondence from extension to file type. Given the volume of software and the limited number of file extensions, this is not the case. Additionally, even files created by the same program with the same extension may not be mutually compatible due to version changes. You would have to be very trusting to assume that the file extension is a reliable indicator of file structure, and this is why of the major operating systems, only Windows handles files this way. 

 

The second method of identification is more reliable, but also more requires more maintenance work. It relies on the developer of a file format specifying a ‘magic number’ which identifies a file. This ‘magic number’ is in practice a series of bytes. The jpeg image format has reserved the sequence FF D8 FF DB, whilst many common formats have their own : https://en.wikipedia.org/wiki/List_of_file_signatures (These numbers are in Hexidecimal, which is a convenient way of expressing binary. Convert between the two here: https://www.binaryhexconverter.com/hex-to-binary-converter ). This solution is more robust because the file contains within itself a reference to its structure, which can be looked up on a registry. However, a magic number is almost always a good predictor of a file type, but it is not infallible. The ‘magic number’ reservation has no binding force; I could create a new file type which uses FF D8 FF DB as a magic number, or I could create a text file which began “ÿØÿÛ” which could be encoded as a text file beginning with the bytes “FF D8 FF DB” - my random letters would appear to be an image file! 

Problems like this, and the fact that many developers do not include magic numbers in their files, has  The National Archives’ PRONOM/DROID systems creating file definitions which rely on common data features which, whilst not explicitly provided for identification purposes, appear reliably enough at a certain position in the file to identify them. To give an analogous human example, even if I don’t give the title, you can probably tell what genre the following expert belongs to:

I had called upon my friend, Mr. Sherlock Holmes, one day in the autumn of last year, and found him in deep conversation with a very stout, florid-faced, elderly gentleman, with fiery red hair.

You do not need a title or ‘magic number’ to know this is a detective story: the mention of ‘Mr. Sherlock Holmes’ gives it away. Similarly, files often begin with certain give way features. These features were not intended to designate the file type, but they appear so reliably that they always do. 

 

At this point we have three methods of identifying files: one explicit and fragile, one explicit but optional, and one implicit and high-maintenance. These three methods are all used by the most common file analysis tools, such as the unix command file https://linux.die.net/man/1/file or the archival analysis tool DROID, which relies on its sister project PRONOM’s database of file signatures. These file signatures document the extrinsic information of the file extension name and intrinsic information about the location of identifying byte sequences, which may be either magic numbers or invariant sequences of bytes within a file. PRONOM aims to resolve the problem of file identity by providing a resource identifier that gives a unique resource identifier to each file type.

 

Unidentified Files

Below is a list of files which were either over or non-identified within the Sheffield Figshare repository. Rosetta identifies files with the Harvard JHove system, which both identifies files in a similar manner to DROID and performs various validity checks on the identified files:

2+ matches

.f

38

.nb

10

.doc

8

.xlsx

4

.tif

2

.docx

2

 

0 matches

.sys

80

.ctm

78

.m

65

.inp

17

.r

15

.xyz

13

.lur

11

.pdb

11

.dat

8

 

Looking at the files with multiple matches, most of the correct file types appear obvious. ‘.f’ is almost certainly FORTRAN code, and the various office files are probably safe to classify based on their extensions (internally they often have zip archives to store data, hence the confusion). The only problematic entry is ‘.nb’, which could be a Mathematica Notebook or a Nota Bene file. Currently, the only way to differentiate the files is to attempt to open them – a time consuming process for staff. There are also files which are not identified at all by Rosetta. These are often plaintext files which contain code or delimiter separated values. There were hundreds of Unicode text files in the Sheffield repository with the extensions .sys, .ctm. and .m. Jhove could not identify these files because Jhove does not look at text encoding or text within files, merely byte sequences. It appeared that another solution would be necessary.

 

Being Greedy with Results

The best software for ingress file identification ought to provide as much information as possible about the file and be as automated as possible in its processing of the data. As files submitted to Sheffield figshare must be manually accepted, it is acceptable if the software takes a while to run. After discussing with the staff, the solution has been to create a the Sheffield Library Information Metadata program, which calls several file identification tools and collates their results into an xml file. Currently, the tools called are JHOVE, DROID, unix file, ffprobe, md5 hash, the Python csvreader module, and a machine learning classifier. The results of all of these are compared and if there is agreement, the filetype is set. If there is disagreement, the most common file type is suggested and the options flagged to the user. The software is written in Python, and tools may be added through new classes.

 

Plaintext

Most of the plaintext files in the fig share repository are delimiter separated value files. These are easily identified with the python csvreader module. This module is fairly robust and will handle headers and missing delimiters. One rule the SLIM system has is that if the csvreader returns positive and DROID identifies a text file to mark the filetype as a delimiter separated value.

 

Code files are more difficult to identify. Whilst strings like ‘int main(int argc, char **argv)’ are indicative of specific languages, other languages do not have any invariant strings. Identifying these files is more like a linguist’s job than the pattern-matching we’ve been looking at so far. To identify these files, it is necessary to use a probabilistic framework. Digital forensic research into file recovery has found that bigrams are the most effective features in classification (https://www.cs.cmu.edu/~jgc/publication/Statistical%20Learning%20for%20File-Type%20Identification.pdf)

 

To learn file types, a Machine Learning Module for SLIM was created: SLIMML. It relies on a set of training files to learn the properties of files. The initial development uses 50 c++ files, 50 MATLAB files, 50 doc files and 50 jpg, and 50 pdf files. SLIMML takes 30 bigrams from each file: 10 from the beginning and end, and 10 of the most common. These bigrams are then used to train a classifier with support vector machines, which are supervised machine learning models which learn boundaries between groups of data. After training, SLIM can identify files by extracting its features and finding which decision boundary it resides within. No expert knowledge or file investigation is required besides a set of example files for each classification.

 

The inclusion of a Machine-Learning based discriminator significantly improved the accuracy of SLIM with common but hard to categorise file types. Where before the model could only identify such files based on the extension and text encoding, now the software was able to discriminate between different file languages with a high degree of accuracy. The initial model was 90% accurate over a test set of 200 mixed files.

 

Improving performance with NLTK techniques

These results are good, but they only look at the beginning and the end of the file, making identification difficult if a long comment pushes code out of the search window. After some discussion in the office, we decided to attempt to treat the files in the training set as a set of languages, and then learn to discriminate between those different languages in code. To obtain ‘words’ from the files we search for Unicode strings within the files and extract these as ‘features’. These features are then converted into vectors through the word2vec embedding model. In short, we assume that the file is a list of strings and treat these strings like words which exist relationally to other words in the file, and represent these relations with vectors. These vectors can then be extracted from unseen files, and used to train a support vector machine classifier. 

With these features, SLIMML achieved 99.2% accuracy on the test set. Most interestingly, it could completely differentiate between MATLAB, R, and C programming language files.

 

Going Forward

The aim of SLIM is to include as much metadata and classification information as possible when archiving files. The system is flexible and allows for additions and new classification rules, or to add new files to the Machine Learning classifier. While the word2vec embedding feature model is a hacky way of looking at files, essentially assuming that enough information exists in plain text to identify them, it does give strong results, especially when combined with the header and footer bigram features. 

In a way, a machine learning approach to file identification is opposed to PRONOM’s aim of having rigorous, invariant file definitions by encouraging a ‘just let the computer figure it out’ attitude. On the other hand, it is not the primary role of library staff to spend time writing rules for file identification, and the SLIM system provides a high volume of information and a judgement on the file type with better accuracy than existing tools.

To return to the opening analogy of Linear A, by recording the true the file type we are not actually providing a key to decode the file for future generations - merely recording that we interpret them in a certain way today. This will definitely not hurt future users in attempting to read old files, but a more thorough documentation - of the composition of the file structure itself - would be preferable. Unfortunately, many file specifications are proprietary and so this information cannot be easily acquired.

Machine learning is often a buzzword with the nebulous promise of ‘improved performance’. Often such performance increases are dependent on the quality of the input data than the cleverness of the model. Improvements to the feature extraction model could help increase accuracy in the future – for instance by looking for maximal length repeated byte sequences and including these alongside the Unicode features. More testing against very similar file types, such as c and c++ or even Python 2 and Python 3 could push the capabilities of the tool.

 





Comments   

#1 Andrew Jackson 2019-11-14 08:56
This looks great! Is any of the code available? Or any of more technical details? I’d like to try to evaluate this approach on a different corpus.
Quote
#2 Chris Loftus 2019-11-18 09:16
Quoting Andrew Jackson:
This looks great! Is any of the code available? Or any of more technical details? I’d like to try to evaluate this approach on a different corpus.

Hello Andrew. Thanks! I'll drop the team responsible a message and get back to you
Quote
#3 Marco Guardigli 2019-11-18 21:57
Very interesting.
Different versions or dialects of specific programming languages could be detected thru source code analysis.
Similarly, executables and other binary files could be analyzed to detect target operating systems.
I am interest and could probably contribute: in the future i will have to analyze historical museum data and study catalographic systems. If possible count me in as interested.
In the past i was working on a text analyzer to perform statistical authorship avaluation.
Quote

Scroll to top