At last week’s iPRES2013 conference in Lisbon, a talk was given about an experiment on the migration of WARC files, done by Tessella, called Studies on the scalability of web preservation. One remark in the talk caused some rumour, namely the fact that the presenter suggested to adapt the WARC file and deviate from the standard. Why did they? Because – as we were told – the current version of the Wayback Machine software, that enables you to render the WARC file format, is not optimal for rendering WARC files with conversion records. But tweaking the format of the Archival Information Package and store this for the long term is not the way we should go. We preserve information for long term. Our future custodians will not understand this (unless they are told so via metadata and even then) and will assume if they see a WARC format, all the rules in the standard are taken into account. Deviating from this is wrong, in fact it is almost a deadly sin.
After reading the corresponding publication (the conference papers are published by the Portugese National Library as a free e-book), I saw that things were less straight forward. The approach Tessella chose was to create two WARC files: a correct WARC according to the standards and an adapted WARC for access. From the article:
This required the development of two different workflows for creating migrated WARC files: one, which is formally correct according to the WARC standard, and maintains the integrity of the WARC schema, and a second which is more pragmatic, and produces a file that can be displayed correctly by current WARC viewers. This pragmatic workflow can also be used for the migration of container formats which do not support conversion records, such as ARC files.
So what should one do in the case an ISO standard does not meet ones requirements? In this case the WARC standard is maintained by the BnF , which can easily be seen if one looks for the standard itself. This is especially mentioned on the internet so that people can get in touch. Another approach is to look for interested parties in the Wayback Machine software, which every one who is involved in web archiving knows, is the Internet Archive. And there is the IIPC, the International Internet Preservation Consortium that is currently initiating a developers working group to improve the Wayback Machine software. So if you have some problems with the standards, think about the millions of precious digital objects that need to be preserved in that format and get in touch with the community. But don’t tweak the format itself!