A deadly sin

At last week’s iPRES2013 conference in Lisbon, a talk was given about an experiment on the migration of WARC files, done by Tessella, called Studies on the scalability of web preservation. One remark in the talk caused some rumour, namely the fact that the presenter suggested to adapt the WARC file and deviate from the standard. Why did they? Because – as we were told –  the current version of the Wayback Machine software, that enables you to render the WARC file format, is not optimal for rendering WARC files with conversion records. But tweaking the format of the Archival Information Package and store this for the long term is not the way we should go. We preserve information for long term. Our future custodians will not understand this (unless they are told so via metadata and even then) and will assume if they see a WARC format, all the rules in the standard are taken into account. Deviating from this is wrong, in fact it is almost a deadly sin.

After reading the corresponding publication (the conference papers are published by the Portugese National Library as a free e-book), I saw that things were less straight forward. The approach Tessella chose was to create two WARC files: a correct WARC according to the standards and an adapted  WARC for access. From the article:

 This required the development of two different workflows for creating migrated WARC files: one, which is formally correct according to the WARC standard, and maintains the integrity of the WARC schema, and a second which is more pragmatic,  and produces a file that can be displayed correctly by current WARC viewers. This pragmatic workflow can also be used for the migration of container formats  which do not support conversion records, such as ARC files.

So what should one do in the case an ISO  standard does not meet ones requirements? In this case the WARC standard is maintained by the BnF , which can easily be seen if one looks for the standard itself. This is especially mentioned on the internet so that people can get in touch. Another approach is to look for interested parties in the Wayback Machine software, which every one who is involved in web archiving knows, is the Internet Archive. And there is the IIPC, the International Internet Preservation Consortium that is currently initiating a developers working group to improve the Wayback Machine software. So if you have some problems with the standards, think about the millions of precious digital objects  that need to be preserved in that format and get in touch with the community. But don’t tweak the format itself!

6 reacties op “A deadly sin

  1. As the presenter of this work at iPres, I would like to respond to your comments.

    Firstly, thank you for taking an interest in this work; we started working on it because the problems of providing a scalable solution to migration within containers (bearing in mind lots of formats are really containers) seemed to us to be an urgent issue. Your comments relate to one of these containers: WARC (and this is in fact our primary example as it is quite a complex container). Hence, there are a few points I’d like to make in response to your comments:

    1. This was research work and as such we needed to work out how to migrate files inside WARC containers; if the output of migration was something that we couldn’t see working (e.g. via Wayback), then we couldn’t be confident that it had indeed worked and it would not be a very good example of verifiable migration inside WARC. Furthermore, such research stimulates and informs the discussion of what, if anything, needs to change. Hence, I think we are justified in exploring alternatives to the standard as part of this work. I hope the paper makes this clear.

    2. I would like to point out that our software (SDB) can indeed create a migrated WARC file that does conform to the WARC standard (by making use of the conversion record type), if the user chooses this option. However no WARC file renderer, including the Wayback Machine software can render WARC conversion records, so such a file, while conforming to the standard, cannot be accessed. Since the whole point of migrating objects in obsolete file formats is to provide continuing access to their contents, not being able to render conversion records negates the point of creating them in the first place.

    3. I agree that if an ISO standard does not meet your requirements that you should contact the maintainers. However, I think that the WARC format is actually well placed to cope with migration of individual records; it is the capability to render WARC files that needs revision.

    4. At the time the work was carried out, it appeared that the Wayback Machine (which was the best WARC file renderer available) was not being maintained actively; the last updates to it appeared to be in 2011. Even so, one of the Tessella developers involved with the work did post a question on the mailing list for Wayback Development about the rendering capabilities of WBM for conversion records (http://sourceforge.net/mailarchive/message.php?msg_id=31011222), but didn’t receive any response. It was that lack of feedback that drove us into having to consider alternative arrangements (i.e. our pragmatic solution).

    5. Our pragmatic solution does not tweak the WARC format; it just modifies the contents of a WARC file to ensure continued access to those contents. Since SDB (our digital archiving software) stores the original manifestation (aka rendition aka representation) of the WARC file as well as the migrated WARC file, no information is lost. In addition, the metadata stored in the WARC file and in SDB records the details of the preservation action,
    so it is clear what has been done to create the new WARC file f rom the original WARC file.

    6. If someone were to use this “in anger” today there is a choice: keep to the standard (and sacrifice access) OR get something where access is possible today. Because we have done the latter and can see it working, we are more confident that the former is working as well.

    7. It is also worth stating that I don’t necessarily expect anyone to use this “in anger” today but I do hope we’ve helped to pave the way to allow this to happen when needed in the future, probably by some variant of what we’ve outlined in this paper.

    8. Since this work was carried out, the IIPC has announced that they are restarting development on the Wayback Machine and Heritrix. This is great news and I hope they will consider modifying the Wayback Machine to handle conversion records. Indeed, after giving my talk I was able to discuss our work with Clement Oury (BnF and treasurer of IIPC and custodian of the WARC standard) who was interested that we had demonstrated that it was possible to migrate files within a WARC file while conforming to the WARC standard, albeit without being able to display them. He promised to discuss the use of conversion records (and the fact that we had demonstrated that they could indeed be used to migrate the contents of WARC files) at the next meeting of the IIPC’s preservation working group. However, he did inform me that the highest priority for updating the Wayback Machine is to handle revisit records (for de-duplicating crawls of a website).

    Finally, the main point of our paper was to demonstrate the migration of content within a container, which I think we have done. While the WARC format was our primary example (as it is quite a complex container), the work is equally applicable to other container formats which have simpler formats.”

    • Dear Pauline, thanks for elaborating on this. In this case it shows that there is a gap between the theory and practice and it is worthwhile that you showed it, although I would still prefer an improved WARC renderer above the “pragmatic way” you have chosen. But the point was raised and hopefully the attention the blog and your comments got in the IIPC circles (I forwarded it to the Preservation Working group) will lead to community work and a solution for this problem.

    • Hi Pauline,

      Development of the open source wayback machine is going to be managed by the IIPC in the future. While this feature was not on our development roadmap for the next year, I think that we would be happy to incorporate it.

      Wayback is currently developed only by volunteers, and they tend to concentrate on features that are of interest to their institutions. I think that your organization may be the only one interested at this time in migration records, and so, if you are interested in this feature, I’m afraid that you’ll probably need to be the ones who implement it.

      The code is currently available at http://github.com/internetarchive/wayback, and will soon be moved to the http://github.com/iipc group. I think that Brad Tofel outlined a possible way to implement this feature here: http://sourceforge.net/mailarchive/message.php?msg_id=31011222

      best, Erik

  2. Pingback: A deadly sin | Research in KB

Geef een reactie

Het e-mailadres wordt niet gepubliceerd. Verplichte velden zijn gemarkeerd met *