Recently Google Scholar pointed me to an interesting pre-print about webarchiving. The authors Xinyue Wand and Zhiwu Xie raised the question whether we as web archiving community are using the “right” format when using the WARC container format for our web archives. The WARC format is an ISO standard (28500-2017) and colleagues of the International Internet Preservation Community have contributed to the shaping of the current WARC 1.1. version, as a successor of the Internet Archive ARC format (also in use in many web archiving organisations when using an older version of Heritrix for harvesting the materials). As a container format, the WARC format is suitable not only for web archives but for other digital objects as well. When designing the WARC format for web harvests, the intention was also to add relevant metadata to the format for preservation of the digital objects. So managing storage and preservation were important goals when designing the format.
The authors of the pre-print however focus on another use of large web archives: access. And not the kind of access that was foreseen when the WARC format was first designed: one person looking for one preserved website. But access to large collections of websites in order to perform large scale data analysis, which is the way that nowadays many scholars want to use the web archives. In other words: re-use of the collected web sites.
According to the authors “[…] the central theme of today’s web archives is still predominantly on web content collections and preservation. Reuse beyond the prescribed browsing pattern is rarely supported. As a result, the archival system architecture is centered around WARC files and not optimized for research-driven analytics workflows.”
It is not only the collection, but it is the WARC format that makes it impossible to do this large scale research, as the performance is too low: “Performance may therefore determine whether it is even feasible to explore certain research questions.” While domain experts have done their best to achieve the most with limited means, the authors suggest that “the domain status quo only reflects best practices extracted from the past experience and requirements that may be irrelevant in big data settings “. In other words: the choosen WARC format.
The authors support their argument with a set of performance tests with large collections of WARC files and the results are reported in the article. They conclude: “Our evaluations provide evidences that WARC carries significant performance penalties for batch data processing workloads, up to two orders of magnitude slower than that of more efficient formats. We therefore call for the web archiving community to consider adopting alternative archival formats.”
Well, that is easier said then done. Apart form large web collections in WARC format, many collections are in the even older ARC format. But in one aspect the authors might be right. As part of preserving the web archives, we need to monitor the changing requirements of our Designated Community. Obviously the research community is part of that Designated Community. As all web archives want re-use of their web collections, did we pay enough attention to the changing needs of our Designated Community? Should not we think about the consequences of new methods to re-use the material that was preserved with so much effort and costs? Unfortunately our IIPC conference will not take place this year due to Corona, but it still is an interesting topic to discuss.
Wang, Xinyue & Xie, Zhiwu. (2020). The Case For Alternative Web Archival Formats To Expedite The Data-To-Insight Cycle.