A Dutch edition of the Keepers registry … but for web archiving!

Last week the Dutch Digital Heritage Network launched a new product: the national registry of web archives in the Netherlands. This collaborative work gives an overview of the websites that are harvested ánd preserved in the Netherlands by a variety of organizations. Not only the KB as national library is collecting web sites, based on the mandate we have (we see websites as “publications”) but many other Dutch organisations are harvesting websites: from the Netherlands Institute for Sound and Vision, to the National Archive and many more. Together we want to save the Dutch web and we want to inform each other about what each of us is doing.

The new registry is a kind of “Keepers Registry ” for the Dutch web archiving. Everyone can see which web sites an institution is harvesting and since when, how often the web site is crawled, with which software and for what reason (for example because of legal mandate or from a collection point of view). Most of the websites can only be viewed on site for legal reasons, but there are exceptions and the access regime is part of the information given. If possible there is also a link to the current live website.

One of the reasons to start this initiative was to avoid duplication of effort. It still might be the case that two organizations are harvesting the same website, but from now on this is intentionally. For example because they have a different perspective (legal mandate versus collection building) or are harvesting it in a different way. We already know of smaller organisations that will not invest in harvesting websites that could be a potential enrichment to their collection, because they are happy to know from the register that another Dutch organisation takes the long term responsibility for it.

Currently we are contacting Internet Archive to discuss whether we could incorporate the Dutch websites in their collection in this registry as well.

OAIS and the KB Web Collection history

The Open Archival Information System (OAIS) values the (future) user of the digital archive very much. This “Designated Community” of the archive need to be served by extra information about the archived material in order to be able to make “information” of the preserved bits in the content data object. The understandability of the content of the digital archive can be improved by contextual information about the archive. In a recent article in Alexandria, my colleague Kees Teszelszky and I describe our vision on how to present this contextual information about one special digital collection: the web collection.


The Nikhef website in 1992, the first in the Netherlands and the third worldwide.

Lees verder

Interactive born-digital artworks and authenticity

Digital art objects are often presented as a very difficult category of digital objects to preserve. Recently a report of Cornell University Library documents their efforts to set up the “Preservation and Access Frameworks for Digital Art Objects (PAFDAO)”. Even if you preserve other kinds of digital objects, the report contains some interesting remarks of which I took two topics “authenticity” and web archiving.


Cultural Authenticity

In order to find out what the users expected of the Rose Goldsen Archive of New Media Art at Cornell, the project group did a survey amongst the users of their interactive born-digital artworks. Cornell University Library had already chosen a preservation strategy for this material, namely emulation. It came as a surprise to find out that their potential users had other opinions about emulation. “Emulation was controversial for many, in large part for its propensity to mask the material historical context (for example, the hardware environments) in which and for which digital artworks had been created”. This historical context was seen as part of its authenticity, in the report called “cultural authenticity” , present outside the digital object. Perhaps not quite the same but at least related to the concept of  “the original look and feel”.

Harvesting web art

Another interesting aspect of the report is that they witnessed an “increasing prominence of video and web art.” But the currently available technologies for web harvesting are in their opinion not mature enough and too costly. I wonder whether they thought the International Internet Preservation Consortium (IIPC) could play a role there? At the IIPC there is a lot of experience in web harvesting, also of difficult material. The IIPC could at least help them with the Environments Database. In finding the right emulation system, requirements of the original environment are needed. But what if that is not available? Then “it is recommended to consider which operating systems and web browsers (and versions) were contemporary with the work, and configuring an emulator or virtual machine to closely match that environment”. (p. 28) And that is exactly why the Preservation Working Group of the IIPC started their Environments Database, in which IIPC organisations regularly give an overview of the equipment in the reading rooms where the public can look at the web collection.

We preservationists have more in common than we sometimes think of. Perhaps you’ll find other interesting topics in this document!

International collaboration in the IIPC Preservation Working Group

323px-Dou,_Gerard_-_Astronomer_by_Candlelight_-_c._1665Web archiving is often about collecting the web. But part of the work is also related to preserving the web. One of the working groups in the International Internet Preservation Consortium (IIPC) is focused on this aspect. Recently we published an article in the D-Lib magazine, called Facing the Challenge of Web Archives Preservation Collaboratively: The Role and Work of the IIPC Preservation Working Group. The article was written by Andrea Goethals, Clément Oury, David Pearson, Tobias Steinke and me. In this article we inform you about our goals, activities and results in the Preservation Working Group. We also report the findings of a survey we did amongst the (around 50) members of the IIPC in 2013 and their approaches to preserving the web. And we want to point you to a set of databases we are maintaining, with crucial information for web archiving, like the Environments Database and the Risks Database. Happy reading!

Oops! Article preserved, references gone


Journal des Scavans

The first scientific journal (Paris, 1665)

Recently an article was published by Klein, Van de Sompel e.a. in PLOS1 (see under), drawing attention to the problem of “reference rot”. Reference rot is a combination of link rot (a reference to a link on the web results in an error message 404) and content drift (the page can be found but the content has been changed.) References in academic publications have a purpose: they underpin the argument. References can point to other scholarly publications but a growing amount of references in scholarly publications refer to sources on the Web. And especially these sources are prone to reference rot (the referenced “publications” having – at least theoretically – a bigger chance of being preserved by national libraries or organisations like CLOCKSS and Portico).

Based on a large set of data, the study shows the impact of reference rot, as well as giving evidence of the fact that many web pages change frequently, not seldom a few days after first being published.

These outcomes will affect a.o. the collections of national libraries and institutional repositories. The authors are worried and conclude that “Our research found that reference rot in scholarly communication is a significant problem that begs for the introduction of a robust solution”. While national libraries preserve academic e-journals and e-books, there is a risk that the references in these publications are no longer there for investigative users. The (future) user might be unable to verify the arguments and conclusions in the publication.

This is a threat to the value of collections. Value consists of several elements. At the KB we used a method called Significance to value our collections. One of the elements is “information value”. This “information value” of the publication is diminishing when verification of the source is no longer possible. Why is it that this risk is not higher on the agenda?

The above mentioned article is restricted to scholarly publications, but with the growing features in e-publications, we might expect this to happen on a larger scale. It is inherent to web-at-large references and will affect a variety of publications, however also in cases the loss of references is less important.

A “robust solution” is not there yet, although this article initiated a creative approach to the link rot problem by my colleague Rene Voorburg (see robustify.js , a website add-on that redirects broken links to archived pages using Memento). David Rosenthal in his excellent blog on this topic doubts whether there will ever be one. He thinks the problem is inherent of the way publishing on the web works. “This is the root of the problem. In the paper world in order to monetize their content the copyright owner had to maximize the number of copies of it. In the Web world, in order to monetize their content the copyright owner has to minimize the number of copies.” Implicitly suggesting that lots of copies should keep the stuff safe!

The reference rot problem is a kind of reality check and shows that even preserved material is incomplete without proper preservation of its context. Content holders should be more aware of this.

You can find the article at:

Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al.(2014) Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. PLoS ONE 9(12): e115253.http://dx.doi.org/10.1371/journal.pone.0115253

More on this problem at http://robustlinks.mementoweb.org