Thou shall not delete …

In the Netherlands we recently had an unpleasant affair in the scientific world. It became clear that a famous professor in social sciences based his publications and conclusions on faked data. This is called ‘scientific misconduct’ and can be based on fabrication, falsification and plagiarism. In this case he fabricated the data himself. A newspaper euphemistically called it “data massaging”. This fraud happened over the past 10 years and was unnoticed by his colleagues. His articles were published in various scientific journals like from Springer’s and Elsevier’s. Undoubtedly these articles are now permanently stored in various long term archives.

A special commission is investigating the complete list of articles of this professor to determine which of them can no longer be called “scientific” and were fraud. Recently  this commission published their first findings, which immediately led to a vivid discussion in the newspapers. Some suggested deleting the affected articles immediately from the university library repositories. Others wanted librarians to add metadata to the articles to indicate that the conclusions were based on faked data. Some said publishers should do this, but the publishers replied that they were waiting for the final report of the special commission and will then react.

How are publishers handling these cases? When you read the guidelines on the NISO site about versioning recommendations , it becomes clear that (in theory) publishers have a special procedure for published articles or “Versions of Records”. As is shown in “ Use Case #10: Corrected Version Corrections to the published version are posted as the equivalent of errata or corrigenda, or the article has these corrections inserted [CVoR], or the errors are so serious (technically or legally)that the article is retracted or removed from publication [VoRs may still exist on various sites but they are no longer formally recognized; the formal publication site identifies the article as having been retracted or removed].”

In a recent article in PNAS  , as cited in the Dutch newspaper NRC Handelsblad of 3-10-2012 two researchers  analysed 2.047 articles that were retracted in the past 40 years from PubMed to see what was the reason for retraction. Although retraction is often stated to be based on erroneous data, in fact only 21% appears to be, while 43,3 % was based on fraud or plagiarism (9,8%) . In 11,3 % the reason for retraction was unclear. Quite often the researchers needed to base their conclusion on extra information, not from the publishers site but from contemporary sources. The long term repositories would have added new versions of articles to already existing records in the digital archive. But what do they do in case the publishers retracted the article?

The long term archives that are based on OAIS all know one of the “responsibilities” of an OAIS archive:” There should be no ad-hoc deletions”  without a related policy. ( p. 3.1) So even  if a long term archive had the intention to delete these fraudulous articles, it should be based on a policy. I´m not aware of long term archives with a policy to delete articles that were retracted by publishers.

The archive “simply” takes the responsibility for the long term accessibility for all articles that are ingested in the digital archive.

In my opinion it is not the role of the archive to judge. The public should be able to see the history of the article and use other sources to get fully informed. One could argue that the future researcher, using the article of the long term archive for his own research, should investigate the validity of the article.

Forgeries happen all the time and sometimes they are discovered, sometimes not. These kinds of affairs (and they happen more often, not only in the Netherlands) show the value of long term archives: to really be a safe haven for scientific publications, despite their content. May be eventually to investigate fraud articles.

Policies: necessary and beneficial

The enormous growth of digital data will require memory organizations to develop a clear vision about their role in caretaking a fair piece of this data cake, with respect to their mission and goals. It is not enough to say for an organization that they will adhere to the ISO  14721 OAIS model. They need to translate this model into policies that are relevant for their specific organisation, their goals and mission, their specific collections. So a variety of policies will be developed, for example a collection policy to make the right selection and access policies to give their public (also defined in their policies) the opportunity to use these data.

In my opinion there are at least three reasons for developing preservation policies:

  1. Organizational sustainability
  2. Professionalize 3rd parties dialogue
  3. Better prepared for new developments

Organizational sustainability

Written policies becoming part of the institutional memory, will reduce the risk that change of staff or management  influences the approach to digital preservation in an organisation. In the next coming years in many organizations like libraries, the group of employees will change dramatically, due to an ageing population. Often these people were the first involved in digital preservation in an organisation. Transfer of knowledge is important, but not enough to achieve a sustainable preservation approach.

Professional 3rd parties dialogue

Organizations nowadays are deliberating whether they need to outsource certain tasks, as they lack the professionalism to perform these tasks themselves. Think of outsourcing storage, sometimes to the “cloud”, but also the outsourcing of digitization of collections, webharvesting or the creation of access tools. But despite this approach, you cannot outsource your responsibility. So it is important to have a clear idea about what the organizations want to achieve (policies!) and derive from that what to expect from these 3rd parties.

Digital Preservation Research developments

Since 2001 the European Commission supported with 94 million Euros research in digital preservation. Projects like Shaman, Planets, and SCAPE are some examples of projects where digital preservation policies played a role. One of the areas of research is related to the fact that the amount of digital material will force organizations to introduce automated ways of handling this material. For example, an automated ingest procedure or an automated migration action, including the quality control. This is only possible if there are clear policies, not only on a high level, but also on a very  detailed level. This is a goal for the SCAPE project.

Policies: not just paper work

To work well, organizational policies should be implemented into workflows and processes, to become part of the “organizational genes” and to be more then a paper work exercise. This ideal situation, however, will need some more work and research!

This blog is an abbreviated version of my presentation at the 2nd Liber preservation workshop in Florence on 6-7 May 2012 (see for the slides