H-Net: Preserving and Improving Access to Specialized Electronic Mailing List Archives
H-Net Digital Preservation Policies and Procedures
Digital Preservation Strategies for the H-Net E-Mail LIsts
MATRIX and H-Net have developed the following long-term digital preservation strategies for the H-Net e-mail lists:
- Creation of archival copies
- Format preservation strategies
- Fixity measures for ensuring integrity and authenticity
Creation of Archival Copies
The H-Net e-mail list archive running on the MATRIX servers is considered the primary copy of the data--the "living archive" from which users may access messages and to which new records are added on a continual basis. As noted in Information Security for Digital Assets at MATRIX, MATRIX creates regular, redundant backups of this server-based data to ensure ongoing access. MATRIX also creates and maintains archival copies of the H-Net data separate from the rest of the server-based data. On an annual basis, MATRIX copies the following onto archival-quality LTO tapes using GNU Tar archiving software:
- H-Net notebook files containing messages posted during the previous calendar year
- Associated metadata, including the log browse cache and fixity database
- A text file containing provenance information for the archival copy
- Browse and search software developed by H-Net to provide web access to the message postings
- Supporting documentation for the browse and search software
Two hardcopy inserts are included with the tape. One is a description of how the tape was created, including the technology used, and the other is a printout of the provenance information contained in the text file on the tape. Archival tapes will be sampled annually to ensure readability and are on a five-year media refreshment schedule. For more detailed information on the archival copies of the H-Net e-mail lists, refer to Archival Copies of H-Net.
Format Preservation Strategies
The H-Net e-mail lists include two types of information to preserve: the messages and attachments to those messages. Different approaches to preservation were considered for these different types of information, and it was determined that no file normalization, conversion, or migration strategies were required at this time.
H-Net E-Mail Messages
All e-mail messages submitted to the H-Net lists are required to be written in plain text formats such as ASCII and UTF-8. The content in the notebook files--which consists of seven-day compilations of messages that include headers containing metadata--is also in plain text. (Refer to H-Net Message Ingest, Storage, and Retrieval Processes for details on the composition of H-Net notebook files.) As ASCII and other plain text formats are well-documented, easily accessible, and considered to be archival formats, no format preservation normalization or migration strategy is required for the messages and notebook files. They are likely to remain accessible over time.
Attachments to H-Net E-Mail Messages
Although attachments are not allowed on the public H-Net e-mail lists, the private lists face no such restriction. The majority of attachments that occur on the private lists include Microsoft Office, PDF, and JPEG formats. While these are all ubiquitously available formats at present, it is recognized that best practices in digital preservation recommend normalizing proprietary Microsoft Office formats into open source formats such as OpenOffice and converting PDF files into the archival PDF-A format. Attachments comprise less than 0.01 percent of all messages on the H-Net lists, however, and Microsoft Office and PDF are documented, readily available formats. JPEG is also a publicy documented, widely used format. Therefore, H-Net and MATRIX will provide only bit-level preservation of the attachments at this time. Archives containing the private list messages, including attachments, will be made available to subscribers to those lists and others who have access privileges. If a user encounters any difficulty on attempting to open an archived attachment, H-Net administrators will provide assistance.
Fixity Measures for Ensuring Integrity and Authenticity
As described in the International Research on Permanent Authentic Records in Electronic Systems (InterPARES) guidelines, electronic records custodians must ensure that the records are kept free of tampering and corruption. MATRIX and H-Net are committed to ensuring the integrity and authenticity of messages on the H-Net e-mail lists through the active and ongoing use of cryptographic hash functions. Within 24 hours of posting, the SHA-256 message digest algorithm is used to establish fixity for a message. The SHA-256 message hashes are stored in a database and used to perform fixity checks when a notebook file closes. If the hashes reconcile, the closed notebook file will receive its own SHA-256 hash. All notebook file hashes will be stored in the fixity database, and notebook files will be validated on a weekly basis using message digest calculations. Refer to Ensuring the Integrity of the H-Net E-Mail Lists for a more detailed explanation of how H-Net message and notebook fixity is established and checked.
Last Revised July 2009