Data integrity and authenticity:
The repository guarantees the integrity and authenticity of the data
The dataset content hosted by ESPRI consists of one or several files containing geo-time-referenced tabular data, mainly (but not all) in netCDF format. netCDF is self-describing: a file includes information about the data it contains through variable and/or global attributes.
On demand of the IPSL community and partners, ESPRI replicates data to complete a large subset of climate modeling and observational data. Such a replication strategy ensures data integrity as it only relies on well-identified sources (e.g., ESGF, Copernicus Climate Data Store) with their own depositor selection and authentication mechanisms. ESPRI data also comes from recognized “authorities” such as institutional partners within common scientific projects. Each data depositor has to be a registered and validated user within ESPRI system. Only validated users can deposit and/or access data. The use of Unix groups allows write disk access to be delegated in an organized manner.
In addition, the ingestion process computes checksums (md5 or sha256) for each incoming file or dataset. Checksums are controlled only when data is transferred. The data integrity is ensured by comparing the computed checksum with the one published alongside the metadata records when available. Data replication softwares is developed by ESPRI engineers and includes checksum control as described in the ESPRI Data Management Plan. For instance, checksums have been systematically computed since 2011 for climate model data. Data downloaded from the ESGF are done through an ESGF download client which systematically computes checksums.
Most of the data hosted by ESPRI is produced by the IPSL or comes from reference sources (e.g. ESGF). It is therefore not necessary to check the authenticity of the data. Other datasets produced outside IPSL are collected by ESPRI as part of a partnership, a research project or an agreement between ESPRI and the depositor. In this case, the source of the data is known and the collection or production protocol established in collaboration with ESPRI. In this context, ESPRI has decided not to set up a protocol for verifying the authenticity of the data, which would be redundant with the work carried out upstream.
Contrary to long tail data, the ESPRI reference catalogs are managed by ESPRI staff only. All datasets from ESPRI data archive are accessible to users through read-only mounts. The data cannot therefore be modified or altered by the users. ESPRI staff is in charge of data conformance to the standards (e.g., the Climate and Forecast Convention) and applies the appropriate data reference syntaxes (DRS, when available, e.g., CMIP, CORDEX) and prepares the datasets with all required data provenance information through:
- the directory structure,
- a metadata card,
- a netCDF attribute that redirects to a dedicated documentation page (e.g. the Earth System Documentation for climate modeling data).
Data provenance information from filename syntax and metadata are also checked to:
- guarantee the authenticity of the replicated data
- and report to the depositors any issue discovered by IPSL users.
The history of dataset versions for climate modeling data is recorded and centralized through an errata developed by ESPRI and made publicly available. Data producers may request privileged access to register any issue raised by the users. Each action (e.g., issue status update) is thus dated and signed. Depending on the issue severity, the data producer can unpublish the deprecated dataset version, the errata information is stored and persisted in any case. Although CMIP and CORDEX producers are encouraged to use the errata, it can easily accommodate other data than those related to the climate modeling.
In the case of the depositors publishing a new version of a dataset, the ESPRI staff:
- replicates the new version,
- verifies the checksum is different from the previous one,
- updates the directory structure accordingly.
Finally, data traceability can be guaranteed by attaching a Digital Object Identifier (DOI) or a Persistent IDentifier (PID) to a file or a dataset, at least when requested by the DRS. Those identifiers are unique and immutable during the whole life-cycle of the data. They are persisted onto the appropriate database whether the data is retracted or completely unpublished (e.g., PIDs attached to climate simulations are registered at DKRZ in Germany during ESGF publication) and thus maintain a permanent link to metadata. The PIDs used in the CMIP6 context also allow to track the filiation between the existing versions of a dataset.
ESPRI is registered with CNRS-INIST which is DataCite’s French research correspondent. ESPRI is therefore authorized to provide DOIs for DataCite.
Rhe repository follows norms standards for metadata and ensures that the metadata follows at least the mandatory editing standards.