The repository has appropriate expertise to address technical data and metadata quality and ensures that sufficient information is available for end users to make quality-related evaluations
Data archived and managed by ESPRI are controlled to ensure the compliance to accepted formats. Data should be in netCDF format whenever possible and follow metadata conventions and criteria used by the environmental science user communities. As detailed in the ESPRI Data Management Plan, quality control procedures are set up with the use of dedicated tools including:
- The CF-checker,
- The Nasa Ames Checker.
- The CMIP and CORDEX Quality Assurance tool from the DKRZ,
- The ESGF PrePARE tool (part of the Climate Model Output Rewriter – CMOR).
All those community-developed tools allow a control of metadata by ESPRI repository and ensure the creation/acquisition of reusable self-describing datasets (datasets that contains sufficient metadata that they are self-describing in the sense that each variable in the file has an associated description of what it represents, including physical units, space location, provenance, citation, etc.).
Compliance with data standards and format is not enough to ensure the completeness and understandability of the data. ESPRI does not necessarily have the capacity to verify the data itself. The scientific validation of the data is therefore left to the discretion of the producers by default. Nevertheless, ESPRI manages observation from instrument measurements that may have acquisition discontinuities. Climate simulations can also encounter writing issues due to simulation failure. The processing performed on the raw data may encounter anomalies or errors. These are all potential sources of errors or potential sources for invalidating data that may appear in the files. ESPRI implements manual or automatic procedures that consist in:
- Monitoring the completeness of the archives, by identifying and filling in missing files (i.e., gaps in a time serie);
- Controlling the physical integrity of the data, by raising empty files or with parameters out of their geophysical range of validation.
Depending on the data stream those data quality checks are done manually or automatically. In any case, ESPRI implements the tools for monitoring data quality processes and systematically stored quality control logs.
Therefore, preventing errors in close relationships with the scientific community becomes an important leverage to guarantee quality of data. Managing errors upstream of data generation is often complicated as it is mostly the responsibility of the scientists in charge of the data. There is no automated notification at this time but the depositor is contacted by email when an error is raised during the quality control.
Conversely, ESPRI implements post-quality control procedures working closely with the scientific managers/producers of the data. For example, to study and better understand climate variability, multi-decadal and multi-variable series of measurements exist: satellite and in-situ for instance. But they remain difficult to use because they have not been harmonized, requalified, formatted in a consistent manner over time. Consequently, a scientific and technical framework for the processing and quality control of long-time series has been developed (the Reobs product) and applied to in situ observatory data. The final product is the product with quality information inside the files ; and values corrected from errors or outliers.
Regarding the climate simulation data, ESPRI developed and provides an Errata Service that centralizes timely information about known issues of ESGF data. Version changes should be documented and justified by explaining what was updated, retracted and/or removed. Consequently, the publication (and unpublication) of a new version of a dataset has to be motivated by an issue and conversely. The Errata Service offers to ESGF users:
- To query about modifications and/or corrections applied to their data through a user-friendly search interface based on a dedicated API to get the version history of a (set of) file(s)/dataset(s)
- To feed back on data and metadata in a “ticket” fashion in order to automatically notify the data producers who are then in charge to validate (or not), document and correct the issue.
Data quality is also enhanced with associated documentation about the data or the result of the quality controls themselves. ESPRI staff manually checks if documentation is associated with each observational dataset distributed through its meta-catalogs. The climate simulations documentation is mostly the responsibility of the data producers. ESPRI implements facilities to centralize and access such documentation. For instance, ESPRI coordinates the development of the Earth System Documentation that aims to nurture an ecosystem of tools & services in support of Earth System documentation creation, analysis and dissemination. Data documentation from ES-DOC is automatically linked to the datasets published on the ESGF.