The repository enables reuse of the data over time, ensuring that appropriate metadata are available to support the understanding and use of the data
Data reuse is one of the pillars of Open Science, which ESPRI integrates into its IT strategy and the evolution of its services. To this end, ESPRI applies the FAIR principles as much as possible in the management of its data and infrastructure and in particular for the data reuse principle. Archiving strategies depending on the data stream details in the ESPRI Data Management Plan, allow the data reuse. In addition, the ESPRI repository IT strategy plan is available in the Digital Strategic Plan (see relevant links).
In this context, to help the user decide if the data is useful for their application, ESPRI provides the metadata that allows discovery, but also all information ensuring understandability of the data. The netCDF format is well suited to meet those requirements and ESPRI staff uses netCDF format to standardize climate simulation data, but also increasingly for observational datasets, by implementing several standards such the Climate and Forecast (CF) or Attribute Convention for Data Discovery (ACDD) standard depending on the type of data.
netCDF is a self-describing and machine-independent format which includes information/attributes about metadata (axis, grid description, geographical projection, etc.), the data itself (measurement, variable name, units, time frequency, etc.) and the context under which the data was generated (instrument, model, experiment, interpolation method, etc.). Depending on this context or on demand of the depositor, netCDF files contain different types of information for the users related to the data provenance, such as the date of generation/collection of the data, the name and version of the software used, if it is raw or processed data. The CF convention requires to detail the variable short, long and standard name together with a self-explanatory description (i.e., defined in the research field’s controlled vocabulary). If available, the netCDF attribute also include the dataset version, a persistent identifier, an URL for further information (e.g., known issues documented on an errata), the history of processing command-lines, the DOI URL or how to cite the date and who to acknowledge.
Although the primary data format supported by ESPRI is the netCDF format, particular attention is paid to respecting the data format and metadata standards of other communities requesting for ESPRI facilities. of other community’s data format and metadata standards submitted to ESPRI repository. ESPRI manages and distributes multiple types of data used by a variety of science communities. To ensure data reusability for those formats they must allow metadata to be stored along with the data itself, in a netCDF fashion, as allowed by HDF5, LIPD and Nasa Ames recently supported by ESPRI. Consequently, whatever the recommended format is, the metadata is stored alongside data. For instance, the datasets from climate simulations produced by the IPSL and held by ESPRI are periodically reused for the reports from IPCC.
In order to take into account the evolution of formats, we rely on international standards only (such as netCDF) and pay attention to the evolution of metadata conventions, enriching them if necessary. This was the case for the implementation of the recent ACDD standard and the multiple versions of the CF convention. The transition from netCDF3 to netCDF4 was followed and implemented for the new datasets managed by ESPRI. For future developments, ESPRI investigates the ZARR format for the storage of chunked, compressed, N-dimensional arrays. ZARR is a new storage format which, thanks to its simple yet well-designed specification, makes large datasets easily accessible to distributed computing and allows metadata storage along the data matrices