User Guide for ESPRI Data services

The primary purpose of the ESPRI data repository is to serve the scientific and academic community. ESPRI provides access to various observational datasets and numerical simulations for IPSL laboratories and their partners through a central data repository infrastructure. Data analysis can be easily facilitated with HPC computing facilities. ESPRI centralizes many data products of interest to the Earth science community, such as climate model simulations, satellite products, ground-based datasets, operational analyses and forecasts, reanalyses, and more.

This document presents a « quick-start » guide to introduce users to ESPRI data and services. Additional details can be found in the ESPRI data management plan if needed.

We offer different data services to the scientific community, as described below. For certain services, it’s necessary to create an account on the ESPRI mesocentre.

Data access

Data discovery and access facilities depend on the type of data and user profiles. Users with access to ESPRI computing center resources can benefit from advanced data access interfaces or direct access to the file system hosting the data trees.

SSH access

Once connected via SSH to the Spirit/X clusters, you have access to all the community data centralised in the /bdd root directory.

Depending on the dataset, an embargo period can be set up to restrict access to a specific group of users. In this case, specific authentication and authorizations are set up to filter access through Unix permissions. The duration of the embargo period is defined according to the rules of the governing institutions. If an embargo period is established, it typically lasts 6 months to restrict access to principal investigators. It then extends to scientific partners for between 6 months and up to two years after data acquisition, and then opens to the public after that.

JupyterHub

ESPRI provides access to resources through a JupyterHub that is focused on research activities, including:

the Pangeo suite
access to ESPRI data catalogues.
job submission on SpiritX cluster.

“intake” catalogues

intake is a lightweight set of tools for loading and sharing data in data science projects through Python environments.

It helps you to:

Discover data by being agnostic to its storage location and archive organisation.
Load data from a variety of formats into containers such as Pandas dataframes, Python lists, NumPy arrays, etc.
Describe and store datasets in catalogue files for easy reuse and sharing between projects and with others.
Get native aggregations on different dimensions (time, members).

For the moment, this catalogue only serves climate simulation data and can be accessed via the following path: /bdd/intake-catalogs/cmip-cordex.yml

Data and workspaces directory structure

User workspaces

Each user has by default 3 workspaces on each cluster on IPSL-X or IPSL-SU:

Home user workspace for sources, executables, etc.
- $HOME, or /home/$USER directory
User data directory for your own small datasets, intermediate data, test, etc.
- /homedata/$USER/ (on MESO IPSL-X )
- /data/$USER/ (on MESO IPSL-SU)
Scratch Folder : for temporary file, models checkpoints, data reorganisation.
- /scratchu/$USER/ (on MESO IPSL-SU)
- /scratchx/$USER/ (on MESO IPSL-X )

Mesocentre IPSL –X (SPIRITX)
Directory	User Quotas	Backup service
/home	32 Go (300 000 files)	Daily incremental
/homedata	1 To (300 000 files)	N/A
/scratchx	2 To (300 000 files)	N/A

Mesocentre IPSL-SU ( SPIRIT)
Directory	User Quotas	Backup service
/home	32 Go (300 000 files)	Daily incremental
/data	1 To (300 000 files)	N/A
/scratchu	2 To (300 000 files)	N/A

To find out the occupancy rate of your disk space, use the “quotas “command.

Data directories on /bdd

The community data centralised in the root directory /bdd (e.g base de données) are databases managed entirely by ESPRI, which is responsible for maintaining the storage space and the data.

You can access the data on /bdd directly from your programs without having to copy them to your user space. Note that it’s better to always use the absolute path /bdd in your code, because we might move the data to underlying file systems.

Each cluster has remote access to all filesystems on the other cluster, but only in read-only mode. To improve data access, it is advisable to work on the closest cluster (IPSL-X or IPSL-UPMC) to the dataset you want to use.

For datasets with restricted access, send an access request to our support : meso-support@ipsl.fr.

Some dataset are too big to be hosted in our datacenter (CMIP, etc.). They are remotely accessed from national french research computing center (IDRIS and TGCC). Access may be interrupted if the national computing centre is undergoing maintenance, or experiencing infrastructure or network problems.

Data discovery

The ESPRI repository manages different data types for which ESPRI manages the entire data lifecycle and catalog them throught differents data portals:

IPSL data portal for multi-thematic data sets,
ESGF portal for the climate simulations carried out by the IPSL modeling center and the climate simulations replicated for the CLIMERI-France Research Infrastructure,
AERIS data portal for atmospheric observations carried out by the AERIS/DATA TERRA Research Infrastructure.

Data replication (on-demand)

The ESPRI teams offer their users the opportunity to enrich the community datasets upon request. ESPRI has powerful download tools (up to around 4-5 TB/day) and ensures that the data archives on /bdd are managed in a uniform and shared way with permanent links to the latest version of the datasets where possible.

As ESPRI does not have the capacity to host all data archives, each request will be examined on the basis of the following criteria:

proximity to the needs for climate observation or simulation data for IPSL research and its contribution to IPCC reports,
overlap with recurring requests from other users,
available disk space.

Conversely, if you have data in your user space that is likely to be of interest to other scientists, we can load it into /bdd and free up your disk space quota.

Requests for data replication/completion information should be sent to meso-support@ipsl.fr.

Data sharing for scientific projects

ESPRI offers the possibility to create “project spaces” for research projects of its users on dedicated storage spaces. These spaces are:

sized according to needs and available funding;
accessible in read and write mode to a group of users to be defined;
managed by users in accordance with ESPRI best practice.

In particular, users are required to:

Avoid duplication with datasets managed by ESPRI,
Use code management tools (e.g. Git),
Use /home spaces for scripts and algorithms,
Share only project data (not /home, /data or /scratch),
Think about migration of data produced in ESPRI community spaces.

At the request of users, these spaces can be made accessible via two THREDDS file servers (TDS). These servers provide real-time and dynamical data access (i.e. without creating catalogues) and use various remote access protocols (such as OPeNDAP, OGC WMS and WCS, HTTP). These THREDDS servers also provide access to the “shared spaces” from the TGCC and IDRIS HPC centres.

Data registration with DOI (Digital Object Identifiers)

ESPRI is registered with CNRS-INIST, the French research correspondent of DataCite. ESPRI is therefore authorised to generate Digital Object Identifiers (DOIs) for data sets. To this end, ESPRI uses the IPSL catalogue functions for automatic DOI generation in conjunction with DataCite. Each dataset or numerical code with a DOI is stored in a persistent location accessible via a THREDDS data server. ESPRI hosts a landing page attached to each DOI, which leads to the terms of use and how to cite the data. More information on how to obtain a DOI can be found here.

If you need to obtain a specific DOI on an identified dataset for a scientific publication, please use the Easydata portal of the DATA TERRA infrastructure.

If you need help or more information on how to get a DOI or which portal, please contact us via espri-contact@ipsl.fr or meso-support@ipsl.fr.

Security and data preservation

Data stored on /bdd is read-only for all users.
The data in /bdd is secure due to the fact that /bdd is made up of redundant hard disks. The system is thus designed to be highly robust. Note that not all data is backed up systematically. It is therefore up to the data producer to explicitly request archiving to guarantee the complete security and longevity of your data if it cannot be recovered elsewhere. Different levels of backup are possible and are described in the ESPRI Data Management Plan. The appropriate level is chosen according to the type of data.

Espri

IPSL Computing and Data Center for Climate Sciences