User Guide for ESPRI Data services
The primary purpose of the ESPRI data repository is to serve the scientific and academic community. ESPRI provides access to various observational datasets and numerical simulations for IPSL laboratories and their partners through a central data repository infrastructure. Data analysis can be easily facilitated with HPC computing facilities. ESPRI centralizes many data products of interest to the Earth science community, such as climate model simulations, satellite products, ground-based datasets, operational analyses and forecasts, reanalyses, and more.
This document presents a « quick-start » guide to introduce users to ESPRI data and services. Additional details can be found in the ESPRI data management plan if needed.
We offer different data services to the scientific community, as described below. For certain services, it’s necessary to create an account on the ESPRI mesocentre.
Data discovery and access facilities depend on the type of data and user profiles. Users with access to ESPRI computing center resources can benefit from advanced data access interfaces or direct access to the file system hosting the data trees.
Once connected via SSH to the Spirit/X clusters, you have access to all the community data centralised in the /bdd root directory.
Depending on the dataset, an embargo period can be set up to restrict access to a specific group of users. In this case, specific authentication and authorizations are set up to filter access through Unix permissions. The duration of the embargo period is defined according to the rules of the governing institutions. If an embargo period is established, it typically lasts 6 months to restrict access to principal investigators. It then extends to scientific partners for between 6 months and up to two years after data acquisition, and then opens to the public after that.
ESPRI provides access to resources through a JupyterHub that is focused on research activities, including:
- the Pangeo suite
- access to ESPRI data catalogues.
- job submission on SpiritX cluster.
intake is a lightweight set of tools for loading and sharing data in data science projects through Python environments.
It helps you to:
- Discover data by being agnostic to its storage location and archive organisation.
- Load data from a variety of formats into containers such as Pandas dataframes, Python lists, NumPy arrays, etc.
- Describe and store datasets in catalogue files for easy reuse and sharing between projects and with others.
- Get native aggregations on different dimensions (time, members).
For the moment, this catalogue only serves climate simulation data and can be accessed via the following path: /bdd/intake-catalogs/cmip-cordex.yml
Data and workspaces directory structure
Each user has by default 3 workspaces on each cluster on IPSL-X or IPSL-SU:
- Home user workspace for sources, executables, etc.
- $HOME, or /home/$USER directory
- User data directory for your own small datasets, intermediate data, test, etc.
- /homedata/$USER/ (on MESO IPSL-X )
- /data/$USER/ (on MESO IPSL-SU)
- Scratch Folder : for temporary file, models checkpoints, data reorganisation.
- /scratchu/$USER/ (on MESO IPSL-SU)
- /scratchx/$USER/ (on MESO IPSL-X )
|Mesocentre IPSL –X (SPIRITX)
|32 Go (300 000 files)
|1 To (300 000 files)
|2 To (300 000 files)
|Mesocentre IPSL-SU ( SPIRIT)
|32 Go (300 000 files)
|1 To (300 000 files)
|2 To (300 000 files)
To find out the occupancy rate of your disk space, use the “quotas “command.
Data directories on /bdd
The community data centralised in the root directory /bdd (e.g base de données) are databases managed entirely by ESPRI, which is responsible for maintaining the storage space and the data.
You can access the data on /bdd directly from your programs without having to copy them to your user space. Note that it’s better to always use the absolute path /bdd in your code, because we might move the data to underlying file systems.
Each cluster has remote access to all filesystems on the other cluster, but only in read-only mode. To improve data access, it is advisable to work on the closest cluster (IPSL-X or IPSL-UPMC) to the dataset you want to use.
For datasets with restricted access, send an access request to our support : email@example.com.
Some dataset are too big to be hosted in our datacenter (CMIP, etc.). They are remotely accessed from national french research computing center (IDRIS and TGCC). Access may be interrupted if the national computing centre is undergoing maintenance, or experiencing infrastructure or network problems.
The ESPRI repository manages different data types for which ESPRI manages the entire data lifecycle and catalog them throught differents data portals:
- IPSL data portal for multi-thematic data sets,
- ESGF portal for the climate simulations carried out by the IPSL modeling center and the climate simulations replicated for the CLIMERI-France Research Infrastructure,
- AERIS data portal for atmospheric observations carried out by the AERIS/DATA TERRA Research Infrastructure.
Data replication (on-demand)
The ESPRI teams offer their users the opportunity to enrich the community datasets upon request. ESPRI has powerful download tools (up to around 4-5 TB/day) and ensures that the data archives on /bdd are managed in a uniform and shared way with permanent links to the latest version of the datasets where possible.
As ESPRI does not have the capacity to host all data archives, each request will be examined on the basis of the following criteria:
- proximity to the needs for climate observation or simulation data for IPSL research and its contribution to IPCC reports,
- overlap with recurring requests from other users,
- available disk space.
Conversely, if you have data in your user space that is likely to be of interest to other scientists, we can load it into /bdd and free up your disk space quota.
Requests for data replication/completion information should be sent to firstname.lastname@example.org.
Data sharing for scientific projects
ESPRI offers the possibility to create “project spaces” for research projects of its users on dedicated storage spaces. These spaces are:
- sized according to needs and available funding;
- accessible in read and write mode to a group of users to be defined;
- managed by users in accordance with ESPRI best practice.
In particular, users are required to:
- Avoid duplication with datasets managed by ESPRI,
- Use code management tools (e.g. Git),
- Use /home spaces for scripts and algorithms,
- Share only project data (not /home, /data or /scratch),
- Think about migration of data produced in ESPRI community spaces.
At the request of users, these spaces can be made accessible via two THREDDS file servers (TDS). These servers provide real-time and dynamical data access (i.e. without creating catalogues) and use various remote access protocols (such as OPeNDAP, OGC WMS and WCS, HTTP). These THREDDS servers also provide access to the “shared spaces” from the TGCC and IDRIS HPC centres.
Data registration with DOI (Digital Object Identifiers)
If you need to obtain a specific DOI on an identified dataset for a scientific publication, please use the Easydata portal of the DATA TERRA infrastructure.
Security and data preservation
Data stored on /bdd is read-only for all users.
The data in /bdd is secure due to the fact that /bdd is made up of redundant hard disks. The system is thus designed to be highly robust. Note that not all data is backed up systematically. It is therefore up to the data producer to explicitly request archiving to guarantee the complete security and longevity of your data if it cannot be recovered elsewhere. Different levels of backup are possible and are described in the ESPRI Data Management Plan. The appropriate level is chosen according to the type of data.