. Leave empty if the image is purely decorative.

Cheatsheet for EuroHPC machines

Starring:. Slurm, Lmod and friends

Learn Slurm in Y minutes

Inspired by https://learnxinyminutes.com

Slurm is the de-facto job scheduler tool widely used in most supercomputers across the world. It does two important things:

  • managing workload to maximize utilization of the whole supercomputer, and
  • keeping an internal tally of usage by the user and the project; which in turn determines your “priority” such that fair sharing of the compute is achieved.

Let’s dive into how to use Slurm as a user. Assume here that the project “account” is named project_042 and the compute “partition” is called dev-g.

####################################################
# The basics
####################################################

# Query status of all your current jobs
squeue --me
# If you have submitted a job already (which we will come to soon),
# the output of this command will look something as follows, with
# each row describing the job id (JOBID), state (ST) etc. of
# every job.
#
#   JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
#  424242     dev-g interact waynebru  R       0:03      1 nid005151

# Query status of the whole cluster
sinfo

####################################################
# Interactive jobs
####################################################

# Oneliner to drop into a compute node
srun --account project_042 --partition dev-g --time 15:00 --gpus=1 --pty bash
# ... or the same using the short-form of the command-line arguments
srun -A project_042 -p dev-g -t 15:00 -G 1 --pty bash
# Note that some machines may also require that you mention the `--qos`
# argument.

# The salloc + srun combo:
# Allocate a compute job
salloc -A project_042 -p dev-g -t 15:00 --nodes 1 --gpus 1
# ... and then send commands remotely from the login node, for eg.
# rocm-smi to check GPU usage
srun rocm-smi
# By default the commands will be relayed and executed in the compute
# node. Alternatively, you can also use `squeue --me`, check the name
# of the compute node and then connect to it directly via ssh.
ssh nid005151
# ... then within the compute node, run as usual
rocm-smi

####################################################
# Non-interactive or batch jobs
####################################################
sbatch jobscript.sh
# sbatch also accepts the most of the same flags as srun and salloc
# More on the contents of a job script later.

####################################################
# Getting help
####################################################
man sbatch  # <-- or any command name

####################################################
# Cancelling jobs
####################################################

# Cancels all jobs launched by you. Jobs can have any state such as "pending", "running" etc.
scancel --me

# Cancel a job with a specific jobid 424242
scancel 424242  # <-- replace with JOBID

# Bonus: Inspecting a job, even past ones
scontrol show job 424242
Bash

Learn Lmod in Y minutes

a.k.a the module system / command

The ml command loads and unloads packages, compilers, libraries, environment variables etc., everything you need to bootstrap your application in an HPC environment.

####################################################
# Module system
####################################################

# List currently loaded modules
module list
# ... or simply
ml

# Search for a software module which contains the wildcard "python"
module spider python
# ... or the shorter form
ml spider python
# ... or for a specific module and a specific version
ml spider cray-python/3.11.7
# This command will also give short help text on what the module is. 
# It can also list the dependent modules the user should load first,
# prior to loading the module.

# Loading a single module
module load cray-python/3.11.7
# ... or the shorter form
ml cray-python/3.11.7
# ... or the shortest form
ml cray-python
# However, it goes without saying that, loading specific versions
# is useful for reproducibility.

# Loading multiple modules
ml LUMI systools/24.03 cray-python/3.11.7

# Unloading / Removing a single module
ml rm cray-python

# Purge / Removing all modules
ml purge

# Save a list of currently loaded module into ~/.lmod.d
ml save my_modules

# Restoring a list of modules
ml restore my_modules
Bash

Job script starter kit

Now that you understood sbatch command, it is time to see how would a basic job script look like. For the purpose of the demo we assume the app is deployed in a containerised fashion as follows:

singularity exec my_container.sif python3 app.py
Bash

Where my_container.sif is a singularity image containing a Python interpreter and all the necessary dependencies such as deep-learning frameworks, CUDA/ROCm runtimes etc. Keep in mind that the containers need to be tailored to the specific machine it is being used on. This is especially true for LUMI, where a special plugin for collective communication is needed to take advantage of the Slingshot interconnect (see e.g. here) Let’s launch a job named my_first_job using a single GPU in a EuroHPC JU machine 🇪🇺.

MareNostrum5 🇪🇸

#!/bin/bash
#SBATCH --job-name my_first_job
#SBATCH --account ehpc042  # <-- replace with your project number
#SBATCH --time 02:00:00
#SBATCH --partition acc
#SBATCH --ntasks 1
#SBATCH --gres=gpu:1 # GPUs per node
#SBATCH --qos=acc_debug  # <-- use `acc_ehpc` for production runs
#SBATCH --cpus-per-task=20
#SBATCH --mail-type=ALL # If you want to get an email when the job starts/ends

ml singularity
singularity exec --nv my_container.sif python3 app.py
Bash

See also: https://www.bsc.es/supportkc/docs/MareNostrum5/slurm

MeluXina 🇱🇺

In MeluXina modules are available only in the compute node and you are always charged for the use of the full compute node even if you request a single gpu.

#!/bin/bash -l
#SBATCH --job-name my_first_job
#SBATCH --account p000042  # <-- replace with your project number
#SBATCH --time 06:00:00
#SBATCH --partition gpu
#SBATCH --ntasks 1
#SBATCH --gpus 4 # Total number of GPUs
#SBATCH --qos=dev  # <-- use `default` for production runs
#SBATCH --cpus-per-task=64
#SBATCH --mail-type=ALL # If you want to get an email when the job starts/ends

ml env/release/latest Singularity-CE
singularity exec --nv my_container.sif python3 app.py
Bash

See also: https://docs.lxp.lu/first-steps/quick_start/

Leonardo 🇮🇹

#!/bin/bash
#SBATCH --job-name my_first_job # Name of the job
#SBATCH --account EUHPC_DXX_042 # Project number
#SBATCH --time 00:05:00 # Wall time
#SBATCH --partition boost_usr_prod # Partition
#SBATCH --ntasks 1 # Number of tasks
#SBATCH --gres=gpu:1 # GPUs per node
#SBATCH --qos boost_qos_dbg # Queue, for production jobs this becomes normal
#SBATCH --cpus-per-task 8
#SBATCH --mail-type=ALL # If you want to get an email when the job starts/ends


singularity exec --nv my_container.sif python3 app.py
Bash

LUMI 🇫🇮

#!/bin/bash
#SBATCH --job-name my_first_job
#SBATCH --account project_000000042
#SBATCH --time 00:45:00
#SBATCH --partition dev-g
#SBATCH --gpus 1 # Total number of GPUs
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 8
#SBATCH --mail-type=ALL # If you want to get an email when the job starts/ends

module use /appl/local/containers/ai-modules
module load singularity-AI-bindings

singularity exec my_container.sif python3 app.py
Bash

EuroHPC JU machine Rosetta

Every JU machine has some machinery that allows users to monitor usage of their allocation, both in terms of storage and computational resources. A guide to the different types of storage can be found here.

MareNostrum5 🇪🇸MeluXina 🇱🇺Leonardo 🇮🇹LUMI 🇫🇮
Computational resourcesbsc_acctmyquotasaldo -blumi-allocations*
Storagebsc_quotamyquotacinQuotalumi-quota*

* lumi-workspaces shows both

MareNostrum5 🇪🇸

bsc_acct # Computational resources used
Bash
Output
---------------------------------------------------------------
CPU GROUP BUDGET:
Project: my_project (ehpcXXX)
Project's Expiration Date:     2026-04-25

Machine:                    Total (classA,classB)[khours]:          Used [khours]:

Marenostrum5 GPP                                 undefined                    0.00
Marenostrum5 ACC                        280.00 (0.00,0.00)              0.78 (00%)
---------------------------------------------------------------

USER CONSUMED CPU:
User:                                             Machine:          Used [khours]:

user1 (Name Surname)                            Marenostrum5 GPP                0.00(0%)
user2 (Name Surname)                            Marenostrum5 ACC               0.26(33%)
---------------------------------------------------------------



Accounting updated on 2025-11-05 11:30:12

$ bsc_quota # Storage usage

 Printing quota for group ehpcXXX:

    Filesystem   Type          Usage          Quota          Limit     In doubt     Grace  |       Files  In doubt
     gpfs_home    USR        5.60 GB       80.00 GB       84.00 GB      0.00 KB      None  |          91         0
 gpfs_projects    GRP       13.59 GB     1000.00 GB        1.03 TB      0.00 KB      None  |       43128         0
  gpfs_scratch    GRP      130.01 GB     1000.00 GB        1.03 TB      0.00 KB      None  |         187         0

 For information regarding /gpfs/tapes, run this command from a Storage 5 node (transfer[1..4].bsc.es).

 For an in-depth explanation of the output, check the User's Guide.
Bash

MeluXina 🇱🇺

myquota
Bash
Output
COMPUTE ALLOCATIONS FOR CURRENT MONTH
COMPUTE USAGE FROM  2025-11-01 to NOW

Project                              CPU node-hours             GPU node-hours            FPGA node-hours           LargeMem node-hours
                                   Used   Max   %used         Used   Max   %used         Used   Max   %used         Used   Max   %used
pXXXXXX                             0                          0      65     0.0%         0                          0

                                                         DATA ALLOCATIONS

 Datapath                                      Data GiB                           No. Files
                                       Used |   Max   | Use%             Used |     Max    | Use%
/home/users/uXXXXXX                      48 |     100 |  48%             7665 |     100000 |   7%
/project/home/pXXXXXX                   267 |     500 |  53%           169896 |    1000000 |  16%
Bash

See also: https://docs.lxp.lu/first-steps/quick_start/

Leonardo 🇮🇹

saldo -b # Computational resources
Bash
Output
-----------------------------------------------------------------------------------------------------------------------------------------

account                start         end         total        localCluster   totConsumed     totConsumed     monthTotal     monthConsumed
                                             (local h)   Consumed(local h)     (local h)               %      (local h)         (local h)
-----------------------------------------------------------------------------------------------------------------------------------------
EUHPC_DXX_XXX       20240927    20250927        144000                 206           206             0.1             0                  0
EUHPC_DXX_XXX       20250826    20260826        144000                  48            48             0.0         11835                  0
$ cinQuota # Storage
--------------------------------------------------------------------------------------------------------------

 	Filesystem				 used		 quota		 grace		 files

--------------------------------------------------------------------------------------------------------------
 /leonardo/home/userexternal/username          	8.203G		 50G		 -		 5579
 /leonardo_scratch/large/userexternal/username 	4k		 0k		 -		 1
 /leonardo_work/EUHPC_DXX_XXX                  	261.6G		 1T		 -		 142937
 /leonardo_scratch/fast/EUHPC_DXX_XXX          	6.926M		 1T		 -		 341
 /leonardo_work/EUHPC_DXX_XXX                  	9.367G		 1000G		 -		 57598
 /leonardo_scratch/fast/EUHPC_DXX_XXX          	4k		 1T		 -		 1
 /leonardo/pub/userexternal/username           	4k		 50G		 -		 1

--------------------------------------------------------------------------------------------------------------
Bash

Leonardo uses a concept of “equivalent hours” for the different resources, as explained in their documentation.

See also: https://docs.hpc.cineca.it/hpc/hpc_scheduler.html

LUMI 🇫🇮

lumi-allocations # Compute
Bash
Output
Data updated: 2025-11-06 10:25:35
Project             |                    CPU (used/allocated)|               GPU (used/allocated)|           Storage (used/allocated)
--------------------------------------------------------------------------------------------------------------------------------------
project_XXXXXXXXX   |              8579/0    (N/A) core/hours|          1690/0    (N/A) gpu/hours|          23406/0    (N/A) TB/hours
project_YYYYYYYYY   |            174/1000  (17.4%) core/hours|        81/18000   (0.4%) gpu/hours|         45/90000   (0.1%) TB/hours
$ lumi-quota # Storage 

Disk area                          Capacity(used/max)  Files(used/max)
----------------------------------------------------------------------
Personal home folder
Home folder is hosted on lustrep3

/users/username                              2.2G/22G         32K/100K
----------------------------------------------------------------------
Project: project_XXXXXXXXX
Project is hosted on lustrep3

/projappl/project_XXXXXXXXX                   52G/54G        397K/500K
/scratch/project_XXXXXXXXX                   3.2T/55T        871K/2.0M
/flash/project_XXXXXXXXX                      54G/2.2T       168K/1.0M
----------------------------------------------------------------------
Project: project_YYYYYYYYY
Project is hosted on lustrep4

/projappl/project_YYYYYYYYY                   26G/54G         14K/100K
/scratch/project_YYYYYYYYY                   247G/55T         83K/2.0M
/flash/project_YYYYYYYYY                     4.1K/2.2T          1/1.0M
----------------------------------------------------------------------
$ lumi-workspaces # Shows both
Bash

See also: https://docs.lumi-supercomputer.eu/runjobs/scheduled-jobs/slurm-quickstart/

Categories: