A Beginner’s Guide to Using a Supercomputer

The following article is a transcript of the video tutorial above.

So you have decided to use a supercomputer to work with your data analysis or simulation faster and more efficiently. In this post we are discussing one of the most frequent questions we get. What does accessing a supercomputer look like?

At the end, you will be able to run a simple program using LUMI, one of the fastest supercomputers in the world and also one of the greenest ones.

The Euro HPC Joint Undertaking has put a number of supercomputers in multiple European countries and companies and public organizations in Europe are eligible to access them for free. If you don't know how, check our video.

Disclaimer


Different supercomputers have different procedures so remember to consult the documentation and the system admin team once you get access. You will also need some basic understanding of Unix shell commands.


1. Authentication

The first thing we have to do is authenticate our computer. This is done by generating SSH keys on your terminal regardless of your operating system. There are a few ways we can do this. For the sake of this article we'll use the following command.

ssh-keygen -t rsa -b 4096
Bash

This generates a 4096-bit RSA key. There is the option to customise the name of the keys but in this article we'll stick with the defaults. We choose the password of our choice and move on.

We can see that a folder with the name SSH has been created in our home folder with a pair of SSH key files. When we access the folder we see that one is called id_rsa and the other one id_rsa.pub which stands for "public". The public file is the one that you're going to be sending to the supercomputer for your identification.

You can see the contents of the file using

cat id_rsa.pub
Bash

You can see that it is a text file containing a string of letters and digits.

In the case of LUMI you copy the text and paste it on my access ID which is the login system LUMI is using. Remember to share your ".pub" file and not the other one which must be kept private and always on your PC. In a couple of hours the computer will be authorized to access LUMI. The LUMI team will send you your username with which you will be able to log in.

The computer that you generated the SSH keys with must be the one that you will use for accessing the supercomputer. To use other computers you will have to regenerate the SSH keys on that computer using in the same procedure and upload the new ".pub" file.


2. Logging in

To log into the supercomputer we type

ssh -i ~/.ssh/ed25529 USERNAME@lumi.csc.fi
Bash

We are now logged into LUMI. We can also see relevant information regarding documentation and user support along with important announcements.

By typing pwd we see that we are located in our user folder. We also type ls to list all the files. If this is your first time accessing the supercomputer this folder will be empty.


3. Copying files

We have a hello.py file on our desktop which prints a greeting and the current date and time.

To copy it to the supercomputer we open a new local terminal instance and type the following.

scp FILE_PATH/file USERNAME@lumi.csc.fi:/users/USERNAME
Bash

scp stands for secure copy and copies our file which is currently on our desktop to our user folder address on LUMI. Typing ls on the LUMI terminal we see that our file has now been transferred to LUMI.


4. Nodes and partitions

Nodes

The moment you log into a supercomputer you are accessing the login node. This is the first point of access on a supercomputer. The login nodes are for administrating your jobs and not for running computations.

The compute nodes are nodes that are supposed to be doing all the hard computational work and you need to access those in order to run your jobs. You can imagine each node as a separate computer, like a powerful workstation but without the screen and keyboard. Each node is connected to other nodes with a high speed network inside the cluster. You can reach the full potential of a supercomputer by accessing multiple nodes and run your jobs in parallel. That is why it's often called parallel computing.

Partitions

Partitions are groups of nodes that are reserved for specific use. This helps you and the admin team with node availability and structuring your work because multiple users will have access to the supercomputer simultaneously.

To have a view over the partitions we write

sinfo -s
Bash
PARTITION   AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
standard       up 2-00:00:00  914/66/191/1171 nid[001002-001524,001526-001723,001725-001847,001849-002023,002028-002152,002154-002180]
small          up 3-00:00:00    249/19/69/337 nid[002181-002214,002216-002218,002220-002243,002248-002523]
interactive    up    8:00:00          0/4/0/4 nid[002524-002527]
debug          up      30:00          0/1/7/8 nid[002528-002535]
lumid          up 1-00:00:00          0/8/0/8 nid[000016-000023]
largemem       up 1-00:00:00          0/6/2/8 nid[000101-000108]
standard-g     up 2-00:00:00 1287/1343/103/27 nid[005110-006253,006256-006410,006412-006518,006520-007340,007342-007493,007496-007682,007684-007806,007808-007851]
small-g        up 3-00:00:00      92/91/5/188 nid[005024-005109,007852-007953]
dev-g          up    3:00:00        2/46/0/48 nid[005000-005023,007954-007977]
q_nordiq       up      15:00          0/1/0/1 nid001001
q_fiqci        up      15:00          0/1/0/1 nid002153
Bash

We see all the different partitions with their availability numbers and other info. Depending on your work you may choose different partitions accordingly.

You can access the compute nodes in two ways. Interactively via the salloc command and non-interactively through batch scripts.


5. Running jobs with salloc

salloc gives you direct access to the number of the compute nodes that you specify. Usually it is used to quickly reserve a few nodes when you want to test, verify, or debug your code.

For instance let's say we want to access one compute node with salloc to debug our code. We write the following

salloc --nodes=1 --time=00:30:00 --account=project_465000485 --partition=debug
Bash

This command will book one node for 30 minutes on partition "debug" using our project number that we have on LUMI.

With the command below you can check which allocations you currently have.

squeue -u USERNAME
Bash

Now that we have allocated a node for our project let's run our hello.py on LUMI.

First, we need to load a python module to run the file because at the moment LUMI doesn't know which programming language or software we're going to use. To find out what python modules are available we type

module avail python
Bash

After seeing the list, we choose the python module of our choice and load it using

module load NAME-OF-MODULE
Bash

Now that python is loaded and we can run the file by typing

srun Python hello.py
Bash

With srun we use the compute node that we have allocated for our jobs using salloc. We immediately see the output of our program on the terminal.


6. Running jobs with batch scripts

Slurm is a job scheduler that helps with scheduling your jobs once the number of compute nodes you specify becomes available. We create the batch file by typing the following

nano batch_job.sh
Bash

We can use any name of our choice for the batch job. In this case we name it "batch_job.sh". We can also use any text editor we want but for this case we use Nano so we don't have to leave the terminal.

A simple batch script can look like this

#!/bin/bash -l
#SBATCH --job-name=hello_batch          # Job name
#SBATCH --output=test_job.o%j           # Name of stdout output file
#SBATCH --error=test_job.e%j            # Name of stderr error file
#SBATCH --partition=standard            # Partition (queue) name
#SBATCH --nodes=4                       # Total number of nodes
#SBATCH --time=2:00:00                  # Run time (d-hh:mm:ss)
#SBATCH --account=project_465000485     # Project for billing

module load NAME-OF-PYTHON-MODULE
srun python hello.py 
ShellScript

In the top we have the name of the batch job, the output file where we're going to save our results and an error file that contains possible run errors. We also see the partitions that we want to use, in this case "standard", the node number that we need, the time of the allocation and the project number. The script will run the same file as before once the four nodes that we specified will become available for the time that we entered.

Now we are set to run the batch script. We type

sbatch batch_job.sh
Bash

Whenever LUMI has four free CPU nodes our batch job will start and save the results to the .o files in your user folder.

There you go we have now run a simple Python program on LUMI one of the world's fastest supercomputers for free.

Categories: