The following article is a transcript of the video tutorial above.
So you have decided to use a supercomputer to work with your data analysis or simulation faster and more efficiently. In this post we are discussing one of the most frequent questions we get. What does accessing a supercomputer look like?
At the end, you will be able to run a simple program using LUMI, one of the fastest supercomputers in the world and also one of the greenest ones.
The Euro HPC Joint Undertaking has put a number of supercomputers in multiple European countries and companies and public organizations in Europe are eligible to access them for free. If you don’t know how, check our video.
Disclaimer
Different supercomputers have different procedures so remember to consult the documentation and the system admin team once you get access. You will also need some basic understanding of Unix shell commands.
1. Authentication
The first thing we have to do is authenticate our computer. This is done by generating SSH keys on your terminal regardless of your operating system. There are a few ways we can do this. For the sake of this article we’ll use the following command.
ssh-keygen -t rsa -b 4096
BashThis generates a 4096-bit RSA key. There is the option to customise the name of the keys but in this article we’ll stick with the defaults. We choose the password of our choice and move on.
We can see that a folder with the name SSH has been created in our home folder with a pair of SSH key files. When we access the folder we see that one is called id_rsa and the other one id_rsa.pub which stands for “public”. The public file is the one that you’re going to be sending to the supercomputer for your identification.
You can see the contents of the file using
cat id_rsa.pub
BashYou can see that it is a text file containing a string of letters and digits.
In the case of LUMI you copy the text and paste it on my access ID which is the login system LUMI is using. Remember to share your “.pub” file and not the other one which must be kept private and always on your PC. In a couple of hours the computer will be authorized to access LUMI. The LUMI team will send you your username with which you will be able to log in.
The computer that you generated the SSH keys with must be the one that you will use for accessing the supercomputer. To use other computers you will have to regenerate the SSH keys on that computer using in the same procedure and upload the new “.pub” file.
2. Logging in
To log into the supercomputer we type
ssh -i ~/.ssh/ed25529 USERNAME@lumi.csc.fi
BashWe are now logged into LUMI. We can also see relevant information regarding documentation and user support along with important announcements.
By typing pwd
we see that we are located in our user folder. We also type ls
to list all the files. If this is your first time accessing the supercomputer this folder will be empty.
3. Copying files
We have a hello.py file on our desktop which prints a greeting and the current date and time.
To copy it to the supercomputer we open a new local terminal instance and type the following.
scp FILE_PATH/file USERNAME@lumi.csc.fi:/users/USERNAME
Bashscp
stands for secure copy and copies our file which is currently on our desktop to our user folder address on LUMI. Typing ls
on the LUMI terminal we see that our file has now been transferred to LUMI.
4. Nodes and partitions
Nodes
The moment you log into a supercomputer you are accessing the login node. This is the first point of access on a supercomputer. The login nodes are for administrating your jobs and not for running computations.
The compute nodes are nodes that are supposed to be doing all the hard computational work and you need to access those in order to run your jobs. You can imagine each node as a separate computer, like a powerful workstation but without the screen and keyboard. Each node is connected to other nodes with a high speed network inside the cluster. You can reach the full potential of a supercomputer by accessing multiple nodes and run your jobs in parallel. That is why it’s often called parallel computing.
Partitions
Partitions are groups of nodes that are reserved for specific use. This helps you and the admin team with node availability and structuring your work because multiple users will have access to the supercomputer simultaneously.
To have a view over the partitions we write
sinfo -s
BashPARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST
standard up 2-00:00:00 914/66/191/1171 nid[001002-001524,001526-001723,001725-001847,001849-002023,002028-002152,002154-002180]
small up 3-00:00:00 249/19/69/337 nid[002181-002214,002216-002218,002220-002243,002248-002523]
interactive up 8:00:00 0/4/0/4 nid[002524-002527]
debug up 30:00 0/1/7/8 nid[002528-002535]
lumid up 1-00:00:00 0/8/0/8 nid[000016-000023]
largemem up 1-00:00:00 0/6/2/8 nid[000101-000108]
standard-g up 2-00:00:00 1287/1343/103/27 nid[005110-006253,006256-006410,006412-006518,006520-007340,007342-007493,007496-007682,007684-007806,007808-007851]
small-g up 3-00:00:00 92/91/5/188 nid[005024-005109,007852-007953]
dev-g up 3:00:00 2/46/0/48 nid[005000-005023,007954-007977]
q_nordiq up 15:00 0/1/0/1 nid001001
q_fiqci up 15:00 0/1/0/1 nid002153
BashWe see all the different partitions with their availability numbers and other info. Depending on your work you may choose different partitions accordingly.
You can access the compute nodes in two ways. Interactively via the salloc
command and non-interactively through batch scripts.
5. Running jobs with salloc
salloc
gives you direct access to the number of the compute nodes that you specify. Usually it is used to quickly reserve a few nodes when you want to test, verify, or debug your code.
For instance let’s say we want to access one compute node with salloc
to debug our code. We write the following
salloc --nodes=1 --time=00:30:00 --account=project_465000485 --partition=debug
BashThis command will book one node for 30 minutes on partition “debug” using our project number that we have on LUMI.
With the command below you can check which allocations you currently have.
squeue -u USERNAME
BashNow that we have allocated a node for our project let’s run our hello.py
on LUMI.
First, we need to load a python module to run the file because at the moment LUMI doesn’t know which programming language or software we’re going to use. To find out what python modules are available we type
module avail python
BashAfter seeing the list, we choose the python module of our choice and load it using
module load NAME-OF-MODULE
BashNow that python is loaded and we can run the file by typing
srun Python hello.py
BashWith srun
we use the compute node that we have allocated for our jobs using salloc
. We immediately see the output of our program on the terminal.
6. Running jobs with batch scripts
Slurm is a job scheduler that helps with scheduling your jobs once the number of compute nodes you specify becomes available. We create the batch file by typing the following
nano batch_job.sh
BashWe can use any name of our choice for the batch job. In this case we name it “batch_job.sh“. We can also use any text editor we want but for this case we use Nano so we don’t have to leave the terminal.
A simple batch script can look like this
#!/bin/bash -l
#SBATCH --job-name=hello_batch # Job name
#SBATCH --output=test_job.o%j # Name of stdout output file
#SBATCH --error=test_job.e%j # Name of stderr error file
#SBATCH --partition=standard # Partition (queue) name
#SBATCH --nodes=4 # Total number of nodes
#SBATCH --time=2:00:00 # Run time (d-hh:mm:ss)
#SBATCH --account=project_465000485 # Project for billing
module load NAME-OF-PYTHON-MODULE
srun python hello.py
ShellScriptIn the top we have the name of the batch job, the output file where we’re going to save our results and an error file that contains possible run errors. We also see the partitions that we want to use, in this case “standard”, the node number that we need, the time of the allocation and the project number. The script will run the same file as before once the four nodes that we specified will become available for the time that we entered.
Now we are set to run the batch script. We type
sbatch batch_job.sh
BashWhenever LUMI has four free CPU nodes our batch job will start and save the results to the .o files in your user folder.
There you go we have now run a simple Python program on LUMI one of the world’s fastest supercomputers for free.