Slurm in Ubuntu Clusters - Part 2

High Performance Computing 2018-09-24 MonRead Time: 7 min.

Overview

Overview of the slurm infrastructure

In this article we will not go through the installation and configuration of the shared filesystem described above. We assume that the HOME directory is shared between the controller and the compute nodes, in the future I'll write about how to achieve this using NFS (for simplicity) and BeeGFS, for now let us follow with Slurm usage.

Summary of commands

scontrol - used to view and modify Slurm configuration and state.
sacct - displays accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database.
sinfo - show information about the compute nodes status.
squeue - show information about the scheduler's job queue.
smap - show information about slurm jobs, partitions, and set configurations parameters.
sview - graphical user interface to view and modify the slurm state.
salloc - obtain a Slurm job allocation execute a command, and then release the allocation when the command is finished.
srun - run parallel jobs.
scancel - cancel a submitted job.
sbatch - submit a batch script to Slurm.

Changing the state of a Node

With the scontrol command we can control, among other things, the state of each compute node. If, for example, we need to put an idle node down for maintenance we would do the following:

root@controller:~# scontrol update NodeName=compute-1 State=Down Reason='Maintenance'
root@controller:~# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
base*        up 3-00:00:00      1  down* compute-1
base*        up 3-00:00:00      3   idle compute-[2-4]
root@controller:~# scontrol update NodeName=compute-1 State=Idle
root@controller:~# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
base*        up 3-00:00:00      4   idle compute-[1-4]
root@controller:~#

If, on the other hand, the node we want to put down is executing jobs (its state is allocated or mix), we need to put it in the draining state. Which means that the node will no longer accept new jobs and when all the running jobs complete, the node's state will be drained, and we can then put it in the down state.

root@controller:~# scontrol update NodeName=compute-1 State=Drain Reason='Maintenance'
root@controller:~# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
base*        up 3-00:00:00      1   drng compute-1
base*        up 3-00:00:00      3   idle compute-[2-4]
root@controller:~# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
base*        up 3-00:00:00      1  drain compute-1
base*        up 3-00:00:00      3   idle compute-[2-4]
root@controller:~# scontrol update NodeName=compute-1 State=Idle
root@controller:~# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
base*        up 3-00:00:00      4   idle compute-[1-4]
root@controller:~#

Executing jobs interactively

To execute jobs interactively, there are usually two steps:

Allocate resources
Run parallel job

As an example, we want to run a BASH shell in a compute cluster. We will ask for just 100MB of RAM and 1 cpu.

username@controller:~$ salloc --cpus-per-task=1 --mem=100MB
salloc: Pending job allocation 27564
salloc: job 27564 queued and waiting for resources
salloc: job 27564 has been allocated resources
salloc: Granted job allocation 27564
username@controller:~$ srun --pty /bin/bash
username@node4:~$ hostname
node4
username@node4:~$ exit
username@node4:~$ exit
salloc: Relinquishing job allocation 27564
username@controller:~$

Batch file

If instead of an interactive job we want a job running on its own for a long amount of time, we can create a batch script and submit it to the slurm scheduler queue.

Ley us take the example code for a simple Hello World! program.

helloworld.c

1
2
3
4
#include <stdio.h>
int main(int argc, char** argv){
  printf('Hello World!\n');
}

Let us compile it.

username@controller:~$ gcc -o helloworld helloworld.c

And now we can create a batch file to submit the job.

simple_hello.batch

1
2
3
4
5
6
#!/bin/bash

WORKDIR=/home/username/helloworld

cd $WORKDIR
./helloword

Let us submit the job with the help of the sbatch command and see its output.

username@controller:~/helloworld$ sbatch simple_hello.batch
Submitted job 12345
username@controller:~/helloworld$ ls
simple_hello.batch   simple_hello-12345.out   simple_hello-12345.err
username@controller:~/helloworld$ cat simple_hello-12345.err
username@controller:~/helloworld$ cat simple_hello-12345.out
Hello World!
username@controller:~/helloworld$

It works!

Now let us try a more complex example, making a C program which will run on multiple nodes, via MPI, and outputs the name of the node it is running on and the rank attributed to it [MPITUTO].

helloworld_mpi.c

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#include <stdio.h>
#include <unistd.h>
#include <mpi.h>

int main(int argc, char** argv){
  // Init the MPI environment
  MPI_Init(NULL, NULL);

  // Get the number of processes
  int world_size;
  MPI_Comm_size(MPI_COMM_WORLD, &world_size);

  // Get the rank of the process
  int world_rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

  // Get the name of the processor
  char processor_name[MPI_MAX_PROCESSOR_NAME];
  int name_len;
  MPI_Get_processor_name(processor_name, &name_len);

  // Print a hello world message
  printf("Hello world from processor %s, rank %d out of %d processors\n",
          processor_name,
          world_rank,
          world_size);

  // sleep for 10 seconds, just enough to see the job executing
  sleep(10);

  // Finalize the MPI environment
  MPI_Finalize();
}

Compile the code with MPI capabilities.

username@controller:~/helloworld$ mpicc -o helloworld_mpi helloworld_mpi.c
username@controller:~/helloworld$ ls
helloworld_mpi helloworld_mpi.c
username@controller:~/helloworld$

helloworld_mpi.batch

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
#!/bin/bash
#SBATCH --job-name=hello-world-mpi
#SBATCH --output=/home/username/helloworld/helloworld_mpi-%j.out
#SBATCH --error=/home/username/helloworld/helloworld_mpi-%j.err
#SBATCH --nodes=2
#SBATCH --ntasks=8
#SBATCH --time=72:00:00
#SBATCH --mem=1G

PATH=/path/to/mpidir/bin:$PATH
WORKDIR=/home/username/helloworld

cd $WORKDIR
mpiexec -np 8 ./helloworld_mpi

Now we can submit the batch script as a job to queue.

username@controller:~/helloworld$ sbatch helloworld_mpi.sbatch
Submitted batch job 27560
username@controller:~/helloworld$  squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
27560      base hello-w_ username PD       0:00      2 (Resources)
username@controller:~/helloworld$  squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
27560      base hello-w_ username  R       0:05      2 node6,node7

On this sequence we can see that the job was submitted to the queue, and it was waiting for resources to be available and run the job. Then, the job started running and was allocated to node6 and node7.

As the job is finished, let us check the job's output:

username@controller:~/helloworld$ cat helloworld_mpi-27560.err
username@controller:~/helloworld$ cat helloworld_mpi-27560.out
Hello world from processor node7, rank 7 out of 8 processors
Hello world from processor node6, rank 5 out of 8 processors
Hello world from processor node6, rank 4 out of 8 processors
Hello world from processor node6, rank 0 out of 8 processors
Hello world from processor node6, rank 2 out of 8 processors
Hello world from processor node6, rank 3 out of 8 processors
Hello world from processor node6, rank 6 out of 8 processors
Hello world from processor node6, rank 1 out of 8 processors
username@controller:~/helloworld$

As we can see, the job was allocated to 7 of node6's CPUs and 1 of node7's CPUs.

As we can also notice, the order of execution of each process is not deterministic, it has to do with the time the process reaches the processor, and also which other processes may be running and yielding in that same processor.

MPI programming is not really in the scope of this article, however if you are interested you can check these tutorials.

Getting job and node information

In part 1 of Slurm in Ubuntu Clusters we saw how to see the general node state with the sinfo command.

username@controller:~$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
base*        up 3-00:00:00      1   idle compute-[1-4]
long         up   infinite      1   idle compute-[1-4]

In the previous section we also saw how to monitor the job submission queue with the squeue command.

There are three more commands that are also very useful to analyse the state of the cluster, and also the job history for a certain user or a set of the users.

The smap command is similar to the squeue, in a way that it also shows the slurm scheduling queue. However it also shows information about Slurm jobs, partitions, and set configurations parameters.

username@controller:~$ smap -c
JOBID PARTITION     USER   NAME ST       TIME NODES NODELIST
27560      base username hellow  R   00:00:05     1 node6,node7

The sacct command displays accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database. With it you can see information about past submitted jobs.

username@controller:~$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
27560.batch       batch                                7  COMPLETED      0:0
27560.0      hydra_pmi+                                2  COMPLETED      0:0

The sview command shows a graphical user interface to view and modify the Slurm state. Of course that you can only modify slurm objects to which you have access, i.e: you can cancel your submitted jobs but not one of another user.

To use this command you require X11 forwarding activated on your SSH session, if you are connecting from a windows machine you can use XMing, a X Window System Server for Windows.

If you are using Bitvise SSH Client, you can activate X11 forwarding on the client side by going to the Terminal tab and enable the X11 Forwarding option.

Conclusion

We have now seen how to set up a slurm cluster and how to manage its jobs and nodes.

There are a lot of details that I have skipped for simplicity, but the information present in this article should get you up and running on the way to master slurm clusters.

References

[MPITUTO]

MPI Tutorials [link]

Tags: linux slurm python scheduler hpc cluster MPI

Manuel Torrinha is an information systems engineer, with more than 10 years of experience in managing GNU/Linux environments. Has an MSc in Information Systems and Computer Engineering. Work interests include High Performance Computing, Data Analysis, and IT management and Administration. Knows diverse programming, scripting and markup languages. Speaks Portuguese and English.