Follow

Submitting Slurm Jobs on Pan

What is Slurm?

Slurm, originally the Simple Linux Utility for Resource Management, is the work scheduler used on the Pan cluster. We chose Slurm from among several candidates due to its widespread use on high-performance computing systems around the world and its excellent scalability and performance on large clusters.

The Slurm home page can be found at https://slurm.schedmd.com/.

When should I use Slurm?

When using the Pan cluster, you should use Slurm to request any computing work that is resource-intensive or likely to take a long time to complete and that does not need further manual input from you or another person once started.

Important Slurm commands

The following table lists some of the most important Slurm commands and what they are most commonly used for.

Command Purpose Arguments Web page
sbatch Submits a job script Command-line options (if any) followed by a script file and any arguments to the script itself https://slurm.schedmd.com/sbatch.html
srun Runs an executable command, using already allocated resources if possible Command-line options (if any) followed by an executable and any arguments to the executable https://slurm.schedmd.com/srun.html
squeue Shows the current state of the job queue or, depending on options, of a subset of all jobs in the queue Command-line options (if any) https://slurm.schedmd.com/squeue.html
scancel Cancels a submitted or running job Command-line options (if any) followed by the ID or IDs of the job or jobs to cancel https://slurm.schedmd.com/scancel.html
interactive Requests an allocation of resources for interactive work Command-line options (No web page)

The srun command

srun is the Slurm command to actually run an executable using allocated resources (CPU cores, memory, and so forth). You must use srun to carry out most work on the cluster. If you intend to submit your job as a script using sbatch (see below), you will need to preface each of your major commands (and especially those that need to be run in parallel) with srun except where we specifically advise otherwise.

Job profiling

srun allows you to capture periodic information about your job's resource use, such as CPU utilisation and memory consumption. To do this, you may use srun in the following form:

srun --profile=task --acctg-freq=<num> <executable> [ executable_options ... ]

The argument to --acctg-freq is the frequency (in seconds) of data gathering. Thus, the higher --acctg-freq, the more seldom data will be gathered.

If you are doing this, we recommend inserting the following line at the end of your job submission script (before submitting the script, of course):

sh5util -j "${SLURM_JOB_ID}" -o profile.h5

You can then view the profile.h5 file using HDF5 viewing utilities.

For more information about job profiling, please see http://slurm.schedmd.com/hdf5_profile_user_guide.html.

Please be aware that job profiling may require large amounts of disk space and is likely to make your job slower and less efficient. If you have concerns about your job's performance, please feel free to contact the NeSI support team.

The sbatch command

It is likely that you will use sbatch more frequently than any other Slurm command except srun. sbatch accomplishes two things:

  • It requests an allocation of resources, and
  • It describes the work to be done using those resources.

The general form of the sbatch command is as follows:

sbatch [OPTIONS] <job_script> [ARGUMENTS]

In this form, the OPTIONS are command-line flags to sbatch, while the ARGUMENTS are arguments and command-line flags to your job submission script.

However, you don't have to put the OPTIONS on the sbatch command line, and indeed we recommend not to. Instead, we recommend that you write them into your job script. Since it is common for job scripts to not require any arguments, the typical sbatch invocation simplifies to:

sbatch <job_script>

The sbatch command itself can be called from within a shell script or other program. More advanced workflows often take advantage of this approach.

Slurm job submission scripts

A Slurm job submission script is much the same as an ordinary shell script, but it has some quirks and additional features. When writing a job submission script, it is important to bear these characteristics in mind.

Job flags in the script

We recommend that, instead of trying to specify job attributes on the command line as flags to sbatch, you write them into the job submission script itself. In this format there are one or more lines at the start of the script just after the shebang (#!) line, each starting with #SBATCH.

Please put each flag on its own #SBATCH line except where otherwise instructed.

Many flags have both short forms (a single dash followed by a single letter or number) and long forms (two dashes followed by words, hyphens, etc.). Slurm on Pan is configured so that after a long-form option you can put either an equals sign or white space before the value.

A table containing some of the more common options follows.

Short form Long form Meaning Notes
-A --account Your project code (e.g., nesi12345) This information is compulsory.
-J --job-name A short name for this job (without spaces) The default is the name of the script file.
-n --ntasks The total number of tasks this job is expected to spawn Default is 1. Note that each multithreaded program execution counts as one task, even if it uses multiple cores. This option will override any --ntasks-per-node specification.
-c --cpus-per-task The total number of CPU cores to be allocated to each task Default is 1. You will need to adjust this value if, and only if, you are running a multithreaded executable.
-t --time The time limit for the job Default is 2 hours. Acceptable formats: mmm, mm:ss, hhh:mm:ss, ddd-hh, ddd-hh:mm, ddd-hh:mm:ss. Note that, due to our maintenance schedule, we have a de facto upper limit of about six weeks; a job with a time limit of more than six weeks is unlikely to ever be allowed to start.
--mem The memory required per node Units are in MB by default, but different units can be specified by using K, M, G or T as a suffix.
--mem-per-cpu The memory required per CPU core Units are in MB by default, but different units can be specified by using K, M, G or T as a suffix.
-C --constraint Limit acceptable nodes to those with particular features We use this option to divide the cluster up based on architecture. Available constraints are wm, sb, ib, avx, kepler and fermi. There is no requirement to specify a constraint.
--gres A list of generic resources to request This is used to request GPUs (--gres=gpu:1 or --gres=gpu:2) or Intel Phi coprocessors (--gres=mic:1 or --gres=mic:2) and to manage the amount of heavy disk usage (--gres=io). If you wish to use multiple types of generic resources, you may separate them with commas, thus: --gres=gpu:1,mic:1,io.
-d --dependency A list of one or more job dependencies to be satisfied before this job can run The most common dependency type is an afterok dependency, which means the previous job must have finished successfully. Workflows making use of dependencies frequently use shell scripts to submit their jobs in order to collect the job ID at submission time and automatically insert it into the next job's submission script.
--exclusive Request exclusive use of each allocated node This will not be done unless explicitly requested (i.e., the default behaviour is node-sharing between jobs).
-D --workdir The directory where the job is to begin execution Default is the directory from which sbatch was invoked. This is not necessarily the same directory as where the job submission script is saved.
-o --output The name of the file to which standard output (STDOUT) will be written Default is slurm-%j.out for normal jobs, or slurm-%A_%a.out for array jobs (where %j is replaced by the job ID, %A by the job ID of the first array job, and %a by the array index). You may use these options (%j, %A, %a and a handful of others) in any alternative name you select for the output file.
-e --error The name of the file to which standard error (STDERR) will be written Default is slurm-%j.err for normal jobs, or slurm-%A_%a.err for array jobs (where %j is replaced by the job ID, %A by the job ID of the first array job, and %a by the array index). You may use these options (%j, %A, %a and a handful of others) in any alternative name you select for the error file.
--mail-type The circumstances in which email notifications are to be sent By default, no notifications will be sent. If notifications are desired, a typical value is ALL, though this name is now somewhat misleading as it does not include all notification types. ALL will notify at the start and end of the job, or if the job should fail or be requeued. This option is meaningless unless used in conjunction with --mail-user.
--mail-user An email address to which job notifications are to be sent You may use multiple --mail-user options, each on its own #SBATCH line. This option is meaningless unless used in conjunction with --mail-type.
-N --nodes The number of nodes to use when running this job Default is 1. May be specified as a range, in which case the lower number (minimum number of nodes) should come first and the higher number (maximum number of nodes) last. If only one number is provided, this is used as both minimum and maximum. If the number of nodes is not specified, Slurm will attempt to allocate enough nodes to satisfy the requirements implied by -n (total number of tasks) and -c (number of cores per task). We recommend explicitly setting the number of nodes only in cases where it is important for application performance.
--ntasks-per-node The number of tasks to run on each node This option is meant to be used with the --nodes option. It is subject to --ntasks, so we recommend not using both in the same job submission. As this option limits the flexibility of the scheduler, we recommend using it only in cases where it is important for application performance, such as where the application expects that every node will be contributing the same number of CPU cores to the overall job.

No resource file support

When executing a Slurm script, the shell will, by default, not load resource files. (A resource file is a file, such as .bashrc or .cshrc, that contains your default user environment settings.) If you have a setting in one of these files that is crucial to your workflow, we recommend that you put that setting in your job submission script explicitly, or else write a custom module and then load that module in your job submission script.

Slurm output environment variables

Slurm sets a number of variables which you may use in the body of your script (everything that comes after the end of the #SBATCH lines). These are called output environment variables. Some of the more useful output environment variables are listed in the following table.

Note that these variables can not be used within the #SBATCH lines themselves. Note also that they will not be automatically set unless the script is actually submitted through sbatch, so if you make use of any of them (or any other output environment variable) and then run the script on the command line for testing purposes, you will need to temporarily assign an appropriate value to each output environment variable you are using. We recommend you do this on the command line (just before running the script) rather than in the script itself.

Do not attempt to assign a value to any output environment variable in the script itself, as such an assignment may overwrite the value set by Slurm.

Variable name Description
SLURM_ARRAY_TASK_COUNT The total number of tasks in the current job array. See the array job template for more information.
SLURM_ARRAY_TASK_ID The numeric ID of the current task in the job array. See the array job template for more information.
SLURM_CPUS_PER_TASK The number of CPUs requested per task. This variable may be useful in scripts running multithreaded applications.
SLURM_JOB_ID The Slurm job ID of this particular job.
SLURM_JOB_NAME The name of this particular job.
SLURM_MEM_PER_CPU The amount of memory available for this job per CPU core.
SLURM_MEM_PER_NODE The amount of memory available for this job per node.
SLURM_NTASKS The total number of tasks requested for this job.
SLURM_NTASKS_PER_NODE The number of tasks requested per node for this job. This variable will only be set if the --ntasks-per-node flag is used. It is not to be confused with SLURM_TASKS_PER_NODE.
SLURM_SUBMIT_DIR The directory from which sbatch was invoked.
SLURM_TASKS_PER_NODE The number of tasks to be run (or being run) on each node in the list of allocated nodes. Unlike SLURM_NTASKS_PER_NODE, this variable is always set, but its value is a moderately complex text string rather than a simple integer. It is not to be confused with SLURM_NTASKS_PER_NODE.

Other useful tips

  • We recommend starting your job submission script off with #!/bin/bash -e instead of merely #!/bin/bash. The -e argument will cause the entire job to fail when a so-called simple statement in the script fails, thereby preventing unnecessary wastage of resources. Use of -e will require you to explicitly handle any harmless errors in your script.

Managing I/O intensive jobs

Jobs which read and/or write large amounts of data can be limited by the GPFS file servers rather then the CPU they are running on. This wastes resources and degrades file access speeds for the entire cluster. In some cases it can greatly help to make use of one of the temporary directories we set up for each job. See Directories Available on Pan for more information.

Alternatively, if the nature of the task is such that it can't avoid using the shared filesystem (e.g.: file format conversions on many large files), we ask that you limit the number of simultaneous jobs.

There are two ways to do that:

  • If your job is an array job, you can restrict the number of simultaneous array tasks running by using the % separator. See the array job template for more details.
  • You can request a heavy I/O token by specifying #SBATCH --gres=io. At most 80 heavy I/O tokens can be in use at any given time.

If your job, whether it is a single job or an array job, is using too much I/O and is not being managed properly, it may cause performance problems for the whole cluster. In that case, we may be obliged to kill the job without notice.

Using topology-aware scheduling

Some jobs gain significant benefit from high network performance (also known as low latency). Slurm allows you to request an assignment of CPU cores and nodes such that the total number of network switches involved in your job is no more than a number you specify.

Because Slurm may not be able to free up an appropriate group of CPU cores in a timely manner, the job will only wait for a maximum of 24 hours before being released into the ordinary queue. You have the option of requesting a maximum wait time of less than 24 hours.

The option to use topology-aware scheduling is as follows:

#SBATCH --switches=<max-num-switches>[@<max-wait-time>]

The maximum wait time may be expressed in any of the same time formats as the job wall time (see above). However, any attempt to set the maximum wait time to more than 24 hours is invalid and will be ignored.

Interactive job sessions

We have prepared a custom command, interactive, which will start an interactive session on a compute node. Such interactive sessions are effectively command line shells that are queued and time-limited. To run an interactive session, please run the following command:

interactive -A <project_code> [OPTIONS]

Interactive job options

The more common options for interactive jobs, specified as command-line flags, are given in the following table.

Flag Description
-a Desired architecture. Options are wm, sb, ib, kepler, and fermi. Default is wm.
-n Number of tasks to run (default 1)
-c Number of CPU cores to reserve for each task (default 1)
-m Amount of memory (in GB) to reserve for each core (default 1)
-e Email address to which to send the notification when the session is ready for use
-x Executable file or program to run (default /bin/bash)

For more information, you may type the following at a command prompt:

interactive --help

Example job submission scripts

A serial job

#!/bin/bash -e

#SBATCH -J SerialJob
#SBATCH -A nesi12345        # Project Account
#SBATCH --time=01:00:00     # Walltime
#SBATCH --mem-per-cpu=8G

module load UsefulApplication/1.0

srun serial_binary

A multithreading (OpenMP) job

#!/bin/bash -e

#SBATCH -J OpenMPJob
#SBATCH -A nesi12345        # Project Account
#SBATCH --time=01:00:00     # Walltime
#SBATCH --mem-per-cpu=8G
#SBATCH --cpus-per-task=8   # 8 OpenMP Threads

module load UsefulApplication/1.0

srun openmp_binary

An MPI job

#!/bin/bash -e

#SBATCH -J MPIJob
#SBATCH -A nesi12345        # Project Account
#SBATCH --time=01:00:00     # Walltime
#SBATCH --ntasks=2          # number of tasks
#SBATCH --mem-per-cpu=8G    # memory/cpu

module load UsefulApplication/1.0

srun mpi_binary

A hybrid (OpenMP and MPI) job

#!/bin/bash -e

#SBATCH -J HybridJob
#SBATCH -A nesi12345        # Project Account
#SBATCH --time=01:00:00     # Walltime
#SBATCH --ntasks=4          # number of tasks
#SBATCH --mem-per-cpu=8G 
#SBATCH --cpus-per-task=8   # 8 OpenMP Threads

module load UsefulApplication/1.0

srun hybrid_binary

An array job

#!/bin/bash -e

#SBATCH -J ArrayJob
#SBATCH --time=01:00:00     # Walltime
#SBATCH -A nesi12345        # Project Account
#SBATCH --mem-per-cpu=8G
#SBATCH --array=1-1000%50   # Array: 1, 2, ..., 999, 1000.
                            # No more than 50 array tasks may be running at
                            # any time.
                            # On Pan, we have capped the number of array
                            # indices per array job to 1,000.

module load UsefulApplication/1.0

# In this example, the array index is being passed into array_binary
# as an argument.

srun array_binary "${SLURM_ARRAY_TASK_ID}"

Comments

Powered by Zendesk