Follow

Job Prioritisation on Pan

This article was last updated on Wednesday 6 December 2017.

Introduction to Job Priorities

In Slurm, every pending job receives a priority score. This priority score is used to determine the order in which jobs will start. A job is only eligible to start if its execution will not delay a higher-priority job.

Job Size

In addition to the job's priority, the actual order in which jobs start depends on the resources those jobs request. Jobs which request more resources (CPUs, memory, GPUs etc.) may have to wait longer for those resources to become available.

We have default maximum allowable job size of 512 cores. You can request permission to run jobs larger than that by emailing support@nesi.org.nz.

Job Duration

Jobs with shorter wall time limits have a natural advantage in the scheduler (see What is Backfill? for more information) and non-GPU jobs of 24 hours or less also benefit a little from being eligible to run on the GPU nodes when those nodes are not needed for GPU jobs. This is to allow actual GPU jobs to gain access to the GPUs in a reasonable time, and is implemented via Slurm "partitions":

Partition 20 GPU nodes All other nodes Constraint
gpu YES NO Require a GPU
short YES YES Time limit of 24 hours
long NO YES

We also have a default maximum allowable time limit of three weeks. You can request permission to run jobs longer than that by emailing support@nesi.org.nz. However, three weeks is already half of the typical six weeks between planned Pan outages, so we would rather help you reduce the time limit with more parallelism if at all possible. If a job must run for longer than three weeks, you should ensure that the job's output data is appropriately checkpointed.

Multi-Factor Prioritisation

Slurm offers several prioritisation schemes. Our current scheme is the multi-factor prioritisation scheme. This means the priority is made up of weighted contributions from various aspects of the job. While the Slurm developers have selected the possible factors, the NeSI team chooses how much each factor is worth, and we have chosen not to use some of the factors Slurm offers.

How are Priorities Calculated on Pan?

We calculate priorities as outlined in the following table.

Factor How Calculated Minimum Achievable Score Maximum Achievable Score
QoS All jobs have a "Quality of Service" value of zero. 0 0
Age Time spent in the queue in hours. A job will reach its maximum possible age score after three weeks in the queue. 0 504
Job Size The number of cores requested (approximately). We prioritise large jobs because small jobs naturally have more opportunities to start, and bumping the expected start time of a large job forward in time makes the scheduler's job that much harder. 0 500
Partition GPU jobs receive a substantial priority boost so that they take precedence over other jobs that are allowed on the GPU nodes. However, because the number of GPU nodes is very small, claiming that your jobs need GPUs when they don't would be unlikely to help them run faster on the whole. 1000 3000
Fair Share Based on the degree to which your share of recent (exponentially weighted average) cluster usage is over or under what is expected for your project, as determined by your allocation of CPU hours. For more information on fair share, please see the Slurm fair share documentation. 0 1000
Nice When submitting a job you can choose to lower its priority by a "nice" value of up to 1,000 points, e.g., sbatch --nice=100 .... -1000 0
Total 0 5004

How Can I See Priority Weighting Factors?

Run the following command at the command prompt on the login node or a build node:

sprio -w

How Can I See the Priorities of Currently Pending Jobs?

The following command, run at a command line on the login node or a build node, will return a list of currently pending jobs, sorted in order of priority (highest to lowest), with a breakdown of priority factors rounded down to the nearest integer:

sprio_header="$(sprio -l | head -n 1)"; sprio_body="$(sprio -l | tail -n +2 | sort -k3r,3 -k2,2)" ; printf "$sprio_header\n$sprio_body\n"

What is Backfill?

Backfill is a scheduling strategy that allows small, short jobs to jump the queue if by doing so they can use otherwise idle machinery. If not for backfill, jobs would start in strict order of priority, and the cluster as a whole would be less productive. You can take advantage of backfill by breaking your work up into small, short jobs.

While the kinds of jobs that can be backfilled will also get a low job size score, it is our general experience that an ability to be backfilled is on the whole more useful when it comes to getting work done on the cluster.

A job will not be backfilled if by doing so it would delay the start of another higher-priority job.

More Information

The Slurm documentation on multi-factor prioritisation is available here. More information about backfill can be found here.

Comments

Powered by Zendesk