Why is my job taking a long time to start?

If you think your job is taking unexpectedly long to start running, there are various possible explanations.

First, check our status page to look for scheduled maintenance periods. If your job would otherwise run on a cluster during maintenance work affecting that cluster, your job may be delayed until after the maintenance work is completed and the cluster returns to service, depending on the nature of the work to be done.

If your job is not supposed to be affected by a maintenance period, you can get more information by running this command:

nn_my_queued_jobs

This command will, for each of your queued jobs, produce an output looking something like this:

$ ./nn_my_queued_jobs 
ACCOUNT JOBID NAME SUBMIT_TIME QOS NODE CPUS MIN_MEMORY PRIORITY START_TIME REASON
nesi99999 12345678 SomeRandomJob 2019-01-01T12:00:00 collab 1 8 2G 1553 N/A QOSMaxCpuPerJobLimit

One of the most useful columns to look at, far over on the right hand side, is the "Reason" column. This column tells you why the job is delayed. Common answers include "Priority", "Resources", "Dependency", "ReqNodeNotAvail", and others.

  • Priority means that the job just isn't in the front of the queue yet.
  • Resources means that there are not currently enough free resources to run the job.
  • Dependency means the job is in some way dependent on another, and the other job (the dependency) has not yet reached the required state.
  • ReqNodeNotAvail means that the job has requested some specific node that is busy working on other jobs, is out of service, or does not exist.

A more comprehensive list of job reason codes is available here (offsite). As noted on that page, if a job is waiting for execution for several reasons, only one reason will be displayed, and there is not a documented importance of reasons. For example, one job could say Priority and another could say Resources, when in fact both are waiting due to Priority and Resources at the same time.

You can check the job's priority relative to other waiting jobs by means of the following command on a Mahuika or Māui login node (as appropriate):

nn_job_priorities

This command is intended to produce the same results as the native Slurm command sprio, but with jobs sorted in order of priority from highest to lowest.

The output should look something like this:

          JOBID PARTITION   PRIORITY        AGE   FAIRSHARE    JOBSIZE        QOS
793492 gpu 1553 504 1000 20 30
2008465 long 1107 336 723 18 30
2039471 long 1083 312 723 18 30
2039456 long 1083 312 723 18 30
2039452 long 1083 312 723 18 30
2039435 long 1083 312 723 18 30
2039399 long 1083 312 723 18 30
2039391 long 1083 312 723 18 30
2039376 long 1083 312 723 18 30
2039371 long 1083 312 723 18 30
...

If, compared to other jobs in the queue, your job's priority (third column) and fair share score (fifth column) are both low, this usually means that your project team has recently been using through CPU core hours faster than expected. See this page for more information on Fair Share, how you can check your project's fair share score, and what you can do about a low project fair share score.

In the unlikely scenario that your job's priority is low but your job's fair share score is high (i.e. nearly 1,000), you will just have to wait, as this scenario supposes that other jobs like yours are ahead in the queue because they have similar resource needs but have been waiting for even longer than your job.

If your job's priority is high compared to other jobs but the job still won't start, make sure that your resource requests (in your Slurm script) are appropriate. If the nature of your work allows, you could try:

  • Being more flexible about where in the cluster CPU cores come from
  • Asking for less memory (RAM)
  • Asking for a shorter wall time

You can use the scontrol command to reduce the job's requested wall time limit, for example the following command will set the wall time limit of job 12345678 to one hour:

$ scontrol update jobid=12345678 TimeLimit=01:00:00

The scontrol update command does not print out any message to say that it has succeeded, so you can check its effect using scontrol show:

$ scontrol show job 12345678 | grep TimeLimit
RunTime=00:00:00 TimeLimit=00:01:00 TimeMin=N/A

Note that you can not yourself use scontrol to increase a job's requested wall time.

You can not adjust the amount of memory (RAM) using scontrol, and altering the number of tasks, nodes or cores requested is unwise without making corresponding changes in the job submission script, and in some cases the program's input file as well. For this reason, if you wish to change the amount of memory (RAM) or the number or arrangement of cores, you should cancel your queued job using the scancel command and then resubmit it. If your project's fair share score is high, your newly submitted job should progress quickly through the queue.

If your job priority is high, your resource requests are low and your job still won't start, please send a request to our support team and we will look into the problem.

Was this article helpful?
0 out of 0 found this helpful