Mahuika Slurm Partitions

Definitions

CPU: A logical core, also known as a hardware thread. Referred to as a "CPU" in the Slurm documentation.  Since Hyperthreading is enabled, there are two CPUs per physical core, and every task— and therefore every job — is allocated an even number of CPUs.

Fairshare Weight: CPU hours are multiplied by this factor to determine usage for the purpose of calculating a project's fair-share score.

Job: A running batch script and any other processes which it might launch with srun.

Node: A single computer within the cluster with its own CPUs and RAM (memory), and sometimes also GPUs. A node is analogous to a workstation (desktop PC) or laptop.

Task: An instance of a running computer program, consisting of one or more threads. All of a task's threads must run within the same node.

Thread: A sequence of instructions executed by a CPU.

Walltime: Real world time, as opposed to CPU time (walltime x CPUs).

General Limits

  • No individual job can request more than 20,000 CPU hours. This has the consequence that a job can request more CPUs if it is shorter (short-and-wide vs long-and-skinny).
  • No individual job can request more than 576 CPUs (8 full nodes), since larger MPI jobs are scheduled less efficiently and are probably suitable for running on Māui.
  • No user can have more than 1,000 jobs in the queue at a time.

These limits are defaults and can be altered on a per-account basis if there is a good reason. For example we will increase the limit on queued jobs for those who need to submit large numbers of jobs, provided that they undertake to do so with job arrays.

Partitions

A partition can be specified via the appropriate sbatch option, e.g.:

#SBATCH --partition=long

However on Mahuika there is generally no need to do so, since the default behaviour is that your job will be assigned to the most suitable partition(s) automatically, based on the resources it requests, including particularly its memory/CPU ratio and time limit. If you do specify a partition and your job is not a good fit for that partition then you may receive a warning, please do not ignore this. E.g.:

sbatch: "bigmem" is not the most appropriate partition for this job, which would otherwise default to "large". If you believe this is incorrect then please contact support@nesi.org.nz and quote the Job ID number.

 

Name

Max Walltime

Nodes

CPUs/Node

Available Mem/CPU

Available Mem/Node

Description

long

3 weeks

69

72

1500 MB

105 GB

For jobs that need to run for longer than 3 days.

large

3 days

long + 157

72

1500 MB

105 GB

Default partition.

bigmem

7 days

9

72

6300 MB

460 GB

Jobs requiring large amounts of memory.

hugemem

7 days

4

80
128
176

18 GB
30 GB
35 GB

1,500 GB
4,000 GB
6,000 GB

Jobs requiring very large amounts of memory.

gpu

3 days

4

72

6300 MB

460 GB 

See below for more info.

 

Debug QoS

Orthogonal to the partitions, each job has a "Quality of Service", with the default QoS for a job being determined by the allocation class of its project. Specifying --qos=debug will override that and give the job very high priority, but is subject to strict limits: 15 minutes per job, and only 1 job at a time per user. Debug jobs may not span more than two nodes.

 

Requesting GPUs

  • Nodes in the gpu partition have 2 GPUs each, so you can request 1 or 2 GPUs per node:
    #SBATCH --gres gpu:1
  • In addition to GPUs, you can request up to four CPUs and up to 54 GB of RAM.
  • There are only 2 GPUs per node, so non-MPI jobs are not be able to use more than two GPUs.
  • There is a limit of 4 GPUs (50% of them) being allocated to any one account.
  • There is also a limit of 100 GPU-hours of allocated GPU-time per account. This allows you to use more GPUs if your jobs are shorter, and so guarantees that all users can at least get short debugging jobs on to a GPU in a reasonably timely manner.  For example you can have: one 3-day 1-GPU job, one 2-day 2-GPU job, or 6 GPUs used by jobs of 15 hours or less.

 

Mahuika Infiniband Islands

Mahuika is divided into “islands” of 26 nodes (or 1,872 CPUs). Communication between two nodes on the same island is faster than between two nodes on different islands. MPI jobs placed entirely within one island will often perform better than those split among multiple islands.

You can request that a job runs within a single InfiniBand island by adding:

#SBATCH --switches=1

Slurm will then run the job within one island provided that this does not delay starting the job by more than the maximum switch waiting time, currently configured to be 5 minutes. That waiting time limit can be reduced by adding @<time> after the number of switches e.g:

#SBATCH --switches=1@00:30:00
Labels: mahuika slurm
Was this article helpful?
5 out of 6 found this helpful