All NeSI systems use the Slurm batch system for the submission, control and management of user jobs.
Slurm has three key functions:
- It allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work;
- It provides a framework for starting, executing, and monitoring work on the set of allocated nodes;
- It arbitrates contention for resources by managing a queue of pending work.
Slurm "partitions" can be considered to be 'job queues', each of which has an assortment of constraints such as job size limit, job time limit, users permitted to use it, etc.
The Slurm implementation on Māui and Mahuika have been designed to support:
- Application development and debugging
- Capacity workloads on Mahuika
- Capability workloads on Māui
- Pre- and post-processing on Mahuika and Māui Ancillary nodes.
- The NeSI Access Policy
- The NeSI Project Allocation and Aging Policy.
Features common to Māui and Mahuika
- Fast turnaround for small debug jobs
- Fast turnaround for pre- and post-processing jobs
- Projects that have exhausted their allocations may continue to submit jobs, but at the lowest priority and with no expectation of service.
- Compute resources are fairly shared across Collaborator (60%), Merit (includes Proposal Development and Postgraduate classes, 20%) and Subscription (includes Commercial) (20%)
- Simple “queue” structure – with different partitions according to job size, hardware,
- and wall-clock time limits.
- Prevent jobs from starting if a Project’s /nesi/nobackup inode quota has been exceeded.
- Use Backfill to improve utilisation.
- At job termination, provide useful information on Job resource, including:
- Account (i.e. Project ID)
- Allocated Cores
- Maximum core count used
- Memory allocated
- Maximum memory used (on any node)
- Total Core-hs consumed by the job
- Elapsed wall-clock time (job)
- Elapsed wall-clock time (from job submission to job completion)
- Job priority
- Maximum number of page faults
- Job Exit Code
Features that differ between Maui and Mahuika
- Optimised for Capacity (High Throughput) Workloads for
- jobs using a single core,
- small tightly coupled jobs that will fit into an InfiniBand Island (up to 26 nodes,
- embarrassingly parallel jobs up to 504 cores, and
- jobs that make high demand for metadata services (e.g. that write and/or read) large numbers of small files, or "chatter" to a log file(s)
- Topology-aware scheduling;
- Nodes are shared.
- Optimised for Capability Workloads (i.e. jobs using 1 to many nodes);
- Makes optimal use of Nodes with different memory sizes – enabling opportunities for in-memory pre-emption;
- Jobs should not make high demands on metadata services (e.g. that write and/or read) large numbers of small files, or chatter to a log file(s);
- Nodes are not shared
- Optimised for jobs that make high demands on metadata services
- Nodes are shared.
NeSI Scheduling Policies Enforced at Job Submission
- Is the Project (Account) allowed to use the specified partition?
- Is the Project (Account) allowed to use the specified QoS?
- Does the job exceed the maximum core (or node) count for that partition?
- Has the Project exceeded the project’s /nesi/nobackup inode quota?
- Has the Project expired?
- Is there sufficient core-h resources available to run the job
- Is the user batch disabled?