- Submit shorter jobs. There are more opportunities for the scheduler to find a time slot to run shorter jobs. If your application has the capability to checkpoint and restart, consider submitting your job for shorter time periods.
- Don't request more time than you will need. Leave some headroom for safety and run-to-run variability on the system but try to be as accurate as possible. Requesting more time than you need will make you wait longer in the partition. Use the information provided at the completion of your job to better define resource requirements.
- Don't request more resources (e.g. cores, memory, GPGPUs) than you will need. Requesting more resources than you need will make you wait longer in the partition. Use the information provided at the completion of your job to better define resource requirements.
- You may improve your throughput for long-running jobs by allowing the job to be pre-empted (i.e. suspended) in memory, while a higher priority job is running on the core / node.
- Consider putting related work into a single Slurm job with multiple job steps both for performance reasons and ease of management. Each Slurm job can contain a multitude of job steps and the overhead in Slurm for managing job steps is much lower than that of individual jobs
- Job arrays are an efficient mechanism of managing a collection of batch jobs with identical resource requirements. Most Slurm commands can manage job arrays either as individual elements (tasks) or as a single entity (e.g. delete an entire job array in a single command)
A primer for new users to Slurm is here, and a "cheat sheet" for Slurm commands is here