What does "oom-kill" mean?

OOM stands for "Out Of Memory", and so an error such as this:

slurmstepd: error: Detected 1 oom-kill event(s) in step 370626.batch cgroup

Indicates that your job used more memory than it requested.  

OOM events can happen even without Slurm's sacct command reporting such a high memory usage, for two reasons:  Unlike the enforcement via cgroup, Slurm's accounting system only records usage every 30 seconds, and does not include any temporary files the job may have put in the memory-based /tmp or $TMPDIR filesystems. 

Labels: faq
Was this article helpful?
0 out of 0 found this helpful