It has come to our attention that a number of jobs failed at about 1:30 p.m. on Tuesday 30 October, and again at about 10:30 a.m. on Wednesday 31 October.
Our initial investigations suggest that the root cause of the problem was one or more jobs that tried to keep hundreds of thousands of files open simultaneously. This behaviour prompted our file system management devices to stop certain nodes from accessing the shared file system until those nodes could be rebooted. As a result, all jobs running on those nodes were terminated.
Our system engineers and support team are continuing to investigate the problem in the hope that we can find ways to mitigate the risk of scientific codes triggering these kinds of events in future.
We regret any inconvenience these events caused to researchers. If you have any concerns about these events or your running jobs, please feel free to raise a support ticket and one of our team will get back to you.
The NeSI team