Outcomes of maintenance conducted 12/02/2019 - 13/02/2019

From 9am 12/02/2019 to 10am 13/02/2019 NeSI performed maintenance during a planned outage to multiple components of both Mahuika and Māui, including their shared storage. The following is a brief list of the activities performed during this outage and some of the improvements to service as a result:

  • At the beginning and end of the outage window, we conducted performance degradation testing, using a suite of microbenchmarks and real science codes, to check for unexplained performance variations.
  • Vendor-recommended updates were made to critical systems firmware and BIOS.
  • Operating system upgrades were made to critical servers.
  • Upgrades to our HPC filesystems (Spectrum Scale) provided fixes for several ongoing issues, including client system deadlocks that were triggered by particular application workloads.
  • More memory is now available for user applications, as a result of HPC filesystem (Spectrum Scale) tuning down of reserved systems resources on compute nodes
  • Mahuika login node environment now automatically loads the top-level NeSI Module into user’s shell environment.
  • Slurm job profiling plugin has been enabled on Mahuika. Generic documentation can be found on the Slurm website, NeSI specific documentation is being drafted but not yet published.

We appreciate your patience and understanding during these activities. As always, if you have any related questions, please don’t hesitate to contact NeSI Support.

Additional Background
NeSI regularly performs scheduled maintenance on various components of its high performance computers (HPCs) to ensure robust and performant operations. Whenever possible, such works are carried out during regular operations in order to minimise user impact, however some more intrusive maintenance activities simply cannot be done without taking critical shared services offline (e.g., high performance filesystems) or must be done consistently across all systems as quickly as possible. Therefore, we must have occasional planned outages were no user activity is possible on the HPCs.

Whenever scheduled maintenance is required, particularly if it requires critical shared services to be taken offline, we will notify users in advance by email and post a status update to status.nesi.org.nz. From that status page, users can subscribe to receive system-specific or all status updates.

If you have questions about these maintenance and outage processes, please contact NeSI Support.

Was this article helpful?
0 out of 0 found this helpful