Primer for Kupe Users Moving to Maui

Logging in

For external users, please connect to lander02.nesi.org.nz. Do not use lander.nesi.org.nz as currently this lander only provides access to Kupe. In the future, it will become the lander for the HPCF in Wellington.

Once on the lander node, or if you are on the NIWA network, simply ssh to login.maui.niwa.co.nz 

Programming Environment

The Programming Environment on Maui has been upgraded with respect to Kupe. The Programming Environment on the Maui_Ancil (CS500) nodes now includes the Cray Programming Environment, as well as the Intel and GNU tool chains. The Cray XC50 programming environments for CCE, GNU, and Intel have been upgraded, including more recent versions of the Cray and GNU compilers, while the Intel compiler version is the same as on Kupe.

As noted here, the CS500 Programming Environment does not presently include the ability to swap environments as on the XC50 (e.g.  ) However, this capability will be included with the next release of the Cray CS programming environment in October.

Software Stack

The software stack on maui, maui_ancil and the Virtual Labs has been rebuilt from scratch. While all relevant software has been migrated from kupe, kupe_mp and the Virtual Labs, a few important changes have been made:

  • There is now a clearer separation between the maui_ancil, and Virtual Labs (CS500) and maui (XC50) software stacks: the maui software stack only contains software needed for large jobs (e.g., XIOS and the grib_api library), while general tools and libraries (e.g., Anaconda, NCO, and Mule) can now only be found on the maui_ancil and Virtual Lab machines. This was necessary to make best use of system capabilities, such as InifiniBand file access on maui_ancil and the Virtual Labs, and to avoid single core jobs running on maui (i.e. XC50).
  • The maui_ancil and Virtual Labs software stack now uses a more recent GNU toolchain based on GCC v7.1.0, to benefit from Intel Skylake architecture capabilities. Module names have therefore changed slightly with respect to Kupe.
  • To avoid clutter, older versions of the same software have not been migrated, with the exception of NCL.

Rose and Cylc users

Maui requires users to work in their project directories on the /nesi/project and /nesi/nobackup file systems, personal directories are no longer available. Rose and Cylc are configured to place the "cylc-run" directory in

/nesi/nobackup/$PROJECT/$USER

So please add

export PROJECT=<your project number>

to your .profile on Maui. You may need to reset this variable for particular sessions if you work on multiple projects.

Other Key Differences between Kupe and Maui.

For NIWA (and NeSI) users the services available on Maui are similar to those on Kupe. However, our experience with Kupe, means that we have made some changes that means you will need to adjust some of your work-habits, job scripts, and workflows.

The following tables outline the differences between the

  • Kupe and Maui Cray XC50 hardware;
  • Kupe and Maui Cray CS500 ancillary nodes (termed "multi-purpose" nodes on Kupe, and ancillary nodes on Maui;
  • Kupe and Maui storage sub-systems;
  • The Slurm partitions on kupe and maui;
  • The Slurm partitions on kupe_mp and maui_ancil, and
  • The virtual lab and cylc/rose services available on the two environments.

Each of these differences is described in the following sections.

Kupe and Maui Cray XC50s

Table 1: Essential differences (and similarities) between Kupe and Maui Cray XC50 Supercomputers

Component

kupe (XC50)

maui (XC50)

Comment

Login nodes (also known as eLogin nodes)

80 cores in 2 × Skylake (Gold 6148, 2.4 GHz, dual socket 20 cores per socket) nodes

80 cores in 2 × Skylake (Gold 6148, 2.4 GHz, dual socket 20 cores per socket) nodes

Identical

Compute nodes

4,160 Skylake (Gold 6148, 2.4 GHz, dual socket 20 cores per socket) cores

18,560 Skylake (Gold 6148, 2.4 GHz, dual socket 20 cores per socket) cores

Kupe is Air Cooled, Maui is Liquid Cooled

Hyperthreading

Enabled

Enabled

Identical

Theoretical Peak performance

0.32PF

1.43PF

 

Memory capacity per login (build) node

768 GB

768 GB

Identical

Compute node memory

All nodes have 96GB memory

232 nodes have 96GB memory

232 nodes have 192 GB memory

50% of Maui’s nodes have “large” memory

Total System memory

10.0TB

66.8TB

 

Interconnect

Cray Aries, Dragonfly topology

Cray Aries, Dragonfly topology

Identical

Workload Manager

Slurm (Multi-Cluster)

Slurm (Multi-Cluster)

In Wellington, can submit jobs to all Slurm clusters (maui, maui_ancil and mahuika)

Operating System

Cray Linux Environment: SLES 12 SP2, and CLE 6.0 UP04

Cray Linux Environment: SLES 12 SP2, and CLE 6.0 UP06

Maui is on a later version of the Cray OS.

 

Kupe and Maui Cray CS500s

Table 2: Essential differences (and similarities) between Kupe and Maui Cray CS500 Ancillary Nodes.

Component

kupe_mp

maui_ancil

Comment

Ancillary nodes

440 cores in 11 × Skylake (Gold 6148, 2.4 GHz, dual socket 20 cores per socket) nodes

1,120 cores in 28 × Skylake (Gold 6148, 2.4 GHz, dual socket 20 cores per socket) nodes

Some maui_ancil nodes are available to non-NIWA NeSI users

Hyperthreading

Enabled

Enabled

Identical

Local Disk

1.2TB SSD

1.2TB SSD

Identical

Operating System

CentOS 7

CentOS 7

Identical

GGPGUs

2 × Tesla P100

8 × Tesla P100

 

Remote Visualisation

NICE DCV

NICE DCV

Identical

Memory per node

768 GB

768 GB

Identical

Interconnect

EDR (100 Gb/s) InfiniBand

EDR (100 Gb/s) InfiniBand

Identical

Workload Manager

Slurm (Multi-Cluster)

Slurm (Multi-Cluster)

Identical

OpenStack

Used to provide Virtual Machines to users

Used to provide Virtual Machines to users

Identical

 

Kupe and Maui Storage Systems (shared with Mahuika)

Table 3: Essential differences (and similarities) between Kupe and Maui Storage Subsystems.

Storage

On: kupe & kupe_mp

On: maui, maui_ancil & mahuika

Comment

File system Hardware

IBM ESS 1×GS4S, 2×GL6S, ~70GB/s Bandwidth

IBM ESS 1×GS4S, 4×GL6S, ~140GB/s Bandwidth

IBM ESS 1×GS4, 1×GL6S

The GS4 and one GL6S are provided for NIWA forecast operations.

Filesystem software

IBM Spectrum Scale v 4.2.3

IBM Spectrum Scale v 5.0.1

Scale v5 is faster than Scale v4.

/home and /project

626 TB

2,505TB

Incudes NIWA only storage

/nobackup

2,505TB

6,263TB

Incudes NIWA only storage

/devoper

470TB

626TB

Available to NIWA only

/oper

626TB

939TB

Available to NIWA only

/nearline

470TB

626TB

Accessed via the Librarian

Offline Storage

>100PB

>100PB

Maui uses LT08 drives; Kupe, LTO7 drives.

More detailed information about the NeSI filesystems is available here.

Kupe and Maui Slurm Partitions

More detailed information about the NeSI Slurm partitions on Slurm clusters maui and maui_ancil is available here

Cray XC50 Slurm Clusters

Table 4: Main differences between the configurations of Slurm on Kupe and Maui (XC50s)

Users

Kupe

Maui

Comment

NeSI

Debug
NeSI

nesi_research

Single partition on maui. The QoS nesi_debug provides high priority access to the system for debug jobs

NIWA

NIWA_Research

Operations

niwa_research
niwa_operations

Two partitions on maui. The niwa_operations partition is 128 large memory nodes. NIWA research jobs may run on the niwa_operations if they can be pre-empted and use less than ~85GB of memory on each node.

The QoS niwa_debug provides high priority access to the system for debug jobs

 

Cray CS500 Slurm Clusters

Table 5: Main differences between the configurations of Slurm on the Kupe and Maui Ancillary (CS500) nodes.

Users

kupe_mp

maui_ancil

Comment (Note name change from _mp to _ancil)

NeSI

N/A

nesi_gpu

nesi_igpu

nesi_prepost

Provides access to the 5×P100 GPGPUs (each is a VM with 8 Slurm CPUs and 1 × P100 GPGPU )

nesi_igpu provides interactive access to a VM with 8 Slurm CPUs and 1 × P100 GPGPU during business hours 7AM to 8PM Monday to Friday

NIWA

ec_cs

sat

general

ec_cs

ec_sat

niwa_work

Used for operational serial jobs (ec_cs and ec_sat) and to support serial processing required by NIWA virtual labs.

 

Kupe and Maui Virtual Lab and Cylc Services

Table 6: Comparison of Virtual Lab and cylc service node names - where the fully qualified node name on Maui_Ancil is <VL-name>.maui.niwa.co.nz  . 

kupe

(VL-name)

maui

(VL-name)

Comment

hafs01

w-nwp01

Virtual lab for NWP focused research

clim01

w-clim01

Virtual lab for Climate Modelling focused research

 

w-haz01

Virtual lab for Hazards focused research

 

w-mar01

Virtual lab for Marine focused research (available in late September)

 

w-cliops01

Virtual lab for CLIDB (available in late September)

sat-login01

w-sat-login01

Virtual lab for satellite focused research (available in late September)

cylc01

w-cylc01

To provide (research) cylc services

 

w-cylc02

To provide (research) cylc service resilience

 

w-cylc03

To provide (research) cylc service resilience

rose01

w-rose01

For Rose services

 

 

 

ec-login01

w-ec-login01

Virtual lab for EcoConnect development (available in late September)

ec-satingest01

w-ec-satingest01

Satellite data ingest service (available in late September)

ec-cylc01

w-ec-cylc01

To provide (operational) cylc services (available in late September)

ec-cylc02

w-ec-cylc02

To provide (operational) cylc service resilience

ec-cylc03

w-ec-cylc03

To provide (operational) cylc service resilience

 

 

 

Was this article helpful?
0 out of 0 found this helpful