Nearline Storage

                                                                                                                                                                      

Service Status

The Nearline Storage service is in an Early Access Programme (EAP) phase and not fully in production for all NeSI users. Selected researchers are being given access. The functionality of the tool and syntax of the commands may change in future. Before deleting any data from your project or nobackup directory that has been uploaded to nearline, please consider whether you require verification of the transfer. We recommend that you do at least a basic verification of all transfers.

Please send feedback about your user experience at https://support.nesi.org.nz/hc/requests/new, which may include functionality issues, intuitive or counter-intuitive behaviours, behaviours or features that you like, suggestions for improvements, transfers taking too long, etc.

Nearline Storage service

NeSI's Nearline Storage service allows you to store your data on our hierarchical system, which consists of a staging area (disk) connected to a tape library. Users of this service gain access to more persistent storage space for their research data, in return for slower access to those files that are stored on tape. We recommend that you use this service for larger datasets that you will only need to access occasionally. The retrieval of data may be delayed, due to tape handling.

Nearline is intended for use with relatively large files and should not be used for a large number of small files. Files smaller than ~100 MB should be combined into archive files using nn_archive_files, tar or a similar tool.

Note

The existing directory structure starting after /nesi/project/<projectID>/ or /nesi/nobackup/<projectID>/ will be mapped onto /nesi/nearline/<projectID>/. While retrieving data, the whole directory structure after /nesi/nearline/<projectID> will be mapped into the target directory. See details below for details.

A Nearline project gets locked when writing to or deleting from it. Until this process is finished no other write or delete operation can be performed on the same project and the user will see a status message "project locked by none".

What you can do

The client allows you to carry out the following operations:

  • View files: View list of files stored in nearline.
  • Put: Copy files from your project or nobackup folder into nearline.
  • Get: Retrieve files from nearline into your project or nobackup folder, without deleting them from nearline.
  • Purge: Delete files stored in nearline.
  • View job status: View a list of jobs (put/get/purge) you have run, along with their status.
  • View quota: View your nearline quota and usage.

Getting started

Nearline has a common tool for access, with a set of nl commands, which are accessible by loading the following module:

module load nearline/1.0.0.9

Help us troubleshoot!

Tip

We highly recommend running the below commands, and especially nlget and nlpurge, from within a tmux or screen session.

The nearline user interface is still being refined. In particular, if you discover a problem (for example, nljobstatus reports an error - see below), it will be difficult for you to find out which command gave rise to the problem. We will also have the same difficulty when we investigate.

To assist both yourself and us in the event of a problem, please run nlput commands in the following manner:

{ { echo "----------" ; echo "Date and time: $(date)"; echo "Working directory: $(pwd)"; set -x; nlput <nlput_arguments> --nowait; set +x ; }  2>&1 ; } | tee -a ~/nearline.log

On the other hand, please run nlget and nlpurge commands in the following manner, so that if necessary you can cancel them from the command line with Ctrl-C:

{ { echo "----------" ; echo "Date and time: $(date)"; echo "Working directory: $(pwd)"; set -x; nearline_command <nearline_cmd_arguments> set +x ; }  2>&1 ; } | tee -a ~/nearline.log

The semicolons and curly braces in the above commands are important. In the second command, nearline_command should be replaced with nlget or nlpurge as desired, and nearline_cmd_arguments with the appropriate compulsory and optional arguments.

The effect of all this syntax is to capture the following information about each wrapped nearline command:

  • Date and time of execution / submission
  • Your working directory when you issued the command
  • The text of the command itself, including arguments
  • The nearline job ID

These will be recorded in a file called nearline.log in your home directory. With this information in nearline.log, you will be able to match the job ID (shown by nljobstatus) to a specific command.

View files

With the following command, you can print the list of files and directories within the specified nearline directory:

nlls /nesi/nearline/<projectID>

OR e.g.

nlls /nesi/nearline/<projectID>/path/to/results/

Furthermore, you can use the additional option -l to get the detailed list including mode, owner, group, filesize, and timestamp. The option -ls, an alternative to -l, will additionally show each file's migration status.

$ nlls -ls /nesi/nearline/<projectID>/results/
mode        s  owner               group      filesize    timestamp    filename
___________________________________________________________________________________________________________________________
-rw-rw----+ r  userName        nesi12345      33.93 MB       Jun 17    file1.tar.gz
-rw-rw----+ r  userName        nesi12345      33.93 MB       Jun 17    file2.tar.gz
-rw-rw----+ r  userName        nesi12345      34.03 MB       Jun 17    file3.tar.gz

Status ("s" column of the -ls output) legend:

  • migrated (m) - data of a specific nearline file is on tape (does not necessarily mean that the file is replicated across sites)
  • pre-migrated (p) - data of a specific nearline file is on both the staging filesystem and the tape.
  • resident (r) - data of a specific nearline file is only on the staging filesystem.

Warning

The option -ls shows only files, no directories.

Put

Data can be copied to nearline using the nlput command. The syntax is:

nlput <projectID> { <src_dir> | <file_list> }

The source directory or file list needs to be located under /nesi/projects/ or /nesi/nobackup/and specified as such. 

Note

The following will not work:

cd /nesi/project/nesi12345
nlput nesi12345 some_directory

It is necessary to do this instead:

nlput nesi12345 /nesi/project/nesi12345/some_directory

The data will be mapped into the same directory structure under /nesi/nearline/ (see below).

The recommended file size to archive is between 1 GB and 1 TB.

Warning

nlput takes only a directory or a file list. A single file is treated as a file list and read line by line, searching for valid file names. Single files can only be migrated using a file list containing the full path of the file to be transferred.

Files and directories are checked for existence and only new files are transferred to nearline. Files already on nearline will not be updated to reflect newer source files. Thus, files that already exist on nearline (either tape or staging disk) will be skipped in the migration process without notification.

Put - directory

All files and subdirectories within a specified directory will be transferred into nearline. The target location maps with the source location. As an example:

nlput nesi12345 /nesi/nobackup/nesi12345/To/Archive/Results/

will copy all data within the Results directory into /nesi/nearline/nesi12345/To/Archive/Results/.

Warning

If you put /nesi/project/nesi12345/To/Archive/Results/ on nearline as well as /nesi/nobackup/nesi12345/To/Archive/Results/, the contents of both source locations (project and nobackup) will be merged into /nesi/nearline/nesi12345/To/Archive/Results/. Within /nesi/nearline/nesi12345/, files with the same name and path will be skipped.

Put - file list

Warning

The file list must be located within /nesi/project or /nesi/nobackup. Any other location will cause obscure errors and failures.

The file_list is a file containing a list of files to be transferred. It can specify only one file per line and directories are ignored.

The target location will again map with the source location, see above.

Update

As a good practice:

  • migrate only large files (SquashFS archives, tarballs, or files that are individually large), or directories containing exclusively large files.
  • Do not try to modify a file in the source (nobackup or project) directory once there is a copy of it on nearline.

If you need to update data on the nearline file system with a newer version of data from nobackup or project:

  1. Compare the contents of the source directory (on /nesi/project or /nesi/nobackup) and the target directory (on /nesi/nearline). To look at one directory on /nesi/nearline at a time, use nlls; if you need to compare a large number of files across a range of directories, or for more thorough verification (e.g. checksums), read this article or contact our support team.
  2. Once you know which files you need to update (i.e. only files whose nearline version is out of date), remove the old files on nearline using nlpurge.
  3. Copy the updated files to the nearline file system using nlput.

Get

Data can be retrieved from nearline using then nlget command. The syntax is:

nlget <projectID> { <src_dir> | <file_list> } <dest_dir> [ --nowait ]

Similar to nlput (see above), nlget accepts a directory src_dir (no single files on nearline accepted) or a file list file_list, defining the source of the data to be retrieved from nearline.

Warning

The file list must be located within /nesi/project or /nesi/nobackup. Any other location will cause obscure errors and failures.

The destination dest_dir needs to be defined. The whole directory structure after /nesi/nearline/ will be created at the destination and the specified data written into it. For example,

nlget nesi00000 /nesi/nearline/nesi00000/dir/to/results/ /nesi/nobackup/

will create the directory structure /nesi/nobackup/nesi00000/dir/to/results/ if that directory structure does not already exist, and copy the data within the Results directory into it.

Files already existing in the destination directory will not be overwritten. A copy of the file will, however, remain on nearline until purged.

Warning

nlget takes only one directory or one file list. Single files are treated as a file list and read line by line, searching for valid file names. A single file can only be retrieved using a file list specifying the full path of the file to be retrieved.

Purge

The nlpurge command deletes specified data on the nearline file system permanently. The syntax is

nlpurge <projectID> { <src_dir> | <file_list> }

A directory src_dir (no single files accepted) or a file list file_list needs to be specified (see nlput above).

Warning

The file list must be located within /nesi/project or /nesi/nobackup. Any other location will cause obscure errors and failures.

View job status

The tool nljobstatus provides current status of submitted (queued, running and completed) tasks. The syntax is:

nljobstatus [ -j <jobid> ]

If no job ID is specified the full list of submitted jobs is returned. In this list, each job looks like the following:


$ nljobstatus
+----------+------------+----------------------------+-----------+-------------+
|  Jobid   | Project ID |         Job Status         | Job Host  |  Job User   |
+----------+------------+----------------------------+-----------+-------------+
| 4e23f517 |     13     |   job done successfully    | librarian | userName    |
| -dfef-40 |            |                            |           |             |
| e9-a83c- |            |                            |           |             |
| 3da78b06 |            |                            |           |             |
|   0310   |            |                            |           |             |
+----------+------------+----------------------------+-----------+-------------+

With the -j flag and a job identifier jobid, information for a specific job can be listed:

$ nljobstatus -j 4e23f517-dfef-40e9-a83c-3da78b060310
+--------------------------------------+
|                Jobid                 |
+--------------------------------------+
| 4e23f517-dfef-40e9-a83c-3da78b060310 |
+--------------------------------------+
+------------+-----------------------+-----------+-------------+
| Project ID |      Job Status       | Job Host  |  Job User   |
+------------+-----------------------+-----------+-------------+
|     13     | job done successfully | librarian | userName    |
+------------+-----------------------+-----------+-------------+
+---------------------+---------------------+---------------------+
|   Job Start Time    |   Job Update Time   |    Job End Time     |
+---------------------+---------------------+---------------------+
| 2019-09-13T03:11:22 | 2019-09-13T03:11:44 | 2019-09-13T03:11:45 |
+---------------------+---------------------+---------------------+

If an nlput or nlpurge is running in that project, the project is locked until the task is finished.

If a job stays in one state for an unexpectedly long time, please contact NeSI Support.

View quota

With the command nlquotalist, the usage and limits of a nearline project quota can be listed:

nlquotalist <projectID>

The output looks like:

$ nlquotalist nesi12345
Projectname                                       Available           Used                Inodes         IUsed
___________________________________________________________________________________________________________________________
nesi12345                                         30.00 TB            27.16 TB            1000000        412

This quota is different from the project quota on GPFS (/nesi/project/<projectID>).

Data management

In case you have the same directory structure on your project and nobackup directories, be careful when archiving data from both. They will be merged in the nearline file system. Further, when retrieving data from nearline, keep in mind that the directory structure up to your projectID will be retrieved:

librarian_get_put.jpeg

Underlying mechanism

The nearline file system consists of two parts: Disk, mainly for buffering data, and the tape library. It consists of a client running on the login/compute node and the backend on the nearline file system. It is important to know that even if you cancel a client process, the corresponding backend process keeps scheduled or running until finished.

The process of what data goes into tape and when is automated, and is not something you will have control over. The service is designed to optimise interaction with the nearline filesystem and avoid problem workloads for the benefit of all users.

If your files are on tape, it will take time to retrieve them. Access to tape readers is on a first come first served basis, and the amount of wait time will vary dramatically depending on overall usage. We cannot guarantee access to your files within any particular timeframe, and indeed wait times could be hours or even in some cases more than a day.

Support contact

Please contact our support team with any queries or concerns you may have regarding this service. We welcome feedback from our users.

Was this article helpful?
0 out of 0 found this helpful
a.homepage:before