Preparing small files for migration to nearline

Migration of files from your project or nobackup directory to your nearline directory is a two-step process. In the first step, the data is copied from project or nobackup to a staging file system with a maximum capacity of 500 TB. In the second step, the data on the staging file system is moved to tape.

To reduce the burden on our tape drives and file catalogue, project teams are strongly encouraged to store only large files on nearline. Because your project or nobackup directory, or any subdirectory of the same, will almost certainly contain some small files and may have a large number of them, this article offers instructions for how to straightforwardly find all these small files and combine them into a few large archive files, perhaps as few as one.

Can't I just tar up the whole project (or nobackup) directory, or at least all its contents?

Yes, you certainly can do that. This is unlikely to suit you, however:

  • Without special options, creating a tarball is effectively taking a copy of the contents of every file in the directory. Unless your project or nobackup directory starts out at less than half full, you may well not have the disk space to create the full tarball.
  • There are options to the tar program that will cause it to delete files as it goes. It is likely, however, that you will want at least some files to remain in your online storage.
  • There are a few projects that have more than 500 TB of data, and such a tarball would be too big to be copied to the staging file system. Even if it were not, however, copying one very large tarball takes a long time, retrieval takes a long time as well, and since any interruption to either process will necessitate starting from scratch, the risk of wasted time increases (interruptions become more likely, and the likely consequences of interruptions become more severe).

What is the recommended option, then?

We recommend that you find all the small files within a directory, then group those small files into a tarball, leaving large files to be copied to nearline individually.

You do not have to create one single tarball for all small files in /nesi/project/<project_code> or /nesi/nobackup/<project_code>, and in fact you may prefer to create tarballs pertaining to particular subdirectories. There is no harm in either approach.

Tip

The tarball creation process can take quite a long time. So that you can freely log out of the cluster, and to protect the process in case you're accidentally disconnected, you should create the tarball by means of a Slurm job, or else in a tmux or screen session.

 Tarball creation is very simple, and can be achieved through the following:

startdir=$(pwd -P) && \
tarball="archive.tar" && \
cd /nesi/project/nesi12345/my_directory && \
find . -type f -and -size -100M -print0 | xargs -0 -I {} tar --remove-files -rvf "${tarball}" {}
# Optionally, compress the archive
# (see below for notes on compression options)
bzip2 -9 "${tarball}"
# Return to where you started
cd "${startdir}"

 Some notes on the above script:

  • The name of the archive is saved as a variable, $tarball, so that it is kept consistent whenever it is used.
  • While we have suggested creating the archive in situ (tarball="archive.tar") as an example, there is no reason not to use a relative or even absolute path (e.g. tarball="/path/to/archive.tar"). You can also put it where you started running the sequence of commands from: tarball="${startdir}/archive.tar".
  • We recommend going to the directory (cd <dir>) before running the find command, so that the archive stores files as relative paths, not absolute paths. This choice will make a big difference when you come to extract the tarball. In the example above, we go one step further: The && means, "Only run the next command if this command is successful, i.e. it completes with an exit code of 0."
  • The -type f option restricts the search to look for files only. Directories, symbolic links and other items will not be found. However, files within subdirectories will be found.
  • The -size -100M option restricts the search to items that are less than 100 MB. This size criterion is not the only valid option, but it likely represents a good balance between creating an overly large tar archive on the one hand, and leaving many small files to be individually copied on the other. 
  • The conjunction -and does exactly what you expect: it limits search results to items satisfying both criteria. (find also recognises the option -or, not relevant here.)
  • The option -print0 separates results with the null character, so that spaces and other special characters in file names don't get misinterpreted as record separators.
  • Piping to xargs -0 gracefully handles a long list of arguments separated by null characters. xargs breaks up long lists of arguments, sending the arguments in small batches to the simple command given as an argument to xargs. In this case, that simple command is tar with flags and arguments.
  • The option -I {} to xargs instructs xargs to replace every later instance of {} with the name of the actual result, in this case a found file, or more precisely a relative path to a found file.
  • We use -r instead of -c as an argument to tar. -c creates a new tar archive for every batch of results provided by xargs, thus overwriting all previously found and tarred files, while -r will append each batch of results to the existing archive, a much more appropriate option.
  • We aren't using tar's -z or -j flags. Who knows how tar will behave if you try to append to a gzipped or bzipped archive?
  • --remove-files will delete each found file once that file has been added to the ever-growing tarball.
  • Go and get a cup of tea or coffee: this command will take a while. In fact, it may take several days to complete.

Can I compress my archive?

Certainly, although in this case it's best leaving compression until the tar process has completed. However, when deciding whether to compress, and if so what compression algorithm to use, you should consider the following:

  • Compression and decompression take time. Generally, the more effective the compression, the longer both compression and decompression will take.
  • However, an uncompressed file will take more space on tape, and both uploading to tape and retrieval from tape will take longer.

This page (off site) offers benchmarks of the popular gzip and bzip2 compression programs, present on our systems, at various levels of compression from -1 (fastest compression time) to -9 (most compressed output). We do not vouch for its accuracy or its applicability to your particular data, but you may find it useful. The general trends should be the same in any case.

Was this article helpful?
0 out of 0 found this helpful