Applying a checksum function to a file will return its checksum. Checksum functions are a type of hash function, and will always return the same hash for any particular file contents, making them useful for file validation. There are many different checksum functions. The most commonly used are listed in the table below.
Checksums can be used to check for minor errors that may have been introduced into a dataset. For example:
- After downloading a file (compare your generated checksum with the checksum provided by the vendor).
- When copying a file onto the cluster (generate a checksum on your local machine and another on the cluster).
- Verifying your results/workflow. (making a checksum of a results file can be a quick way to confirm nothing has changed).
- Corroborate files when working in a team.
While not necessary to do in every case, every time, file integrity should be one of the first things you check when troubleshooting.
The file '
corrupt.bin' has had 1 byte changed, yet on inspection would appear identical.
-rw-rw-r-- 1 393315 copy.bin -rw-rw-r-- 1 393315 corrupt.bin -rw-rw-r-- 1 393315 original.bin
By using a MD5 checksum (
md5sum *) we can see that '
corrupt.bin' has diverged from the original, while '
copy.bin' has not.
002c33835b3921d92d8074f3b392ef65 copy.bin ef749eb4110c2a3b3c747390095d0b76 corrupt.bin 002c33835b3921d92d8074f3b392ef65 original.bin
Note that filename, path, permissions or any other metadata does not affect the checksum.
Checksum functions are designed so that similar files will not produce similar hashes.
You will only need to compare a few characters of the string to confirm validity.
The checksum for file 'filename.txt' can be found with the following commands.