BLAST Databases

We download the standard NCBI databases quarterly, and create a corresponding environment module named like BLASTDB/<yyyy-mm> which sets the BLASTDB environment variable accordingly. If you want to use one of these databases then you should find out what our most recent version is (module avail BLASTDB) and then load it in your batch script.

module load BLASTDB

Because we only keep a few recent versions of the databases, you may be required from time to time to change the BLASTDB module version if you use old job submission scripts as templates for new ones.


Database I/O

When given a large amount of query sequence to get through the BLAST search programs will take batches of it, running through the database with each batch and then starting over with the next batch.  This can cause the database to be repeatedly read from disk and so limit the speed of your search.

To avoid that potential I/O problem the example script below includes an optional extra step which copies the BLAST database into $TMPDIR, which is a per-job temporary directory. Since compute nodes do not have local disks, this is in memory, and so must be allowed for in the memory requested by the job.  

If you are searching a very small database (less than 10 GB) or searching with a small amount of query sequence (so that only one or two passes through the database will be required) then this extra complexity will not speed up your job and is not recommended. 

If you have many CPU-days of BLAST searches to complete then it it worth considering and experimenting with this use of TMPDIR.

Example script

#!/bin/bash -e

#SBATCH --job-name      BLAST
#SBATCH --time          02:30:00  # ~100 CPU hrs / GB blastn query vs nt
#SBATCH --ntasks 1
#SBATCH --mem 30G # plus enough for the DB if using TMPDIR below #SBATCH --cpus-per-task 36 # or 72 if using all of the node's memory
module load BLAST/2.10.0-GCC-9.2.0 module load BLASTDB/2020-01 # This script takes one argument, the FASTA file of query sequences. QUERIES=$1 FORMAT="6 qseqid qstart qend qseq sseqid sgi sacc sstart send staxids sscinames stitle length evalue bitscore" BLASTOPTS="-evalue 0.05 -max_target_seqs 10" BLASTAPP=blastn DB=nt #BLASTAPP=blastx #DB=nr # Keep the database in RAM if searching a large amount of
# query sequence against a database of between 10GB and 75GB.
# otherwise comment out these two lines: cp $BLASTDB/{$DB,taxdb}* $TMPDIR/ export BLASTDB=$TMPDIR # Single node multithreaded BLAST. srun $BLASTAPP $BLASTOPTS -db $DB -query $QUERIES -outfmt "$FORMAT" \ -out $QUERIES.$DB.$BLASTAPP -num_threads $SLURM_CPUS_PER_TASK


Labels: mahuika tier1 biology app
Was this article helpful?
0 out of 1 found this helpful