BLAST Databases

We download the standard NCBI databases quarterly, and create a corresponding environment module named like BLASTDB/<yyyy-mm> which sets the BLASTDB environment variable accordingly. If you want to use one of these databases then you should find out what our most recent version is (module avail BLASTDB) and then load it in your batch script.

module load BLASTDB

Because we only keep a few recent versions of the databases, you may be required from time to time to change the BLASTDB module version if you use old job submission scripts as templates for new ones.


Example script

This script starts by copying the BLAST database being searched into $TMPDIR, which is a per-job temporary directory.  Since compute nodes do not have local disks, this is in memory, and so must be allowed for in the memory requested by the job.  This approach is thought to be the optimal one when you have a large amount of query sequence to get through, as it prevents having to re-read the database from disk over and over.  If you are searching a very small database (less than 10 GB) or searching with a small amount of query sequence (so that only one pass through the database will be required) then this extra complexity is not required.

#!/bin/bash -e

#SBATCH --job-name       BLAST
#SBATCH --time           02:30:00      # Allow 100 CPU hrs / GB of blastn query seq
#SBATCH --hint nomultithread # Unless trying 72 threads
#SBATCH --ntasks 1 #SBATCH --cpus-per-task 36 # 1 whole node. #SBATCH --mem 105G # 1 whole node. Allow for whole database.
##SBATCH --mem 200G # For large databases (nr, refseq_genomic) module load BLAST/2.9.0-gimkl-2018b module load BLASTDB/2020-01 # This script takes one argument, the FASTA file of query sequences. QUERIES=$1 FORMAT="6 qseqid qstart qend qseq sseqid sgi sacc sstart send staxids sscinames stitle length evalue bitscore" BLASTOPTS="-evalue 0.05 -max_target_seqs 10" BLASTAPP=blastn DB=nt #BLASTAPP=blastx #DB=nr # Keep the database in RAM if searching multiple
# query sequences against a database of over 10GB. cp $BLASTDB/{$DB,taxdb}* $TMPDIR/ export BLASTDB=$TMPDIR # Single node multithreaded BLAST. srun $BLASTAPP $BLASTOPTS -db $DB -query $QUERIES -outfmt "$FORMAT" \ -out $QUERIES.$DB.$BLASTAPP -num_threads $SLURM_CPUS_PER_TASK


Labels: mahuika tier1 biology app
Was this article helpful?
0 out of 0 found this helpful