Introduction
Quite often it is necessary to extract unmapped read pairs from a bam file. The samtools framework allows us to do this quite easily if the alignments are in SAM/BAM format.
Note that it doesn’t work for Colour Space reads as the fastq is generated from the nucleotide sequence and qualities and the unmapped reads will not have these attributes. A bam2fastq for Colour Space reads is required.
The process is done in two steps:
- Extracting the unmapped reads into a readname sorted BAM file
- Converting the BAM file to fastq read files.
It is also possible to skip the production of fastq files and realign the reads directly from the BAM file.
To convert BAM to FASTQ we use bam2fastq that comes in the Hydra package. There is also a bam2fastq from Hudson Alpha that could be used but again it doesn’t do Colour Space reads and if you use it’s options to extract unmapped reads it will only extract pairs where both reads are unmapped, on plus side it doesn’t require any special sorting of the Bam file.
Align your reads with Novoalign (Illumina)
# Illumina paired-end: mean=500, sd=100 novoalign -d database.nix -f reads1.fastq -reads2.fastq -oSAM -i 500 100 | samtools view -bS - alignments
Filter Alignments for unmapped pairs
Next up we create three filter categories for:
- An unmapped read whose mate is mapped.
- A mapped read who’s mate is unmapped
- Both reads of the pair are unmapped
These categories translate to the following filtering commands:
samtools view -u -f 4 -F264 alignments.bam > tmps1.bam samtools view -u -f 8 -F 260 alignments.bam > tmps2.bam samtools view -u -f 12 -F 256 alignments.bam > tmps3.bam
The above step will work on sorted or unsorted BAM files.
We then merge these temporary bam files and sort into read name order. The sort is required to get the mates into the correct order.
samtools merge -u - tmps[123].bam | samtools sort -n - unmapped
Realign Reads directly from BAM file
novoalign -F BAMPE -f unmapped.bam -d ...
or,
Extract the reads in FASTQ format
Next up is to extract the unmapped read pairs into their respective FASTQ files. We use the bamToFastq program available in the Hydra-SV package:
bamToFastq -bam unmapped.bam -fq1 unmapped_reads1.fastq -fq2 unmapped_reads2.fastq
Single End For single-end/fragment reads the process is much easier and can be done in one-line command:
samtools view -uf 4 alignments.bam | bamToFastq -bam stdin -fq1 unmapped.fastq -fq2 empty.fastq