This post is actually inspired by the discussion on the Qualimap google groups.
Assume we need to calculate mean coverage inside of the exons or any other arbitrary regions defined by an annotation file. In this post I will present 3 approaches to perform this task. The impatient can check my favourite (and hopefully the easiest) method.
After googling a little bit on the given problem it's easy to find that one can do coverage calculation using Samtools mpileup or coverageBed. The third option that I will show here will be the Qualimap tool.
So, here we go.
Using coverageBed
Given file test.bam and regions defined in file test.bed one can calculate coverage using the following command:
coverageBed -abam test.bam -b test.bed -d | awk '{c+=$8;len+=1}END{print "mappedBases=" c ,"RegionsSize=" len, "meanCoverage=" c/len}'
What's going on here? coverageBed outputs coverage at each position of each region intersecting with the BAM file; output is piped to a simple awk script which will accumulate the total length of regions and the number of mapped bases (column 8); finally awk will output mean coverage.
Unfortunately, there is a minor problem in this script. Since some of the regions could intersect, we count intersecting bases twice. It does not matter if you are interested in absolute coverage of your regions, but this could bias the mean coverage.
Using samtools mpileup
This option is quite popular. Here is the command:
samtools mpileup -l test.bed -A -Q 0 test.bam | awk 'BEGIN{regionLen=317355}{c+=$4}END{print "mappedBases=" c,"meanCoverage=" c/regionLen}'
Mpileup outputs coverage at each position of the reference intersected with region which has non-zero coverage. To compute the mean coverage one has to calculate the "efficient" length of the regions: the length including non covered bases and counting the intersections only once. In this script the efficient length is the magic number 317355.
More things to keep in mind: since mpileup is originally designed to discover SNPs it may discard a lot of reads based on the minimum coverage, read quality and other parameters. So one has to turn off the turnings by using appropriate command line arguments: -A, -Q0, -d etc. Refer to mpileup documentation for more details.
Using Qualimap:
Qualimap calculates coverage similar to mpileup, which means it calculates coverage at each position of the reference intersected with a region. Moreover it calculates the "efficient length" of the regions (taking into account non covered bases and intersection of regions) when computing mean coverage. Some other advantages of using Qualimap: it can work both with BED and GFF files and output coverage of "outside" regions. Finally, Qualimap also reports per chromosome coverage. Ok, enough of this advertising :) Here's your command:
qualimap bamqc -bam test.bam -gff test.gff
So, what's your preferable way of calculating coverage?
See ya later, folks :)