Tuesday, April 9, 2013

Exome coverage and per chromosome coverage

This post is actually inspired by the discussion on the Qualimap google groups.

Assume we need to calculate mean coverage inside of the exons or any other arbitrary regions defined by an annotation file. In this post I will present 3 approaches to perform this task. The impatient can check my favourite (and hopefully the easiest) method.

After googling a little bit on the given problem it's easy to find that one can do coverage calculation using Samtools mpileup or coverageBed. The third option that I will show here will be the Qualimap tool.

So, here we go.

Using coverageBed

Given file test.bam and regions defined in file test.bed one can calculate coverage using the following command:

coverageBed -abam test.bam -b test.bed -d | awk '{c+=$8;len+=1}END{print "mappedBases=" c ,"RegionsSize=" len, "meanCoverage=" c/len}'

What's going on here? coverageBed outputs coverage at each position of each region intersecting with the BAM file; output is piped to a simple awk script which will accumulate the total length of regions and the number of mapped bases (column 8); finally awk will output mean coverage.

Unfortunately, there is a minor problem in this script. Since some of the regions could intersect, we count intersecting bases twice. It does not matter if you are interested in absolute coverage of your regions, but this could bias the mean coverage.

Using samtools mpileup

This option is quite popular. Here is the command:

samtools mpileup -l test.bed -A -Q 0 test.bam | awk 'BEGIN{regionLen=317355}{c+=$4}END{print "mappedBases=" c,"meanCoverage=" c/regionLen}'

Mpileup outputs coverage at each position of the reference intersected with region which has non-zero coverage. To compute the mean coverage one has to calculate the "efficient" length of the regions: the length including non covered bases and counting the intersections only once. In this script the efficient length is the magic number 317355.

More things to keep in mind: since mpileup is originally designed to discover SNPs it may discard a lot of reads based on the minimum coverage, read quality and other parameters. So one has to turn off the turnings by using appropriate command line arguments: -A, -Q0, -d etc. Refer to mpileup documentation for more details.

Using Qualimap:

Qualimap calculates coverage similar to mpileup, which means it calculates coverage at each position of the reference intersected with a region. Moreover it calculates the "efficient length" of the regions (taking into account non covered bases and intersection of regions) when computing mean coverage. Some other advantages of using Qualimap: it can work both with BED and GFF files and output coverage of "outside" regions. Finally, Qualimap also reports per chromosome coverage. Ok, enough of this advertising :) Here's your command:

qualimap bamqc -bam test.bam -gff test.gff

So, what's your preferable way of calculating coverage?

See ya later, folks :)


medhat helmy said...

How did you computed this number 317355? Can you give more details :)

Konstantin Okonechnikov said...

Check the header of the BAM file ( samtools view -H file.bam). Length can be computed as sum of lengths provided in @SQ header lines. The attribute is called LN. Unfortunately samtools mpileup doesn't go through each position. Try using coverageBed , since option -g goes through each position of genome and length can be computed easily. I also suggest qualimap, it's much easier ;)