Monday, March 11, 2013

Random subsample from a BAM file

If you want to extract a random subsample of reads from a BAM file it is possible to use samtools view command with parameter -s.

The tricky part is to set the random seed: it is supposed to be the integer part of the provided parameter value. So, let's say you would like to have 1% of reads in your sample and the seed number must be equal to 42. Then your command should look like this:

samtools view -s 42.01 -b accepted_hits.bam > sample.bam

This syntax is a little bit obscure, but there is also an alternative: DownsampleSam program from Picard package. Here one can set the random seed explicitly using -R option:

java -jar ~/tools/picard-tools-1.70/DownsampleSam.jar I=accepted_hits.bam P=0.01 R=42 O=sample.bam


Have fun! :)

7 comments:

Sheetal said...

I tried using -s option and got an error saying -s invalid option. Is it specific to a version of samtools?
Thanks

Sheetal said...

Yes, version is important.
I just found out -s option is not available on version 0.1.16
Also option is cryptic:
so I had to write it as
-s -1.5 ( to get 50% of reads)
-s -1.7 ( to get 70% of reads)
1 is the seed and .5/.7 the percent of reads for sub sampling

Sheetal said...

Yes, version is important.
I just found out -s option is not available on version 0.1.16
Also option is cryptic:
so I had to write it as
-s -1.5 ( to get 50% of reads)
-s -1.7 ( to get 70% of reads)
1 is the seed and .5/.7 the percent of reads for sub sampling

Konstantin Okonechnikov said...

Yep, very cryptic indeed. In my examples I used 0.18

sj said...

I know this is an old thread, but I was wondering if you have multiple samples should you use the same seed or different seeds? Does it matter?

Thanks!

Konstantin Okonechnikov said...

If there are multiple BAM files and only a single sample is generated from each one, then the same seed can be used. However, if there is a single BAM file, and multiple samples are generated from the same file than various seeds should be applied, otherwise samples will be the same.

sj said...

That's what I thought, thanks!