If you want to extract a random subsample of reads from a BAM file it is possible to use samtools view command with parameter -s.
The tricky part is to set the random seed: it is supposed to be the integer part of the provided parameter value. So, let's say you would like to have 1% of reads in your sample and the seed number must be equal to 42. Then your command should look like this:
This syntax is a little bit obscure, but there is also an alternative: DownsampleSam program from Picard package. Here one can set the random seed explicitly using -R option:
Have fun! :)
The tricky part is to set the random seed: it is supposed to be the integer part of the provided parameter value. So, let's say you would like to have 1% of reads in your sample and the seed number must be equal to 42. Then your command should look like this:
samtools view -s 42.01 -b accepted_hits.bam > sample.bam
This syntax is a little bit obscure, but there is also an alternative: DownsampleSam program from Picard package. Here one can set the random seed explicitly using -R option:
java -jar ~/tools/picard-tools-1.70/DownsampleSam.jar I=accepted_hits.bam P=0.01 R=42 O=sample.bam
Have fun! :)
7 comments:
I tried using -s option and got an error saying -s invalid option. Is it specific to a version of samtools?
Thanks
Yes, version is important.
I just found out -s option is not available on version 0.1.16
Also option is cryptic:
so I had to write it as
-s -1.5 ( to get 50% of reads)
-s -1.7 ( to get 70% of reads)
1 is the seed and .5/.7 the percent of reads for sub sampling
Yes, version is important.
I just found out -s option is not available on version 0.1.16
Also option is cryptic:
so I had to write it as
-s -1.5 ( to get 50% of reads)
-s -1.7 ( to get 70% of reads)
1 is the seed and .5/.7 the percent of reads for sub sampling
Yep, very cryptic indeed. In my examples I used 0.18
I know this is an old thread, but I was wondering if you have multiple samples should you use the same seed or different seeds? Does it matter?
Thanks!
If there are multiple BAM files and only a single sample is generated from each one, then the same seed can be used. However, if there is a single BAM file, and multiple samples are generated from the same file than various seeds should be applied, otherwise samples will be the same.
That's what I thought, thanks!
Post a Comment