thoughts about ...: 2013

Tuesday, August 6, 2013

Trimming FastQ reads with FastX : Sanger quality

Well, this is something simple and stupid.

We just want to trim 20 nucleotides from the right end of the read. There is a FastX toolkit designed for this kind of tasks. Running a very simple command:



fastx_trimmer  -t 20 -i reads.fq -o reads.trimmed.fq

But suddenly this error occurs:



fastx_trimmer: Invalid quality score value (char ',' ord 44 quality value -20) on line 20

Whatta?? Ok, the problem is with the quality encoding. There is a magic parameter -Q which allows to set the offset for quality values and by some reason it is not described in the documentation of FastX.

Here is a working command:



fastx_trimmer -Q33 -t 20 -i reads.fq -o reads.trimmed.fq

Thanks to my supervisor and this thread for help.

Saturday, July 27, 2013

BOSC-2013: my personal summary

ISMB/ECCB 2013 is over, and now I have some time to wrap up my thoughts and impressions.

Actually I would like to tell about one of the SIGs that was preceding ISMB: the Bioinformatics Open Source Conference or simply BOSC 2013. This conference is organized by the Open Bionformatics Foundation (guys responsible for Biopython, BioJava and other awesome projects) and devoted to all aspects of developing open source software for bioinformatics research. Starting from this year the scope of the conference was extended from open software to open science in general.

This was the second time I attended BOSC and once again I enjoyed greatly the people and the talks. Although BOSC is not the biggest SIG, it has this awesome atmosphere of a community of enthusiasts. For me it is an event where I can meet and talk to people who share my views on how the bioinformatics research should be conducted. A lot of things are being discussed: new open source tools and pipelines, visualization methods along with best practices to perform bioinformatics analysis and balance between being a software engineer and a researcher producing scientific results.

All talks were great, I highly recommend to check the presentation slides. Soon there will be video also. I would like to highlight several talks that I found especially interesting for myself.

Jug: Python-library that allows using decorators to easily launch multiple instances of your pipeline.

Ten simple rules of the open development of scientific software: based on a manuscript with the same name, which I missed somehow. This paper is a part of bigger PLOS "Ten simple rules of ..." collection.

DGE-Vis: an interesting interactive way to visualize RNA-seq expression studies

Unipro UGENE: a sequence analysis toolkit with sexy GUI. This is the project I've been involvled with for a long time, so I was very happy that it was presented at BOSC.

bcbbio: a community developed pipeline for SNP calling. Even if you are not interested in SNP discovery, this post by Brad Chapman is still worth looking at, because it contains many nice ideas on how to develop a reliable bioiformatics pipeline.

This time I was also giving a talk at BOSC. It was about Qualimap, a tool for next generation sequencing alignment data quality assesment. I have been working on it since I started my PhD at MPIIB in October 2011. We published the paper almost a year ago and since then Qualimap has become more mature and gathered a small community of users. We continue to maintain the tool and improve it based on the feedback from users. As I mentioned during the conference we are soon moving to a bitbucket repo. Subscribe to the Google group for updates.

Btw, one of the speakers was Sean Eddy. During his talk I felt somewhat similar to being at the concert of a favorite rock band \m/ I wish I had my copy of Sequence Analysis book with me to get an autograph :) One cool idea from his talk: sometimes reinventing the wheel and developing your own home-brewed set of tools for common tasks is actually normal. This activity should be considered as an exercise which allows to better understand things that you work on. After having enough time spent playing with the concept, it is still recommended to switch to the best instruments available ;)

The conference also included several panel discussions. An important discussion was devoted to the funding of open source scientific software development. It is commonly considered that main result of scientific research should be a manuscript. However for a computational biologist most often the software is true result of the research, while the paper is only "an advertisiment". Unfortunately the software usually is not considered by grant committees or lab bosses as a relevant result. The discussion demonstrated that bioinformatics community slowly moves away from this misconception, however a lot of work has to be done yet.

So, to sum up: BOSC is awesome, don't miss it next year in Boston.

Tuesday, July 2, 2013

Stupid penguin strikes back: Ubuntu does not boot after installing proprietary ATI video driver

Well, in my opinion Linux will never be a user-friendly operating system suitable for home usage. There are a lot issues but there is one particular problem I hate at most: drivers for new hardware. You can only be safe when installing Linux on at least 1 year old computer. But if you have a brand new laptop, get ready.

There will be blood.

Ok, maybe I'm a bit angry, but I just lost 2 hours by fixing some stupid stuff.

Recently I got a new laptop. There was a bunch of problems after installing Ubuntu 12.04 along with Windows 8 (btw, the installation itself was a bit of a quest): overheating, reduced battery time, system suspension failures. What else? Ah, yes, after installing native ATI video drivers the system did not boot. There were no error message, it just hanged.

Here's how I fixed it. Maybe it will be of help to other "hackers" out there.

***

The goal of the rescue operation is to remove the new driver and restore the default video-driver.

First, load in the recovery mode.
Select item "root" console. Btw, you can try loading in fail safe mode, but in my case it did not work.
Remount the file system in read/write mode:



mount -o remount,rw /

Then remove the video-driver:



apt-get purge fglrx*

Remove X11 configuration file:



mv /etc/X11/xorg.conf /etc/X11/xorg.conf.broken

Reboot and pray that it worked ;)

Help was found here and here.

Dismissed!

Tuesday, June 11, 2013

Awk: conditions

Just a small working example to remember:



awk '{ if ($3 == "20" || $3 == "17") print $0}' Homo_sapiens.GRCh37.68.simple_names.refGene | less

Friday, May 10, 2013

RNA-seq: protocol strand specificity

Wanted to write a detailed post about the topic, but found something nice on the web:

http://onetipperday.blogspot.de/2012/07/how-to-tell-which-library-type-to-use.html

Sometimes I feel like everything what we think of or do, was already thought through or done :)

Tuesday, April 9, 2013

Exome coverage and per chromosome coverage

This post is actually inspired by the discussion on the Qualimap google groups.

Assume we need to calculate mean coverage inside of the exons or any other arbitrary regions defined by an annotation file. In this post I will present 3 approaches to perform this task. The impatient can check my favourite (and hopefully the easiest) method.

After googling a little bit on the given problem it's easy to find that one can do coverage calculation using Samtools mpileup or coverageBed. The third option that I will show here will be the Qualimap tool.

So, here we go.

Using coverageBed

Given file test.bam and regions defined in file test.bed one can calculate coverage using the following command:



coverageBed -abam test.bam -b test.bed -d | awk '{c+=$8;len+=1}END{print "mappedBases=" c ,"RegionsSize=" len, "meanCoverage=" c/len}'

What's going on here? coverageBed outputs coverage at each position of each region intersecting with the BAM file; output is piped to a simple awk script which will accumulate the total length of regions and the number of mapped bases (column 8); finally awk will output mean coverage.

Unfortunately, there is a minor problem in this script. Since some of the regions could intersect, we count intersecting bases twice. It does not matter if you are interested in absolute coverage of your regions, but this could bias the mean coverage.

Using samtools mpileup

This option is quite popular. Here is the command:



samtools mpileup -l test.bed -A -Q 0 test.bam | awk 'BEGIN{regionLen=317355}{c+=$4}END{print "mappedBases=" c,"meanCoverage=" c/regionLen}'

Mpileup outputs coverage at each position of the reference intersected with region which has non-zero coverage. To compute the mean coverage one has to calculate the "efficient" length of the regions: the length including non covered bases and counting the intersections only once. In this script the efficient length is the magic number 317355.

More things to keep in mind: since mpileup is originally designed to discover SNPs it may discard a lot of reads based on the minimum coverage, read quality and other parameters. So one has to turn off the turnings by using appropriate command line arguments: -A, -Q0, -d etc. Refer to mpileup documentation for more details.

Using Qualimap:

Qualimap calculates coverage similar to mpileup, which means it calculates coverage at each position of the reference intersected with a region. Moreover it calculates the "efficient length" of the regions (taking into account non covered bases and intersection of regions) when computing mean coverage. Some other advantages of using Qualimap: it can work both with BED and GFF files and output coverage of "outside" regions. Finally, Qualimap also reports per chromosome coverage. Ok, enough of this advertising :) Here's your command:



qualimap bamqc -bam test.bam -gff test.gff

So, what's your preferable way of calculating coverage?

See ya later, folks :)

Monday, March 11, 2013

Random subsample from a BAM file

If you want to extract a random subsample of reads from a BAM file it is possible to use samtools view command with parameter -s.

The tricky part is to set the random seed: it is supposed to be the integer part of the provided parameter value. So, let's say you would like to have 1% of reads in your sample and the seed number must be equal to 42. Then your command should look like this:



samtools view -s 42.01 -b accepted_hits.bam > sample.bam

This syntax is a little bit obscure, but there is also an alternative: DownsampleSam program from Picard package. Here one can set the random seed explicitly using -R option:



java -jar ~/tools/picard-tools-1.70/DownsampleSam.jar I=accepted_hits.bam P=0.01 R=42 O=sample.bam

Have fun! :)

Thursday, January 24, 2013

Chromosome names: with "chr" prefix or without

There is one irritating problem related to lack of standardization in genome annotations: "chr" prefix inconsistency in chromosome names. For example, UCSC RefSeq annotations include this prefix, while the Ensemble annotations don't.

It's a good rule to take into account this problem when developing bioinformatics software. Unfortunately not everyone follows this guidance. In result one might need to remove the "chr" prefix or insert it in FASTA file or an alignment file to prepare correct input data.

Here is a nice post from Pierre Lindenbaum, which provides solution to the problem using sed.

And here is my small example command, which inserts "chr" prefix into FASTA file:



sed -i -e 's/^>/>chr/' hg19.fa

P.S. Thanks to my supervisor for improvement, it's better to set "^" anchor to indicate the start of the line.

Friday, January 11, 2013

Output subset of columns from file

Nice tip to remember: Unix command line tool cut allows to output not only selected columns from file, but also a subset of columns.

Example:



kokonech@ultor:~/playgrnd/scythe$ cut -f 1,2,3,4,5,6 scythe_output/fusions.txt 

#ref1 break_pos1 strand1 ref2 break_pos2 strand2

chr20 49446917 + chr17 58761341 +

kokonech@ultor:~/playgrnd/scythe$ cut -f 1-6 scythe_output/fusions.txt 

#ref1 break_pos1 strand1 ref2 break_pos2 strand2

chr20 49446917 + chr17 58761341 +

thoughts about ...