thoughts about ...: qualimap

Showing posts with label qualimap. Show all posts

Wednesday, December 9, 2015

Multi-sample quality control of RNA-seq data: interpreting the results

I am happy to report that Qualimap2 manuscript is published. Therefore, I would like to provide a small tutorial on how to interpret the quality control of RNA-seq alignment data results generated by Qualimap. There are also other frequently used applications such as RSeQC and RNA-SeQC, but they were not updated rather long. Also, there are not too many manuals, explaining how to interpret the results of the performed QC analysis. Of course, I'll be happy to have any comments, questions or suggestions about this small tutorial.

Ok, let's assume we designed an experiment to analyze and compare gene expression between two conditions and there is a number of samples for each condition. RNA-sequencing was performed for each sample. The details, about how to perform the main analysis can be found in many places (for example here). We will focus only on quality control.

The initial quality control performed with FastQC is well-known. However, a number of problematic issues can be detected only after the alignment of reads is performed and counts are computed. Most of technological biases including transcript length, coverage uneven distribution or contamination by non-required transcripts can be detected only from a processed dataset, therefore data investigation performed by Qualimap is required to confirm sufficient quality of the processed data.

The quality control process should start from the examination of RNA-seq QC report. Here's an instruction to launch the mode from GUI and example command line. The report will look like this. How to interpret the results?

The Summary statistics provides status of the sequencing process in general. The low proportion of aligned reads reported in the summary can demonstrate ligation and amplification biases. Frequently, in RNA-seq analysis the typical majority of aligned reads should lie in known exonic regions. For example, comparing the proportion of aligned reads falling to exon region in case of well-known organisms such as Mus Musculus or Homo Sapiens should be up to 90%. Smaller proportion can indicate sequencing biases or some mistakes in mapping process. The same conclusions can be derived from Junction analysis plot: known junctions should dominate novel.

Some specific issues occurring during rRNA depletion and polyA selection can lead to biases in 5' region and 3' region. For example, it is quite well-known that polyA selection can lead to high expression in 3' region. The 5'-3' bias allows to detect this event. In correct experiments it should be close to one.

The plots focused on transcriptome coverage analysis such as Coverage Histogram and Coverage profile along genes give an overview of available level of expressed genes. Importantly, low level coverage in RNA-seq data influences significantly on differential expression analysis, therefore demonstration of high and low coverage level is quite useful to detect biases and apply correct gene expression normalization.

After the quality control of the BAM file is performed, the computed counts will allow to investigate further experiment properties. Most importantly, Counts QC analysis can be performed applying the experiment design, such as biological conditions and number of samples. Here are instruction about the application in general and command line. Here's an example output.

The analysis of expression contamination allows to verify if the total number of reads in the experiment is enough to detect all expressed genes. The plot Saturation demonstrates the influence of number of sequencing reads on expression distribution. Basically, the angle of the line plot shows how the increase in number of reads controls the proportion of novel detected genes. If more reads do not lead to the growth of a number of detected genes (line angle close to zero), additional sequencing is not required. Notably, there is also a specific customizable limitation of minimum number of read counts required for a transcript to be added to the plot.

The biotype analysis of counts performed for each sample allows to detect the types of expressed features. This is specially important if RNA-seq experiment focused on long RNA or other type of RNA. Abnormal contamination can be detected from plots Biotype detection and Counts per biotype. Typically in mRNA-seq experiments protein coding proportion should dominate in counts. Additionally, the results detect suitability level of a selected sequencing protocol.

For correct normalization of counts values in gene expression analysis it is important to take into account the length and GC content of expressed transcripts. The requirements for such normalization can be checked from Length bias and GC bias plots. Cubic spline regression model is applied to detect if length and GC proportion fits gene expression. Generally, the computed coefficient of determination more than 70% along with p-value lower than 0.05 indicate effect on expression level and importance of normalization.

The computed global plots demonstrating all samples together such as Scatterplot matrix, Counts density and Counts distribution allow to detect outliers. For example, despite different biological conditions the global expression levels among analyzed samples should match sufficiently. Importantly, different biological conditions should influence the gene expression. Therefore all plots related to expression analysis normalized to a specific condition also allow to detect outliers.

So, these are some suggestions how to interpret the detailed quality control. Unfortunately there might be some other issues occurring in experiments, here I only mentioned the most frequent problems. I'll be glad to have more suggestions to update the tutorial.

Thursday, July 16, 2015

BOSC 2015: amazing as always

I just returned from ISMB/ECCB 2015 conference that was in Dublin. However, I would like to describe my favorite pre-conference event. Of course, this is BOSC. I consider BOSC as one of the best conferences due to the importance of "open" activity in any science. As always, all the presentations were awesome. Hopefully the videos of the talks will be available soon, I look forward to it. Here I'll just mention some talks that were specially interesting for me.

Of course, both keynotes were amazing.

First one was from Holly Bik. As a biologist she turned to bioinformatics and described in detail this process and current results. Already from the name of the talk it's clear to understand the main point: "Bioinformatics: Still a scary world for biologists". As a bioinformatician in frequent contact with biologists I understand this status and totally agree with it. Biologists have a lot of work to do, and getting experience in bioinformatics is not so easy. However, it is great that there are already more and more biologists who are going deeper in this direction.

Second keynote was from Ewan Birney, a Joint Associate Director of EMBL-EBI and co-founder of Open Bioinformatics Foundation. He is a perfect example of bioinformatics super-specialist. Several topics were described in detail during this talk. The most interesting subject in my opinion - importance of open source code. Three basic reasons: transparency, efficiency and community. In the end of the talk there was a surprising block about storing of text information in DNA (more details here). In this case one room will be enough to keep encoded in DNA "all" written books. So, if you read "Fahrenheit 451" by Ray Bradbury, this looks like a perfect solution for the main problem.

Among other talks I was especially interested in RNA-seq analysis. "GO express" project presented by Kevin Rue-Albrecht is a nice example of a detailed clustering and group detection investigation from gene expression analysis.

One other project I would like to mention is a novel pipeline design toolkit called Nextflow (presented by Paolo Di Tommaso). Of course, there is a number of workflow managers. However the toolkit that can be applied only by creating small scripts to make runs parallel and with detailed reports is very useful for practical bioinformaticians.

The most surprising talk for me that was from Bastian Greshake about openSNP database. Basically this database allows to submit personal SNP data created by companies such as 23andme. In result it provides rather interesting statistics comparison from supplied datasets. This comparison might be useful for for future research projects and even just for users to find people rather similar to them. One cool point: disclaimer, totally recommend to read it!

I was also giving a small talk during the conference. It was about second version of Qualimap. The talk was rather quick (only 5 mintues, no questions afterwards) therefore very useful as practice. Additionally I got several interesting questions in the poster session. Moreover, I met people who are using Qualimap and additionally during the ISMB/ECCB conference there was a cool poster about Eager project from Alexander Peltzer that mentions Qualimap as a part of a pipeline.

Ok, that is enough about BOSC event, totally recommend to visit the next one during ISMB 2016!

Monday, March 30, 2015

Computing coverage out of exon regions

After performing alignment of RNA-seq or exome sequencing data it is quite important to compute coverage. There are several ways how to perform this task. However, there is one additional interesting issue: check if there is any coverage outside of exon regions. For example, in case of RNA-seq this allows to show if there are some previously unknown transcripts expressed. In this post I will describe how to compute out-of-region coverage from BAM file.

To perform out-of-region coverage check two files should be available:
- alignment data in SAM/BAM format. Sample: kidney.bam.
- gene annotations in BED format. Sample: transcripts.human.64.bed.

I will describe two methods to perform this task. Of course, I will recommend second method, using my favorite tool ;)

Using BEDtools

First of all, a file listing the coordinates of regions outside of genes from annotation file is required. This can be done using command complmentBed:



$~/tools/BEDTools/bin/complementBed -i transcripts.human.64.bed  -g hg19.fa.fai > transcripts.human.64.outside.bed

After, the creating out-of-region annotation file, coverage can be computed:

$~/tools/BEDTools/bin/coverageBed -abam kidney.bam -b transcripts.human.64.outside.bed -d | awk '{c+=$5;len+=1}END{print "mappedBases=" c ,"RegionsSize=" len, "meanCoverage=" c/len}'

More details about BEDtools can be found here.

Using Qualimap

Qualimap BAM QC mode supports performing analysis of a BAM file within regions from annotation file in BED/GFF/GTF formats. Additionally, there is an option to perform additional analysis in "out of regions" block. The options is called "Analyze outside regions" , more details in documentation. So, here's an example:



$ qualimap bamqc -bam kidney.bam -gff annotations/transcripts.human.64.bed -os --java-mem-size=4G

Qualimap can be downloaded from here.

Well, that's enough for now. Have fun ;)

P.S. Thanks a lot to Tristan Carland for reporting a bug in QualiMap when computing coverage out of regions. In version 2.1 the bug is fixed.

Tuesday, August 5, 2014

biss2014: Quality control of high throughput sequencing data

Yesterday I returned to Berlin from Saint Petersburg, where I took part in the work of BioInformatics Summer School 2014.

I suppose this is a biggest event of its kind in Russia. The school was arranged by Bionformatics Institute with intense support from Algorithmic Biology Lab of St. Petersburg Academic University. If you haven't heard about this lab - these are the guys who published Spades assembler and created Rosalind.

The school gathered 100 Russian-speaking students from a number of countries: 50 biologists and 50 computer scientists. The list of tutors included distinguished scientists and representatives of biotech industry. Several promising PhD students in the area of bioinformatics also made their teaching contribution :) For example, I participated as a tutor and was giving one talk and two practical workshops. For me it was already second time I participated in the work of the school and again it was an awesome experience!

The official language of the school was Russian, however I prepared all supporting materials including presentations and tutorial notes in English. In very close future I am going to wrap-up and publish the teaching materials I created for the school.

Today I will start with the slides of my talk: "Quality control of high throughput sequencing data"

The talk describes common erros and biases specific to HTS data; how to detect those biases and how to eliminate their impact on the analysis.

Slides for the talk in PDF format can be found here.

Wednesday, July 16, 2014

Learning statistics: Principal Component Analysis

We decided to apply Principal Component Analysis for upcoming multisample BAM QC analysis in Qualimap.

The idea is very simple: we compare 2 and more BAM files based on a number of various metrics including mean coverage, GC content, insert size, mapping quality and others. Since there are multiple comparison features avaialable one can apply PCA to analyze how different are the samples from each other by using biplot.

Here is a small collection of tutorials on PCA that I went through to get more confident with the topic.

1) An intuitive explanation of PCA
Nice overview of the topic, focusing on understanding the application of PCA.

2) A tutorial on principal component analysis
Practical tutorial with very low entry level which allows to do try PCA on a small example

3) Using R for Multivariate Analysis
Great tutorial on multivariate analysis using R. Totally recommended.

Saturday, July 27, 2013

BOSC-2013: my personal summary

ISMB/ECCB 2013 is over, and now I have some time to wrap up my thoughts and impressions.

Actually I would like to tell about one of the SIGs that was preceding ISMB: the Bioinformatics Open Source Conference or simply BOSC 2013. This conference is organized by the Open Bionformatics Foundation (guys responsible for Biopython, BioJava and other awesome projects) and devoted to all aspects of developing open source software for bioinformatics research. Starting from this year the scope of the conference was extended from open software to open science in general.

This was the second time I attended BOSC and once again I enjoyed greatly the people and the talks. Although BOSC is not the biggest SIG, it has this awesome atmosphere of a community of enthusiasts. For me it is an event where I can meet and talk to people who share my views on how the bioinformatics research should be conducted. A lot of things are being discussed: new open source tools and pipelines, visualization methods along with best practices to perform bioinformatics analysis and balance between being a software engineer and a researcher producing scientific results.

All talks were great, I highly recommend to check the presentation slides. Soon there will be video also. I would like to highlight several talks that I found especially interesting for myself.

Jug: Python-library that allows using decorators to easily launch multiple instances of your pipeline.

Ten simple rules of the open development of scientific software: based on a manuscript with the same name, which I missed somehow. This paper is a part of bigger PLOS "Ten simple rules of ..." collection.

DGE-Vis: an interesting interactive way to visualize RNA-seq expression studies

Unipro UGENE: a sequence analysis toolkit with sexy GUI. This is the project I've been involvled with for a long time, so I was very happy that it was presented at BOSC.

bcbbio: a community developed pipeline for SNP calling. Even if you are not interested in SNP discovery, this post by Brad Chapman is still worth looking at, because it contains many nice ideas on how to develop a reliable bioiformatics pipeline.

This time I was also giving a talk at BOSC. It was about Qualimap, a tool for next generation sequencing alignment data quality assesment. I have been working on it since I started my PhD at MPIIB in October 2011. We published the paper almost a year ago and since then Qualimap has become more mature and gathered a small community of users. We continue to maintain the tool and improve it based on the feedback from users. As I mentioned during the conference we are soon moving to a bitbucket repo. Subscribe to the Google group for updates.

Btw, one of the speakers was Sean Eddy. During his talk I felt somewhat similar to being at the concert of a favorite rock band \m/ I wish I had my copy of Sequence Analysis book with me to get an autograph :) One cool idea from his talk: sometimes reinventing the wheel and developing your own home-brewed set of tools for common tasks is actually normal. This activity should be considered as an exercise which allows to better understand things that you work on. After having enough time spent playing with the concept, it is still recommended to switch to the best instruments available ;)

The conference also included several panel discussions. An important discussion was devoted to the funding of open source scientific software development. It is commonly considered that main result of scientific research should be a manuscript. However for a computational biologist most often the software is true result of the research, while the paper is only "an advertisiment". Unfortunately the software usually is not considered by grant committees or lab bosses as a relevant result. The discussion demonstrated that bioinformatics community slowly moves away from this misconception, however a lot of work has to be done yet.

So, to sum up: BOSC is awesome, don't miss it next year in Boston.

Tuesday, April 9, 2013

Exome coverage and per chromosome coverage

This post is actually inspired by the discussion on the Qualimap google groups.

Assume we need to calculate mean coverage inside of the exons or any other arbitrary regions defined by an annotation file. In this post I will present 3 approaches to perform this task. The impatient can check my favourite (and hopefully the easiest) method.

After googling a little bit on the given problem it's easy to find that one can do coverage calculation using Samtools mpileup or coverageBed. The third option that I will show here will be the Qualimap tool.

So, here we go.

Using coverageBed

Given file test.bam and regions defined in file test.bed one can calculate coverage using the following command:



coverageBed -abam test.bam -b test.bed -d | awk '{c+=$8;len+=1}END{print "mappedBases=" c ,"RegionsSize=" len, "meanCoverage=" c/len}'

What's going on here? coverageBed outputs coverage at each position of each region intersecting with the BAM file; output is piped to a simple awk script which will accumulate the total length of regions and the number of mapped bases (column 8); finally awk will output mean coverage.

Unfortunately, there is a minor problem in this script. Since some of the regions could intersect, we count intersecting bases twice. It does not matter if you are interested in absolute coverage of your regions, but this could bias the mean coverage.

Using samtools mpileup

This option is quite popular. Here is the command:



samtools mpileup -l test.bed -A -Q 0 test.bam | awk 'BEGIN{regionLen=317355}{c+=$4}END{print "mappedBases=" c,"meanCoverage=" c/regionLen}'

Mpileup outputs coverage at each position of the reference intersected with region which has non-zero coverage. To compute the mean coverage one has to calculate the "efficient" length of the regions: the length including non covered bases and counting the intersections only once. In this script the efficient length is the magic number 317355.

More things to keep in mind: since mpileup is originally designed to discover SNPs it may discard a lot of reads based on the minimum coverage, read quality and other parameters. So one has to turn off the turnings by using appropriate command line arguments: -A, -Q0, -d etc. Refer to mpileup documentation for more details.

Using Qualimap:

Qualimap calculates coverage similar to mpileup, which means it calculates coverage at each position of the reference intersected with a region. Moreover it calculates the "efficient length" of the regions (taking into account non covered bases and intersection of regions) when computing mean coverage. Some other advantages of using Qualimap: it can work both with BED and GFF files and output coverage of "outside" regions. Finally, Qualimap also reports per chromosome coverage. Ok, enough of this advertising :) Here's your command:



qualimap bamqc -bam test.bam -gff test.gff

So, what's your preferable way of calculating coverage?

See ya later, folks :)

thoughts about ...