Wednesday, December 9, 2015

Multi-sample quality control of RNA-seq data: interpreting the results

I am happy to report that Qualimap2 manuscript is published. Therefore, I would like to provide a small tutorial on how to interpret the quality control of RNA-seq alignment data results generated by Qualimap. There are also other frequently used applications such as RSeQC and RNA-SeQC, but they were not updated rather long. Also, there are not too many manuals, explaining how to interpret the results of the performed QC analysis. Of course, I'll be happy to have any comments, questions or suggestions about this small tutorial.

Ok, let's assume we designed an experiment to analyze and compare gene expression between two conditions and there is a number of samples for each condition. RNA-sequencing was performed for each sample. The details, about how to perform the main analysis can be found in many places (for example here). We will focus only on quality control.

The initial quality control performed with FastQC is well-known. However, a number of problematic issues can be detected only after the alignment of reads is performed and counts are computed. Most of technological biases including transcript length, coverage uneven distribution or contamination by non-required transcripts can be detected only from a processed dataset, therefore data investigation performed by Qualimap is required to confirm sufficient quality of the processed data.

The quality control process should start from the examination of RNA-seq QC report. Here's an instruction to launch the mode from GUI and example command line. The report will look like this. How to interpret the results?

The Summary statistics provides status of the sequencing process in general. The low proportion of aligned reads reported in the summary can demonstrate ligation and amplification biases. Frequently, in RNA-seq analysis the typical majority of aligned reads should lie in known exonic regions. For example, comparing the proportion of aligned reads falling to exon region in case of well-known organisms such as Mus Musculus or Homo Sapiens should be up to 90%. Smaller proportion can indicate sequencing biases or some mistakes in mapping process. The same conclusions can be derived from Junction analysis plot: known junctions should dominate novel.

Some specific issues occurring during rRNA depletion and polyA selection can lead to biases in 5' region and 3' region. For example, it is quite well-known that polyA selection can lead to high expression in 3' region. The 5'-3' bias allows to detect this event. In correct experiments it should be close to one.

The plots focused on transcriptome coverage analysis such as Coverage Histogram and Coverage profile along genes give an overview of available level of expressed genes. Importantly, low level coverage in RNA-seq data influences significantly on differential expression analysis, therefore demonstration of high and low coverage level is quite useful to detect biases and apply correct gene expression normalization.

After the quality control of the BAM file is performed, the computed counts will allow to investigate further experiment properties. Most importantly, Counts QC analysis can be performed applying the experiment design, such as biological conditions and number of samples. Here are instruction about the application in general and command line. Here's an example output.

The analysis of expression contamination allows to verify if the total number of reads in the experiment is enough to detect all expressed genes. The plot Saturation demonstrates the influence of number of sequencing reads on expression distribution. Basically, the angle of the line plot shows how the increase in number of reads controls the proportion of novel detected genes. If more reads do not lead to the growth of a number of detected genes (line angle close to zero), additional sequencing is not required. Notably, there is also a specific customizable limitation of minimum number of read counts required for a transcript to be added to the plot.

The biotype analysis of counts performed for each sample allows to detect the types of expressed features. This is specially important if RNA-seq experiment focused on long RNA or other type of RNA. Abnormal contamination can be detected from plots Biotype detection and Counts per biotype. Typically in mRNA-seq experiments protein coding proportion should dominate in counts. Additionally, the results detect suitability level of a selected sequencing protocol.

For correct normalization of counts values in gene expression analysis it is important to take into account the length and GC content of expressed transcripts. The requirements for such normalization can be checked from Length bias and GC bias plots. Cubic spline regression model is applied to detect if length and GC proportion fits gene expression. Generally, the computed coefficient of determination more than 70% along with p-value lower than 0.05 indicate effect on expression level and importance of normalization.

The computed global plots demonstrating all samples together such as Scatterplot matrix, Counts density and Counts distribution allow to detect outliers. For example, despite different biological conditions the global expression levels among analyzed samples should match sufficiently. Importantly, different biological conditions should influence the gene expression. Therefore all plots related to expression analysis normalized to a specific condition also allow to detect outliers.

So, these are some suggestions how to interpret the detailed quality control. Unfortunately there might be some other issues occurring in experiments, here I only mentioned the most frequent problems. I'll be glad to have more suggestions to update the tutorial.

Thursday, July 16, 2015

BOSC 2015: amazing as always

I just returned from ISMB/ECCB 2015 conference that was in Dublin. However, I would like to describe my favorite pre-conference event. Of course, this is BOSC. I consider BOSC as one of the best conferences due to the importance of "open" activity in any science. As always, all the presentations were awesome. Hopefully the videos of the talks will be available soon, I look forward to it. Here I'll just mention some talks that were specially interesting for me.

Of course, both keynotes were amazing.

First one was from Holly Bik. As a biologist she turned to bioinformatics and described in detail this process and current results. Already from the name of the talk it's clear to understand the main point: "Bioinformatics: Still a scary world for biologists". As a bioinformatician in frequent contact with biologists I understand this status and totally agree with it. Biologists have a lot of work to do, and getting experience in bioinformatics is not so easy. However, it is great that there are already more and more biologists who are going deeper in this direction.

Second keynote was from Ewan Birney, a Joint Associate Director of EMBL-EBI and co-founder of Open Bioinformatics Foundation. He is a perfect example of bioinformatics super-specialist. Several topics were described in detail during this talk. The most interesting subject in my opinion - importance of open source code. Three basic reasons: transparency, efficiency and community. In the end of the talk there was a surprising block about storing of text information in DNA (more details here). In this case one room will be enough to keep encoded in DNA "all" written books. So, if you read "Fahrenheit 451" by Ray Bradbury, this looks like a perfect solution for the main problem.

Among other talks I was especially interested in RNA-seq analysis. "GO express" project presented by Kevin Rue-Albrecht is a nice example of a detailed clustering and group detection investigation from gene expression analysis.

One other project I would like to mention is a novel pipeline design toolkit called Nextflow (presented by Paolo Di Tommaso). Of course, there is a number of workflow managers. However the toolkit that can be applied only by creating small scripts to make runs parallel and with detailed reports is very useful for practical bioinformaticians.

The most surprising talk for me that was from Bastian Greshake about openSNP database. Basically this database allows to submit personal SNP data created by companies such as 23andme. In result it provides rather interesting statistics comparison from supplied datasets. This comparison might be useful for for future research projects and even just for users to find people rather similar to them. One cool point: disclaimer, totally recommend to read it!

I was also giving a small talk during the conference. It was about second version of Qualimap. The talk was rather quick (only 5 mintues, no questions afterwards) therefore very useful as practice. Additionally I got several interesting questions in the poster session. Moreover, I met people who are using Qualimap and additionally during the ISMB/ECCB conference there was a cool poster about Eager project from Alexander Peltzer that mentions Qualimap as a part of a pipeline.

Ok, that is enough about BOSC event, totally recommend to visit the next one during ISMB 2016!

Wednesday, May 13, 2015

Linux: how to print header line of tab-delimeted file in column with numbers

Let's say there is a header of certain atab-delimited file that has a lot of parameters. There are many examples: annotation files, results of certain algorithms, etc...

Sometimes it's interesting to show only select certain parameters. This can be performed easily using cut command. For example, this is how to report only chromosome name and start-end positions of gene exon from GTF file:

cut -f 1,4,5 Homo_sapiens.GRCh38.78.gtf | less

To perform such task using cut command it's important to report exact indexes of required columns. But, what if there are too many columns (more than 30 for example)? How to detect exact indexes?

Here's an example of a command to detect index numbers of a first line of a file:

head -n 1 fusions.detailed.txt | tr "\t" "\n" | cat -n

This example creates a list of all parameters of a detailed report from InFusion tool. Result of this example can be found here (numbers and words in bold font).

Monday, March 30, 2015

Computing coverage out of exon regions

After performing alignment of RNA-seq or exome sequencing data it is quite important to compute coverage. There are several ways how to perform this task. However, there is one additional interesting issue: check if there is any coverage outside of exon regions. For example, in case of RNA-seq this allows to show if there are some previously unknown transcripts expressed. In this post I will describe how to compute out-of-region coverage from BAM file.

To perform out-of-region coverage check two files should be available:
- alignment data in SAM/BAM format. Sample: kidney.bam.
- gene annotations in BED format. Sample: transcripts.human.64.bed.

I will describe two methods to perform this task. Of course, I will recommend second method, using my favorite tool ;)

Using BEDtools

First of all, a file listing the coordinates of regions outside of genes from annotation file is required. This can be done using command complmentBed:

$~/tools/BEDTools/bin/complementBed -i transcripts.human.64.bed -g hg19.fa.fai > transcripts.human.64.outside.bed

After, the creating out-of-region annotation file, coverage can be computed:

$~/tools/BEDTools/bin/coverageBed -abam kidney.bam -b transcripts.human.64.outside.bed -d | awk '{c+=$5;len+=1}END{print "mappedBases=" c ,"RegionsSize=" len, "meanCoverage=" c/len}'

More details about BEDtools can be found here.

Using Qualimap

Qualimap BAM QC mode supports performing analysis of a BAM file within regions from annotation file in BED/GFF/GTF formats. Additionally, there is an option to perform additional analysis in "out of regions" block. The options is called "Analyze outside regions" , more details in documentation. So, here's an example:

$ qualimap bamqc -bam kidney.bam -gff annotations/transcripts.human.64.bed -os --java-mem-size=4G

Qualimap can be downloaded from here.

Well, that's enough for now. Have fun ;)

P.S. Thanks a lot to Tristan Carland for reporting a bug in QualiMap when computing coverage out of regions. In version 2.1 the bug is fixed.

Tuesday, February 24, 2015

Neuroscience is coming

Well, I was out of this world for a couple months, but now I am back :) And my current report will be not about bioinformatics, but about neuroscience.

Not so long ago I started reading neuropsychology books. The first book I read was called "The Man Who Mistook His Wife" by Oliver Sacks. I enjoyed this book from the beginning till the end. It is really amazing. The book describes some negative events which might happen to the brain because of a certain disease. There are four parts in this book: "Losses", "Excesses", "Transports" and "The World of the Simple". Each part focuses on a certain type of a brain damaging event.

I am not going to explain this in detail, but will just give some examples:

- A man loses an ability to memorize anything. When he is 40 years old, he still thinks that he is 20. He can not memorize anyone or anything, but still happy about his life.

- A young woman loses an ability to use her body, even to move or to walk (so called proprioception). But she manages to restore herself using other parts of the brain.

- Old ladies hear in their brains music from their young years. When the music plays, they can not hear anything else.

These are just three examples but there are much more amazing stories. I really recommend to read this book. Moreover, Oliver Sacks wrote more than ten books. Some of them were even used to create a movie. For example, a film called "Awakenings" with Robert De Niro and Robin Williams is really incredible in my opinion.

Also Oliver Sacks in his books always refers to other great neuroscientists. I didn't even know previously that one of the famous neuropsychologists was from Soviet Union. His name is Alexander Luria. I also read some of his books, and I totally recommend one that is called "The Man with a Shattered World: The History of a Brain Wound". This is a story about a Russian who was wounded during WWII. A bullet stuck in his brain, but he managed to survive and improve himself. I really recommend this book too.

There are other neuroscience authors, perhaps I will write again on this topic after reading their books.

And the last notice: one of the developing sciences is called neuroinformatics. Maybe soon it will become even more crucial than bioinformatics :)

P.S.: Thanks a lot to neuropsychologists whom I met and to Arakdy Urkop for recommending me the book.