Wednesday, December 9, 2015

Multi-sample quality control of RNA-seq data: interpreting the results


I am happy to report that Qualimap2 manuscript is published. Therefore, I would like to provide a small tutorial on how to interpret the quality control of RNA-seq alignment data results generated by Qualimap. There are also other frequently used applications such as RSeQC and RNA-SeQC, but they were not updated rather long. Also, there are not too many manuals, explaining how to interpret the results of the performed QC analysis. Of course, I'll be happy to have any comments, questions or suggestions about this small tutorial.

Ok, let's assume we designed an experiment to analyze and compare gene expression between two conditions and there is a number of samples for each condition. RNA-sequencing was performed for each sample. The details, about how to perform the main analysis can be found in many places (for example here). We will focus only on quality control.

The initial quality control performed with FastQC is well-known. However, a number of problematic issues can be detected only after the alignment of reads is performed and counts are computed. Most of technological biases including transcript length, coverage uneven distribution or contamination by non-required transcripts can be detected only from a processed dataset, therefore data investigation performed by Qualimap is required to confirm sufficient quality of the processed data.

The quality control process should start from the examination of RNA-seq QC report. Here's an instruction to launch the mode from GUI and example command line. The report will look like this. How to interpret the results?

The Summary statistics provides status of the sequencing process in general. The low proportion of aligned reads reported in the summary can demonstrate ligation and amplification biases. Frequently, in RNA-seq analysis the typical majority of aligned reads should lie in known exonic regions. For example, comparing the proportion of aligned reads falling to exon region in case of well-known organisms such as Mus Musculus or Homo Sapiens should be up to 90%. Smaller proportion can indicate sequencing biases or some mistakes in mapping process. The same conclusions can be derived from Junction analysis plot: known junctions should dominate novel.

Some specific issues occurring during rRNA depletion and polyA selection can lead to biases in 5' region and 3' region. For example, it is quite well-known that polyA selection can lead to high expression in 3' region. The 5'-3' bias allows to detect this event. In correct experiments it should be close to one.

The plots focused on transcriptome coverage analysis such as Coverage Histogram and Coverage profile along genes give an overview of available level of expressed genes. Importantly, low level coverage in RNA-seq data influences significantly on differential expression analysis, therefore demonstration of high and low coverage level is quite useful to detect biases and apply correct gene expression normalization.

After the quality control of the BAM file is performed, the computed counts will allow to investigate further experiment properties. Most importantly, Counts QC analysis can be performed applying the experiment design, such as biological conditions and number of samples. Here are instruction about the application in general and command line. Here's an example output.

The analysis of expression contamination allows to verify if the total number of reads in the experiment is enough to detect all expressed genes. The plot Saturation demonstrates the influence of number of sequencing reads on expression distribution. Basically, the angle of the line plot shows how the increase in number of reads controls the proportion of novel detected genes. If more reads do not lead to the growth of a number of detected genes (line angle close to zero), additional sequencing is not required. Notably, there is also a specific customizable limitation of minimum number of read counts required for a transcript to be added to the plot.

The biotype analysis of counts performed for each sample allows to detect the types of expressed features. This is specially important if RNA-seq experiment focused on long RNA or other type of RNA. Abnormal contamination can be detected from plots Biotype detection and Counts per biotype. Typically in mRNA-seq experiments protein coding proportion should dominate in counts. Additionally, the results detect suitability level of a selected sequencing protocol.

For correct normalization of counts values in gene expression analysis it is important to take into account the length and GC content of expressed transcripts. The requirements for such normalization can be checked from Length bias and GC bias plots. Cubic spline regression model is applied to detect if length and GC proportion fits gene expression. Generally, the computed coefficient of determination more than 70% along with p-value lower than 0.05 indicate effect on expression level and importance of normalization.

The computed global plots demonstrating all samples together such as Scatterplot matrix, Counts density and Counts distribution allow to detect outliers. For example, despite different biological conditions the global expression levels among analyzed samples should match sufficiently. Importantly, different biological conditions should influence the gene expression. Therefore all plots related to expression analysis normalized to a specific condition also allow to detect outliers.

So, these are some suggestions how to interpret the detailed quality control. Unfortunately there might be some other issues occurring in experiments, here I only mentioned the most frequent problems. I'll be glad to have more suggestions to update the tutorial.