thoughts about ...: RNA-seq

Showing posts with label RNA-seq. Show all posts

Sunday, December 18, 2016

This finally happend: novel tool for fusion discovery from RNA-seq data

There was a long and a bit crazy story, however the toolkit that I was working on has now a publication in PLOS OnE. In short, tool has some unique features including support of strand-specificity to detect anti-sense chimeras and discovery of intergenic fusions. It's open-source and available in bitbucket repo.

It took us more than 3 years to finish the publication and this was amazing experience, which allowed me to learn what really can happen in science. Main lesson: there can be serious competition in research area that results in multiple unexpected issues. The journals with high impact factor are complex systems. Imagine submitting a manuscript, waiting for review 4 months(!) and receiving a statement that the "manuscript should be rejected while tool ability X is not important", with some strange citations. And then, exactly one month after rejection of your manuscript, there is a publication in the same journal with main statement of "the importance of ability X" in their tool. In our case the "ability X" was fusion isoform discovery confirmed with experimental validation.

This was probably main project for me during PhD and a perfect lesson. Of course, I am super grateful to my supervisor Dr. Fernando Garcia-Alcalde, to the whole team from MPIIB and to Lexogen company for amazing support. Moreover, additional useful comments I received from Prof. Steven Salzberg and he doesn't need an introduction for those who are working in sequencing data analysis area (btw, there is also his awesome blog).

Despite the complexity and multiple factors, science remains to me the most exciting and interesting area, where critical revision and communication have high impact. I will be glad to any comments, fixes or suggestions about InFusion. And of course, wish everyone not to experience any strange "non-scientific" problems, but have honest and correct reviews. Stay in science ;)

Wednesday, December 9, 2015

Multi-sample quality control of RNA-seq data: interpreting the results

I am happy to report that Qualimap2 manuscript is published. Therefore, I would like to provide a small tutorial on how to interpret the quality control of RNA-seq alignment data results generated by Qualimap. There are also other frequently used applications such as RSeQC and RNA-SeQC, but they were not updated rather long. Also, there are not too many manuals, explaining how to interpret the results of the performed QC analysis. Of course, I'll be happy to have any comments, questions or suggestions about this small tutorial.

Ok, let's assume we designed an experiment to analyze and compare gene expression between two conditions and there is a number of samples for each condition. RNA-sequencing was performed for each sample. The details, about how to perform the main analysis can be found in many places (for example here). We will focus only on quality control.

The initial quality control performed with FastQC is well-known. However, a number of problematic issues can be detected only after the alignment of reads is performed and counts are computed. Most of technological biases including transcript length, coverage uneven distribution or contamination by non-required transcripts can be detected only from a processed dataset, therefore data investigation performed by Qualimap is required to confirm sufficient quality of the processed data.

The quality control process should start from the examination of RNA-seq QC report. Here's an instruction to launch the mode from GUI and example command line. The report will look like this. How to interpret the results?

The Summary statistics provides status of the sequencing process in general. The low proportion of aligned reads reported in the summary can demonstrate ligation and amplification biases. Frequently, in RNA-seq analysis the typical majority of aligned reads should lie in known exonic regions. For example, comparing the proportion of aligned reads falling to exon region in case of well-known organisms such as Mus Musculus or Homo Sapiens should be up to 90%. Smaller proportion can indicate sequencing biases or some mistakes in mapping process. The same conclusions can be derived from Junction analysis plot: known junctions should dominate novel.

Some specific issues occurring during rRNA depletion and polyA selection can lead to biases in 5' region and 3' region. For example, it is quite well-known that polyA selection can lead to high expression in 3' region. The 5'-3' bias allows to detect this event. In correct experiments it should be close to one.

The plots focused on transcriptome coverage analysis such as Coverage Histogram and Coverage profile along genes give an overview of available level of expressed genes. Importantly, low level coverage in RNA-seq data influences significantly on differential expression analysis, therefore demonstration of high and low coverage level is quite useful to detect biases and apply correct gene expression normalization.

After the quality control of the BAM file is performed, the computed counts will allow to investigate further experiment properties. Most importantly, Counts QC analysis can be performed applying the experiment design, such as biological conditions and number of samples. Here are instruction about the application in general and command line. Here's an example output.

The analysis of expression contamination allows to verify if the total number of reads in the experiment is enough to detect all expressed genes. The plot Saturation demonstrates the influence of number of sequencing reads on expression distribution. Basically, the angle of the line plot shows how the increase in number of reads controls the proportion of novel detected genes. If more reads do not lead to the growth of a number of detected genes (line angle close to zero), additional sequencing is not required. Notably, there is also a specific customizable limitation of minimum number of read counts required for a transcript to be added to the plot.

The biotype analysis of counts performed for each sample allows to detect the types of expressed features. This is specially important if RNA-seq experiment focused on long RNA or other type of RNA. Abnormal contamination can be detected from plots Biotype detection and Counts per biotype. Typically in mRNA-seq experiments protein coding proportion should dominate in counts. Additionally, the results detect suitability level of a selected sequencing protocol.

For correct normalization of counts values in gene expression analysis it is important to take into account the length and GC content of expressed transcripts. The requirements for such normalization can be checked from Length bias and GC bias plots. Cubic spline regression model is applied to detect if length and GC proportion fits gene expression. Generally, the computed coefficient of determination more than 70% along with p-value lower than 0.05 indicate effect on expression level and importance of normalization.

The computed global plots demonstrating all samples together such as Scatterplot matrix, Counts density and Counts distribution allow to detect outliers. For example, despite different biological conditions the global expression levels among analyzed samples should match sufficiently. Importantly, different biological conditions should influence the gene expression. Therefore all plots related to expression analysis normalized to a specific condition also allow to detect outliers.

So, these are some suggestions how to interpret the detailed quality control. Unfortunately there might be some other issues occurring in experiments, here I only mentioned the most frequent problems. I'll be glad to have more suggestions to update the tutorial.

Monday, March 30, 2015

Computing coverage out of exon regions

After performing alignment of RNA-seq or exome sequencing data it is quite important to compute coverage. There are several ways how to perform this task. However, there is one additional interesting issue: check if there is any coverage outside of exon regions. For example, in case of RNA-seq this allows to show if there are some previously unknown transcripts expressed. In this post I will describe how to compute out-of-region coverage from BAM file.

To perform out-of-region coverage check two files should be available:
- alignment data in SAM/BAM format. Sample: kidney.bam.
- gene annotations in BED format. Sample: transcripts.human.64.bed.

I will describe two methods to perform this task. Of course, I will recommend second method, using my favorite tool ;)

Using BEDtools

First of all, a file listing the coordinates of regions outside of genes from annotation file is required. This can be done using command complmentBed:



$~/tools/BEDTools/bin/complementBed -i transcripts.human.64.bed  -g hg19.fa.fai > transcripts.human.64.outside.bed

After, the creating out-of-region annotation file, coverage can be computed:

$~/tools/BEDTools/bin/coverageBed -abam kidney.bam -b transcripts.human.64.outside.bed -d | awk '{c+=$5;len+=1}END{print "mappedBases=" c ,"RegionsSize=" len, "meanCoverage=" c/len}'

More details about BEDtools can be found here.

Using Qualimap

Qualimap BAM QC mode supports performing analysis of a BAM file within regions from annotation file in BED/GFF/GTF formats. Additionally, there is an option to perform additional analysis in "out of regions" block. The options is called "Analyze outside regions" , more details in documentation. So, here's an example:



$ qualimap bamqc -bam kidney.bam -gff annotations/transcripts.human.64.bed -os --java-mem-size=4G

Qualimap can be downloaded from here.

Well, that's enough for now. Have fun ;)

P.S. Thanks a lot to Tristan Carland for reporting a bug in QualiMap when computing coverage out of regions. In version 2.1 the bug is fixed.

Tuesday, October 7, 2014

Visualizing RNA-seq alignments with fusions: is there anything better than IGV?

I was recently working on improving the module for generating alignments with annotated fusion-supporting reads in the context of InFusion pipeline for fusion gene and chimeric transcript discovery from RNA-seq data. Once you have such alignment in BAM format it is quite useful to visually inspect the reads supporting fusion breakpoint (junction between fused genes or transcripts). First of all it is a good sanity check to see if the prediction makes sense. Additionally, inspecting the coverage around the breakpoint might give some hints if the prediction is false positive.

So far I was using IGV for this purpose and it was working nicely for me. One cool thing about IGV is that it allows to color reads by any SAM record tag. This makes possible to color-code reads belonging to the same fusion based on a custom tag which is assigned by InFusion. IGV also allows splitting the screen to visualize both parts of the fusion (sadly without zooming). You get to see something like this:

When checking several fusion predictions I noticed that IGV was not showing some fusion-supporting alignments. Later I figured out that this was due to a configuration setting of a maximum number of reads displayed. But at first I thought - why not try using something else? There are tons of genome browsers out there.

I tested several popular of them, but none worked as good as IGV. See my comments below.

Important note: all software was tested on Ubuntu 12.04 x64.

In my tests I was trying to visualize a certain region of a BAM file that according to prediction has a fusion breakpoint. This how the region looks in IGV v2.2.2:

As you can see reads belonging to fusions are shown in different colors based on a custom SAM record tag. This allows to easily distinguish between the fusion-supporting and "normal" alignments. Another very nice feature is the coverage track. It makes possible to see the overall coverage level in the region and visually detect any suspicious peaks.

Now let's see how other browsers display the same region.

IGB v8.1.11. I loaded the BAM file and successfully zoomed to the region. At first I was not able to see any reads, however after clicking on "Optimize track height" the reads appeared. Unfortunately it was not possible to color reads by arbitrary tag and I could not find an option to search for a read by name. So it was not easy to find and inspect fusion-supporting alignments:

GenomeView 2450. As previously I loaded the BAM file and zoomed to the region of interest. Only certain number of reads was shown. Also as I understand this browser has a fixed color-scheme for reads. Therefore it was not possible to visualize fusion supporting reads:

Tablet v1.14.04. I loaded the BAM file and tried to jump to the desired base in the chromosome, however the software crashed complaining that the BAM file is not sorted or index is missing (which was not true). I also tried to search by name for a certain read supporting the fusion breakpoint in order to quickly zoom to the location. The read was found successfully, but once again when I tried to go the location of the read the software crashed.

Update 08.10.2014: The developers replied very quickly to my bug-report. The issue was due to one alignment record in the BAM file having wrongly formatted CIGAR. After fixing this record the region was shown successfully. Searching read by name is a very handy feature of Tablet. However the coloring by tag is not supported. One possible solution to mark fusion-supporting reads is to assign read groups to them, but this would require some additional work.

Savant v2.0.5. I loaded the BAM file, but unfortunately when I zoomed to the region of interest the data track was flickering and it was not possible to see anything. I reported the issue to the developers. Here how it looks on a screenshot:

Conclusion. Although the bugs in the software might be fixed in future versions, at the current moment IGV remains the best browser to visualize RNA-seq alignments with fusion genes. I suppose the same true (and some people agree) for visualization of structural variations in WGS-data.

Friday, January 17, 2014

Getting information about ENCODE dataset

The ENCODE project has vast amounts of interesting data to offer. However for me it was not very clear how to get detailed information about a particular sample. Here is a quick short-cut.

I was searching for raw RNA-seq data from several cancer cell-lines.
There is a a web-interface that allows to search inside ENCODE datasets by applying particular filters. In my case I used filters "Cell, tissue or DNA sample" and "Experiment type".

The result of the query is a list of files describing samples, which pass the search criteria. Most likely one might be willing to get more information about a particular sample. However, in the table describing the results I could not find a link to a corresponding publication or at least to a page with the description of the dataset.

One way to find this information is to copy the GEO sample accession and then search for it in the GEO Accession viewer.

From GEO-sample there is usually a link to the GEO-series which has also the link to the publication.

Note, that if you are searching for tracks, it is easy to obtain summary of the track just by double clicking on it after adding in the UCSC Genome browser.

Hope that was useful!

Friday, May 10, 2013

RNA-seq: protocol strand specificity

Wanted to write a detailed post about the topic, but found something nice on the web:

http://onetipperday.blogspot.de/2012/07/how-to-tell-which-library-type-to-use.html

Sometimes I feel like everything what we think of or do, was already thought through or done :)

thoughts about ...