thoughts about ...: 2014

Tuesday, October 7, 2014

Visualizing RNA-seq alignments with fusions: is there anything better than IGV?

I was recently working on improving the module for generating alignments with annotated fusion-supporting reads in the context of InFusion pipeline for fusion gene and chimeric transcript discovery from RNA-seq data. Once you have such alignment in BAM format it is quite useful to visually inspect the reads supporting fusion breakpoint (junction between fused genes or transcripts). First of all it is a good sanity check to see if the prediction makes sense. Additionally, inspecting the coverage around the breakpoint might give some hints if the prediction is false positive.

So far I was using IGV for this purpose and it was working nicely for me. One cool thing about IGV is that it allows to color reads by any SAM record tag. This makes possible to color-code reads belonging to the same fusion based on a custom tag which is assigned by InFusion. IGV also allows splitting the screen to visualize both parts of the fusion (sadly without zooming). You get to see something like this:

When checking several fusion predictions I noticed that IGV was not showing some fusion-supporting alignments. Later I figured out that this was due to a configuration setting of a maximum number of reads displayed. But at first I thought - why not try using something else? There are tons of genome browsers out there.

I tested several popular of them, but none worked as good as IGV. See my comments below.

Important note: all software was tested on Ubuntu 12.04 x64.

In my tests I was trying to visualize a certain region of a BAM file that according to prediction has a fusion breakpoint. This how the region looks in IGV v2.2.2:

As you can see reads belonging to fusions are shown in different colors based on a custom SAM record tag. This allows to easily distinguish between the fusion-supporting and "normal" alignments. Another very nice feature is the coverage track. It makes possible to see the overall coverage level in the region and visually detect any suspicious peaks.

Now let's see how other browsers display the same region.

IGB v8.1.11. I loaded the BAM file and successfully zoomed to the region. At first I was not able to see any reads, however after clicking on "Optimize track height" the reads appeared. Unfortunately it was not possible to color reads by arbitrary tag and I could not find an option to search for a read by name. So it was not easy to find and inspect fusion-supporting alignments:

GenomeView 2450. As previously I loaded the BAM file and zoomed to the region of interest. Only certain number of reads was shown. Also as I understand this browser has a fixed color-scheme for reads. Therefore it was not possible to visualize fusion supporting reads:

Tablet v1.14.04. I loaded the BAM file and tried to jump to the desired base in the chromosome, however the software crashed complaining that the BAM file is not sorted or index is missing (which was not true). I also tried to search by name for a certain read supporting the fusion breakpoint in order to quickly zoom to the location. The read was found successfully, but once again when I tried to go the location of the read the software crashed.

Update 08.10.2014: The developers replied very quickly to my bug-report. The issue was due to one alignment record in the BAM file having wrongly formatted CIGAR. After fixing this record the region was shown successfully. Searching read by name is a very handy feature of Tablet. However the coloring by tag is not supported. One possible solution to mark fusion-supporting reads is to assign read groups to them, but this would require some additional work.

Savant v2.0.5. I loaded the BAM file, but unfortunately when I zoomed to the region of interest the data track was flickering and it was not possible to see anything. I reported the issue to the developers. Here how it looks on a screenshot:

Conclusion. Although the bugs in the software might be fixed in future versions, at the current moment IGV remains the best browser to visualize RNA-seq alignments with fusion genes. I suppose the same true (and some people agree) for visualization of structural variations in WGS-data.

Tuesday, August 5, 2014

biss2014: Quality control of high throughput sequencing data

Yesterday I returned to Berlin from Saint Petersburg, where I took part in the work of BioInformatics Summer School 2014.

I suppose this is a biggest event of its kind in Russia. The school was arranged by Bionformatics Institute with intense support from Algorithmic Biology Lab of St. Petersburg Academic University. If you haven't heard about this lab - these are the guys who published Spades assembler and created Rosalind.

The school gathered 100 Russian-speaking students from a number of countries: 50 biologists and 50 computer scientists. The list of tutors included distinguished scientists and representatives of biotech industry. Several promising PhD students in the area of bioinformatics also made their teaching contribution :) For example, I participated as a tutor and was giving one talk and two practical workshops. For me it was already second time I participated in the work of the school and again it was an awesome experience!

The official language of the school was Russian, however I prepared all supporting materials including presentations and tutorial notes in English. In very close future I am going to wrap-up and publish the teaching materials I created for the school.

Today I will start with the slides of my talk: "Quality control of high throughput sequencing data"

The talk describes common erros and biases specific to HTS data; how to detect those biases and how to eliminate their impact on the analysis.

Slides for the talk in PDF format can be found here.

Wednesday, July 16, 2014

Learning statistics: Principal Component Analysis

We decided to apply Principal Component Analysis for upcoming multisample BAM QC analysis in Qualimap.

The idea is very simple: we compare 2 and more BAM files based on a number of various metrics including mean coverage, GC content, insert size, mapping quality and others. Since there are multiple comparison features avaialable one can apply PCA to analyze how different are the samples from each other by using biplot.

Here is a small collection of tutorials on PCA that I went through to get more confident with the topic.

1) An intuitive explanation of PCA
Nice overview of the topic, focusing on understanding the application of PCA.

2) A tutorial on principal component analysis
Practical tutorial with very low entry level which allows to do try PCA on a small example

3) Using R for Multivariate Analysis
Great tutorial on multivariate analysis using R. Totally recommended.

Thursday, May 15, 2014

In situ, in vitro and other confusing expressions: explained with examples

Ok, biologists love using these Latin expressions, but somehow I keep forgetting what they mean and wikipedia entries only add more confusion. Additionally they can also mean different things depending on the context. In my humble opinion, scientific language should be explicit and clear, but not ambiguous. But who am I to judge, right? ;)

Here are the explanations with examples in the area of genomics:

in vivo
In a living thing. Usually refers to an experiment which is performed in an animal with analysis of the observed phenotypes. For example, survival outcome of a gene knockout in a genetically modified mouse.

ex vivo
Out of a living thing. We assume extracting some tissues or cells from a living organism. Example: tumor samples extracted from a cancer patient.

in situ
In position, in a system. Analysis of a biological system without disrupting it. For example, we can analyse gene expression in a cell without extracting the RNA by using hybridization probes and mircoscopy.

in vitro
In a lab tube. An artificial extraction and analysis of biological material. The most common example: DNA extracted from cell cultures.

in silico
In a computer simulation. At least this one is clear :)

Luckily, in my lab I have some nice biologists to help me understand these things. Anyway, corrections and suggestions are welcome!

Update 1:

Hey, there is one I missed - ex situ. It means "out of natural system". An example from ecology/conservation biology: an animal moved from the natural habitat to the zoo/nature reserve. Never encountered this in molecular biology, but one can guess as example some cells transplanted from one organism to another. Thanks to Arkady Ukrop for this one.

Tuesday, May 6, 2014

Learning statistics: Simpsons paradox

Although I had a couple of nice courses in the university, I still feel myself not very confident when going through the statistics analysis part in a genomics paper. Apparently, it is not enough to know algorithms and biology to become a bionformatician :) After realizing (finally!) that statistics is super important for science and especially for data analysis, I started taking some MOOCs (this,this and this for example ) to refresh and improve my knowledge. As a result I am learning a lot of cool new things now! :)

Today I came across something interesting in the Exploratory Data Analysis course: Simpson's paradox.

Somehow I never heard about it before or just did not keep attention to it. The idea is the following: main trend or correlation is different for distinct groups in the dataset compared to the dataset as a whole. Such effect is usually introduced by some unrecognized confounding factor in the data.

Nice visualization and explanation can be found here.

Monday, February 17, 2014

IGV: how to activate split screen for arbitrary regions

IGV allows to use so called "split-screen" view in order to show 2 mate pairs from different genomic regions simultaneously. This mode can be called using feature "View mate region in split screen".

However it is also possible to activate this mode for arbitrary regions. This can be useful for instance when looking at fusion genes. To turn on the split-mode type the coordinates of the regions separated by space in the Location field. For example, "chr15:75094339-75094402 chr11:7655726-7655936".

Tested in IGV 2.2.2

Friday, January 17, 2014

Getting information about ENCODE dataset

The ENCODE project has vast amounts of interesting data to offer. However for me it was not very clear how to get detailed information about a particular sample. Here is a quick short-cut.

I was searching for raw RNA-seq data from several cancer cell-lines.
There is a a web-interface that allows to search inside ENCODE datasets by applying particular filters. In my case I used filters "Cell, tissue or DNA sample" and "Experiment type".

The result of the query is a list of files describing samples, which pass the search criteria. Most likely one might be willing to get more information about a particular sample. However, in the table describing the results I could not find a link to a corresponding publication or at least to a page with the description of the dataset.

One way to find this information is to copy the GEO sample accession and then search for it in the GEO Accession viewer.

From GEO-sample there is usually a link to the GEO-series which has also the link to the publication.

Note, that if you are searching for tracks, it is easy to obtain summary of the track just by double clicking on it after adding in the UCSC Genome browser.

Hope that was useful!

thoughts about ...