Thursday, December 6, 2012

Use egrep to filter GTF file

Just an example using small regular expression to filter file by chromosome name:


cat Homo_sapiens.GRCh37.68.gtf | egrep "^(17|20)\b" > Homo_sapiens.GRCh37.68.17_20.gtf

Monday, October 22, 2012

Counting size of transcript from GTF file

Here I use grep to get records associated with query transcript and awk to perform the calculations:

cat test.gtf | grep "transcript_id \"ENSMUST00000105216\"" | egrep -v "CDS|start_codon|stop_codon" | awk '{l += $5 -$4 + 1}END{print l}'

Note that egrep is used to filter out all records containing CDS, start_codon or stop_codon. egrep is applied because it supports regular expression "logical or" statement.

Thursday, September 27, 2012

Tuesday, September 25, 2012

Finding a Debian package with with a missing include file

Have you ever had this problem: you build some application or library from source and have this stupid missing include file, which indicates that you don't have a dependency. Now you wonder which package you should install.

Luckily in Debian one can do the following:

apt-file search readproc.h


Source is here.

See ya folks!

Monday, August 20, 2012

Sorting dictionaries in python


Suppose we want to sort a dictionary based on its value, or a list of tuples based on some arbituary key.

Easy as a pie, just use itemgetter() function from operator:


>>> student_tuples = [
('john', 'A', 15),
('jane', 'B', 12),
('dave', 'B', 10),
]
>>> from operator import itemgetter
>>> sorted(student_tuples, key=itemgetter(2))
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]


More examples are here:
http://wiki.python.org/moin/HowTo/Sorting

Wednesday, June 20, 2012

Another crazy one-liner: list of genes in GTF file

Well, here is another UNIX one-liner (thanks to my supervisor), which is really crazy:  

grep codon hg19.genes.gtf | awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,"\"");print a[4]}' | sort | uniq | grep -f - hg19.genes.gtf > hg19.genes.withCDS.gtf 

Ok, what's going on here?

Well, I want to sort the GTF file in the following way: only leave records of genes, which have the CDS regions defined. The lines defining CDS have the type "start_codon" or "stop_codon". One can write a script in Python which will parse the GTF file, collect all lines for single gene record and output them only if CDS is present. However, it is also possible to do this using awk and grep.

Have fun deciphering this one :)

A couple of hints:

Tuesday, April 24, 2012

BED to GFF in one line

Here you go:

awk 'BEGIN{OFS="\t"}{ print $1,"qualimap",$4,$2,$3,$5,$6,".","bed" }' CpGIslandsByTakai.human.bed > file.gff

NOTE:
Here I use on 6 fields of BED file, but you can use this script as a basis and allow more fields to be used.

Thursday, April 12, 2012

Using regular expressions to analyze CIGAR

Suppose we want to know what is the value of a specific CIGAR operator in a SAM record. This problem can be easily solved by using regular expressions. I will use Python, but one can adapt code to any other programming language supporting RegExp.

For example, let's search for all possible alignments matches (M operator) in CIGAR:

In [55]: import re

In [56]: match = re.findall(r'(\d+)M', '40M25N5M')

In [57]: print match
-------> print(match)
['40', '5']


Expression \d+M represents all strings having pattern "nM", where n is a number that can consist from one or multiple digits. Round brackets create a group from the number, so it can be accessed later.

Similarly one can iterate over a CIGAR string:


In [60]: match = re.findall(r'(\d+)(\w)', '40M25N5M')

In [61]: match
Out[61]: [('40', 'M'), ('25', 'N'), ('5', 'M')]


Here we use \w meta-symbol to represent any letter and round brackets for grouping.

Have fun!

Thursday, March 8, 2012

Sum up a column in a tab delimited file

Suppose we have a tab-delimited file:

kokonech@ultor:~/playgrnd/densityAnalysis/density_repeats.64.LTRs.bed$ head 24h-i-input.bam.bed
1 83886031 83886750 ERVL-E-int 980 -1 N 187 LTRs 1 0
1 83886031 83886750 ERVL-E-int 980 -1 N 187 LTRs 2 0
1 83886031 83886750 ERVL-E-int 980 -1 N 187 LTRs 3 0
1 83886031 83886750 ERVL-E-int 980 -1 N 187 LTRs 4 0
1 83886031 83886750 ERVL-E-int 980 -1 N 187 LTRs 5 0
1 83886031 83886750 ERVL-E-int 980 -1 N 187 LTRs 6 0
1 83886031 83886750 ERVL-E-int 980 -1 N 187 LTRs 7 0
1 83886031 83886750 ERVL-E-int 980 -1 N 187 LTRs 8 0
1 83886031 83886750 ERVL-E-int 980 -1 N 187 LTRs 9 0
1 83886031 83886750 ERVL-E-int 980 -1 N 187 LTRs 10 0


We want to sum up a column, let's say the 11th.
A piece of cake using Awk:

awk '{a+=$11}END{printf "%i\n",a}' 24h-i-input.bam.bed

Friday, February 3, 2012

counting the sizes of cpg-islands

Small task: we need to calculate the histogram of CpG island sizes.

The CpG islands are stored in a BED file:

kokonech@ultor:~$ head CpGIslandsByTakai.wihtNames.bed
track name="CpG islands by Takai et al." description="CpG islands by Takai et al." color=0,200,100 CpG_0 0 +
chr1 10231 11413 CpG_1 0 +
chr1 26760 27059 CpG_2 0 +
chr1 28597 29942 CpG_3 0 +
chr1 51388 52085 CpG_4 0 +
chr1 88289 88610 CpG_5 0 +
chr1 121148 121635 CpG_6 0 +
chr1 134035 134389 CpG_7 0 +
chr1 134896 135424 CpG_8 0 +
chr1 175186 175545 CpG_9 0 +


First we get the sizes using simple awk command:

awk {'print $3 - $2 + 1'} CpGIslandsByTakai.wihtNames.bed > CpGislands.sizes


Then I use a small python script to draw histogram.
The listing of the script can be found below.

 import numpy as np  
 import matplotlib.pyplot as plt  
 import matplotlib.mlab as mlab  
 import sys  
   
 x = []  
   
 file = open(sys.argv[1])  
   
 for line in file:  
   val = int(line)  
   if (val > 3000):   
     val = 3000  
   x.append ( val )  
   
 fig = plt.figure()  
 ax = fig.add_subplot(111)  
   
 # the histogram of the data  
 n, bins, patches = ax.hist(x, 50, normed=1, facecolor='green', alpha=0.75)  
   
 ax.set_xlabel('CpG island sizes')  
 ax.set_ylabel('Count')  
 ax.grid(True)  
   
 plt.show()  

Tuesday, January 10, 2012

Extract single chromosome reads from BAM/SAM

Sometimes it is required to extract subset of reads for only one specific chromosome.
It's rather easy to accomplish this task with SAMtools.

First we create the index for a BAM file. It is required for random region positioning.

samtools index accepted_hits.bam

Then we extract the data for specific region, for example chromosome 20.

samtools view -h accepted_hits.bam chr20 > accepted_hits.20.sam