Tuesday, April 24, 2012

BED to GFF in one line

Here you go:

awk 'BEGIN{OFS="\t"}{ print $1,"qualimap",$4,$2,$3,$5,$6,".","bed" }' CpGIslandsByTakai.human.bed > file.gff

NOTE:
Here I use on 6 fields of BED file, but you can use this script as a basis and allow more fields to be used.

Thursday, April 12, 2012

Using regular expressions to analyze CIGAR

Suppose we want to know what is the value of a specific CIGAR operator in a SAM record. This problem can be easily solved by using regular expressions. I will use Python, but one can adapt code to any other programming language supporting RegExp.

For example, let's search for all possible alignments matches (M operator) in CIGAR:

In [55]: import re

In [56]: match = re.findall(r'(\d+)M', '40M25N5M')

In [57]: print match
-------> print(match)
['40', '5']


Expression \d+M represents all strings having pattern "nM", where n is a number that can consist from one or multiple digits. Round brackets create a group from the number, so it can be accessed later.

Similarly one can iterate over a CIGAR string:


In [60]: match = re.findall(r'(\d+)(\w)', '40M25N5M')

In [61]: match
Out[61]: [('40', 'M'), ('25', 'N'), ('5', 'M')]


Here we use \w meta-symbol to represent any letter and round brackets for grouping.

Have fun!