Thursday, April 12, 2012

Using regular expressions to analyze CIGAR

Suppose we want to know what is the value of a specific CIGAR operator in a SAM record. This problem can be easily solved by using regular expressions. I will use Python, but one can adapt code to any other programming language supporting RegExp.

For example, let's search for all possible alignments matches (M operator) in CIGAR:

In [55]: import re

In [56]: match = re.findall(r'(\d+)M', '40M25N5M')

In [57]: print match
-------> print(match)
['40', '5']


Expression \d+M represents all strings having pattern "nM", where n is a number that can consist from one or multiple digits. Round brackets create a group from the number, so it can be accessed later.

Similarly one can iterate over a CIGAR string:


In [60]: match = re.findall(r'(\d+)(\w)', '40M25N5M')

In [61]: match
Out[61]: [('40', 'M'), ('25', 'N'), ('5', 'M')]


Here we use \w meta-symbol to represent any letter and round brackets for grouping.

Have fun!

1 comment:

Christian Orellana said...

Nice trick! Only thing is, considering = is a valid operation on a cigar string according to the specification, I use:

r'([0-9]+)([MIDNSHPX=])'