## Thursday, April 12, 2012

### Using regular expressions to analyze CIGAR

Suppose we want to know what is the value of a specific CIGAR operator in a SAM record. This problem can be easily solved by using regular expressions. I will use Python, but one can adapt code to any other programming language supporting RegExp.

For example, let's search for all possible alignments matches (M operator) in CIGAR:
``` In : import re In : match = re.findall(r'(\d+)M', '40M25N5M') In : print match -------> print(match) ['40', '5'] ```

Expression \d+M represents all strings having pattern "nM", where n is a number that can consist from one or multiple digits. Round brackets create a group from the number, so it can be accessed later.

Similarly one can iterate over a CIGAR string:

``` In : match = re.findall(r'(\d+)(\w)', '40M25N5M') In : match Out: [('40', 'M'), ('25', 'N'), ('5', 'M')] ```

Here we use \w meta-symbol to represent any letter and round brackets for grouping.

Have fun!

#### 1 comment:

Christian Orellana said...

Nice trick! Only thing is, considering = is a valid operation on a cigar string according to the specification, I use:

r'([0-9]+)([MIDNSHPX=])'