Friday, February 3, 2012

counting the sizes of cpg-islands

Small task: we need to calculate the histogram of CpG island sizes.

The CpG islands are stored in a BED file:

kokonech@ultor:~$ head CpGIslandsByTakai.wihtNames.bed
track name="CpG islands by Takai et al." description="CpG islands by Takai et al." color=0,200,100 CpG_0 0 +
chr1 10231 11413 CpG_1 0 +
chr1 26760 27059 CpG_2 0 +
chr1 28597 29942 CpG_3 0 +
chr1 51388 52085 CpG_4 0 +
chr1 88289 88610 CpG_5 0 +
chr1 121148 121635 CpG_6 0 +
chr1 134035 134389 CpG_7 0 +
chr1 134896 135424 CpG_8 0 +
chr1 175186 175545 CpG_9 0 +


First we get the sizes using simple awk command:

awk {'print $3 - $2 + 1'} CpGIslandsByTakai.wihtNames.bed > CpGislands.sizes


Then I use a small python script to draw histogram.
The listing of the script can be found below.

 import numpy as np  
 import matplotlib.pyplot as plt  
 import matplotlib.mlab as mlab  
 import sys  
   
 x = []  
   
 file = open(sys.argv[1])  
   
 for line in file:  
   val = int(line)  
   if (val > 3000):   
     val = 3000  
   x.append ( val )  
   
 fig = plt.figure()  
 ax = fig.add_subplot(111)  
   
 # the histogram of the data  
 n, bins, patches = ax.hist(x, 50, normed=1, facecolor='green', alpha=0.75)  
   
 ax.set_xlabel('CpG island sizes')  
 ax.set_ylabel('Count')  
 ax.grid(True)  
   
 plt.show()