Small task: we need to calculate the histogram of CpG island sizes.
The CpG islands are stored in a BED file:
kokonech@ultor:~$ head CpGIslandsByTakai.wihtNames.bed
track name="CpG islands by Takai et al." description="CpG islands by Takai et al." color=0,200,100 CpG_0 0 +
chr1 10231 11413 CpG_1 0 +
chr1 26760 27059 CpG_2 0 +
chr1 28597 29942 CpG_3 0 +
chr1 51388 52085 CpG_4 0 +
chr1 88289 88610 CpG_5 0 +
chr1 121148 121635 CpG_6 0 +
chr1 134035 134389 CpG_7 0 +
chr1 134896 135424 CpG_8 0 +
chr1 175186 175545 CpG_9 0 +
First we get the sizes using simple awk command:
awk {'print $3 - $2 + 1'} CpGIslandsByTakai.wihtNames.bed > CpGislands.sizes
Then I use a small python script to draw histogram.
The listing of the script can be found below.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import sys
x = []
file = open(sys.argv[1])
for line in file:
val = int(line)
if (val > 3000):
val = 3000
x.append ( val )
fig = plt.figure()
ax = fig.add_subplot(111)
# the histogram of the data
n, bins, patches = ax.hist(x, 50, normed=1, facecolor='green', alpha=0.75)
ax.set_xlabel('CpG island sizes')
ax.set_ylabel('Count')
ax.grid(True)
plt.show()