Wednesday, June 20, 2012

Another crazy one-liner: list of genes in GTF file

Well, here is another UNIX one-liner (thanks to my supervisor), which is really crazy:  

grep codon hg19.genes.gtf | awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,"\"");print a[4]}' | sort | uniq | grep -f - hg19.genes.gtf > hg19.genes.withCDS.gtf 

Ok, what's going on here?

Well, I want to sort the GTF file in the following way: only leave records of genes, which have the CDS regions defined. The lines defining CDS have the type "start_codon" or "stop_codon". One can write a script in Python which will parse the GTF file, collect all lines for single gene record and output them only if CDS is present. However, it is also possible to do this using awk and grep.

Have fun deciphering this one :)

A couple of hints: