thoughts about ...: June 2012

Wednesday, June 20, 2012

Another crazy one-liner: list of genes in GTF file

Well, here is another UNIX one-liner (thanks to my supervisor), which is really crazy:

grep codon hg19.genes.gtf | awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,"\"");print a[4]}' | sort | uniq | grep -f - hg19.genes.gtf > hg19.genes.withCDS.gtf

Ok, what's going on here?

Well, I want to sort the GTF file in the following way: only leave records of genes, which have the CDS regions defined. The lines defining CDS have the type "start_codon" or "stop_codon". One can write a script in Python which will parse the GTF file, collect all lines for single gene record and output them only if CDS is present. However, it is also possible to do this using awk and grep.

Have fun deciphering this one :)

A couple of hints:

split() is an awk function to split strings
the "-f" param for grep takes a dash, which is a pointer for standard input

thoughts about ...

Wednesday, June 20, 2012

Another crazy one-liner: list of genes in GTF file

About Me

Blog Archive