Well, here is another UNIX one-liner (thanks to my supervisor), which is really crazy:
Well, I want to sort the GTF file in the following way: only leave records of genes, which have the CDS regions defined. The lines defining CDS have the type "start_codon" or "stop_codon". One can write a script in Python which will parse the GTF file, collect all lines for single gene record and output them only if CDS is present. However, it is also possible to do this using awk and grep.
Have fun deciphering this one :)
A couple of hints:
grep codon hg19.genes.gtf | awk 'BEGIN{FS="\t";OFS="\t"}{split($9,a,"\"");print a[4]}' | sort | uniq | grep -f - hg19.genes.gtf > hg19.genes.withCDS.gtf
Ok, what's going on here?Well, I want to sort the GTF file in the following way: only leave records of genes, which have the CDS regions defined. The lines defining CDS have the type "start_codon" or "stop_codon". One can write a script in Python which will parse the GTF file, collect all lines for single gene record and output them only if CDS is present. However, it is also possible to do this using awk and grep.
Have fun deciphering this one :)
A couple of hints:
- split() is an awk function to split strings
- the "-f" param for grep takes a dash, which is a pointer for standard input