Structure is as GFF,
so the fields are:
<seqname> <source> <feature> <start> <end> <score>
<strand> <frame> [attributes] [comments]
Here is a simple example with 3 translated exons. Order of rows is not important.
AB000381 Twinscan CDS 380 401 . + 0 gene_id "001"; transcript_id "001.1"; AB000381 Twinscan CDS 501 650 . + 2 gene_id "001"; transcript_id "001.1"; AB000381 Twinscan CDS 700 707 . + 2 gene_id "001"; transcript_id "001.1"; AB000381 Twinscan start_codon 380 382 . + 0 gene_id "001"; transcript_id "001.1"; AB000381 Twinscan stop_codon 708 710 . + 0 gene_id "001"; transcript_id "001.1";The whitespace in this example is provided only for readability. In GTF, fields must be separated by a single TAB and no white space.
<seqname>
The FPC contig ID from the Golden Path.
<source>
The source column should be a unique label indicating where the annotations
came from --- typically the name of either a prediction program or a public
database.
<feature>
The following feature types are required: "CDS", "start_codon", "stop_codon".
The feature "exon" is optional, since this project will not evaluate predicted
splice sites outside of protein coding regions. All other features will
be ignored.
CDS represents the coding sequence starting with the first translated codon and proceeding to the last translated codon. Unlike Genbank annotation, the stop codon is not included in the CDS for the terminal exon.
<start> <end>
Integer start and end coordinates of the feature relative to the beginning
of the sequence named in <seqname>. <start> must be less than
or equal to <end>. Sequence numbering starts at 1. Values of <start>
and <end> that extend outside the reference sequence are technically
acceptable, but they are discouraged for purposes of this project.
<score>
The score field will not be used for this project, so you can either
provide a meaningful float or replace it by a dot.
<frame>
0 indicates that the first whole codon of the reading frame is located
at 5'-most base. 1 means that there is one extra base before the first
codon and 2 means that there are two extra bases before the first codon.
Note that the frame is not the length of the CDS mod 3.
Here are the details excised from the GFF spec. Important: Note comment on reverse strand.
'0' indicates that the specified region is in frame, i.e. that its first base corresponds to the first base of a codon. '1' indicates that there is one extra base, i.e. that the second base of the region corresponds to the first base of a codon, and '2' means that the third base togel online of the region is the first base of a codon. If the strand is '-', then the first base of the region is value of <end>, because the corresponding coding region will run from <end> to <start> on the reverse strand.[attributes]
Attributes must end in a semicolon which must then be separated from the start of any subsequent attribute by exactly one space character (NOT a tab character).
Textual attributes should be surrounded by doublequotes.
Here is an example of a gene on the negative strand. Larger coordinates are 5' of smaller coordinates. Thus, the start codon is 3 bp with largest coordinates among all those bp that fall within the CDS regions. Similarly, the stop codon is the 3 bp with coordinates just less than the smallest coordinates within the CDS regions.
AB000123 Twinscan CDS
193817 194022 . -
2 gene_id "AB000123.1"; transcript_id "AB00123.1.2";
AB000123 Twinscan CDS
199645 199752 . -
2 gene_id "AB000123.1"; transcript_id "AB00123.1.2";
AB000123 Twinscan CDS
200369 200508 . -
1 gene_id "AB000123.1"; transcript_id "AB00123.1.2";
AB000123 Twinscan CDS
215991 216028 . -
0 gene_id "AB000123.1"; transcript_id "AB00123.1.2";
AB000123 Twinscan start_codon
216026 216028 . -
. gene_id "AB000123.1"; transcript_id
"AB00123.1.2";
AB000123 Twinscan stop_codon
193814 193816 . -
. gene_id "AB000123.1"; transcript_id
"AB00123.1.2";
Note the frames of the coding exons. For example:
AB000381 Twinscan exon
150 200 . + . gene_id
"AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan exon
300 401 . + . gene_id
"AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan CDS
380 401 . + 0 gene_id
"AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan exon
501 650 . + . gene_id
"AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan CDS
501 650 . + 2 gene_id
"AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan exon
700 800 . + . gene_id
"AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan CDS
700 707 . + 2 gene_id
"AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan exon
900 1000 . + . gene_id
"AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan start_codon 380 382
. + 0 gene_id "AB000381.000"; transcript_id
"AB000381.000.1";
AB000381 Twinscan stop_codon 708
710 . + 0 gene_id "AB000381.000";
transcript_id "AB000381.000.1";