summaryrefslogtreecommitdiff
path: root/doc/tbl2asn.txt
blob: 3362c78aa76bc7434d3c7329c808c696f47d4871 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
TBL2ASN AUTOMATED BULK SUBMISSION PROGRAM

tbl2asn is a program that automates the submission of sequence records to
GenBank.  It uses many of the same functions as Sequin, but is driven
entirely by data files, and records need no additional manual editing before
submission.  Entire genomes, consisting of many chromosomes with feature
annotation, can be processed in seconds using this method.

For a submission, tbl2asn expects a template file containing a text ASN.1
Submit-block object.  These can be generated in Sequin and saved for use by
tbl2asn.  The Submit-block contains contact information (to whom questions
on the submission can be addressed) and a submission citation (which lists
the authors who get scientific credit for the sequencing).

The template file can also contain one or more text ASN.1 Seq-descr objects
(such as Title or BioSource) appended after the Submit-block.  These can
also be generated in Sequin and saved to a file, and then appended to the
template with a text editor.  They will become descriptors packaged at the
top of each submission file.

tbl2asn reads six other kinds of data files.  Nucleotide sequence data is
expected in FASTA format, and these files are identified by having a .fsa
suffix.  Feature table files, in the five-column format described later,
have a .tbl suffix.  These can be easily generated by most genome centers
that maintain feature locations in a spreadsheet or database.  For sets of
data records, a source qualifier table can be placed in a .src file.  The
protein translations of CDS features can be supplied as FASTA sequences in
files with a .pep suffix.  These will replace the tbl2asn-generated
conceptual translations, and can be used to verify correct CDS intervals.
The nucleotide sequence products of mRNA features can be provided as FASTA
files with a .rna suffix.  Sequence quality scores can be supplied in files
with a .qvl suffix.

tbl2asn generates a .sqn file for submission to the database from these
input files.


COMMAND-LINE ARGUMENTS

To process a set of chromosomes, sets of .fsa and .tbl files (along with
optional .src, .pep, .rna, and .qvl files) are placed into a source
directory.  The path to this directory is specified in the -p command-line
argument.  The path for the resulting .sqn submission files is given in the
-r argument.  If the -r argument is not given, the .sqn files are saved in
the source directory.

For example, if an organism has fifteen chromosomes, one would expect at
least the following files in the source directory:

  chr01.fsa
  chr01.tbl
  chr02.fsa
  ...
  chr14.tbl
  chr15.fsa
  chr15.tbl

The exact names of the files are not important, but when a file with a
suffix of .fsa is found, tbl2asn will look for a file with the same prefix
that has a .tbl suffix, and then generate a .sqn file.

The -t command-line argument specifies the template file.

Normally a single FASTA sequence per .fsa file is expected.  If there are
multiple sequences, only the first is processed, unless one of two -a flag
variants are given.  These are discussed below.

The -a s flag tells tbl2asn to package the multiple FASTA components as a set
of unrelated sequences.  This accommodates users who create a single file
instead of one file per sequence.  A single FASTA component can now have gap
indications (e.g., >?unk100) on a separate line as long as it is followed
immediately by more sequence lines, with no > and a mock identifier.

The -a d flag tells the program to make a delta sequence out of the multiple
components.  This can be used for HTGS submissions where the sequence of
the BAC/PAC clone has not been completely determined.  By convention, gaps
of 100 base pairs should be inserted in between the actual sequence
segments with lines containing an angle bracket '>', a question mark '?',
the letters "unk", and the length of the gap.

>?unk100

The -g flag causes tbl2asn to generate a genomic product set.  Within the
set, the products of each related mRNA and CDS are packaged together in an
internal nuc-prot set.  The feature table must provide reciprocal
protein_id and transcript_id qualifiers in order to correctly identify each
mRNA/CDS pair.  From the resulting .sqn file, the genomic sequence, all
transcripts, and all proteins will be entered into the database and given
accessions.  Note, however, that -g cannot be used for records submitted to
GenBank.  It is only suitable for records going into RefSeq.

If a feature table is not given, the -k c flag tells tbl2asn to annotate the
longest Open Reading Frame (ORF) on each record.  The -k m flag allows
alternative start codons to be used when finding the ORF.  The protein will
be named 'unknown' unless the name is present in the .fsa file definition
line, e.g., [protein=helicase].  The two flags can be combined as -k cm.

Data records will be validated when the -V v flag is indicated.  Output is
saved to files with a .val suffix.  The validator checks for many things,
including internal stops in CDS features and mismatches between the CDS
translation and the supplied protein sequence.  Errors need to be corrected
before submitting files to GenBank.

GenBank format output is generated when the -V b flag is used.  Resulting
files have a .gbf suffix.  The validation and flatfile generation flags can
be combined as -V vb.


NUCLEOTIDE SEQUENCE FORMAT

tbl2asn can read nucleotide sequences of any size in FASTA format.  A FASTA
record consists of a single definition line, beginning with a '>' and
followed by optional text, and subsequent lines of sequence.  At minimum,
all definition lines must contain an identifier for the sequence, called
the SeqID.  The SeqID cannot begin with "assembly", as this is reserved for
entry of accession lists in Sequin.  Other optional information about the
biological source of the organism can also be encoded in brackets on the
definition line.  A sample definition line is

>Sc_16 [organism=Saccharomyces cerevisiae] [strain=S288C] [chromosome=XVI]

Other elements include [topology=circular] and [location=mitochondrion].
Rna viruses would be indicated by [molecule=rna] and [moltype=genomic]. The
sequencing technique can be supplied as [tech=fli cDNA].  Many other source
qualifiers, such as map, clone, isolate, cell-line, and cultivar, can be
used.  For organisms that are not commonly submitted with tbl2asn, the
nuclear and mitochondrial genetic codes can be indicated by [gcode=1] and
[mgcode=3], respectively.  This will ensure proper translation of CDS
features.  Primary accessions of TPA (third party annotation) records are
given by [primary=xxx,xxx,...].  Finally, a general note can be added with
[note=xxx].

Note that the definition line must be a single line, with no return or
newline characters.  Some word-processors will word-wrap text, either during
display or when saving to a file, and care must be taken to avoid unwanted
newlines introduced by the editor.

>slpy [organism=Zea mays] [chromosome=9] Sleepy transposon
TGTAAGATCACTGCTGGGTTGTTGATGAGTTGAGCACCGCTCCCGGCACCCGTCTCCTCTCACGAAGATC
TTTAGGGTATGAAAAGTATCTGGAGTTCTTACACGACGGCGAGCCGCCTCTTCTCCGGACGCAGCCGGCC
AGCCTTCTTCTCCAAGTCACCTTTTACCGACTCCAAACCCCACCTCAAATACTCCACTCAATCCAGATCA
...

Multiple SeqIDs can be indicated in FASTA-style parsable strings.

>gnl|ZGP|chr1|gb|U28041
>gi|54465|emb|X16935|MMTCRAC


FEATURE TABLE FORMAT

tbl2asn reads features from a simple five-column tab-delimited table.  This
is described in more detail at

http://www.ncbi.nlm.nih.gov/Sequin/table.html

The feature table specifies the location and type of each feature, and
tbl2asn processes the feature intervals and translates any CDSs into
proteins. The first line of the table contains the following basic
information.

>Feature SeqId table_name 

The SeqId must be the same as that used on the sequence. The table name is
optional.  Subsequent lines of the table list the features.  Columns are
separated by tabs.

The first and second columns are the start and stop locations of the
feature, respectively, the third column is the type of feature (the feature
key, e.g., gene, mRNA, CDS), the fourth column is a qualifier name (e.g.,
"product", and the fifth is a qualifier value (e.g., the name of the protein
or gene).

A simple feature table is

>Feature sde3g
240     4084    gene
                        gene    SDE3
240     1361    mRNA
1450    1641
1730    3184
3275    4084
                        product RNA helicase SDE3

579     1361    CDS
1450    1641
1730    3184
3275    3880
                        product RNA helicase SDE3

If a feature contains multiple intervals, each interval is listed on a
separate line by its start and stop position.   Features that are on the
complementary strand are indicated by reversing the interval locations. 
Locations of partial (incomplete) features are indicated with a '>' or '<'
next to the number.

Gene features are always a single interval, and their location should cover
the intervals of all the relevant features.  If the gene feature spans the
intervals of the CDS or mRNA features for that gene, there is no need to
include gene qualifiers on those features in the table, since they will be
picked up by overlap.  Use of the overlapping gene can be suppressed by
adding a gene qualifier with the value "-".  This is important when, for
example, a tRNA is encoded within an intron of a housekeeping gene.

Translation exception qualifiers are parsed from the same style used in the
GenBank flatfile.

                        transl_except   (pos:591..593,aa:Sec)

The codon recognized and anticodon position of tRNAs can also be given.

                        codon_recognized   TGG
                        anticodon   (pos:7591..7593,aa:Trp)

In addition to the standard qualifiers seen in GenBank format, several other
tokens are used to direct values to specific fields in the ASN.1 data. 
These include gene_syn, gene_desc, locus_tag, prot_desc, prot_note,
region_name, bond_type, and site_type.

Genomic product sets require protein_id and transcript_id qualifiers on each
mRNA and CDS feature.  These are used to associate the correct pair of
features for packaging.

                        protein_id      lcl|sde3p
                        transcript_id   lcl|sde3m

Exceptional biological situations can be annotated by use of the exception
qualifier.  For example

                        exception       ribosomal slippage

The following are legal exception qualifier values

  RNA editing
  reasons given in citation
  ribosomal slippage
  trans-splicing
  alternative processing
  artificial frameshift
  nonconsensus splice site
  rearrangement required for product
  modified codon recognition
  alternative start codon
  dicistronic gene
  transcribed pseudogene

Since the International Nucleotide Sequence Database collaboration only
allows "RNA editing" and "reasons given in citation" to appear in release
mode, other exceptions are mapped to the /note qualifier in the flatfile.
However, each exception text string turns off specific validator tests that
would otherwise produce warning messages, so they should be entered as
exception qualifiers, not as notes.

Gene Ontology (GO) terms can be indicated with the following qualifiers

                        go_component    endoplasmic reticulum|0005783
                        go_process      glycolysis and gluconeogenesis|57|89197757|ACT,TEM
                        go_function     excision repair|93||IPD

The value field is separated by vertical bars '|' into a descriptive
string, the GO identifier (leading zeroes are retained), and optionally a
PubMed ID (or GO Reference number starting with a leading 0) and one or more
GO evidence codes.


SOURCE TABLE FORMAT

For sets of sequences, a source qualifier table can optionally be placed in
a tab-delimited file with a .src extension.  The first line gives the
source qualifier names, separated by tabs.  The first column must be the
sequence identifier.  For example

sequence_id     organism    strain       isolate

The remaining lines each give the source qualifiers for one sequence.  For
example

sde3g           Zea mays    A69Y         JH90.6-2x12

The same information can be provided in the FASTA definition line or in the
source section of the five-column feature table.


PROTEIN SEQUENCE FORMAT

Protein sequences are FASTA files with a .pep extension that can substitute
for the translated product of a CDS feature.  Supplying these files acts as
a reality check that the CDS intervals do in fact translate to the expected
protein sequence.  The FASTA defline with a '>' and sequence identifier is
required.

>sde3p
MSVSUYKSDDEYSVIADKGEIGFIDYQNDGSSGCYNPFDEGPVVVSVPFPFKKEKPQSVTVGETSFDSFT
VKNTMDEPVDLWTKIYASNPEDSFTLSILKPPSKDSDLKERQCFYETFTLEDRMLEPGDTLTIWVSCKPK
...

The SeqID must match a protein_id in the .tbl file.  In the table above,
the protein_id and transcript_id need to explicitly use a 'lcl|' prefix
before the SeqID string to indicate a local identifier.  A local sequence
identifier is assumed when reading FASTA, but a database accession is
assumed in the feature table.

Sequin's Suggest Interval functionality, which can determine CDS intervals
from nucleotide and protein sequences plus the genetic code, is not used in
tbl2asn.  Instead, the CDS intervals are required, and the supplied protein
sequence is just used to confirm proper translation.


MESSENGER RNA SEQUENCE FORMAT

mRNA sequences are FASTA files with a .rna extension that can substitute
for the transcribed product of an mRNA feature.  Like the .pep files, they
act as a reality check that the supplied intervals do in fact encode the
expected mRNA sequence.

>sde3m
TTTTCATGTTTCTTCTCCTTTGAAGCCTGCCTGCGTTAGTCTGGCTTCATTGCTTCTCCATTTCTTGGTG
TGATCGAATCAAAGAGTGTAACCCATTTTGCTACTGATTCAGTACGTATGATCAATTCTCTCAATTTCAG
...

The SeqID must match the transcript_id from an mRNA feature.


QUALITY SCORES FORMAT

Phrap/Consed quality scores can be supplied in .qvl files.  These generate
Seq-graph data that will be attached to the nucleotide sequence from the
.fsa file. Programs such as Sequin can display these in a graphical view.

>chr1
 51 63 70 82 82 82 90 90 90 90 86 86 86 86 90 90 90 90 90 86
 86 86 86 86 86 86 86 90 90 90 90 90 90 86 86 78 78 90 90 86
...

These values can be extracted from the output files of the Phrap and Consed
programs used to process raw data from automated sequencing machines.


SUBMISSION TEMPLATE FORMAT

The submission template is an ASN.1 Submit-block that can be generated by
Sequin.  A simple example is shown below.

Submit-block ::= {
  contact {
    contact {
      name
        name {
          last "Darwin" ,
          first "Charles" ,
          initials "C.R." ,
          suffix "" } ,
      affil
        std {
          affil "Oxbridge University" ,
          div "Evolutionary Biology Department" ,
          city "Camford" ,
          country "United Kingdom" ,
          street "1859 Tennis Court Lane" ,
          email "darwin@beagle.edu.uk" ,
          phone "01 44 171-007-1212" ,
          postal-code "OX1 2BH" } } } ,
  cit {
    authors {
      names
        std {
          {
            name
              name {
                last "Darwin" ,
                first "Charles" ,
                initials "C.R." } } } ,
      affil
        std {
          affil "Oxbridge University" ,
          div "Evolutionary Biology Department" ,
          city "Camford" ,
          country "United Kingdom" ,
          street "1859 Tennis Court Lane" ,
          postal-code "OX1 2BH" } } ,
    date
      std {
        year 2003 ,
        month 2 ,
        day 28 } } ,
  subtype new  }

This can be exported from the Desktop view of a template file in Sequin, or
from the initial submission dialogs.  In addition, unpublished reference or
comments can also be generated in Sequin and saved from the Desktop.  The
two files can be catenated to make a .sbt template with the publication or
comment descriptor after the submit block.