doc/man/fa2htgs.1


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267

.TH FA2HTGS 1 2006-05-29 NCBI "NCBI Tools User's Manual"
.SH NAME
fa2htgs \- formatter for high throughput genome sequencing project submissions
.SH SYNOPSIS
.B fa2htgs
[\|\fB\-\fP\|]
[\|\fB\-6\fP\ \fIstr\fP\|]
[\|\fB\-7\fP\ \fIstr\fP\|]
[\|\fB\-A\fP\ \fIfilename\fP\|]
[\|\fB\-C\fP\ \fIstr\fP\|]
[\|\fB\-D\fP\|]
[\|\fB\-L\fP\ \fIfilename\fP\|]
[\|\fB\-M\fP\ \fIstr\fP\|]
[\|\fB\-N\fP\|]
[\|\fB\-O\fP\ \fIfilename\fP\|]
[\|\fB\-P\fP\ \fIstr\fP\|]
[\|\fB\-Q\fP\ \fIfilename\fP\|]
[\|\fB\-S\fP\ \fIstr\fP\|]
[\|\fB\-T\fP\ \fIfilename\fP\|]
[\|\fB\-X\fP\|]
[\|\fB\-a\fP\ \fIstr\fP\|]
[\|\fB\-b\fP\ \fIN\fP\|]
[\|\fB\-c\fP\ \fIstr\fP\|]
[\|\fB\-d\fP\ \fIstr\fP\|]
[\|\fB\-e\fP\ \fIfilename\fP\|]
[\|\fB\-f\fP\|]
\fB\-g\fP\ \fIstr\fP
[\|\fB\-h\fP\ \fIstr\fP\|]
[\|\fB\-i\fP\ \fIfilename\fP\|]
[\|\fB\-k\fP\ \fIstr\fP\|]
[\|\fB\-l\fP\ \fIN\fP\|]
[\|\fB\-m\fP\|]
[\|\fB\-n\fP\ \fIstr\fP\|]
[\|\fB\-o\fP\ \fIfilename\fP\|]
[\|\fB\-p\fP\ \fIN\fP\|]
[\|\fB\-q\fP\|]
[\|\fB\-r\fP\ \fIstr\fP\|]
\fB\-s\fP\ \fIstr\fP
[\|\fB\-t\fP\ \fIfilename\fP\|]
[\|\fB\-u\fP\|]
[\|\fB\-v\fP\|]
[\|\fB\-w\fP\|]
[\|\fB\-x\fP\ \fIstr\fP\|]
.SH DESCRIPTION
\fBfa2htgs\fP is a program used to generate Seq-submits (an ASN.1
sequence submission file) for high throughput genome sequencing
projects.
.PP
\fBfa2htgs\fP will read a FASTA file (or an Ace Contig file with Phrap
sequence quality values), a Sequin submission template file, (to get
contact and citation information for the submission), and a series of
command line arguments (see below).  This program will then combines
these information to make a submission suitable for GenBank. Once you
have generated your submission file, you need to follow the submission
protocol (see the README present on your FTP account or mailed out to
your Center).
.PP
\fBfa2htgs\fP is intended for the automation by scripts for bulk
submission of unannotated genome sequence. It can easily be extended
from its current simple form to allow more complicated processing.  A
submission prepared with \fBfa2htgs\fP can also be read into
\fBPsequin\fP(1), and then annotated more extensively.
.PP
Questions and concerns about this processing protocol, or how to 
use this tool should be forwarded to <htgs@ncbi.nlm.nih.gov>.
.SH OPTIONS
A summary of options is included below.
.TP
\fB\-\fP
Print usage message
.TP
\fB\-6\fP\ \fIstr\fP
SP6 clone (e.g., Contig1,left)
.TP
\fB\-7\fP\ \fIstr\fP
T7 clone (e.g., Contig2,right)
.TP
\fB\-A\fP\ \fIfilename\fP
Filename for accession list input (mutually exclusive with \fB\-T\fP
and \fB\-i\fP).  The input file contains a tab-delimited table with
three to five columns, which are accession number, start position,
stop position, and (optionally) length and strand.  If start > stop,
the minus strand on the referenced accession is used.  A gap is
indicated by the word "gap" instead of an accession, 0 for the start
and stop positions, and a number for the length.
.TP
\fB\-C\fP\ \fIstr\fP
Clone library name (will appear as \fB/clone-lib="\fP\fIstr\fP\fB"\fP
on the source feature)
.TP
\fB\-D\fP
HTGS_DRAFT sequence
.TP
\fB\-L\fP\ \fIfilename\fP
Read phrap contig order from \fIfilename\fP.  This is a tab-delimited
file that can be used to drive the order of contigs (normally
specified by \fB\-P\fP), as well as indicating the SP6 and T7 ends.  It
can also be used when contigs are known to be in opposite orientation.
For example:
.nf

    Contig2     +       1       SP6     left
    Contig3     +       1
    Contig1     -               T7      right

.fi
The first column is the contig name, the second is the orientation,
the third is the fragment_group, the fourth indicates the SP6 or T7
end, and the fifth says which side of SP6 or T7 end had vector
removed.
.TP
\fB\-M\fP\ \fIstr\fP
Map name (will appear as \fB/map="\fP\fIstr\fP\fB"\fP on the source feature)
.TP
\fB\-N\fP
Annotate assembly_fragments
.TP
\fB\-O\fP\ \fIfilename\fP
Read comment from \fIfilename\fP (100-character-per-line maximum;
\fB~\fP is a linebreak and \fB`~\fP is a literal \fB~\fP.  You can
check the format with \fBPSequin\fP(1).)
.TP
\fB\-P\fP\ \fIstr\fP
Contigs to use, separated by commas.  If \fB\-P\fP is not indicated
with the \fB\-T\fP option, then the fragments will go in in the order
that they are in the ace file (which is appropriate for a phase 1
record, but not for a phase 2 or 3).  If you need to set the order of
the segments of the ace file, you need to set it with the \fB\-P\fP
flag, like this: \fB\-P "Contig1,Contig4,Contig3,Contig2,Contig5"\fP
.TP
\fB\-Q\fP\ \fIfilename\fP
Read quality scores from \fIfilename\fP
.TP
\fB\-S\fP\ \fIstr\fP
Strain name
.TP
\fB\-T\fP\ \fIfilename\fP
Filename for phrap input (mutually exclusive with \fB\-A\fP and \fB\-i\fP)
.TP
\fB\-X\fP
The coordinates in the input file are on the resulting segmented
sequence.  (Bases 1 through \fIn\fP of each accession are used.)
Otherwise, the coordinates are on the individual accessions, which
need not start at base 1 of the record.
.TP
\fB\-a\fP\ \fIstr\fP
GenBank accession; use if and only if updating a sequence.
.TP
\fB\-b\fP\ \fIN\fP
Gap length (default = 100; anything from 0 to 1000000000 is legal)
.TP
\fB\-c\fP\ \fIstr\fP
Clone name (will appear as \fB/clone\fP in the source feature; can be
the same as \fB\-s\fP)
.TP
\fB\-d\fP\ \fIstr\fP
Title for sequence (will appear in GenBank \fBDEFINITION\fP line)
.TP
\fB\-e\fP\ \fIfilename\fP
Log errors to \fIfilename\fP
.TP
\fB\-f\fP
htgs_fulltop keyword
.TP
\fB\-g\fP\ \fIstr\fP
Genome Center tag (probably the same as your login name on the NCBI FTP server)
.TP
\fB\-h\fP\ \fIstr\fP
Chromosome (will appear as \fB/chromosome\fP in the source feature)
.TP
\fB\-i\fP\ \fIfilename\fP
Filename for fasta input (default is stdin; mutually exclusive with
\fB\-A\fP and \fB\-T\fP)
.TP
\fB\-k\fP\ \fIstr\fP
Add the supplied string as a keyword.
.TP
\fB\-l\fP\ \fIN\fP
Length of sequence in bp (default = 0). The length is checked against
the actual number of bases we get. For phase 1 and 2 sequence it is
also used to estimate gap lengths. For phase 1 and 2 records, it is
important to use a number GREATER than the amount of provided
nucleotide, otherwise this will generate false 'gaps'.  Here is
assumed that the putative full length of the BAC or cosmid will be
used.  There should be at least 20 to 30 'n' in between the segments
(you can check for these in Sequin), as this will ensure proper
behavior when this sequence is used with BLAST.  Otherwise
'artifactual' unrelated segment neighbors may be brought into
proximity of each other.
.TP
\fB\-m\fP
Take comment from template
.TP
\fB\-n\fP\ \fIstr\fP
Organism name (default = Homo sapiens)
.TP
\fB\-o\fP\ \fIfilename\fP
Filename for asn.1 output (default = stdout)
.TP
\fB\-p\fP\ \fIN\fP
HTGS phase:
.RS
.PD 0
.IP 1
A collection of unordered contigs with gaps of unknown length.  A
Phase 1 record must at the very least have two segments with one gap.
(default)
.IP 2
A series of ordered contigs, possibly with known gap lengths.  This
could be a single sequence without gaps, if the sequence has
ambiguities to resolve.
.IP 3
A single contiguous sequence.  This sequence is finished, but not
necessarily annotated.
.PD
.RE
.TP
\fB\-q\fP
htgs_cancelled keyword
.TP
\fB\-r\fP\ \fIstr\fP
Remark for update (brief comment describing the nature of the update,
such as "new sequence", "new citation", or "updated features")
.TP
\fB\-s\fP\ \fIstr\fP
Sequence name.  The sequence must have a name that is unique within
the genome center. We use the combination of the genome center name
(\fB\-g\fP argument) and the sequence name (\fB\-s\fP) to track this
sequence and to talk to you about it.  The name can have any form you
like but must be unique within your center.
.TP
\fB\-t\fP\ \fIfilename\fP
Filename for Seq-submit template (default = template.sub)
.TP
\fB\-u\fP
Take biosource from template
.TP
\fB\-v\fP
htgs_activefin keyword
.TP
\fB\-w\fP
Whole Genome Shotgun flag
.TP
\fB\-x\fP\ \fIstr\fP
Secondary accession numbers, separated by commas, s.t. U10000,L11000.
.PP
.RS
In some cases a large segment will supersede another or group of other
accession numbers (records).  These records which are no longer wanted
in GenBank should be made secondary. Using the \fB\-x\fP argument you
can list the Accession Numbers you want to make secondary.  This will
instruct us to remove the accession number(s) from GenBank, and will
no longer be part of the GenBank release. They will nonetheless be
available from Entrez.
.PP
\fBGREAT CARE\fP should be taken when using this argument!!!  Improper
use of accession numbers here will result in the inappropriate
withdrawal of GenBank records from GenBank, EMBL and DDBJ.  We provide
this parameter as a convenience to submitting centers, but this may
need to be removed if it is not used carefully.
.RE
.SH AUTHOR
The National Center for Biotechnology Information.
.SH SEE ALSO
.ad l
.BR Psequin (1),
fa2htgs/README