1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
|
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator"
content="HTML Tidy for Linux/x86 (vers 1st October 2002), see www.w3.org" />
<title></title>
</head>
<body>
<pre>
README for standalone MEGABLAST
(last updated 10/20/2000)
Mega BLAST uses the greedy algorithm of Webb Miller et al. for nucleotide
sequence alignment search and concatenates many queries to save time spent
scanning the database. This program is optimized for aligning sequences that
differ slightly as a result of sequencing or other similar "errors". It is up to
10 times faster than more common sequence similarity programs and therefore can
be used to swiftly compare two large sets of sequences against each other.
Most of the options are similar to those in the blastall binary (see README.bls
file for their descriptions). Note that megablast binary does not require the
program option. Below are the more detailed explanations of some of the options
either specific to Mega BLAST or having different meaning:
-----------------------------
-W Word size.
When W is divisible by 4, it guarantees that all perfect matches of length
W + 3 will be found by Mega BLAST search, however perfect matches of length
as low as W might also be found, although the latter is not guaranteed. Any
value of W not divisible by 4 is equivalent to the nearest value divisible by
4 (with 4*i+2 equivalent to 4*i).
-----------------------------
-G, -E Affine gapping penalties.
If these options are not set (both are 0), then non-affine gapping is assumed
with gap opening penalty 0 and gap extension penalty E, that can be computed
from match reward r and mismatch penalty q by the formula: E = r/2 - q. The
affine version of Mega BLAST requires significantly more memory, so it should
be avoided if possible, especially when some of the query or database
sequences are very long.
-----------------------------
-D Type of the Mega BLAST output.
0: Produce one-line output for each alignment, in the form
'subject-id'=='[+-]query-id' (s_off q_off s_end q_end) score
Here subject(query)-id is a gi number, an accession or some other type of
identifier found in the FASTA definition line of the respective sequence.
+ or - corresponds to same or different strand alignment.
Score for non-affine gapping parameters means the total number of
differences (mismatches + gaps). For affine case it is the actual (raw)
score of the alignment.
1: Show the same output as level 0, plus the endpoints and percentage
of identical nucleotides for each ungapped segment in the alignment.
2: Show the traditional BLAST (blastn) output.
3: Show one-line output for each alignment, with the following fields
tab-separated:
Query id, Subject id, percent of identity, alignment length, number of
mismatches (not including gaps), number of gap openings, start of
alignment in query, end of alignment in query, start of alignment in
subject, end of alignment in subject, expected value, bit score.
If the alignment is from a reverse strand, the subject start and end are
printed in the reverse order, reflecting the actual direction of the
alignment.
-----------------------------
-F Filtering
This option is described in the README.bls file and in general works
identically to other BLAST programs. It actually contains two different
options: the type of filtering and what stages of the search should mask the
filtered regions. The option is specified by a string that contains all types
of filters the user wants to apply, separated by semicolons or spaces. The
available filters for nucleotide BLAST or Mega BLAST searches are:
D - dust
R - Human repeats
V - Vector screen
L - low complexity (equivalent to D)
Finally, if letter 'm' is included in the filter string, all types of
filters are used to mask the query sequence regions only on the word finding
stage and do not affect the extension stage.
E.g. if the option -F "m D;R" is specified, then both dust and human repeats
filtering will be applied, but the alignments will be extended through the
filtered areas. With option -F "L;V" the dust and vector screen filters will
be applied, and the filtered areas will be masked for all stages of the
search.
The -F m option affects the lower case filtering (specified by the -U option)
as well. Therefore if one wants to use lower case filtering, but allow the
extension through lower case regions of the query sequence, the -F m -U T
combination of options must be used.
-----------------------------
-X X-dropoff value.
As in BLAST, this values provides a cutoff threshold for the extension
algorithm tree exploration. When the score of a given branch drops below the
current best score minus the X-dropoff, the exploration of this branch
stops.
-y X-dropoff for ungapped extensions.
-Z X-dropoff for the final gapped extension.
Both -y and -Z are used only in conjunction with -n T option, i.e. when
non-greedy gapped extension is performed, like in blastn.
Note that all of these are the raw values, as opposed to bit values for other
variations of BLAST.
-----------------------------
-e The cutoff expectation value.
By default this value is set to a very large number, i.e. effectively there
is no expectation value cutoff.
-----------------------------
-v Maximal number of database sequences to report alignments from.
-b Maximal number of reported alignments for a given database sequence.
These options are meaningful only in conjunction with -D 2.
-----------------------------
-J Believe the query defline.
The default is T (TRUE) for all types of output except -D 2. In the latter
case, the default is F (FALSE), unless a SeqAlign ASN.1 output is required,
specified by the -O option.
Note: this option must be set to F (FALSE) if the sequence IDs in the FASTA
file are not unique.
-----------------------------
-M Maximal total length of queries to be concatenated for a single megablast
search.
Setting this value to smaller than default (20,000,000) can reduce the memory
image of the program for large searches.
-----------------------------
-P Maximal number of positions for a hash value.
This option provides for a very simple type of filtering if it is set to a
non-zero value. Namely, any pattern of length 12 when word size is greater
than or equal to 16 (8 for smaller word sizes), that appears in all of the
query sequences together more than P times, is masked and not included in the
search look-up table. If such masking occurs, megablast shows a warning
message on the standard output. This can be useful when running megablast for
very long unmasked sequences, in which case when -P option is not set, the
search might take a very long time.
-----------------------------
-O ASN.1 Seqalign file.
This option specifies a file name for writing ASN.1 output. It is only
meaningful in conjunction with -D 2. The ASN.1 will consist of separate
ASN.1 codes for each query sequence:
Seq-annot ::= {
All hits for first query
}
Seq-annot ::= {
All hits for second query
}
etc.
-----------------------------
-s Minimal hit score to report.
By default this value is set to W, where W is the wordsize (-W option),
i.e. is ignored (since all found alignments are extended from an exact match
of length at least W).
-----------------------------
-Q Masked query output.
All regions of the query sequences, that were hit by any found alignment, are
masked by N's. The output is written to a file specified by the -Q option. It
can be used only in conjunction with -D 2.
-----------------------------
-f Show full IDs in the output.
By default, for -D 0 and -D 1 outputs, the sequence IDs are reported as GIs
or accession numbers (if GIs are not available). If -f is set to T, full IDs
will be shown, unless -J option is set to F. In the latter case full deflines
will be shown for the query sequences.
-----------------------------
-U Use lower case filtering of FASTA sequences.
Like in blastall binary, this option allows to treat lower case in the query
sequences as masked residues. The deafult for this option is set to FALSE,
in which case the lower case is treated identically to upper case.
-----------------------------
-p Cutoff by percentage of identity
The alignments with identity percentage below the value of this option are
not reported in all output formats except -D 0 (with the latter the traceback
is not performed, so it is impossible to calculate the percentage of identical
residues).
-----------------------------
-t Discontiguous word template length.
If this is not zero, the discontiguous word approach is used. The supported
template lengths are 16, 18, and 21. The word size (-W parameter) must be 11 or
12 in this case.
-N Discontiguous template type: coding (0), non-coding (1), or both (2).
For each of the three template lengths, two discontiguous templates are
supported. One of them, called coding, is based on the '110' pattern, the other
is optimal, or close to optimal, based on the hit probability simulations for
random sequences.
The exact templates are:
W = 11, t = 16, coding: 1101101101101101
W = 11, t = 16, non-coding: 1110010110110111
W = 12, t = 16, coding: 1111101101101101
W = 12, t = 16, non-coding: 1110110110110111
W = 11, t = 18, coding: 101101100101101101
W = 11, t = 18, non-coding: 111010010110010111
W = 12, t = 18, coding: 101101101101101101
W = 12, t = 18, non-coding: 111010110010110111
W = 11, t = 21, coding: 100101100101100101101
W = 11, t = 21, non-coding: 111010010100010010111
W = 12, t = 21, coding: 100101101101100101101
W = 12, t = 21, non-coding: 111010010110010010111
If 'both' option is chosen, then all initial matches satisfying either one of the
two types of templates are extended.
-----------------------------
-g Generate words for every base of the database.
Both in blastn and traditional megablast, the database sequences are compressed
4:1, and words are looked up only at the beginning of each byte, i.e. at every 4th
base. This option prescribes to lookup words starting at any arbitrary base
of the database sequence.
-----------------------------
-H Maximal number of HSPs to save per database sequence.
-----------------------------
</pre>
</body>
</html>
|