1 | .\" DO NOT MODIFY THIS FILE! It was generated by help2man 1.40.10. |
---|
2 | .TH MAST: "1" "February 2013" "Motif Alignment and Search Tool" "User Commands" |
---|
3 | .SH NAME |
---|
4 | MAST \- Motif Alignment and Search Tool |
---|
5 | .SH SYNOPSIS |
---|
6 | .B mast <motif file> <sequence file> |
---|
7 | [\fIoptions\fR] |
---|
8 | .SH DESCRIPTION |
---|
9 | MAST: Motif Alignment and Search Tool |
---|
10 | .PP |
---|
11 | Inputs |
---|
12 | .TP |
---|
13 | \fB<motif file>\fR |
---|
14 | file containing motifs to use; normally a MEME output file |
---|
15 | .TP |
---|
16 | \fB<sequence file>\fR |
---|
17 | search sequences in FASTA\-formatted database with motifs |
---|
18 | .TP |
---|
19 | \fB\-bfile <file>\fR |
---|
20 | read background frequencies from <file> |
---|
21 | .TP |
---|
22 | \fB\-dblist\fR |
---|
23 | read the <sequence file> as a list of FASTA\-formatted databases |
---|
24 | .PP |
---|
25 | Outputs |
---|
26 | .TP |
---|
27 | \fB\-o <dir>\fR |
---|
28 | directory to output mast results; directory must not exist |
---|
29 | .TP |
---|
30 | \fB\-oc <dir>\fR |
---|
31 | directory to output mast results with overwriting allowed |
---|
32 | .TP |
---|
33 | \fB\-hit_list\fR |
---|
34 | print a machine\-readable list of all hits only; outputs to standard out and overrides \fB\-seqp\fR |
---|
35 | .PP |
---|
36 | Which Motifs To Use |
---|
37 | .TP |
---|
38 | \fB\-remcorr\fR |
---|
39 | remove highly correlated motifs from query |
---|
40 | .TP |
---|
41 | \fB\-m <m>+\fR |
---|
42 | use only motif number \fB<m>\fR (overrides \fB\-mev\fR); this can be |
---|
43 | repeated to select multiple motifs |
---|
44 | .TP |
---|
45 | \fB\-c <count>\fR |
---|
46 | only use the first \fB<count>\fR motifs or all motifs when \fB<count>\fR is zero (default: 0) |
---|
47 | .TP |
---|
48 | \fB\-mev <mev>\fR |
---|
49 | use only motifs with E\-values less than \fB<mev>\fR |
---|
50 | .TP |
---|
51 | \fB\-diag <diag>\fR |
---|
52 | nominal order and spacing of motifs is specified by \fB<diag>\fR which is a block diagram |
---|
53 | .PP |
---|
54 | DNA\-Only Options |
---|
55 | .TP |
---|
56 | \fB\-norc\fR |
---|
57 | do not score reverse complement DNA strand |
---|
58 | .TP |
---|
59 | \fB\-sep\fR |
---|
60 | score reverse complement DNA strand as a separate sequence |
---|
61 | .TP |
---|
62 | \fB\-dna\fR |
---|
63 | translate DNA sequences to protein; motifs must be protein; sequences must be DNA |
---|
64 | .TP |
---|
65 | \fB\-comp\fR |
---|
66 | adjust p\-values and E\-values for sequence composition |
---|
67 | .PP |
---|
68 | Which Results To Print |
---|
69 | .TP |
---|
70 | \fB\-ev <ev>\fR |
---|
71 | print results for sequences with E\-value < \fB<ev>\fR (default: 10) |
---|
72 | .PP |
---|
73 | Appearance Of Block Diagrams |
---|
74 | .TP |
---|
75 | \fB\-mt <mt>\fR |
---|
76 | show motif matches with p\-value < \fB<mt>\fR (default: 0.0001) |
---|
77 | .TP\fB\-w\fR show weak matches (<mt> < p\-value < <mt>*10) in angle brackets in |
---|
78 | the hit list or when the xml is converted to text |
---|
79 | .TP\fB\-best\fR include only the best motif hits in \fB\-hit_list\fR diagrams |
---|
80 | .TP\fB\-seqp\fR use SEQUENCE p\-values for motif thresholds (default: use |
---|
81 | POSITION p\-values) |
---|
82 | Miscellaneous |
---|
83 | .TP\fB\-mf <mf>\fR in results use <mf> as motif file name |
---|
84 | .TP\fB\-df <df>\fR in results use <df> as database name (ignored when \fB\-dblist\fR) |
---|
85 | .TP\fB\-dl <dl>\fR in results use <dl> as link to search sequence names; token |
---|
86 | SEQUENCEID is replaced with the FASTA sequence ID; ignored when |
---|
87 | \fB\-dblist\fR; |
---|
88 | .TP\fB\-minseqs <ms>\fR lower bound on number of sequences in db |
---|
89 | .TP\fB\-nostatus\fR do not print progress report |
---|
90 | .TP\fB\-notext\fR do not create text output |
---|
91 | .TP\fB\-nohtml\fR do not create html output |
---|
92 | .IP |
---|
93 | MAST is a tool for searching biological sequence databases for |
---|
94 | sequences that contain one or more of a group of known motifs. |
---|
95 | .IP |
---|
96 | A motif is a sequence pattern that occurs repeatedly in a group of |
---|
97 | related protein or DNA sequences. Motifs are represented as |
---|
98 | position\-dependent scoring matrices that describe the score of each |
---|
99 | possible letter at each position in the pattern. Individual motifs may |
---|
100 | not contain gaps. Patterns with variable\-length gaps must be split into |
---|
101 | two or more separate motifs before being submitted as input to MAST. |
---|
102 | .IP |
---|
103 | MAST takes as input a file containing the descriptions of one or more |
---|
104 | motifs and searches a sequence database that you select for sequences |
---|
105 | that match the motifs. The motif file can be the output of the MEME |
---|
106 | motif discovery tool or any file in the appropriate format. |
---|
107 | .IP |
---|
108 | MAST outputs an xml file which can then be converted into html or text |
---|
109 | format. The xml file is designed for machine processing and the html |
---|
110 | file is designed for human viewing. The text format is available for |
---|
111 | backwards compatibility though due to design decisions made to optimise |
---|
112 | the xml for html generation the output for separate scoring mode is not |
---|
113 | identical and some options were removed. The text format will be |
---|
114 | unsupported in future releases and so we recommend you migrate any |
---|
115 | programs reading mast output to the xml format. |
---|
116 | .IP |
---|
117 | MAST outputs three things: |
---|
118 | .IP |
---|
119 | 1. The names of the high\-scoring sequences sorted by the strength of |
---|
120 | .IP |
---|
121 | the combined match of the sequence to all of the motifs in the |
---|
122 | group. |
---|
123 | .IP |
---|
124 | 2. Motif diagrams showing the order and spacing of the motifs within |
---|
125 | .IP |
---|
126 | each matching sequence. |
---|
127 | .IP |
---|
128 | 3. Detailed annotation of each matching sequence showing the sequence |
---|
129 | .IP |
---|
130 | and the locations and strengths of matches to the motifs. |
---|
131 | .IP |
---|
132 | MAST works by calculating match scores for each sequence in the |
---|
133 | database compared with each of the motifs in the group of motifs you |
---|
134 | provide. For each sequence, the match scores are converted into various |
---|
135 | types of p\-values and these are used to determine the overall match of |
---|
136 | the sequence to the group of motifs and the probable order and spacing |
---|
137 | of occurrences of the motifs in the sequence. |
---|
138 | .IP |
---|
139 | MAST generates a human readable file from the xml output containing: |
---|
140 | .IP |
---|
141 | * the version of MAST and the date it was built, |
---|
142 | * the reference to cite if you use MAST in your research, |
---|
143 | * a description of the databases and motifs used in the search, |
---|
144 | * an explanation of the result, |
---|
145 | * the sequences identifier and score sorted by score matching the |
---|
146 | .IP |
---|
147 | group of motifs above a stated level of statistical significance, |
---|
148 | .IP |
---|
149 | * motif diagrams showing the order and spacing of occurrences of the |
---|
150 | .IP |
---|
151 | motifs in the significant sequences and, |
---|
152 | .IP |
---|
153 | * annotated sequences showing the positions and p\-values of all motif |
---|
154 | .IP |
---|
155 | occurrences in each of the high\-scoring sequences. |
---|
156 | .IP |
---|
157 | The html version is the recommended version for human reading and has |
---|
158 | all sections documented however the text version has no documentation |
---|
159 | for the first section. That section lists each motif along with the |
---|
160 | sequence that would achieve the best possible match score. In order to |
---|
161 | avoid biased scores when multiple motif scores are combined, MAST also |
---|
162 | computes the pairwise correlations between each pair of motifs. The |
---|
163 | correlation between two motifs is the maximum sum of Pearson's |
---|
164 | correlation coefficients for aligned columns divided by the width of |
---|
165 | the shorter motif. The maximum is found by trying all alignments of the |
---|
166 | two motifs. Motifs with correlations below 0.60 have little effect on |
---|
167 | the accuracy of the combined scores. Pairs of motifs with higher |
---|
168 | correlations should be removed from the query. |
---|
169 | .PP |
---|
170 | Match Scores |
---|
171 | .IP |
---|
172 | The match score of a motif to a position in a sequence is the sum of |
---|
173 | the score from each column of the position\-dependent scoring matrix |
---|
174 | corresponding to the letter at that position in the sequence. For |
---|
175 | example, if the sequence is |
---|
176 | .IP |
---|
177 | TAATGTTGGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGC |
---|
178 | .IP |
---|
179 | ======== |
---|
180 | .IP |
---|
181 | and the motif is represented by the position\-dependent scoring matrix |
---|
182 | (where each row of the matrix corresponds to a position in the motif) |
---|
183 | .TP |
---|
184 | Position |
---|
185 | A C G T |
---|
186 | .TP |
---|
187 | 1 |
---|
188 | 1.447 0.188 \fB\-4\fR.025 \fB\-4\fR.095 |
---|
189 | .TP |
---|
190 | 2 |
---|
191 | 0.739 1.339 \fB\-3\fR.945 \fB\-2\fR.325 |
---|
192 | .TP |
---|
193 | 3 |
---|
194 | 1.764 \fB\-3\fR.562 \fB\-4\fR.197 \fB\-3\fR.895 |
---|
195 | .TP |
---|
196 | 4 |
---|
197 | 1.574 \fB\-3\fR.784 \fB\-1\fR.594 \fB\-1\fR.994 |
---|
198 | .TP |
---|
199 | 5 |
---|
200 | 1.602 \fB\-3\fR.935 \fB\-4\fR.054 \fB\-1\fR.370 |
---|
201 | .TP |
---|
202 | 6 |
---|
203 | 0.797 \fB\-3\fR.647 \fB\-0\fR.814 0.215 |
---|
204 | .TP |
---|
205 | 7 |
---|
206 | \fB\-1\fR.280 1.873 \fB\-0\fR.607 \fB\-1\fR.993 |
---|
207 | .TP |
---|
208 | 8 |
---|
209 | \fB\-3\fR.076 1.035 1.414 \fB\-3\fR.913 |
---|
210 | .IP |
---|
211 | then the match score of the fourth position in the sequence |
---|
212 | (underlined) would be found by summing the score for T in position 1, G |
---|
213 | in position 2 and so on until G in position 8. So the match score would |
---|
214 | be |
---|
215 | .IP |
---|
216 | score = \fB\-4\fR.095 + \fB\-3\fR.945 + \fB\-3\fR.895 + \fB\-1\fR.994 |
---|
217 | .IP |
---|
218 | + \fB\-4\fR.054 + \fB\-0\fR.814 + \fB\-1\fR.933 + 1.414 |
---|
219 | .IP |
---|
220 | = \fB\-19\fR.316 |
---|
221 | .IP |
---|
222 | The match scores for other positions in the sequence are calculated in |
---|
223 | the same way. Match scores are only calculated if the match completely |
---|
224 | fits within the sequence. Match scores are not calculated if the motif |
---|
225 | would overhang either end of the sequence. |
---|
226 | .PP |
---|
227 | P\-values |
---|
228 | .IP |
---|
229 | MAST reports all matches of a sequence to a motif or group of motifs in |
---|
230 | terms of the p\-value of the match. MAST considers the p\-values of four |
---|
231 | types of events: |
---|
232 | .IP |
---|
233 | * position p\-value: the match of a single position within a sequence |
---|
234 | .IP |
---|
235 | to a given motif, |
---|
236 | .IP |
---|
237 | * sequence p\-value: the best match of any position within a sequence |
---|
238 | .IP |
---|
239 | to a given motif, |
---|
240 | .IP |
---|
241 | * combined p\-value: the combined best matches of a sequence to a |
---|
242 | .IP |
---|
243 | group of motifs, and |
---|
244 | .IP |
---|
245 | * E\-value: observing a combined p\-value at least as small in a random |
---|
246 | .IP |
---|
247 | database of the same size. |
---|
248 | .IP |
---|
249 | All p\-values are based on a random sequence model that assumes each |
---|
250 | position in a random sequence is generated according to the average |
---|
251 | letter frequencies of all sequences in the appropriate (peptide or |
---|
252 | nucleotide) non\-redundant database (ftp://ncbi.nlm.nih.gov/blast/db/) |
---|
253 | on September 22, 1996. This can be overridden by specifying the \fB\-bfile\fR |
---|
254 | or \fB\-comp\fR options (see below). For DNA sequences, unless \fB\-norc\fR is given, |
---|
255 | the positive and reverse complement strand frequencies are averaged |
---|
256 | together. |
---|
257 | .IP |
---|
258 | 1. \fB\-bfile\fR <bfile> The random model uses the letter frequencies given |
---|
259 | .IP |
---|
260 | in <bfile> instead of the non\-redundant database frequencies. The |
---|
261 | format of <bfile> is the same as that for the MEME \fB\-bfile\fR option; |
---|
262 | see the MEME documentation for details. You can create files in the |
---|
263 | appropriate format based on the base/residue composition of your |
---|
264 | own FASTA sequence files using the command "fasta\-get\-markov" |
---|
265 | included in the MEME distribution. Type fasta\-get\-markov on the |
---|
266 | command line for documentation. (Sample files are also given in |
---|
267 | directory tests: tests/nt.freq and tests/na.freq.) |
---|
268 | .IP |
---|
269 | 2. \fB\-comp\fR The random model uses the letter frequencies in the current |
---|
270 | .IP |
---|
271 | target sequence instead of the non\-redundant database frequencies. |
---|
272 | This causes p\-values and E\-values to be compensated individually |
---|
273 | for the actual composition of each sequence in the database. This |
---|
274 | option can increase search time substantially due to the need to |
---|
275 | compute a different score distribution for each high\-scoring |
---|
276 | sequence. With this option and DNA sequences, the positive and |
---|
277 | reverse complement strand frequencies are not averaged together. |
---|
278 | .IP |
---|
279 | Position p\-value |
---|
280 | .IP |
---|
281 | The p\-value of a match of a given position within a sequence to a motif |
---|
282 | is defined as the probability of a randomly selected position in a |
---|
283 | randomly generated sequence having a match score at least as large as |
---|
284 | that of the given position. Note:If MAST is combining reverse |
---|
285 | complement DNA strands, the position p\-value is not corrected for |
---|
286 | multiple tests. |
---|
287 | .IP |
---|
288 | Sequence p\-value |
---|
289 | .IP |
---|
290 | The p\-value of a match of a sequence to a motif is defined as the |
---|
291 | probability of a randomly generated sequence of the same length having |
---|
292 | a match score at least as large as the largest match score of any |
---|
293 | position in the sequence. |
---|
294 | .IP |
---|
295 | Combined p\-value |
---|
296 | .IP |
---|
297 | The p\-value of a match of a sequence to a group of motifs is defined as |
---|
298 | the probability of a randomly generated sequence of the same length |
---|
299 | having sequence p\-values whose product is at least as small as the |
---|
300 | product of the sequence p\-values of the matches of the motifs to the |
---|
301 | given sequence. |
---|
302 | .IP |
---|
303 | E\-value |
---|
304 | .IP |
---|
305 | The E\-value of the match of a sequence in a database to a a group of |
---|
306 | motifs is defined as the expected number of sequences in a random |
---|
307 | database of the same size that would match the motifs as well as the |
---|
308 | sequence does and is equal to the combined p\-value of the sequence |
---|
309 | times the number of sequences in the database. |
---|
310 | .PP |
---|
311 | High\-scoring Sequences |
---|
312 | .IP |
---|
313 | MAST lists the names and part of the descriptive text of all sequences |
---|
314 | whose E\-value is less than E. Sequences shorter than one or more of the |
---|
315 | motifs are skipped. The sequences are sorted by increasing E\-value. The |
---|
316 | value of E is set to 10 for the WEB server but is user\-selectable in |
---|
317 | the down\-loadable version of MAST. |
---|
318 | .PP |
---|
319 | Motif Diagrams |
---|
320 | .IP |
---|
321 | Motif diagrams show the order and spacing of non\-overlapping matches to |
---|
322 | the motifs in each high\-scoring sequence. Motif occurrences are |
---|
323 | determined based on the position p\-value of matches to the motif. |
---|
324 | Strong matches (p\-value < M) are shown in square brackets (`[ ]'), weak |
---|
325 | matches (M < p\-value < M x 10) are shown in angle brackets (`< >') and |
---|
326 | the length of non\-motif sequence ("spacer") is shown between |
---|
327 | underscores (`_'). For example, |
---|
328 | .IP |
---|
329 | 27_[3]_44_<4>_99_[1]_7 |
---|
330 | .IP |
---|
331 | shows an initial spacer of length 27, followed by a strong match to |
---|
332 | motif 3, a spacer of length 44, a weak match to motif 4, a spacer of |
---|
333 | length 99, a strong match to motif 1 and a final non\-motif sequence of |
---|
334 | length 7. The value of M is 0.0001 for the WEB server but is |
---|
335 | user\-selectable in the downloadable version of MAST. |
---|
336 | .PP |
---|
337 | Annotated Sequences |
---|
338 | .IP |
---|
339 | MAST annotates each high\-scoring sequence by printing the sequence |
---|
340 | along with the position and strength of all the non\-overlapping motif |
---|
341 | occurrences. The four lines above each motif occurrence contain, |
---|
342 | respectively, |
---|
343 | .IP |
---|
344 | * the motif number of the occurrence, |
---|
345 | * the position p\-value of the occurrence, |
---|
346 | * the best possible match to the motif, and |
---|
347 | * a plus sign (`+') above each letter in the occurrence that has a |
---|
348 | .IP |
---|
349 | positive match score to the motif. |
---|
350 | .IP |
---|
351 | The best possible match to a motif is the sequence of letters which |
---|
352 | would achieve the highest match score. |
---|
353 | .PP |
---|
354 | Hit List |
---|
355 | .IP |
---|
356 | If you specify the \fB\-hit_list\fR switch to MAST, MAST outputs ONLY a list |
---|
357 | of "hits" in easily machine\-readable format. Each line corresponds to |
---|
358 | one motif occurrence in one sequence. The format of the hit lines is |
---|
359 | .IP |
---|
360 | [<sequence_name> <strand><motif> <start> <end> <score> <p\-value>]+ |
---|
361 | .IP |
---|
362 | where |
---|
363 | .IP |
---|
364 | <sequence_name> is the name of the sequence containing the hit |
---|
365 | <strand> is the strand (+ or \- for DNA, blank for protein), |
---|
366 | <motif> is the motif number, |
---|
367 | <start> is the starting position of the hit, |
---|
368 | <end> is the ending position of the hit, and |
---|
369 | <score> is the score the hit, |
---|
370 | <p\-value> is the position p\-value of the hit. |
---|
371 | .IP |
---|
372 | Two comment lines (starting with "#") are written above the list of |
---|
373 | hits, and the MAST command line is printed as a comment line after the |
---|
374 | list. An example of the output using the \fB\-hit_list\fR switch to MAST is: |
---|
375 | .IP |
---|
376 | # All non\-overlapping hits in all sequences. |
---|
377 | # sequence_name motif hit_start hit_end score hit_p\-value |
---|
378 | ce1cg \fB\-2\fR 8 22 1459.90 1.67e\-06 |
---|
379 | ara +2 2 16 1661.18 5.04e\-08 |
---|
380 | bglr1 +2 1 15 1274.97 1.42e\-05 |
---|
381 | cya \fB\-2\fR 19 33 1101.37 6.64e\-05 |
---|
382 | gale +2 5 19 1076.21 8.11e\-05 |
---|
383 | ilv \fB\-2\fR 6 20 1098.85 6.78e\-05 |
---|
384 | malk +2 37 51 1085.02 7.56e\-05 |
---|
385 | ompa +2 5 19 1583.18 2.43e\-07 |
---|
386 | # mast tests/meme/meme.crp0.oops tests/common/crp0.s \fB\-hit_list\fR \fB\-m\fR 2 |
---|
387 | .PP |
---|
388 | Loading Multiple Sequence Databases |
---|
389 | .IP |
---|
390 | Multiple sequence databases can be loaded by MAST by putting the file |
---|
391 | names into a file and specifying that file instead of the sequence |
---|
392 | database with the option \fB\-dblist\fR. |
---|
393 | .IP |
---|
394 | The file list has one file name on each line with the optional name and |
---|
395 | link as follows: |
---|
396 | .IP |
---|
397 | <file> [<name> <link>] |
---|
398 | \&... |
---|
399 | \&... |
---|
400 | .IP |
---|
401 | If it is specified then the name will be used instead of the file name |
---|
402 | in the output. If the link is specified then all sequences for that |
---|
403 | database in the html output will have a hyperlink to the URL specified |
---|
404 | with the text SEQUENCEID replaced with the FASTA sequence id. |
---|
405 | .PP |
---|
406 | EXAMPLES: |
---|
407 | .IP |
---|
408 | The following examples assume that file "meme.results" is the output of |
---|
409 | a MEME run containing at least 3 motifs which was created on the |
---|
410 | trainingset "training.fasta" and file SwissProt is a copy of the |
---|
411 | Swiss\-Prot database on your local disk. DNA_DB is a copy of a DNA |
---|
412 | database on your local disk. |
---|
413 | .IP |
---|
414 | 1. Annotate the training set: |
---|
415 | .IP |
---|
416 | mast meme.results training.fasta |
---|
417 | .IP |
---|
418 | 2. Find sequences matching the motif and annotate them in the |
---|
419 | .IP |
---|
420 | SwissProt database: |
---|
421 | .IP |
---|
422 | mast meme.results SwissProt |
---|
423 | .IP |
---|
424 | 3. Show sequences with weaker combined matches to motifs. |
---|
425 | .IP |
---|
426 | mast meme.results SwissProt \fB\-ev\fR 200 |
---|
427 | .IP |
---|
428 | 4. Include a nominal order and spacing of the first three motifs in |
---|
429 | .IP |
---|
430 | the calculation of the sequence p\-values to increase the |
---|
431 | sensitivity of the search for matching sequences: |
---|
432 | .IP |
---|
433 | mast meme.results SwissProt \fB\-diag\fR "9\-[2]\-61\-[1]\-62\-[3]\-91" |
---|
434 | .IP |
---|
435 | 5. Use only the first and third motifs in the search: |
---|
436 | .IP |
---|
437 | mast meme.results SwissProt \fB\-m\fR 1 \fB\-m\fR 3 |
---|
438 | .IP |
---|
439 | 6. Use only the first two motifs in the search: |
---|
440 | .IP |
---|
441 | mast meme.results SwissProt \fB\-c\fR 2 |
---|
442 | .IP |
---|
443 | 7. Search DNA sequences using protein motifs, adjusting p\-values and |
---|
444 | .IP |
---|
445 | E\-values for each sequence by that sequence's composition: |
---|
446 | .IP |
---|
447 | mast meme.results DNA_DB \fB\-dna\fR \fB\-comp\fR |
---|
448 | .PP |
---|
449 | References |
---|
450 | .IP |
---|
451 | 1. file://localhost/home/tillea/debian\-maintain/repack/meme/meme_4.9.0/doc/meme\-format.html |
---|
452 | 2. file://localhost/home/tillea/debian\-maintain/repack/meme/meme_4.9.0/doc/mast.html#dblist |
---|
453 | .PP |
---|
454 | Usage |
---|
455 | .IP |
---|
456 | mast <motif file> <sequence file> [options] |
---|
457 | .IP |
---|
458 | Inputs |
---|
459 | <motif file> file containing motifs to use; normally a MEME output |
---|
460 | file |
---|
461 | <sequence file> search sequences in FASTA\-formatted database with |
---|
462 | motifs; |
---|
463 | .TP\fB\-bfile <file>\fR read background frequencies from <file> |
---|
464 | .TP\fB\-dblist\fR read the <sequence file> as a list of FASTA\-formatted |
---|
465 | databases |
---|
466 | Outputs |
---|
467 | .TP\fB\-o <dir>\fR directory to output mast results; directory must not exist |
---|
468 | .TP\fB\-oc <dir>\fR directory to output mast results with overwriting allowed |
---|
469 | .TP\fB\-hit_list\fR print a machine\-readable list of all hits only; outputs to |
---|
470 | standard out and overrides \fB\-seqp\fR |
---|
471 | Which Motifs To Use |
---|
472 | .TP\fB\-remcorr\fR remove highly correlated motifs from query |
---|
473 | .TP\fB\-m <m>\fR+ use only motif number <m> (overrides \fB\-mev\fR); this can be |
---|
474 | repeated to select multiple motifs |
---|
475 | .TP\fB\-c <count>\fR only use the first <count> motifs or all motifs when |
---|
476 | <count> is zero (default: 0) |
---|
477 | .TP\fB\-mev <mev>\fR use only motifs with E\-values less than <mev> |
---|
478 | .TP\fB\-diag <diag>\fR nominal order and spacing of motifs is specified by |
---|
479 | <diag> which is a block diagram |
---|
480 | DNA\-Only Options |
---|
481 | .TP\fB\-norc\fR do not score reverse complement DNA strand |
---|
482 | .TP\fB\-sep\fR score reverse complement DNA strand as a separate sequence |
---|
483 | .TP\fB\-dna\fR translate DNA sequences to protein; motifs must be protein; |
---|
484 | sequences must be DNA |
---|
485 | .TP\fB\-comp\fR adjust p\-values and E\-values for sequence composition |
---|
486 | Which Results To Print |
---|
487 | .TP\fB\-ev <ev>\fR print results for sequences with E\-value < <ev> (default: |
---|
488 | 10) |
---|
489 | Appearance Of Block Diagrams |
---|
490 | .TP\fB\-mt <mt>\fR show motif matches with p\-value < <mt> (default: 0.0001) |
---|
491 | .TP\fB\-w\fR show weak matches (<mt> < p\-value < <mt>*10) in angle brackets in |
---|
492 | the hit list or when the xml is converted to text |
---|
493 | .TP\fB\-best\fR include only the best motif hits in \fB\-hit_list\fR diagrams |
---|
494 | .TP\fB\-seqp\fR use SEQUENCE p\-values for motif thresholds (default: use |
---|
495 | POSITION p\-values) |
---|
496 | Miscellaneous |
---|
497 | .TP\fB\-mf <mf>\fR in results use <mf> as motif file name |
---|
498 | .TP\fB\-df <df>\fR in results use <df> as database name (ignored when \fB\-dblist\fR) |
---|
499 | .TP\fB\-dl <dl>\fR in results use <dl> as link to search sequence names; token |
---|
500 | SEQUENCEID is replaced with the FASTA sequence ID; ignored when |
---|
501 | \fB\-dblist\fR; |
---|
502 | .TP\fB\-minseqs <ms>\fR lower bound on number of sequences in db |
---|
503 | .TP\fB\-nostatus\fR do not print progress report |
---|
504 | .TP\fB\-notext\fR do not create text output |
---|
505 | .TP\fB\-nohtml\fR do not create html output |
---|
506 | .IP |
---|
507 | MAST is a tool for searching biological sequence databases for |
---|
508 | sequences that contain one or more of a group of known motifs. |
---|
509 | .IP |
---|
510 | A motif is a sequence pattern that occurs repeatedly in a group of |
---|
511 | related protein or DNA sequences. Motifs are represented as |
---|
512 | position\-dependent scoring matrices that describe the score of each |
---|
513 | possible letter at each position in the pattern. Individual motifs may |
---|
514 | not contain gaps. Patterns with variable\-length gaps must be split into |
---|
515 | two or more separate motifs before being submitted as input to MAST. |
---|
516 | .IP |
---|
517 | MAST takes as input a file containing the descriptions of one or more |
---|
518 | motifs and searches a sequence database that you select for sequences |
---|
519 | that match the motifs. The motif file can be the output of the MEME |
---|
520 | motif discovery tool or any file in the appropriate format. |
---|
521 | .IP |
---|
522 | MAST outputs an xml file which can then be converted into html or text |
---|
523 | format. The xml file is designed for machine processing and the html |
---|
524 | file is designed for human viewing. The text format is available for |
---|
525 | backwards compatibility though due to design decisions made to optimise |
---|
526 | the xml for html generation the output for separate scoring mode is not |
---|
527 | identical and some options were removed. The text format will be |
---|
528 | unsupported in future releases and so we recommend you migrate any |
---|
529 | programs reading mast output to the xml format. |
---|
530 | .IP |
---|
531 | MAST outputs three things: |
---|
532 | .IP |
---|
533 | 1. The names of the high\-scoring sequences sorted by the strength of |
---|
534 | .IP |
---|
535 | the combined match of the sequence to all of the motifs in the |
---|
536 | group. |
---|
537 | .IP |
---|
538 | 2. Motif diagrams showing the order and spacing of the motifs within |
---|
539 | .IP |
---|
540 | each matching sequence. |
---|
541 | .IP |
---|
542 | 3. Detailed annotation of each matching sequence showing the sequence |
---|
543 | .IP |
---|
544 | and the locations and strengths of matches to the motifs. |
---|
545 | .IP |
---|
546 | MAST works by calculating match scores for each sequence in the |
---|
547 | database compared with each of the motifs in the group of motifs you |
---|
548 | provide. For each sequence, the match scores are converted into various |
---|
549 | types of p\-values and these are used to determine the overall match of |
---|
550 | the sequence to the group of motifs and the probable order and spacing |
---|
551 | of occurrences of the motifs in the sequence. |
---|
552 | .IP |
---|
553 | MAST generates a human readable file from the xml output containing: |
---|
554 | .IP |
---|
555 | * the version of MAST and the date it was built, |
---|
556 | * the reference to cite if you use MAST in your research, |
---|
557 | * a description of the databases and motifs used in the search, |
---|
558 | * an explanation of the result, |
---|
559 | * the sequences identifier and score sorted by score matching the |
---|
560 | .IP |
---|
561 | group of motifs above a stated level of statistical significance, |
---|
562 | .IP |
---|
563 | * motif diagrams showing the order and spacing of occurrences of the |
---|
564 | .IP |
---|
565 | motifs in the significant sequences and, |
---|
566 | .IP |
---|
567 | * annotated sequences showing the positions and p\-values of all motif |
---|
568 | .IP |
---|
569 | occurrences in each of the high\-scoring sequences. |
---|
570 | .IP |
---|
571 | The html version is the recommended version for human reading and has |
---|
572 | all sections documented however the text version has no documentation |
---|
573 | for the first section. That section lists each motif along with the |
---|
574 | sequence that would achieve the best possible match score. In order to |
---|
575 | avoid biased scores when multiple motif scores are combined, MAST also |
---|
576 | computes the pairwise correlations between each pair of motifs. The |
---|
577 | correlation between two motifs is the maximum sum of Pearson's |
---|
578 | correlation coefficients for aligned columns divided by the width of |
---|
579 | the shorter motif. The maximum is found by trying all alignments of the |
---|
580 | two motifs. Motifs with correlations below 0.60 have little effect on |
---|
581 | the accuracy of the combined scores. Pairs of motifs with higher |
---|
582 | correlations should be removed from the query. |
---|
583 | .PP |
---|
584 | Match Scores |
---|
585 | .IP |
---|
586 | The match score of a motif to a position in a sequence is the sum of |
---|
587 | the score from each column of the position\-dependent scoring matrix |
---|
588 | corresponding to the letter at that position in the sequence. For |
---|
589 | example, if the sequence is |
---|
590 | .IP |
---|
591 | TAATGTTGGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGC |
---|
592 | .IP |
---|
593 | ======== |
---|
594 | .IP |
---|
595 | and the motif is represented by the position\-dependent scoring matrix |
---|
596 | (where each row of the matrix corresponds to a position in the motif) |
---|
597 | .TP |
---|
598 | Position |
---|
599 | A C G T |
---|
600 | .TP |
---|
601 | 1 |
---|
602 | 1.447 0.188 \fB\-4\fR.025 \fB\-4\fR.095 |
---|
603 | .TP |
---|
604 | 2 |
---|
605 | 0.739 1.339 \fB\-3\fR.945 \fB\-2\fR.325 |
---|
606 | .TP |
---|
607 | 3 |
---|
608 | 1.764 \fB\-3\fR.562 \fB\-4\fR.197 \fB\-3\fR.895 |
---|
609 | .TP |
---|
610 | 4 |
---|
611 | 1.574 \fB\-3\fR.784 \fB\-1\fR.594 \fB\-1\fR.994 |
---|
612 | .TP |
---|
613 | 5 |
---|
614 | 1.602 \fB\-3\fR.935 \fB\-4\fR.054 \fB\-1\fR.370 |
---|
615 | .TP |
---|
616 | 6 |
---|
617 | 0.797 \fB\-3\fR.647 \fB\-0\fR.814 0.215 |
---|
618 | .TP |
---|
619 | 7 |
---|
620 | \fB\-1\fR.280 1.873 \fB\-0\fR.607 \fB\-1\fR.993 |
---|
621 | .TP |
---|
622 | 8 |
---|
623 | \fB\-3\fR.076 1.035 1.414 \fB\-3\fR.913 |
---|
624 | .IP |
---|
625 | then the match score of the fourth position in the sequence |
---|
626 | (underlined) would be found by summing the score for T in position 1, G |
---|
627 | in position 2 and so on until G in position 8. So the match score would |
---|
628 | be |
---|
629 | .IP |
---|
630 | score = \fB\-4\fR.095 + \fB\-3\fR.945 + \fB\-3\fR.895 + \fB\-1\fR.994 |
---|
631 | .IP |
---|
632 | + \fB\-4\fR.054 + \fB\-0\fR.814 + \fB\-1\fR.933 + 1.414 |
---|
633 | .IP |
---|
634 | = \fB\-19\fR.316 |
---|
635 | .IP |
---|
636 | The match scores for other positions in the sequence are calculated in |
---|
637 | the same way. Match scores are only calculated if the match completely |
---|
638 | fits within the sequence. Match scores are not calculated if the motif |
---|
639 | would overhang either end of the sequence. |
---|
640 | .PP |
---|
641 | P\-values |
---|
642 | .IP |
---|
643 | MAST reports all matches of a sequence to a motif or group of motifs in |
---|
644 | terms of the p\-value of the match. MAST considers the p\-values of four |
---|
645 | types of events: |
---|
646 | .IP |
---|
647 | * position p\-value: the match of a single position within a sequence |
---|
648 | .IP |
---|
649 | to a given motif, |
---|
650 | .IP |
---|
651 | * sequence p\-value: the best match of any position within a sequence |
---|
652 | .IP |
---|
653 | to a given motif, |
---|
654 | .IP |
---|
655 | * combined p\-value: the combined best matches of a sequence to a |
---|
656 | .IP |
---|
657 | group of motifs, and |
---|
658 | .IP |
---|
659 | * E\-value: observing a combined p\-value at least as small in a random |
---|
660 | .IP |
---|
661 | database of the same size. |
---|
662 | .IP |
---|
663 | All p\-values are based on a random sequence model that assumes each |
---|
664 | position in a random sequence is generated according to the average |
---|
665 | letter frequencies of all sequences in the appropriate (peptide or |
---|
666 | nucleotide) non\-redundant database (ftp://ncbi.nlm.nih.gov/blast/db/) |
---|
667 | on September 22, 1996. This can be overridden by specifying the \fB\-bfile\fR |
---|
668 | or \fB\-comp\fR options (see below). For DNA sequences, unless \fB\-norc\fR is given, |
---|
669 | the positive and reverse complement strand frequencies are averaged |
---|
670 | together. |
---|
671 | .IP |
---|
672 | 1. \fB\-bfile\fR <bfile> The random model uses the letter frequencies given |
---|
673 | .IP |
---|
674 | in <bfile> instead of the non\-redundant database frequencies. The |
---|
675 | format of <bfile> is the same as that for the MEME \fB\-bfile\fR option; |
---|
676 | see the MEME documentation for details. You can create files in the |
---|
677 | appropriate format based on the base/residue composition of your |
---|
678 | own FASTA sequence files using the command "fasta\-get\-markov" |
---|
679 | included in the MEME distribution. Type fasta\-get\-markov on the |
---|
680 | command line for documentation. (Sample files are also given in |
---|
681 | directory tests: tests/nt.freq and tests/na.freq.) |
---|
682 | .IP |
---|
683 | 2. \fB\-comp\fR The random model uses the letter frequencies in the current |
---|
684 | .IP |
---|
685 | target sequence instead of the non\-redundant database frequencies. |
---|
686 | This causes p\-values and E\-values to be compensated individually |
---|
687 | for the actual composition of each sequence in the database. This |
---|
688 | option can increase search time substantially due to the need to |
---|
689 | compute a different score distribution for each high\-scoring |
---|
690 | sequence. With this option and DNA sequences, the positive and |
---|
691 | reverse complement strand frequencies are not averaged together. |
---|
692 | .IP |
---|
693 | Position p\-value |
---|
694 | .IP |
---|
695 | The p\-value of a match of a given position within a sequence to a motif |
---|
696 | is defined as the probability of a randomly selected position in a |
---|
697 | randomly generated sequence having a match score at least as large as |
---|
698 | that of the given position. Note:If MAST is combining reverse |
---|
699 | complement DNA strands, the position p\-value is not corrected for |
---|
700 | multiple tests. |
---|
701 | .IP |
---|
702 | Sequence p\-value |
---|
703 | .IP |
---|
704 | The p\-value of a match of a sequence to a motif is defined as the |
---|
705 | probability of a randomly generated sequence of the same length having |
---|
706 | a match score at least as large as the largest match score of any |
---|
707 | position in the sequence. |
---|
708 | .IP |
---|
709 | Combined p\-value |
---|
710 | .IP |
---|
711 | The p\-value of a match of a sequence to a group of motifs is defined as |
---|
712 | the probability of a randomly generated sequence of the same length |
---|
713 | having sequence p\-values whose product is at least as small as the |
---|
714 | product of the sequence p\-values of the matches of the motifs to the |
---|
715 | given sequence. |
---|
716 | .IP |
---|
717 | E\-value |
---|
718 | .IP |
---|
719 | The E\-value of the match of a sequence in a database to a a group of |
---|
720 | motifs is defined as the expected number of sequences in a random |
---|
721 | database of the same size that would match the motifs as well as the |
---|
722 | sequence does and is equal to the combined p\-value of the sequence |
---|
723 | times the number of sequences in the database. |
---|
724 | .PP |
---|
725 | High\-scoring Sequences |
---|
726 | .IP |
---|
727 | MAST lists the names and part of the descriptive text of all sequences |
---|
728 | whose E\-value is less than E. Sequences shorter than one or more of the |
---|
729 | motifs are skipped. The sequences are sorted by increasing E\-value. The |
---|
730 | value of E is set to 10 for the WEB server but is user\-selectable in |
---|
731 | the down\-loadable version of MAST. |
---|
732 | .PP |
---|
733 | Motif Diagrams |
---|
734 | .IP |
---|
735 | Motif diagrams show the order and spacing of non\-overlapping matches to |
---|
736 | the motifs in each high\-scoring sequence. Motif occurrences are |
---|
737 | determined based on the position p\-value of matches to the motif. |
---|
738 | Strong matches (p\-value < M) are shown in square brackets (`[ ]'), weak |
---|
739 | matches (M < p\-value < M x 10) are shown in angle brackets (`< >') and |
---|
740 | the length of non\-motif sequence ("spacer") is shown between |
---|
741 | underscores (`_'). For example, |
---|
742 | .IP |
---|
743 | 27_[3]_44_<4>_99_[1]_7 |
---|
744 | .IP |
---|
745 | shows an initial spacer of length 27, followed by a strong match to |
---|
746 | motif 3, a spacer of length 44, a weak match to motif 4, a spacer of |
---|
747 | length 99, a strong match to motif 1 and a final non\-motif sequence of |
---|
748 | length 7. The value of M is 0.0001 for the WEB server but is |
---|
749 | user\-selectable in the downloadable version of MAST. |
---|
750 | .PP |
---|
751 | Annotated Sequences |
---|
752 | .IP |
---|
753 | MAST annotates each high\-scoring sequence by printing the sequence |
---|
754 | along with the position and strength of all the non\-overlapping motif |
---|
755 | occurrences. The four lines above each motif occurrence contain, |
---|
756 | respectively, |
---|
757 | .IP |
---|
758 | * the motif number of the occurrence, |
---|
759 | * the position p\-value of the occurrence, |
---|
760 | * the best possible match to the motif, and |
---|
761 | * a plus sign (`+') above each letter in the occurrence that has a |
---|
762 | .IP |
---|
763 | positive match score to the motif. |
---|
764 | .IP |
---|
765 | The best possible match to a motif is the sequence of letters which |
---|
766 | would achieve the highest match score. |
---|
767 | .PP |
---|
768 | Hit List |
---|
769 | .IP |
---|
770 | If you specify the \fB\-hit_list\fR switch to MAST, MAST outputs ONLY a list |
---|
771 | of "hits" in easily machine\-readable format. Each line corresponds to |
---|
772 | one motif occurrence in one sequence. The format of the hit lines is |
---|
773 | .IP |
---|
774 | [<sequence_name> <strand><motif> <start> <end> <score> <p\-value>]+ |
---|
775 | .IP |
---|
776 | where |
---|
777 | .IP |
---|
778 | <sequence_name> is the name of the sequence containing the hit |
---|
779 | <strand> is the strand (+ or \- for DNA, blank for protein), |
---|
780 | <motif> is the motif number, |
---|
781 | <start> is the starting position of the hit, |
---|
782 | <end> is the ending position of the hit, and |
---|
783 | <score> is the score the hit, |
---|
784 | <p\-value> is the position p\-value of the hit. |
---|
785 | .IP |
---|
786 | Two comment lines (starting with "#") are written above the list of |
---|
787 | hits, and the MAST command line is printed as a comment line after the |
---|
788 | list. An example of the output using the \fB\-hit_list\fR switch to MAST is: |
---|
789 | .IP |
---|
790 | # All non\-overlapping hits in all sequences. |
---|
791 | # sequence_name motif hit_start hit_end score hit_p\-value |
---|
792 | ce1cg \fB\-2\fR 8 22 1459.90 1.67e\-06 |
---|
793 | ara +2 2 16 1661.18 5.04e\-08 |
---|
794 | bglr1 +2 1 15 1274.97 1.42e\-05 |
---|
795 | cya \fB\-2\fR 19 33 1101.37 6.64e\-05 |
---|
796 | gale +2 5 19 1076.21 8.11e\-05 |
---|
797 | ilv \fB\-2\fR 6 20 1098.85 6.78e\-05 |
---|
798 | malk +2 37 51 1085.02 7.56e\-05 |
---|
799 | ompa +2 5 19 1583.18 2.43e\-07 |
---|
800 | # mast tests/meme/meme.crp0.oops tests/common/crp0.s \fB\-hit_list\fR \fB\-m\fR 2 |
---|
801 | .PP |
---|
802 | Loading Multiple Sequence Databases |
---|
803 | .IP |
---|
804 | Multiple sequence databases can be loaded by MAST by putting the file |
---|
805 | names into a file and specifying that file instead of the sequence |
---|
806 | database with the option \fB\-dblist\fR. |
---|
807 | .IP |
---|
808 | The file list has one file name on each line with the optional name and |
---|
809 | link as follows: |
---|
810 | .IP |
---|
811 | <file> [<name> <link>] |
---|
812 | \&... |
---|
813 | \&... |
---|
814 | .IP |
---|
815 | If it is specified then the name will be used instead of the file name |
---|
816 | in the output. If the link is specified then all sequences for that |
---|
817 | database in the html output will have a hyperlink to the URL specified |
---|
818 | with the text SEQUENCEID replaced with the FASTA sequence id. |
---|
819 | .PP |
---|
820 | EXAMPLES: |
---|
821 | .IP |
---|
822 | The following examples assume that file "meme.results" is the output of |
---|
823 | a MEME run containing at least 3 motifs which was created on the |
---|
824 | trainingset "training.fasta" and file SwissProt is a copy of the |
---|
825 | Swiss\-Prot database on your local disk. DNA_DB is a copy of a DNA |
---|
826 | database on your local disk. |
---|
827 | .IP |
---|
828 | 1. Annotate the training set: |
---|
829 | .IP |
---|
830 | mast meme.results training.fasta |
---|
831 | .IP |
---|
832 | 2. Find sequences matching the motif and annotate them in the |
---|
833 | .IP |
---|
834 | SwissProt database: |
---|
835 | .IP |
---|
836 | mast meme.results SwissProt |
---|
837 | .IP |
---|
838 | 3. Show sequences with weaker combined matches to motifs. |
---|
839 | .IP |
---|
840 | mast meme.results SwissProt \fB\-ev\fR 200 |
---|
841 | .IP |
---|
842 | 4. Include a nominal order and spacing of the first three motifs in |
---|
843 | .IP |
---|
844 | the calculation of the sequence p\-values to increase the |
---|
845 | sensitivity of the search for matching sequences: |
---|
846 | .IP |
---|
847 | mast meme.results SwissProt \fB\-diag\fR "9\-[2]\-61\-[1]\-62\-[3]\-91" |
---|
848 | .IP |
---|
849 | 5. Use only the first and third motifs in the search: |
---|
850 | .IP |
---|
851 | mast meme.results SwissProt \fB\-m\fR 1 \fB\-m\fR 3 |
---|
852 | .IP |
---|
853 | 6. Use only the first two motifs in the search: |
---|
854 | .IP |
---|
855 | mast meme.results SwissProt \fB\-c\fR 2 |
---|
856 | .IP |
---|
857 | 7. Search DNA sequences using protein motifs, adjusting p\-values and |
---|
858 | .IP |
---|
859 | E\-values for each sequence by that sequence's composition: |
---|
860 | .IP |
---|
861 | mast meme.results DNA_DB \fB\-dna\fR \fB\-comp\fR |
---|
862 | .PP |
---|