Ticket #2966: mast.1

File mast.1, 32.3 KB (added by onlyjob, 10 years ago)
Line 
1.\" DO NOT MODIFY THIS FILE!  It was generated by help2man 1.40.10.
2.TH MAST: "1" "February 2013" "Motif Alignment and Search Tool" "User Commands"
3.SH NAME
4MAST \- Motif Alignment and Search Tool
5.SH SYNOPSIS
6.B mast <motif file> <sequence file>
7[\fIoptions\fR]
8.SH DESCRIPTION
9MAST: Motif Alignment and Search Tool
10.PP
11Inputs
12.TP
13\fB<motif file>\fR
14file containing motifs to use; normally a MEME output file
15.TP
16\fB<sequence file>\fR
17search sequences in FASTA\-formatted database with motifs
18.TP
19\fB\-bfile <file>\fR
20read background frequencies from <file>
21.TP
22\fB\-dblist\fR
23read the <sequence file> as a list of FASTA\-formatted databases
24.PP
25Outputs
26.TP
27\fB\-o <dir>\fR
28directory to output mast results; directory must not exist
29.TP
30\fB\-oc <dir>\fR
31directory to output mast results with overwriting allowed
32.TP
33\fB\-hit_list\fR
34print a machine\-readable list of all hits only; outputs to standard out and overrides \fB\-seqp\fR
35.PP
36Which Motifs To Use
37.TP
38\fB\-remcorr\fR
39remove highly correlated motifs from query
40.TP
41\fB\-m <m>+\fR
42use only motif number \fB<m>\fR (overrides \fB\-mev\fR); this can be
43repeated to select multiple motifs
44.TP
45\fB\-c <count>\fR
46only use the first \fB<count>\fR motifs or all motifs when \fB<count>\fR is zero (default: 0)
47.TP
48\fB\-mev <mev>\fR
49use only motifs with E\-values less than \fB<mev>\fR
50.TP
51\fB\-diag <diag>\fR
52nominal order and spacing of motifs is specified by \fB<diag>\fR which is a block diagram
53.PP
54DNA\-Only Options
55.TP
56\fB\-norc\fR
57do not score reverse complement DNA strand
58.TP
59\fB\-sep\fR
60score reverse complement DNA strand as a separate sequence
61.TP
62\fB\-dna\fR
63translate DNA sequences to protein; motifs must be protein; sequences must be DNA
64.TP
65\fB\-comp\fR
66adjust p\-values and E\-values for sequence composition
67.PP
68Which Results To Print
69.TP
70\fB\-ev <ev>\fR
71print results for sequences with E\-value < \fB<ev>\fR (default: 10)
72.PP
73Appearance Of Block Diagrams
74.TP
75\fB\-mt <mt>\fR
76show motif matches with p\-value < \fB<mt>\fR (default: 0.0001)
77.TP\fB\-w\fR show weak matches (<mt> < p\-value < <mt>*10) in angle brackets in
78the hit list or when the xml is converted to text
79.TP\fB\-best\fR include only the best motif hits in \fB\-hit_list\fR diagrams
80.TP\fB\-seqp\fR use SEQUENCE p\-values for motif thresholds (default: use
81POSITION p\-values)
82Miscellaneous
83.TP\fB\-mf <mf>\fR in results use <mf> as motif file name
84.TP\fB\-df <df>\fR in results use <df> as database name (ignored when \fB\-dblist\fR)
85.TP\fB\-dl <dl>\fR in results use <dl> as link to search sequence names; token
86SEQUENCEID is replaced with the FASTA sequence ID; ignored when
87\fB\-dblist\fR;
88.TP\fB\-minseqs <ms>\fR lower bound on number of sequences in db
89.TP\fB\-nostatus\fR do not print progress report
90.TP\fB\-notext\fR do not create text output
91.TP\fB\-nohtml\fR do not create html output
92.IP
93MAST is a tool for searching biological sequence databases for
94sequences that contain one or more of a group of known motifs.
95.IP
96A motif is a sequence pattern that occurs repeatedly in a group of
97related protein or DNA sequences. Motifs are represented as
98position\-dependent scoring matrices that describe the score of each
99possible letter at each position in the pattern. Individual motifs may
100not contain gaps. Patterns with variable\-length gaps must be split into
101two or more separate motifs before being submitted as input to MAST.
102.IP
103MAST takes as input a file containing the descriptions of one or more
104motifs and searches a sequence database that you select for sequences
105that match the motifs. The motif file can be the output of the MEME
106motif discovery tool or any file in the appropriate format.
107.IP
108MAST outputs an xml file which can then be converted into html or text
109format. The xml file is designed for machine processing and the html
110file is designed for human viewing. The text format is available for
111backwards compatibility though due to design decisions made to optimise
112the xml for html generation the output for separate scoring mode is not
113identical and some options were removed. The text format will be
114unsupported in future releases and so we recommend you migrate any
115programs reading mast output to the xml format.
116.IP
117MAST outputs three things:
118.IP
1191. The names of the high\-scoring sequences sorted by the strength of
120.IP
121the combined match of the sequence to all of the motifs in the
122group.
123.IP
1242. Motif diagrams showing the order and spacing of the motifs within
125.IP
126each matching sequence.
127.IP
1283. Detailed annotation of each matching sequence showing the sequence
129.IP
130and the locations and strengths of matches to the motifs.
131.IP
132MAST works by calculating match scores for each sequence in the
133database compared with each of the motifs in the group of motifs you
134provide. For each sequence, the match scores are converted into various
135types of p\-values and these are used to determine the overall match of
136the sequence to the group of motifs and the probable order and spacing
137of occurrences of the motifs in the sequence.
138.IP
139MAST generates a human readable file from the xml output containing:
140.IP
141* the version of MAST and the date it was built,
142* the reference to cite if you use MAST in your research,
143* a description of the databases and motifs used in the search,
144* an explanation of the result,
145* the sequences identifier and score sorted by score matching the
146.IP
147group of motifs above a stated level of statistical significance,
148.IP
149* motif diagrams showing the order and spacing of occurrences of the
150.IP
151motifs in the significant sequences and,
152.IP
153* annotated sequences showing the positions and p\-values of all motif
154.IP
155occurrences in each of the high\-scoring sequences.
156.IP
157The html version is the recommended version for human reading and has
158all sections documented however the text version has no documentation
159for the first section. That section lists each motif along with the
160sequence that would achieve the best possible match score. In order to
161avoid biased scores when multiple motif scores are combined, MAST also
162computes the pairwise correlations between each pair of motifs. The
163correlation between two motifs is the maximum sum of Pearson's
164correlation coefficients for aligned columns divided by the width of
165the shorter motif. The maximum is found by trying all alignments of the
166two motifs. Motifs with correlations below 0.60 have little effect on
167the accuracy of the combined scores. Pairs of motifs with higher
168correlations should be removed from the query.
169.PP
170Match Scores
171.IP
172The match score of a motif to a position in a sequence is the sum of
173the score from each column of the position\-dependent scoring matrix
174corresponding to the letter at that position in the sequence. For
175example, if the sequence is
176.IP
177TAATGTTGGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGC
178.IP
179========
180.IP
181and the motif is represented by the position\-dependent scoring matrix
182(where each row of the matrix corresponds to a position in the motif)
183.TP
184Position
185A      C      G      T
186.TP
1871
1881.447  0.188  \fB\-4\fR.025 \fB\-4\fR.095
189.TP
1902
1910.739  1.339  \fB\-3\fR.945 \fB\-2\fR.325
192.TP
1933
1941.764  \fB\-3\fR.562 \fB\-4\fR.197 \fB\-3\fR.895
195.TP
1964
1971.574  \fB\-3\fR.784 \fB\-1\fR.594 \fB\-1\fR.994
198.TP
1995
2001.602  \fB\-3\fR.935 \fB\-4\fR.054 \fB\-1\fR.370
201.TP
2026
2030.797  \fB\-3\fR.647 \fB\-0\fR.814 0.215
204.TP
2057
206\fB\-1\fR.280 1.873  \fB\-0\fR.607 \fB\-1\fR.993
207.TP
2088
209\fB\-3\fR.076 1.035  1.414  \fB\-3\fR.913
210.IP
211then the match score of the fourth position in the sequence
212(underlined) would be found by summing the score for T in position 1, G
213in position 2 and so on until G in position 8. So the match score would
214be
215.IP
216score = \fB\-4\fR.095 + \fB\-3\fR.945 + \fB\-3\fR.895 + \fB\-1\fR.994
217.IP
218+ \fB\-4\fR.054 + \fB\-0\fR.814 + \fB\-1\fR.933 + 1.414
219.IP
220= \fB\-19\fR.316
221.IP
222The match scores for other positions in the sequence are calculated in
223the same way. Match scores are only calculated if the match completely
224fits within the sequence. Match scores are not calculated if the motif
225would overhang either end of the sequence.
226.PP
227P\-values
228.IP
229MAST reports all matches of a sequence to a motif or group of motifs in
230terms of the p\-value of the match. MAST considers the p\-values of four
231types of events:
232.IP
233* position p\-value: the match of a single position within a sequence
234.IP
235to a given motif,
236.IP
237* sequence p\-value: the best match of any position within a sequence
238.IP
239to a given motif,
240.IP
241* combined p\-value: the combined best matches of a sequence to a
242.IP
243group of motifs, and
244.IP
245* E\-value: observing a combined p\-value at least as small in a random
246.IP
247database of the same size.
248.IP
249All p\-values are based on a random sequence model that assumes each
250position in a random sequence is generated according to the average
251letter frequencies of all sequences in the appropriate (peptide or
252nucleotide) non\-redundant database (ftp://ncbi.nlm.nih.gov/blast/db/)
253on September 22, 1996. This can be overridden by specifying the \fB\-bfile\fR
254or \fB\-comp\fR options (see below). For DNA sequences, unless \fB\-norc\fR is given,
255the positive and reverse complement strand frequencies are averaged
256together.
257.IP
2581. \fB\-bfile\fR <bfile> The random model uses the letter frequencies given
259.IP
260in <bfile> instead of the non\-redundant database frequencies. The
261format of <bfile> is the same as that for the MEME \fB\-bfile\fR option;
262see the MEME documentation for details. You can create files in the
263appropriate format based on the base/residue composition of your
264own FASTA sequence files using the command "fasta\-get\-markov"
265included in the MEME distribution. Type fasta\-get\-markov on the
266command line for documentation. (Sample files are also given in
267directory tests: tests/nt.freq and tests/na.freq.)
268.IP
2692. \fB\-comp\fR The random model uses the letter frequencies in the current
270.IP
271target sequence instead of the non\-redundant database frequencies.
272This causes p\-values and E\-values to be compensated individually
273for the actual composition of each sequence in the database. This
274option can increase search time substantially due to the need to
275compute a different score distribution for each high\-scoring
276sequence. With this option and DNA sequences, the positive and
277reverse complement strand frequencies are not averaged together.
278.IP
279Position p\-value
280.IP
281The p\-value of a match of a given position within a sequence to a motif
282is defined as the probability of a randomly selected position in a
283randomly generated sequence having a match score at least as large as
284that of the given position. Note:If MAST is combining reverse
285complement DNA strands, the position p\-value is not corrected for
286multiple tests.
287.IP
288Sequence p\-value
289.IP
290The p\-value of a match of a sequence to a motif is defined as the
291probability of a randomly generated sequence of the same length having
292a match score at least as large as the largest match score of any
293position in the sequence.
294.IP
295Combined p\-value
296.IP
297The p\-value of a match of a sequence to a group of motifs is defined as
298the probability of a randomly generated sequence of the same length
299having sequence p\-values whose product is at least as small as the
300product of the sequence p\-values of the matches of the motifs to the
301given sequence.
302.IP
303E\-value
304.IP
305The E\-value of the match of a sequence in a database to a a group of
306motifs is defined as the expected number of sequences in a random
307database of the same size that would match the motifs as well as the
308sequence does and is equal to the combined p\-value of the sequence
309times the number of sequences in the database.
310.PP
311High\-scoring Sequences
312.IP
313MAST lists the names and part of the descriptive text of all sequences
314whose E\-value is less than E. Sequences shorter than one or more of the
315motifs are skipped. The sequences are sorted by increasing E\-value. The
316value of E is set to 10 for the WEB server but is user\-selectable in
317the down\-loadable version of MAST.
318.PP
319Motif Diagrams
320.IP
321Motif diagrams show the order and spacing of non\-overlapping matches to
322the motifs in each high\-scoring sequence. Motif occurrences are
323determined based on the position p\-value of matches to the motif.
324Strong matches (p\-value < M) are shown in square brackets (`[ ]'), weak
325matches (M < p\-value < M x 10) are shown in angle brackets (`< >') and
326the length of non\-motif sequence ("spacer") is shown between
327underscores (`_'). For example,
328.IP
32927_[3]_44_<4>_99_[1]_7
330.IP
331shows an initial spacer of length 27, followed by a strong match to
332motif 3, a spacer of length 44, a weak match to motif 4, a spacer of
333length 99, a strong match to motif 1 and a final non\-motif sequence of
334length 7. The value of M is 0.0001 for the WEB server but is
335user\-selectable in the downloadable version of MAST.
336.PP
337Annotated Sequences
338.IP
339MAST annotates each high\-scoring sequence by printing the sequence
340along with the position and strength of all the non\-overlapping motif
341occurrences. The four lines above each motif occurrence contain,
342respectively,
343.IP
344* the motif number of the occurrence,
345* the position p\-value of the occurrence,
346* the best possible match to the motif, and
347* a plus sign (`+') above each letter in the occurrence that has a
348.IP
349positive match score to the motif.
350.IP
351The best possible match to a motif is the sequence of letters which
352would achieve the highest match score.
353.PP
354Hit List
355.IP
356If you specify the \fB\-hit_list\fR switch to MAST, MAST outputs ONLY a list
357of "hits" in easily machine\-readable format. Each line corresponds to
358one motif occurrence in one sequence. The format of the hit lines is
359.IP
360[<sequence_name> <strand><motif> <start> <end> <score> <p\-value>]+
361.IP
362where
363.IP
364<sequence_name> is the name of the sequence containing the hit
365<strand>        is the strand (+ or \- for DNA, blank for protein),
366<motif>         is the motif number,
367<start>         is the starting position of the hit,
368<end>           is the ending position of the hit, and
369<score>         is the score the hit,
370<p\-value>       is the position p\-value of the hit.
371.IP
372Two comment lines (starting with "#") are written above the list of
373hits, and the MAST command line is printed as a comment line after the
374list. An example of the output using the \fB\-hit_list\fR switch to MAST is:
375.IP
376# All non\-overlapping hits in all sequences.
377# sequence_name motif hit_start hit_end score hit_p\-value
378ce1cg \fB\-2\fR 8 22  1459.90 1.67e\-06
379ara +2 2 16  1661.18 5.04e\-08
380bglr1 +2 1 15  1274.97 1.42e\-05
381cya \fB\-2\fR 19 33  1101.37 6.64e\-05
382gale +2 5 19  1076.21 8.11e\-05
383ilv \fB\-2\fR 6 20  1098.85 6.78e\-05
384malk +2 37 51  1085.02 7.56e\-05
385ompa +2 5 19  1583.18 2.43e\-07
386# mast tests/meme/meme.crp0.oops tests/common/crp0.s \fB\-hit_list\fR \fB\-m\fR 2
387.PP
388Loading Multiple Sequence Databases
389.IP
390Multiple sequence databases can be loaded by MAST by putting the file
391names into a file and specifying that file instead of the sequence
392database with the option \fB\-dblist\fR.
393.IP
394The file list has one file name on each line with the optional name and
395link as follows:
396.IP
397<file> [<name> <link>]
398\&...
399\&...
400.IP
401If it is specified then the name will be used instead of the file name
402in the output. If the link is specified then all sequences for that
403database in the html output will have a hyperlink to the URL specified
404with the text SEQUENCEID replaced with the FASTA sequence id.
405.PP
406EXAMPLES:
407.IP
408The following examples assume that file "meme.results" is the output of
409a MEME run containing at least 3 motifs which was created on the
410trainingset "training.fasta" and file SwissProt is a copy of the
411Swiss\-Prot database on your local disk. DNA_DB is a copy of a DNA
412database on your local disk.
413.IP
4141. Annotate the training set:
415.IP
416mast meme.results training.fasta
417.IP
4182. Find sequences matching the motif and annotate them in the
419.IP
420SwissProt database:
421.IP
422mast meme.results SwissProt
423.IP
4243. Show sequences with weaker combined matches to motifs.
425.IP
426mast meme.results SwissProt \fB\-ev\fR 200
427.IP
4284. Include a nominal order and spacing of the first three motifs in
429.IP
430the calculation of the sequence p\-values to increase the
431sensitivity of the search for matching sequences:
432.IP
433mast meme.results SwissProt \fB\-diag\fR "9\-[2]\-61\-[1]\-62\-[3]\-91"
434.IP
4355. Use only the first and third motifs in the search:
436.IP
437mast meme.results SwissProt \fB\-m\fR 1 \fB\-m\fR 3
438.IP
4396. Use only the first two motifs in the search:
440.IP
441mast meme.results SwissProt \fB\-c\fR 2
442.IP
4437. Search DNA sequences using protein motifs, adjusting p\-values and
444.IP
445E\-values for each sequence by that sequence's composition:
446.IP
447mast meme.results DNA_DB \fB\-dna\fR \fB\-comp\fR
448.PP
449References
450.IP
4511. file://localhost/home/tillea/debian\-maintain/repack/meme/meme_4.9.0/doc/meme\-format.html
4522. file://localhost/home/tillea/debian\-maintain/repack/meme/meme_4.9.0/doc/mast.html#dblist
453.PP
454Usage
455.IP
456mast <motif file> <sequence file> [options]
457.IP
458Inputs
459<motif file> file containing motifs to use; normally a MEME output
460file
461<sequence file> search sequences in FASTA\-formatted database with
462motifs;
463.TP\fB\-bfile <file>\fR read background frequencies from <file>
464.TP\fB\-dblist\fR read the <sequence file> as a list of FASTA\-formatted
465databases
466Outputs
467.TP\fB\-o <dir>\fR directory to output mast results; directory must not exist
468.TP\fB\-oc <dir>\fR directory to output mast results with overwriting allowed
469.TP\fB\-hit_list\fR print a machine\-readable list of all hits only; outputs to
470standard out and overrides \fB\-seqp\fR
471Which Motifs To Use
472.TP\fB\-remcorr\fR remove highly correlated motifs from query
473.TP\fB\-m <m>\fR+ use only motif number <m> (overrides \fB\-mev\fR); this can be
474repeated to select multiple motifs
475.TP\fB\-c <count>\fR only use the first <count> motifs or all motifs when
476<count> is zero (default: 0)
477.TP\fB\-mev <mev>\fR use only motifs with E\-values less than <mev>
478.TP\fB\-diag <diag>\fR nominal order and spacing of motifs is specified by
479<diag> which is a block diagram
480DNA\-Only Options
481.TP\fB\-norc\fR do not score reverse complement DNA strand
482.TP\fB\-sep\fR score reverse complement DNA strand as a separate sequence
483.TP\fB\-dna\fR translate DNA sequences to protein; motifs must be protein;
484sequences must be DNA
485.TP\fB\-comp\fR adjust p\-values and E\-values for sequence composition
486Which Results To Print
487.TP\fB\-ev <ev>\fR print results for sequences with E\-value < <ev> (default:
48810)
489Appearance Of Block Diagrams
490.TP\fB\-mt <mt>\fR show motif matches with p\-value < <mt> (default: 0.0001)
491.TP\fB\-w\fR show weak matches (<mt> < p\-value < <mt>*10) in angle brackets in
492the hit list or when the xml is converted to text
493.TP\fB\-best\fR include only the best motif hits in \fB\-hit_list\fR diagrams
494.TP\fB\-seqp\fR use SEQUENCE p\-values for motif thresholds (default: use
495POSITION p\-values)
496Miscellaneous
497.TP\fB\-mf <mf>\fR in results use <mf> as motif file name
498.TP\fB\-df <df>\fR in results use <df> as database name (ignored when \fB\-dblist\fR)
499.TP\fB\-dl <dl>\fR in results use <dl> as link to search sequence names; token
500SEQUENCEID is replaced with the FASTA sequence ID; ignored when
501\fB\-dblist\fR;
502.TP\fB\-minseqs <ms>\fR lower bound on number of sequences in db
503.TP\fB\-nostatus\fR do not print progress report
504.TP\fB\-notext\fR do not create text output
505.TP\fB\-nohtml\fR do not create html output
506.IP
507MAST is a tool for searching biological sequence databases for
508sequences that contain one or more of a group of known motifs.
509.IP
510A motif is a sequence pattern that occurs repeatedly in a group of
511related protein or DNA sequences. Motifs are represented as
512position\-dependent scoring matrices that describe the score of each
513possible letter at each position in the pattern. Individual motifs may
514not contain gaps. Patterns with variable\-length gaps must be split into
515two or more separate motifs before being submitted as input to MAST.
516.IP
517MAST takes as input a file containing the descriptions of one or more
518motifs and searches a sequence database that you select for sequences
519that match the motifs. The motif file can be the output of the MEME
520motif discovery tool or any file in the appropriate format.
521.IP
522MAST outputs an xml file which can then be converted into html or text
523format. The xml file is designed for machine processing and the html
524file is designed for human viewing. The text format is available for
525backwards compatibility though due to design decisions made to optimise
526the xml for html generation the output for separate scoring mode is not
527identical and some options were removed. The text format will be
528unsupported in future releases and so we recommend you migrate any
529programs reading mast output to the xml format.
530.IP
531MAST outputs three things:
532.IP
5331. The names of the high\-scoring sequences sorted by the strength of
534.IP
535the combined match of the sequence to all of the motifs in the
536group.
537.IP
5382. Motif diagrams showing the order and spacing of the motifs within
539.IP
540each matching sequence.
541.IP
5423. Detailed annotation of each matching sequence showing the sequence
543.IP
544and the locations and strengths of matches to the motifs.
545.IP
546MAST works by calculating match scores for each sequence in the
547database compared with each of the motifs in the group of motifs you
548provide. For each sequence, the match scores are converted into various
549types of p\-values and these are used to determine the overall match of
550the sequence to the group of motifs and the probable order and spacing
551of occurrences of the motifs in the sequence.
552.IP
553MAST generates a human readable file from the xml output containing:
554.IP
555* the version of MAST and the date it was built,
556* the reference to cite if you use MAST in your research,
557* a description of the databases and motifs used in the search,
558* an explanation of the result,
559* the sequences identifier and score sorted by score matching the
560.IP
561group of motifs above a stated level of statistical significance,
562.IP
563* motif diagrams showing the order and spacing of occurrences of the
564.IP
565motifs in the significant sequences and,
566.IP
567* annotated sequences showing the positions and p\-values of all motif
568.IP
569occurrences in each of the high\-scoring sequences.
570.IP
571The html version is the recommended version for human reading and has
572all sections documented however the text version has no documentation
573for the first section. That section lists each motif along with the
574sequence that would achieve the best possible match score. In order to
575avoid biased scores when multiple motif scores are combined, MAST also
576computes the pairwise correlations between each pair of motifs. The
577correlation between two motifs is the maximum sum of Pearson's
578correlation coefficients for aligned columns divided by the width of
579the shorter motif. The maximum is found by trying all alignments of the
580two motifs. Motifs with correlations below 0.60 have little effect on
581the accuracy of the combined scores. Pairs of motifs with higher
582correlations should be removed from the query.
583.PP
584Match Scores
585.IP
586The match score of a motif to a position in a sequence is the sum of
587the score from each column of the position\-dependent scoring matrix
588corresponding to the letter at that position in the sequence. For
589example, if the sequence is
590.IP
591TAATGTTGGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGC
592.IP
593========
594.IP
595and the motif is represented by the position\-dependent scoring matrix
596(where each row of the matrix corresponds to a position in the motif)
597.TP
598Position
599A      C      G      T
600.TP
6011
6021.447  0.188  \fB\-4\fR.025 \fB\-4\fR.095
603.TP
6042
6050.739  1.339  \fB\-3\fR.945 \fB\-2\fR.325
606.TP
6073
6081.764  \fB\-3\fR.562 \fB\-4\fR.197 \fB\-3\fR.895
609.TP
6104
6111.574  \fB\-3\fR.784 \fB\-1\fR.594 \fB\-1\fR.994
612.TP
6135
6141.602  \fB\-3\fR.935 \fB\-4\fR.054 \fB\-1\fR.370
615.TP
6166
6170.797  \fB\-3\fR.647 \fB\-0\fR.814 0.215
618.TP
6197
620\fB\-1\fR.280 1.873  \fB\-0\fR.607 \fB\-1\fR.993
621.TP
6228
623\fB\-3\fR.076 1.035  1.414  \fB\-3\fR.913
624.IP
625then the match score of the fourth position in the sequence
626(underlined) would be found by summing the score for T in position 1, G
627in position 2 and so on until G in position 8. So the match score would
628be
629.IP
630score = \fB\-4\fR.095 + \fB\-3\fR.945 + \fB\-3\fR.895 + \fB\-1\fR.994
631.IP
632+ \fB\-4\fR.054 + \fB\-0\fR.814 + \fB\-1\fR.933 + 1.414
633.IP
634= \fB\-19\fR.316
635.IP
636The match scores for other positions in the sequence are calculated in
637the same way. Match scores are only calculated if the match completely
638fits within the sequence. Match scores are not calculated if the motif
639would overhang either end of the sequence.
640.PP
641P\-values
642.IP
643MAST reports all matches of a sequence to a motif or group of motifs in
644terms of the p\-value of the match. MAST considers the p\-values of four
645types of events:
646.IP
647* position p\-value: the match of a single position within a sequence
648.IP
649to a given motif,
650.IP
651* sequence p\-value: the best match of any position within a sequence
652.IP
653to a given motif,
654.IP
655* combined p\-value: the combined best matches of a sequence to a
656.IP
657group of motifs, and
658.IP
659* E\-value: observing a combined p\-value at least as small in a random
660.IP
661database of the same size.
662.IP
663All p\-values are based on a random sequence model that assumes each
664position in a random sequence is generated according to the average
665letter frequencies of all sequences in the appropriate (peptide or
666nucleotide) non\-redundant database (ftp://ncbi.nlm.nih.gov/blast/db/)
667on September 22, 1996. This can be overridden by specifying the \fB\-bfile\fR
668or \fB\-comp\fR options (see below). For DNA sequences, unless \fB\-norc\fR is given,
669the positive and reverse complement strand frequencies are averaged
670together.
671.IP
6721. \fB\-bfile\fR <bfile> The random model uses the letter frequencies given
673.IP
674in <bfile> instead of the non\-redundant database frequencies. The
675format of <bfile> is the same as that for the MEME \fB\-bfile\fR option;
676see the MEME documentation for details. You can create files in the
677appropriate format based on the base/residue composition of your
678own FASTA sequence files using the command "fasta\-get\-markov"
679included in the MEME distribution. Type fasta\-get\-markov on the
680command line for documentation. (Sample files are also given in
681directory tests: tests/nt.freq and tests/na.freq.)
682.IP
6832. \fB\-comp\fR The random model uses the letter frequencies in the current
684.IP
685target sequence instead of the non\-redundant database frequencies.
686This causes p\-values and E\-values to be compensated individually
687for the actual composition of each sequence in the database. This
688option can increase search time substantially due to the need to
689compute a different score distribution for each high\-scoring
690sequence. With this option and DNA sequences, the positive and
691reverse complement strand frequencies are not averaged together.
692.IP
693Position p\-value
694.IP
695The p\-value of a match of a given position within a sequence to a motif
696is defined as the probability of a randomly selected position in a
697randomly generated sequence having a match score at least as large as
698that of the given position. Note:If MAST is combining reverse
699complement DNA strands, the position p\-value is not corrected for
700multiple tests.
701.IP
702Sequence p\-value
703.IP
704The p\-value of a match of a sequence to a motif is defined as the
705probability of a randomly generated sequence of the same length having
706a match score at least as large as the largest match score of any
707position in the sequence.
708.IP
709Combined p\-value
710.IP
711The p\-value of a match of a sequence to a group of motifs is defined as
712the probability of a randomly generated sequence of the same length
713having sequence p\-values whose product is at least as small as the
714product of the sequence p\-values of the matches of the motifs to the
715given sequence.
716.IP
717E\-value
718.IP
719The E\-value of the match of a sequence in a database to a a group of
720motifs is defined as the expected number of sequences in a random
721database of the same size that would match the motifs as well as the
722sequence does and is equal to the combined p\-value of the sequence
723times the number of sequences in the database.
724.PP
725High\-scoring Sequences
726.IP
727MAST lists the names and part of the descriptive text of all sequences
728whose E\-value is less than E. Sequences shorter than one or more of the
729motifs are skipped. The sequences are sorted by increasing E\-value. The
730value of E is set to 10 for the WEB server but is user\-selectable in
731the down\-loadable version of MAST.
732.PP
733Motif Diagrams
734.IP
735Motif diagrams show the order and spacing of non\-overlapping matches to
736the motifs in each high\-scoring sequence. Motif occurrences are
737determined based on the position p\-value of matches to the motif.
738Strong matches (p\-value < M) are shown in square brackets (`[ ]'), weak
739matches (M < p\-value < M x 10) are shown in angle brackets (`< >') and
740the length of non\-motif sequence ("spacer") is shown between
741underscores (`_'). For example,
742.IP
74327_[3]_44_<4>_99_[1]_7
744.IP
745shows an initial spacer of length 27, followed by a strong match to
746motif 3, a spacer of length 44, a weak match to motif 4, a spacer of
747length 99, a strong match to motif 1 and a final non\-motif sequence of
748length 7. The value of M is 0.0001 for the WEB server but is
749user\-selectable in the downloadable version of MAST.
750.PP
751Annotated Sequences
752.IP
753MAST annotates each high\-scoring sequence by printing the sequence
754along with the position and strength of all the non\-overlapping motif
755occurrences. The four lines above each motif occurrence contain,
756respectively,
757.IP
758* the motif number of the occurrence,
759* the position p\-value of the occurrence,
760* the best possible match to the motif, and
761* a plus sign (`+') above each letter in the occurrence that has a
762.IP
763positive match score to the motif.
764.IP
765The best possible match to a motif is the sequence of letters which
766would achieve the highest match score.
767.PP
768Hit List
769.IP
770If you specify the \fB\-hit_list\fR switch to MAST, MAST outputs ONLY a list
771of "hits" in easily machine\-readable format. Each line corresponds to
772one motif occurrence in one sequence. The format of the hit lines is
773.IP
774[<sequence_name> <strand><motif> <start> <end> <score> <p\-value>]+
775.IP
776where
777.IP
778<sequence_name> is the name of the sequence containing the hit
779<strand>        is the strand (+ or \- for DNA, blank for protein),
780<motif>         is the motif number,
781<start>         is the starting position of the hit,
782<end>           is the ending position of the hit, and
783<score>         is the score the hit,
784<p\-value>       is the position p\-value of the hit.
785.IP
786Two comment lines (starting with "#") are written above the list of
787hits, and the MAST command line is printed as a comment line after the
788list. An example of the output using the \fB\-hit_list\fR switch to MAST is:
789.IP
790# All non\-overlapping hits in all sequences.
791# sequence_name motif hit_start hit_end score hit_p\-value
792ce1cg \fB\-2\fR 8 22  1459.90 1.67e\-06
793ara +2 2 16  1661.18 5.04e\-08
794bglr1 +2 1 15  1274.97 1.42e\-05
795cya \fB\-2\fR 19 33  1101.37 6.64e\-05
796gale +2 5 19  1076.21 8.11e\-05
797ilv \fB\-2\fR 6 20  1098.85 6.78e\-05
798malk +2 37 51  1085.02 7.56e\-05
799ompa +2 5 19  1583.18 2.43e\-07
800# mast tests/meme/meme.crp0.oops tests/common/crp0.s \fB\-hit_list\fR \fB\-m\fR 2
801.PP
802Loading Multiple Sequence Databases
803.IP
804Multiple sequence databases can be loaded by MAST by putting the file
805names into a file and specifying that file instead of the sequence
806database with the option \fB\-dblist\fR.
807.IP
808The file list has one file name on each line with the optional name and
809link as follows:
810.IP
811<file> [<name> <link>]
812\&...
813\&...
814.IP
815If it is specified then the name will be used instead of the file name
816in the output. If the link is specified then all sequences for that
817database in the html output will have a hyperlink to the URL specified
818with the text SEQUENCEID replaced with the FASTA sequence id.
819.PP
820EXAMPLES:
821.IP
822The following examples assume that file "meme.results" is the output of
823a MEME run containing at least 3 motifs which was created on the
824trainingset "training.fasta" and file SwissProt is a copy of the
825Swiss\-Prot database on your local disk. DNA_DB is a copy of a DNA
826database on your local disk.
827.IP
8281. Annotate the training set:
829.IP
830mast meme.results training.fasta
831.IP
8322. Find sequences matching the motif and annotate them in the
833.IP
834SwissProt database:
835.IP
836mast meme.results SwissProt
837.IP
8383. Show sequences with weaker combined matches to motifs.
839.IP
840mast meme.results SwissProt \fB\-ev\fR 200
841.IP
8424. Include a nominal order and spacing of the first three motifs in
843.IP
844the calculation of the sequence p\-values to increase the
845sensitivity of the search for matching sequences:
846.IP
847mast meme.results SwissProt \fB\-diag\fR "9\-[2]\-61\-[1]\-62\-[3]\-91"
848.IP
8495. Use only the first and third motifs in the search:
850.IP
851mast meme.results SwissProt \fB\-m\fR 1 \fB\-m\fR 3
852.IP
8536. Use only the first two motifs in the search:
854.IP
855mast meme.results SwissProt \fB\-c\fR 2
856.IP
8577. Search DNA sequences using protein motifs, adjusting p\-values and
858.IP
859E\-values for each sequence by that sequence's composition:
860.IP
861mast meme.results DNA_DB \fB\-dna\fR \fB\-comp\fR
862.PP