Context Navigation

Back to Ticket #2966

Ticket #2966: mast.1

File mast.1, 32.3 KB (added by onlyjob, 11 years ago)

Line
1	.\" DO NOT MODIFY THIS FILE! It was generated by help2man 1.40.10.
2	.TH MAST: "1" "February 2013" "Motif Alignment and Search Tool" "User Commands"
3	.SH NAME
4	MAST \- Motif Alignment and Search Tool
5	.SH SYNOPSIS
6	.B mast <motif file> <sequence file>
7	[\fIoptions\fR]
8	.SH DESCRIPTION
9	MAST: Motif Alignment and Search Tool
10	.PP
11	Inputs
12	.TP
13	\fB<motif file>\fR
14	file containing motifs to use; normally a MEME output file
15	.TP
16	\fB<sequence file>\fR
17	search sequences in FASTA\-formatted database with motifs
18	.TP
19	\fB\-bfile <file>\fR
20	read background frequencies from <file>
21	.TP
22	\fB\-dblist\fR
23	read the <sequence file> as a list of FASTA\-formatted databases
24	.PP
25	Outputs
26	.TP
27	\fB\-o <dir>\fR
28	directory to output mast results; directory must not exist
29	.TP
30	\fB\-oc <dir>\fR
31	directory to output mast results with overwriting allowed
32	.TP
33	\fB\-hit_list\fR
34	print a machine\-readable list of all hits only; outputs to standard out and overrides \fB\-seqp\fR
35	.PP
36	Which Motifs To Use
37	.TP
38	\fB\-remcorr\fR
39	remove highly correlated motifs from query
40	.TP
41	\fB\-m <m>+\fR
42	use only motif number \fB<m>\fR (overrides \fB\-mev\fR); this can be
43	repeated to select multiple motifs
44	.TP
45	\fB\-c <count>\fR
46	only use the first \fB<count>\fR motifs or all motifs when \fB<count>\fR is zero (default: 0)
47	.TP
48	\fB\-mev <mev>\fR
49	use only motifs with E\-values less than \fB<mev>\fR
50	.TP
51	\fB\-diag <diag>\fR
52	nominal order and spacing of motifs is specified by \fB<diag>\fR which is a block diagram
53	.PP
54	DNA\-Only Options
55	.TP
56	\fB\-norc\fR
57	do not score reverse complement DNA strand
58	.TP
59	\fB\-sep\fR
60	score reverse complement DNA strand as a separate sequence
61	.TP
62	\fB\-dna\fR
63	translate DNA sequences to protein; motifs must be protein; sequences must be DNA
64	.TP
65	\fB\-comp\fR
66	adjust p\-values and E\-values for sequence composition
67	.PP
68	Which Results To Print
69	.TP
70	\fB\-ev <ev>\fR
71	print results for sequences with E\-value < \fB<ev>\fR (default: 10)
72	.PP
73	Appearance Of Block Diagrams
74	.TP
75	\fB\-mt <mt>\fR
76	show motif matches with p\-value < \fB<mt>\fR (default: 0.0001)
77	.TP\fB\-w\fR show weak matches (<mt> < p\-value < <mt>*10) in angle brackets in
78	the hit list or when the xml is converted to text
79	.TP\fB\-best\fR include only the best motif hits in \fB\-hit_list\fR diagrams
80	.TP\fB\-seqp\fR use SEQUENCE p\-values for motif thresholds (default: use
81	POSITION p\-values)
82	Miscellaneous
83	.TP\fB\-mf <mf>\fR in results use <mf> as motif file name
84	.TP\fB\-df <df>\fR in results use <df> as database name (ignored when \fB\-dblist\fR)
85	.TP\fB\-dl <dl>\fR in results use <dl> as link to search sequence names; token
86	SEQUENCEID is replaced with the FASTA sequence ID; ignored when
87	\fB\-dblist\fR;
88	.TP\fB\-minseqs <ms>\fR lower bound on number of sequences in db
89	.TP\fB\-nostatus\fR do not print progress report
90	.TP\fB\-notext\fR do not create text output
91	.TP\fB\-nohtml\fR do not create html output
92	.IP
93	MAST is a tool for searching biological sequence databases for
94	sequences that contain one or more of a group of known motifs.
95	.IP
96	A motif is a sequence pattern that occurs repeatedly in a group of
97	related protein or DNA sequences. Motifs are represented as
98	position\-dependent scoring matrices that describe the score of each
99	possible letter at each position in the pattern. Individual motifs may
100	not contain gaps. Patterns with variable\-length gaps must be split into
101	two or more separate motifs before being submitted as input to MAST.
102	.IP
103	MAST takes as input a file containing the descriptions of one or more
104	motifs and searches a sequence database that you select for sequences
105	that match the motifs. The motif file can be the output of the MEME
106	motif discovery tool or any file in the appropriate format.
107	.IP
108	MAST outputs an xml file which can then be converted into html or text
109	format. The xml file is designed for machine processing and the html
110	file is designed for human viewing. The text format is available for
111	backwards compatibility though due to design decisions made to optimise
112	the xml for html generation the output for separate scoring mode is not
113	identical and some options were removed. The text format will be
114	unsupported in future releases and so we recommend you migrate any
115	programs reading mast output to the xml format.
116	.IP
117	MAST outputs three things:
118	.IP
119	1. The names of the high\-scoring sequences sorted by the strength of
120	.IP
121	the combined match of the sequence to all of the motifs in the
122	group.
123	.IP
124	2. Motif diagrams showing the order and spacing of the motifs within
125	.IP
126	each matching sequence.
127	.IP
128	3. Detailed annotation of each matching sequence showing the sequence
129	.IP
130	and the locations and strengths of matches to the motifs.
131	.IP
132	MAST works by calculating match scores for each sequence in the
133	database compared with each of the motifs in the group of motifs you
134	provide. For each sequence, the match scores are converted into various
135	types of p\-values and these are used to determine the overall match of
136	the sequence to the group of motifs and the probable order and spacing
137	of occurrences of the motifs in the sequence.
138	.IP
139	MAST generates a human readable file from the xml output containing:
140	.IP
141	* the version of MAST and the date it was built,
142	* the reference to cite if you use MAST in your research,
143	* a description of the databases and motifs used in the search,
144	* an explanation of the result,
145	* the sequences identifier and score sorted by score matching the
146	.IP
147	group of motifs above a stated level of statistical significance,
148	.IP
149	* motif diagrams showing the order and spacing of occurrences of the
150	.IP
151	motifs in the significant sequences and,
152	.IP
153	* annotated sequences showing the positions and p\-values of all motif
154	.IP
155	occurrences in each of the high\-scoring sequences.
156	.IP
157	The html version is the recommended version for human reading and has
158	all sections documented however the text version has no documentation
159	for the first section. That section lists each motif along with the
160	sequence that would achieve the best possible match score. In order to
161	avoid biased scores when multiple motif scores are combined, MAST also
162	computes the pairwise correlations between each pair of motifs. The
163	correlation between two motifs is the maximum sum of Pearson's
164	correlation coefficients for aligned columns divided by the width of
165	the shorter motif. The maximum is found by trying all alignments of the
166	two motifs. Motifs with correlations below 0.60 have little effect on
167	the accuracy of the combined scores. Pairs of motifs with higher
168	correlations should be removed from the query.
169	.PP
170	Match Scores
171	.IP
172	The match score of a motif to a position in a sequence is the sum of
173	the score from each column of the position\-dependent scoring matrix
174	corresponding to the letter at that position in the sequence. For
175	example, if the sequence is
176	.IP
177	TAATGTTGGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGC
178	.IP
179	========
180	.IP
181	and the motif is represented by the position\-dependent scoring matrix
182	(where each row of the matrix corresponds to a position in the motif)
183	.TP
184	Position
185	A C G T
186	.TP
187	1
188	1.447 0.188 \fB\-4\fR.025 \fB\-4\fR.095
189	.TP
190	2
191	0.739 1.339 \fB\-3\fR.945 \fB\-2\fR.325
192	.TP
193	3
194	1.764 \fB\-3\fR.562 \fB\-4\fR.197 \fB\-3\fR.895
195	.TP
196	4
197	1.574 \fB\-3\fR.784 \fB\-1\fR.594 \fB\-1\fR.994
198	.TP
199	5
200	1.602 \fB\-3\fR.935 \fB\-4\fR.054 \fB\-1\fR.370
201	.TP
202	6
203	0.797 \fB\-3\fR.647 \fB\-0\fR.814 0.215
204	.TP
205	7
206	\fB\-1\fR.280 1.873 \fB\-0\fR.607 \fB\-1\fR.993
207	.TP
208	8
209	\fB\-3\fR.076 1.035 1.414 \fB\-3\fR.913
210	.IP
211	then the match score of the fourth position in the sequence
212	(underlined) would be found by summing the score for T in position 1, G
213	in position 2 and so on until G in position 8. So the match score would
214	be
215	.IP
216	score = \fB\-4\fR.095 + \fB\-3\fR.945 + \fB\-3\fR.895 + \fB\-1\fR.994
217	.IP
218	+ \fB\-4\fR.054 + \fB\-0\fR.814 + \fB\-1\fR.933 + 1.414
219	.IP
220	= \fB\-19\fR.316
221	.IP
222	The match scores for other positions in the sequence are calculated in
223	the same way. Match scores are only calculated if the match completely
224	fits within the sequence. Match scores are not calculated if the motif
225	would overhang either end of the sequence.
226	.PP
227	P\-values
228	.IP
229	MAST reports all matches of a sequence to a motif or group of motifs in
230	terms of the p\-value of the match. MAST considers the p\-values of four
231	types of events:
232	.IP
233	* position p\-value: the match of a single position within a sequence
234	.IP
235	to a given motif,
236	.IP
237	* sequence p\-value: the best match of any position within a sequence
238	.IP
239	to a given motif,
240	.IP
241	* combined p\-value: the combined best matches of a sequence to a
242	.IP
243	group of motifs, and
244	.IP
245	* E\-value: observing a combined p\-value at least as small in a random
246	.IP
247	database of the same size.
248	.IP
249	All p\-values are based on a random sequence model that assumes each
250	position in a random sequence is generated according to the average
251	letter frequencies of all sequences in the appropriate (peptide or
252	nucleotide) non\-redundant database (ftp://ncbi.nlm.nih.gov/blast/db/)
253	on September 22, 1996. This can be overridden by specifying the \fB\-bfile\fR
254	or \fB\-comp\fR options (see below). For DNA sequences, unless \fB\-norc\fR is given,
255	the positive and reverse complement strand frequencies are averaged
256	together.
257	.IP
258	1. \fB\-bfile\fR <bfile> The random model uses the letter frequencies given
259	.IP
260	in <bfile> instead of the non\-redundant database frequencies. The
261	format of <bfile> is the same as that for the MEME \fB\-bfile\fR option;
262	see the MEME documentation for details. You can create files in the
263	appropriate format based on the base/residue composition of your
264	own FASTA sequence files using the command "fasta\-get\-markov"
265	included in the MEME distribution. Type fasta\-get\-markov on the
266	command line for documentation. (Sample files are also given in
267	directory tests: tests/nt.freq and tests/na.freq.)
268	.IP
269	2. \fB\-comp\fR The random model uses the letter frequencies in the current
270	.IP
271	target sequence instead of the non\-redundant database frequencies.
272	This causes p\-values and E\-values to be compensated individually
273	for the actual composition of each sequence in the database. This
274	option can increase search time substantially due to the need to
275	compute a different score distribution for each high\-scoring
276	sequence. With this option and DNA sequences, the positive and
277	reverse complement strand frequencies are not averaged together.
278	.IP
279	Position p\-value
280	.IP
281	The p\-value of a match of a given position within a sequence to a motif
282	is defined as the probability of a randomly selected position in a
283	randomly generated sequence having a match score at least as large as
284	that of the given position. Note:If MAST is combining reverse
285	complement DNA strands, the position p\-value is not corrected for
286	multiple tests.
287	.IP
288	Sequence p\-value
289	.IP
290	The p\-value of a match of a sequence to a motif is defined as the
291	probability of a randomly generated sequence of the same length having
292	a match score at least as large as the largest match score of any
293	position in the sequence.
294	.IP
295	Combined p\-value
296	.IP
297	The p\-value of a match of a sequence to a group of motifs is defined as
298	the probability of a randomly generated sequence of the same length
299	having sequence p\-values whose product is at least as small as the
300	product of the sequence p\-values of the matches of the motifs to the
301	given sequence.
302	.IP
303	E\-value
304	.IP
305	The E\-value of the match of a sequence in a database to a a group of
306	motifs is defined as the expected number of sequences in a random
307	database of the same size that would match the motifs as well as the
308	sequence does and is equal to the combined p\-value of the sequence
309	times the number of sequences in the database.
310	.PP
311	High\-scoring Sequences
312	.IP
313	MAST lists the names and part of the descriptive text of all sequences
314	whose E\-value is less than E. Sequences shorter than one or more of the
315	motifs are skipped. The sequences are sorted by increasing E\-value. The
316	value of E is set to 10 for the WEB server but is user\-selectable in
317	the down\-loadable version of MAST.
318	.PP
319	Motif Diagrams
320	.IP
321	Motif diagrams show the order and spacing of non\-overlapping matches to
322	the motifs in each high\-scoring sequence. Motif occurrences are
323	determined based on the position p\-value of matches to the motif.
324	Strong matches (p\-value < M) are shown in square brackets (`[ ]'), weak
325	matches (M < p\-value < M x 10) are shown in angle brackets (`< >') and
326	the length of non\-motif sequence ("spacer") is shown between
327	underscores (`_'). For example,
328	.IP
329	27_[3]_44_<4>_99_[1]_7
330	.IP
331	shows an initial spacer of length 27, followed by a strong match to
332	motif 3, a spacer of length 44, a weak match to motif 4, a spacer of
333	length 99, a strong match to motif 1 and a final non\-motif sequence of
334	length 7. The value of M is 0.0001 for the WEB server but is
335	user\-selectable in the downloadable version of MAST.
336	.PP
337	Annotated Sequences
338	.IP
339	MAST annotates each high\-scoring sequence by printing the sequence
340	along with the position and strength of all the non\-overlapping motif
341	occurrences. The four lines above each motif occurrence contain,
342	respectively,
343	.IP
344	* the motif number of the occurrence,
345	* the position p\-value of the occurrence,
346	* the best possible match to the motif, and
347	* a plus sign (`+') above each letter in the occurrence that has a
348	.IP
349	positive match score to the motif.
350	.IP
351	The best possible match to a motif is the sequence of letters which
352	would achieve the highest match score.
353	.PP
354	Hit List
355	.IP
356	If you specify the \fB\-hit_list\fR switch to MAST, MAST outputs ONLY a list
357	of "hits" in easily machine\-readable format. Each line corresponds to
358	one motif occurrence in one sequence. The format of the hit lines is
359	.IP
360	[<sequence_name> <strand><motif> <start> <end> <score> <p\-value>]+
361	.IP
362	where
363	.IP
364	<sequence_name> is the name of the sequence containing the hit
365	<strand> is the strand (+ or \- for DNA, blank for protein),
366	<motif> is the motif number,
367	<start> is the starting position of the hit,
368	<end> is the ending position of the hit, and
369	<score> is the score the hit,
370	<p\-value> is the position p\-value of the hit.
371	.IP
372	Two comment lines (starting with "#") are written above the list of
373	hits, and the MAST command line is printed as a comment line after the
374	list. An example of the output using the \fB\-hit_list\fR switch to MAST is:
375	.IP
376	# All non\-overlapping hits in all sequences.
377	# sequence_name motif hit_start hit_end score hit_p\-value
378	ce1cg \fB\-2\fR 8 22 1459.90 1.67e\-06
379	ara +2 2 16 1661.18 5.04e\-08
380	bglr1 +2 1 15 1274.97 1.42e\-05
381	cya \fB\-2\fR 19 33 1101.37 6.64e\-05
382	gale +2 5 19 1076.21 8.11e\-05
383	ilv \fB\-2\fR 6 20 1098.85 6.78e\-05
384	malk +2 37 51 1085.02 7.56e\-05
385	ompa +2 5 19 1583.18 2.43e\-07
386	# mast tests/meme/meme.crp0.oops tests/common/crp0.s \fB\-hit_list\fR \fB\-m\fR 2
387	.PP
388	Loading Multiple Sequence Databases
389	.IP
390	Multiple sequence databases can be loaded by MAST by putting the file
391	names into a file and specifying that file instead of the sequence
392	database with the option \fB\-dblist\fR.
393	.IP
394	The file list has one file name on each line with the optional name and
395	link as follows:
396	.IP
397	<file> [<name> <link>]
398	\&...
399	\&...
400	.IP
401	If it is specified then the name will be used instead of the file name
402	in the output. If the link is specified then all sequences for that
403	database in the html output will have a hyperlink to the URL specified
404	with the text SEQUENCEID replaced with the FASTA sequence id.
405	.PP
406	EXAMPLES:
407	.IP
408	The following examples assume that file "meme.results" is the output of
409	a MEME run containing at least 3 motifs which was created on the
410	trainingset "training.fasta" and file SwissProt is a copy of the
411	Swiss\-Prot database on your local disk. DNA_DB is a copy of a DNA
412	database on your local disk.
413	.IP
414	1. Annotate the training set:
415	.IP
416	mast meme.results training.fasta
417	.IP
418	2. Find sequences matching the motif and annotate them in the
419	.IP
420	SwissProt database:
421	.IP
422	mast meme.results SwissProt
423	.IP
424	3. Show sequences with weaker combined matches to motifs.
425	.IP
426	mast meme.results SwissProt \fB\-ev\fR 200
427	.IP
428	4. Include a nominal order and spacing of the first three motifs in
429	.IP
430	the calculation of the sequence p\-values to increase the
431	sensitivity of the search for matching sequences:
432	.IP
433	mast meme.results SwissProt \fB\-diag\fR "9\-[2]\-61\-[1]\-62\-[3]\-91"
434	.IP
435	5. Use only the first and third motifs in the search:
436	.IP
437	mast meme.results SwissProt \fB\-m\fR 1 \fB\-m\fR 3
438	.IP
439	6. Use only the first two motifs in the search:
440	.IP
441	mast meme.results SwissProt \fB\-c\fR 2
442	.IP
443	7. Search DNA sequences using protein motifs, adjusting p\-values and
444	.IP
445	E\-values for each sequence by that sequence's composition:
446	.IP
447	mast meme.results DNA_DB \fB\-dna\fR \fB\-comp\fR
448	.PP
449	References
450	.IP
451	1. file://localhost/home/tillea/debian\-maintain/repack/meme/meme_4.9.0/doc/meme\-format.html
452	2. file://localhost/home/tillea/debian\-maintain/repack/meme/meme_4.9.0/doc/mast.html#dblist
453	.PP
454	Usage
455	.IP
456	mast <motif file> <sequence file> [options]
457	.IP
458	Inputs
459	<motif file> file containing motifs to use; normally a MEME output
460	file
461	<sequence file> search sequences in FASTA\-formatted database with
462	motifs;
463	.TP\fB\-bfile <file>\fR read background frequencies from <file>
464	.TP\fB\-dblist\fR read the <sequence file> as a list of FASTA\-formatted
465	databases
466	Outputs
467	.TP\fB\-o <dir>\fR directory to output mast results; directory must not exist
468	.TP\fB\-oc <dir>\fR directory to output mast results with overwriting allowed
469	.TP\fB\-hit_list\fR print a machine\-readable list of all hits only; outputs to
470	standard out and overrides \fB\-seqp\fR
471	Which Motifs To Use
472	.TP\fB\-remcorr\fR remove highly correlated motifs from query
473	.TP\fB\-m <m>\fR+ use only motif number <m> (overrides \fB\-mev\fR); this can be
474	repeated to select multiple motifs
475	.TP\fB\-c <count>\fR only use the first <count> motifs or all motifs when
476	<count> is zero (default: 0)
477	.TP\fB\-mev <mev>\fR use only motifs with E\-values less than <mev>
478	.TP\fB\-diag <diag>\fR nominal order and spacing of motifs is specified by
479	<diag> which is a block diagram
480	DNA\-Only Options
481	.TP\fB\-norc\fR do not score reverse complement DNA strand
482	.TP\fB\-sep\fR score reverse complement DNA strand as a separate sequence
483	.TP\fB\-dna\fR translate DNA sequences to protein; motifs must be protein;
484	sequences must be DNA
485	.TP\fB\-comp\fR adjust p\-values and E\-values for sequence composition
486	Which Results To Print
487	.TP\fB\-ev <ev>\fR print results for sequences with E\-value < <ev> (default:
488	10)
489	Appearance Of Block Diagrams
490	.TP\fB\-mt <mt>\fR show motif matches with p\-value < <mt> (default: 0.0001)
491	.TP\fB\-w\fR show weak matches (<mt> < p\-value < <mt>*10) in angle brackets in
492	the hit list or when the xml is converted to text
493	.TP\fB\-best\fR include only the best motif hits in \fB\-hit_list\fR diagrams
494	.TP\fB\-seqp\fR use SEQUENCE p\-values for motif thresholds (default: use
495	POSITION p\-values)
496	Miscellaneous
497	.TP\fB\-mf <mf>\fR in results use <mf> as motif file name
498	.TP\fB\-df <df>\fR in results use <df> as database name (ignored when \fB\-dblist\fR)
499	.TP\fB\-dl <dl>\fR in results use <dl> as link to search sequence names; token
500	SEQUENCEID is replaced with the FASTA sequence ID; ignored when
501	\fB\-dblist\fR;
502	.TP\fB\-minseqs <ms>\fR lower bound on number of sequences in db
503	.TP\fB\-nostatus\fR do not print progress report
504	.TP\fB\-notext\fR do not create text output
505	.TP\fB\-nohtml\fR do not create html output
506	.IP
507	MAST is a tool for searching biological sequence databases for
508	sequences that contain one or more of a group of known motifs.
509	.IP
510	A motif is a sequence pattern that occurs repeatedly in a group of
511	related protein or DNA sequences. Motifs are represented as
512	position\-dependent scoring matrices that describe the score of each
513	possible letter at each position in the pattern. Individual motifs may
514	not contain gaps. Patterns with variable\-length gaps must be split into
515	two or more separate motifs before being submitted as input to MAST.
516	.IP
517	MAST takes as input a file containing the descriptions of one or more
518	motifs and searches a sequence database that you select for sequences
519	that match the motifs. The motif file can be the output of the MEME
520	motif discovery tool or any file in the appropriate format.
521	.IP
522	MAST outputs an xml file which can then be converted into html or text
523	format. The xml file is designed for machine processing and the html
524	file is designed for human viewing. The text format is available for
525	backwards compatibility though due to design decisions made to optimise
526	the xml for html generation the output for separate scoring mode is not
527	identical and some options were removed. The text format will be
528	unsupported in future releases and so we recommend you migrate any
529	programs reading mast output to the xml format.
530	.IP
531	MAST outputs three things:
532	.IP
533	1. The names of the high\-scoring sequences sorted by the strength of
534	.IP
535	the combined match of the sequence to all of the motifs in the
536	group.
537	.IP
538	2. Motif diagrams showing the order and spacing of the motifs within
539	.IP
540	each matching sequence.
541	.IP
542	3. Detailed annotation of each matching sequence showing the sequence
543	.IP
544	and the locations and strengths of matches to the motifs.
545	.IP
546	MAST works by calculating match scores for each sequence in the
547	database compared with each of the motifs in the group of motifs you
548	provide. For each sequence, the match scores are converted into various
549	types of p\-values and these are used to determine the overall match of
550	the sequence to the group of motifs and the probable order and spacing
551	of occurrences of the motifs in the sequence.
552	.IP
553	MAST generates a human readable file from the xml output containing:
554	.IP
555	* the version of MAST and the date it was built,
556	* the reference to cite if you use MAST in your research,
557	* a description of the databases and motifs used in the search,
558	* an explanation of the result,
559	* the sequences identifier and score sorted by score matching the
560	.IP
561	group of motifs above a stated level of statistical significance,
562	.IP
563	* motif diagrams showing the order and spacing of occurrences of the
564	.IP
565	motifs in the significant sequences and,
566	.IP
567	* annotated sequences showing the positions and p\-values of all motif
568	.IP
569	occurrences in each of the high\-scoring sequences.
570	.IP
571	The html version is the recommended version for human reading and has
572	all sections documented however the text version has no documentation
573	for the first section. That section lists each motif along with the
574	sequence that would achieve the best possible match score. In order to
575	avoid biased scores when multiple motif scores are combined, MAST also
576	computes the pairwise correlations between each pair of motifs. The
577	correlation between two motifs is the maximum sum of Pearson's
578	correlation coefficients for aligned columns divided by the width of
579	the shorter motif. The maximum is found by trying all alignments of the
580	two motifs. Motifs with correlations below 0.60 have little effect on
581	the accuracy of the combined scores. Pairs of motifs with higher
582	correlations should be removed from the query.
583	.PP
584	Match Scores
585	.IP
586	The match score of a motif to a position in a sequence is the sum of
587	the score from each column of the position\-dependent scoring matrix
588	corresponding to the letter at that position in the sequence. For
589	example, if the sequence is
590	.IP
591	TAATGTTGGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGC
592	.IP
593	========
594	.IP
595	and the motif is represented by the position\-dependent scoring matrix
596	(where each row of the matrix corresponds to a position in the motif)
597	.TP
598	Position
599	A C G T
600	.TP
601	1
602	1.447 0.188 \fB\-4\fR.025 \fB\-4\fR.095
603	.TP
604	2
605	0.739 1.339 \fB\-3\fR.945 \fB\-2\fR.325
606	.TP
607	3
608	1.764 \fB\-3\fR.562 \fB\-4\fR.197 \fB\-3\fR.895
609	.TP
610	4
611	1.574 \fB\-3\fR.784 \fB\-1\fR.594 \fB\-1\fR.994
612	.TP
613	5
614	1.602 \fB\-3\fR.935 \fB\-4\fR.054 \fB\-1\fR.370
615	.TP
616	6
617	0.797 \fB\-3\fR.647 \fB\-0\fR.814 0.215
618	.TP
619	7
620	\fB\-1\fR.280 1.873 \fB\-0\fR.607 \fB\-1\fR.993
621	.TP
622	8
623	\fB\-3\fR.076 1.035 1.414 \fB\-3\fR.913
624	.IP
625	then the match score of the fourth position in the sequence
626	(underlined) would be found by summing the score for T in position 1, G
627	in position 2 and so on until G in position 8. So the match score would
628	be
629	.IP
630	score = \fB\-4\fR.095 + \fB\-3\fR.945 + \fB\-3\fR.895 + \fB\-1\fR.994
631	.IP
632	+ \fB\-4\fR.054 + \fB\-0\fR.814 + \fB\-1\fR.933 + 1.414
633	.IP
634	= \fB\-19\fR.316
635	.IP
636	The match scores for other positions in the sequence are calculated in
637	the same way. Match scores are only calculated if the match completely
638	fits within the sequence. Match scores are not calculated if the motif
639	would overhang either end of the sequence.
640	.PP
641	P\-values
642	.IP
643	MAST reports all matches of a sequence to a motif or group of motifs in
644	terms of the p\-value of the match. MAST considers the p\-values of four
645	types of events:
646	.IP
647	* position p\-value: the match of a single position within a sequence
648	.IP
649	to a given motif,
650	.IP
651	* sequence p\-value: the best match of any position within a sequence
652	.IP
653	to a given motif,
654	.IP
655	* combined p\-value: the combined best matches of a sequence to a
656	.IP
657	group of motifs, and
658	.IP
659	* E\-value: observing a combined p\-value at least as small in a random
660	.IP
661	database of the same size.
662	.IP
663	All p\-values are based on a random sequence model that assumes each
664	position in a random sequence is generated according to the average
665	letter frequencies of all sequences in the appropriate (peptide or
666	nucleotide) non\-redundant database (ftp://ncbi.nlm.nih.gov/blast/db/)
667	on September 22, 1996. This can be overridden by specifying the \fB\-bfile\fR
668	or \fB\-comp\fR options (see below). For DNA sequences, unless \fB\-norc\fR is given,
669	the positive and reverse complement strand frequencies are averaged
670	together.
671	.IP
672	1. \fB\-bfile\fR <bfile> The random model uses the letter frequencies given
673	.IP
674	in <bfile> instead of the non\-redundant database frequencies. The
675	format of <bfile> is the same as that for the MEME \fB\-bfile\fR option;
676	see the MEME documentation for details. You can create files in the
677	appropriate format based on the base/residue composition of your
678	own FASTA sequence files using the command "fasta\-get\-markov"
679	included in the MEME distribution. Type fasta\-get\-markov on the
680	command line for documentation. (Sample files are also given in
681	directory tests: tests/nt.freq and tests/na.freq.)
682	.IP
683	2. \fB\-comp\fR The random model uses the letter frequencies in the current
684	.IP
685	target sequence instead of the non\-redundant database frequencies.
686	This causes p\-values and E\-values to be compensated individually
687	for the actual composition of each sequence in the database. This
688	option can increase search time substantially due to the need to
689	compute a different score distribution for each high\-scoring
690	sequence. With this option and DNA sequences, the positive and
691	reverse complement strand frequencies are not averaged together.
692	.IP
693	Position p\-value
694	.IP
695	The p\-value of a match of a given position within a sequence to a motif
696	is defined as the probability of a randomly selected position in a
697	randomly generated sequence having a match score at least as large as
698	that of the given position. Note:If MAST is combining reverse
699	complement DNA strands, the position p\-value is not corrected for
700	multiple tests.
701	.IP
702	Sequence p\-value
703	.IP
704	The p\-value of a match of a sequence to a motif is defined as the
705	probability of a randomly generated sequence of the same length having
706	a match score at least as large as the largest match score of any
707	position in the sequence.
708	.IP
709	Combined p\-value
710	.IP
711	The p\-value of a match of a sequence to a group of motifs is defined as
712	the probability of a randomly generated sequence of the same length
713	having sequence p\-values whose product is at least as small as the
714	product of the sequence p\-values of the matches of the motifs to the
715	given sequence.
716	.IP
717	E\-value
718	.IP
719	The E\-value of the match of a sequence in a database to a a group of
720	motifs is defined as the expected number of sequences in a random
721	database of the same size that would match the motifs as well as the
722	sequence does and is equal to the combined p\-value of the sequence
723	times the number of sequences in the database.
724	.PP
725	High\-scoring Sequences
726	.IP
727	MAST lists the names and part of the descriptive text of all sequences
728	whose E\-value is less than E. Sequences shorter than one or more of the
729	motifs are skipped. The sequences are sorted by increasing E\-value. The
730	value of E is set to 10 for the WEB server but is user\-selectable in
731	the down\-loadable version of MAST.
732	.PP
733	Motif Diagrams
734	.IP
735	Motif diagrams show the order and spacing of non\-overlapping matches to
736	the motifs in each high\-scoring sequence. Motif occurrences are
737	determined based on the position p\-value of matches to the motif.
738	Strong matches (p\-value < M) are shown in square brackets (`[ ]'), weak
739	matches (M < p\-value < M x 10) are shown in angle brackets (`< >') and
740	the length of non\-motif sequence ("spacer") is shown between
741	underscores (`_'). For example,
742	.IP
743	27_[3]_44_<4>_99_[1]_7
744	.IP
745	shows an initial spacer of length 27, followed by a strong match to
746	motif 3, a spacer of length 44, a weak match to motif 4, a spacer of
747	length 99, a strong match to motif 1 and a final non\-motif sequence of
748	length 7. The value of M is 0.0001 for the WEB server but is
749	user\-selectable in the downloadable version of MAST.
750	.PP
751	Annotated Sequences
752	.IP
753	MAST annotates each high\-scoring sequence by printing the sequence
754	along with the position and strength of all the non\-overlapping motif
755	occurrences. The four lines above each motif occurrence contain,
756	respectively,
757	.IP
758	* the motif number of the occurrence,
759	* the position p\-value of the occurrence,
760	* the best possible match to the motif, and
761	* a plus sign (`+') above each letter in the occurrence that has a
762	.IP
763	positive match score to the motif.
764	.IP
765	The best possible match to a motif is the sequence of letters which
766	would achieve the highest match score.
767	.PP
768	Hit List
769	.IP
770	If you specify the \fB\-hit_list\fR switch to MAST, MAST outputs ONLY a list
771	of "hits" in easily machine\-readable format. Each line corresponds to
772	one motif occurrence in one sequence. The format of the hit lines is
773	.IP
774	[<sequence_name> <strand><motif> <start> <end> <score> <p\-value>]+
775	.IP
776	where
777	.IP
778	<sequence_name> is the name of the sequence containing the hit
779	<strand> is the strand (+ or \- for DNA, blank for protein),
780	<motif> is the motif number,
781	<start> is the starting position of the hit,
782	<end> is the ending position of the hit, and
783	<score> is the score the hit,
784	<p\-value> is the position p\-value of the hit.
785	.IP
786	Two comment lines (starting with "#") are written above the list of
787	hits, and the MAST command line is printed as a comment line after the
788	list. An example of the output using the \fB\-hit_list\fR switch to MAST is:
789	.IP
790	# All non\-overlapping hits in all sequences.
791	# sequence_name motif hit_start hit_end score hit_p\-value
792	ce1cg \fB\-2\fR 8 22 1459.90 1.67e\-06
793	ara +2 2 16 1661.18 5.04e\-08
794	bglr1 +2 1 15 1274.97 1.42e\-05
795	cya \fB\-2\fR 19 33 1101.37 6.64e\-05
796	gale +2 5 19 1076.21 8.11e\-05
797	ilv \fB\-2\fR 6 20 1098.85 6.78e\-05
798	malk +2 37 51 1085.02 7.56e\-05
799	ompa +2 5 19 1583.18 2.43e\-07
800	# mast tests/meme/meme.crp0.oops tests/common/crp0.s \fB\-hit_list\fR \fB\-m\fR 2
801	.PP
802	Loading Multiple Sequence Databases
803	.IP
804	Multiple sequence databases can be loaded by MAST by putting the file
805	names into a file and specifying that file instead of the sequence
806	database with the option \fB\-dblist\fR.
807	.IP
808	The file list has one file name on each line with the optional name and
809	link as follows:
810	.IP
811	<file> [<name> <link>]
812	\&...
813	\&...
814	.IP
815	If it is specified then the name will be used instead of the file name
816	in the output. If the link is specified then all sequences for that
817	database in the html output will have a hyperlink to the URL specified
818	with the text SEQUENCEID replaced with the FASTA sequence id.
819	.PP
820	EXAMPLES:
821	.IP
822	The following examples assume that file "meme.results" is the output of
823	a MEME run containing at least 3 motifs which was created on the
824	trainingset "training.fasta" and file SwissProt is a copy of the
825	Swiss\-Prot database on your local disk. DNA_DB is a copy of a DNA
826	database on your local disk.
827	.IP
828	1. Annotate the training set:
829	.IP
830	mast meme.results training.fasta
831	.IP
832	2. Find sequences matching the motif and annotate them in the
833	.IP
834	SwissProt database:
835	.IP
836	mast meme.results SwissProt
837	.IP
838	3. Show sequences with weaker combined matches to motifs.
839	.IP
840	mast meme.results SwissProt \fB\-ev\fR 200
841	.IP
842	4. Include a nominal order and spacing of the first three motifs in
843	.IP
844	the calculation of the sequence p\-values to increase the
845	sensitivity of the search for matching sequences:
846	.IP
847	mast meme.results SwissProt \fB\-diag\fR "9\-[2]\-61\-[1]\-62\-[3]\-91"
848	.IP
849	5. Use only the first and third motifs in the search:
850	.IP
851	mast meme.results SwissProt \fB\-m\fR 1 \fB\-m\fR 3
852	.IP
853	6. Use only the first two motifs in the search:
854	.IP
855	mast meme.results SwissProt \fB\-c\fR 2
856	.IP
857	7. Search DNA sequences using protein motifs, adjusting p\-values and
858	.IP
859	E\-values for each sequence by that sequence's composition:
860	.IP
861	mast meme.results DNA_DB \fB\-dna\fR \fB\-comp\fR
862	.PP

Download in other formats:

Original Format