Discriminative Motif Discovery
An approach to motif discovery where you are attempting
to find a motif that is not only specific to sequences
of interest, but is not specific to other similar
sequences.
An example is with finding a DNA
transcription factor binding site, where you may have
some sequences that are likely to contain binding sites,
and other sequences that are unlikely to contain binding
sites (perhaps all found in one experiment). In this
case, you want to use the likely
(‘positive’) sequences to search for the
motif, while using the unlikely (‘negative’)
sequences to guide the search away from motifs that are
not only specific to the positive but also the negative
sequences. The approach we use here is to calculate a
position-specific prior that we use
as an additional input to MEME, to inform the search as to the
probability of a motif starting at any given starting
point in the positive sequences. The approach we use is
based on the discriminative prior devised by
Narlikar et al. We modify their approach to
search for the "best" initial motif width, and to handle
protein sequences, using spaced
triples.
The principle behind a discriminative
prior arises from the question:
What fraction of all occurrences of the w-mer
(subsequence w long) at that site are found in the
positive set?
We score each position by counting
how often the w-mer starting there occurs in the
positive set (call this count #X) and how often
it occurs in the negative set (this count is
#Y). The score for the location is then:
#X
#X + #Y
If we find any ambiguous letters in a w-mer
(e.g. N, which stands for any base in
DNA), we count any such w-mer as if it occurs once.
For protein sequences, instead of requiring an exact
match of a w-wide subsequence, we break the
subsequence down into spaced triples, in which we
only care about the first and last letter, and one of
the interior letters. See below for a more detained
description of spaced triples.
The actual formula we calculate includes pseudocounts so
that small numbers of occurrences in the positive set without
any occurrences in the negative set do not cause false
peaks in the PSP. Once we have calculated scores, we
scale them to lie between 0.1 and 0.9 inclusive. We
convert the scaled scores to probabilities to complete
the calculation of the PSP. As with DNA sequences, we
treat any triple containing an ambiguous symbol as if it
only occurs once.
More detail follows.
Position-Specific Priors
Position-specific priors (PSPs) assign a probability
that a motif starts at each possible location in
your sequence data. MEME uses PSPs to guide its
search, biasing the search towards sites that have
higher values in the PSP file. PSPs in general
assist MEME by increasing the probability of start
sites containing a subsequence that is commonly
found in sequences of interest, as well as taking
lowering the probability of start sites that are
characteristic of sequences that are unlikely to
contain the feature of interest. PSPs are calculated
here using a combination of the sequences you supply
to MEME (the positive
set) and an
additional negative set
of sequences.
Negative Set The negative set is a
sequence file (in any of the
formats allowed for
MEME input) containing sequences that are used to
help MEME discriminate a motif in the positive
set. The negative set should contain sequences that
are in some sense a contrast to likely sites for
motifs (e.g. sequences rejected as unlikely to
contain a transcription factor binding site), but
otherwise similar to the positive set.
Positive Set The positive set is the
same sequences as you provide to MEME to search for
motifs. We call these sequences the positive set
when
doing discriminative
motif finding, to contrast these sequences with
the negative set.
Spaced Triples Spaced triples are
sub-sequences in which only the first and last
letter (residue or amino acid for protein) and one
interior letter are used in matches. For example,
the subsequence MTFEKI contains the following
triples: MT...I M.F..I M..E.I M...KI
where "." matches anything. We use spaced triples
for protein because the probability of exact matches
is much lower than for DNA, with the much larger
amino acid alphabet we use to represent protein
sequences. Any triple containing an ambiguous
letter is scored as zero.
To score a subsequence w wide
using triples, we count how often each triple
contained in that subsequence occurs in the positive
and negative sets and use the maximum counts in the
same formula as we use for scoring a w-mer,
but with the counts now referring to the highest
count of triples in the current location, rather
than how many times the whole subsequence occurs:
#X
#X + #Y
This paper contains results illustrating the benefits of PSPs:
Timothy L. Bailey, Mikael Bodén, Tom Whitington, and
Philip Machanick, "The value of
position-specific priors in motif discovery using
MEME",
BMC Bioinformatics, 11(1):179, 2010.
[full text]