• Discriminative Motif Discovery
    An approach to motif discovery where you are attempting to find a motif that is not only specific to sequences of interest, but is not specific to other similar sequences.

    An example is with finding a DNA transcription factor binding site, where you may have some sequences that are likely to contain binding sites, and other sequences that are unlikely to contain binding sites (perhaps all found in one experiment). In this case, you want to use the likely (‘positive’) sequences to search for the motif, while using the unlikely (‘negative’) sequences to guide the search away from motifs that are not only specific to the positive but also the negative sequences. The approach we use here is to calculate a position-specific prior that we use as an additional input to MEME, to inform the search as to the probability of a motif starting at any given starting point in the positive sequences. The approach we use is based on the discriminative prior devised by Narlikar et al. We modify their approach to search for the "best" initial motif width, and to handle protein sequences, using spaced triples.

    The principle behind a discriminative prior arises from the question:
    What fraction of all occurrences of the w-mer (subsequence w long) at that site are found in the positive set?
    We score each position by counting how often the w-mer starting there occurs in the positive set (call this count #X) and how often it occurs in the negative set (this count is #Y). The score for the location is then:

    #X + #Y

    If we find any ambiguous letters in a w-mer (e.g. N, which stands for any base in DNA), we count any such w-mer as if it occurs once.

    For protein sequences, instead of requiring an exact match of a w-wide subsequence, we break the subsequence down into spaced triples, in which we only care about the first and last letter, and one of the interior letters. See below for a more detained description of spaced triples.

    The actual formula we calculate includes pseudocounts so that small numbers of occurrences in the positive set without any occurrences in the negative set do not cause false peaks in the PSP. Once we have calculated scores, we scale them to lie between 0.1 and 0.9 inclusive. We convert the scaled scores to probabilities to complete the calculation of the PSP. As with DNA sequences, we treat any triple containing an ambiguous symbol as if it only occurs once.

    More detail follows.

    • Position-Specific Priors
      Position-specific priors (PSPs) assign a probability that a motif starts at each possible location in your sequence data. MEME uses PSPs to guide its search, biasing the search towards sites that have higher values in the PSP file. PSPs in general assist MEME by increasing the probability of start sites containing a subsequence that is commonly found in sequences of interest, as well as taking lowering the probability of start sites that are characteristic of sequences that are unlikely to contain the feature of interest. PSPs are calculated here using a combination of the sequences you supply to MEME (the positive set) and an additional negative set of sequences.
    • Negative Set
      The negative set is a sequence file (in any of the formats allowed for MEME input) containing sequences that are used to help MEME discriminate a motif in the positive set. The negative set should contain sequences that are in some sense a contrast to likely sites for motifs (e.g. sequences rejected as unlikely to contain a transcription factor binding site), but otherwise similar to the positive set.
    • Positive Set
      The positive set is the same sequences as you provide to MEME to search for motifs. We call these sequences the positive set when doing discriminative motif finding, to contrast these sequences with the negative set.
    • Spaced Triples
      Spaced triples are sub-sequences in which only the first and last letter (residue or amino acid for protein) and one interior letter are used in matches. For example, the subsequence
      MTFEKI contains the following triples:
      where "." matches anything. We use spaced triples for protein because the probability of exact matches is much lower than for DNA, with the much larger amino acid alphabet we use to represent protein sequences. Any triple containing an ambiguous letter is scored as zero.

      To score a subsequence w wide using triples, we count how often each triple contained in that subsequence occurs in the positive and negative sets and use the maximum counts in the same formula as we use for scoring a w-mer, but with the counts now referring to the highest count of triples in the current location, rather than how many times the whole subsequence occurs:

      #X + #Y

    This paper contains results illustrating the benefits of PSPs:
    Timothy L. Bailey, Mikael Bodén, Tom Whitington, and Philip Machanick, "The value of position-specific priors in motif discovery using MEME", BMC Bioinformatics, 11(1):179, 2010. [full text]