memeplus.tech(5).pdf

(176 KB) Pobierz
FITTING A MIXTURE MODEL BY
EXPECTATION MAXIMIZATION
TO DISCOVER MOTIFS IN BIOPOLYMERS
UCSD Technical Report CS94-351
TIMOTHY L. BAILEY
TBAILEY@CS.UCSD.EDU
Department of Computer Science and Engineering,
University of California at San Diego, La Jolla, California 92093-0114
phone: (619) 534-8187
fax: (619) 534-7029
CHARLES ELKAN
ELKAN@CS.UCSD.EDU
Department of Computer Science and Engineering,
University of California at San Diego, La Jolla, California 92093-0114
phone: (619) 534-8897
fax: (619) 534-7029
ABSTRACT:
The algorithm described in this paper discovers one or more motifs in a
collection of DNA or protein sequences by using the technique of expectation maximization
to t a two-component nite mixture model to the set of sequences. Multiple motifs are
found by tting a two-component nite mixture model to the data, probabilistically erasing
the occurrences of the motif thus found, and repeating the process to nd successive motifs.
The algorithm requires only a set of sequences and a number specifying the width of the
motifs as input. It returns a model of each motif and a threshold which together can be used
as a Bayes-optimal classier for searching for occurrences of the motif in other databases.
The algorithm estimates how many times each motif occurs in the input dataset and outputs
an alignment of the occurrences of the motif. The algorithm is capable of discovering
several different motifs with differing numbers of occurrences in a single dataset. Motifs
are discovered by treating the set of sequences as though they were created by a stochastic
process which can be modelled as a mixture of two densities, one of which generated the
occurrences of the motif, and the other the rest of the positions in the sequences. Expec-
tation maximization is used to estimate the parameters of the two densities and the mixing
parameter.
KEYWORDS:
Unsupervised learning, expectation maximization, mixture model,
consensus pattern, motif, biopolymer, binding site.
Contents
1
2
3
Introduction
The nite mixture model
Expectation maximization in nite mixture models
3.1 The MM algorithm
3.2 Computing the expected likelihood of a model
Implementation of MM
4.1 The MEME+ algorithm
4.2 Output of MEME+
4.3 Finding good starting points for MM
Experimental Results
5.1 MEME+ discovers multiple motifs
5.2 Sensitivity to prior
5.3 Sensitivity to noise
5.4 Sensitivity to subsampling
Discussion
2
3
4
4
9
14
14
16
17
18
20
22
23
26
26
4
5
6
Acknowledgements:
Timothy Bailey is supported by NIH Genome Analysis Pre-
Doctoral Training Grant No. HG00005. Charles Elkan is supported in part by the
National Science Foundation under Award No. IRI-9110813.
 ¢£¢£¢£¢¢£ 
               
                                             
¢£¢£¢£¢¢£¢ ¡¤¡¢£¢£¢¢£¢£¢¡ 
 ¢£¢£¢£¢¢£¢ ¡¤¡¢ 
                       
 ¢£¢£¢£¢¢£¢ ¡¤¡¢£¢£¢¢£¢£¢¡ 
                                           
                                       
¢£¢£¢£¢¢£¢ ¡¤¡¢£¢£¢¢£¢ 
 ¢£¢£¢£¢¢£¢ ¡¤¡¢£¢£¢¢£ 
                                   
 ¢£¢£¢£¢¢£¢ ¡¤¡¢£¢£¢¢£¢£¢¡ 
                                           
 ¢£¢£¢£¢¢£¢ ¡¤¡¢£¢£¢¢£¢£¢¡ 
                                           
                           
¢£¢£¢£¢¢£¢ ¡¤¡¢£ 
1 Introduction
Finding a cluster of numerous similar subsequences in a set of biopolymer sequences
is evidence that the subsequences occur not by chance but because they share some
biological function. For example, the shared biological function which accounts
for the similarity of a subset of subsequences might be a common protein binding
site or splice junction in the case of DNA sequences, or the active sites of related
enzymes in the case of protein sequences. This paper describes an algorithm called
MM
which, given a dataset of possibly related biopolymer sequences, estimates
the parameters of a probabilistic model which could have generated the dataset.
The probabilistic model is a two-component nite mixture model. One component
describes a set of similar subsequences of xed width (the “motif”), while the
other component describes all other positions in the sequences (the “background”).
Fitting the model includes estimating the relative frequency of motif occurrences.
This estimate can be used to select the threshold for a Bayes-optimal classier for
nding occurrences of the motif in other databases.
The
MM
algorithm is an extension of the expectation maximization technique
for tting nite mixture models developed by Aitkin and Rubin [1985]. It is
related to the algorithm based on expectation maximization described by Lawrence
and Reilly [1990], but it relaxes the assumption that each sequence in the dataset
contains one occurrence of the motif. In other words,
MM
solves an unsupervised
learning problem: it is intended to be useful for discovering new motifs in datasets
that may or may not contain motifs, treating each subsequence of width
in the
dataset as an unlabeled sample. On the other hand, the algorithm of Lawrence and
Reilly [1990] treats each sequence as a labeled sample, and solves a supervised
learning problem.
MM
is also related to the algorithm described in [Bailey and Elkan, 1993].
Unlike that algorithm,
MM
estimates cluster size (number of occurrences of the
motif) at the same time it is learning models of the motif and the background. This
removes the need for the user of the algorithm to know in advance the number of
times the motif occurs in the dataset. This makes it possible to search for motifs in
datasets about which very little is known.
The
MM
algorithm has been implemented as an option to the
MEME
software
for discovering multiple motifs in biopolymer sequences [Bailey and Elkan, 1993].
MM
can therefore be used to discover multiple motifs in a dataset. Briey, this is
done by repeatedly applying
MM
to the dataset and then probabilistically erasing
all occurrences of the discovered motif. Because
MM
estimates the number of
occurrences of each motif,
MEME
using
MM
is able to nd motifs with different
numbers of occurrences in a single dataset. This increases the usefulness of
MEME
as a tool for exploring datasets that contain more than one motif.
The rest of this paper is organized as follows. Section 2 describes the nite
mixture model used by
MM
, and Section 3 gives the analysis needed to apply
2
¥
the expectation maximization idea to this type of model. Section 4 describes the
implementation of
MM
in the context of
MEME
. Section 5 presents experimental
results of using the
MM
algorithm to discover motifs in several DNA and protein
datasets. Finally, Section 6 concludes the paper by discussing the strengths and
limitations of the
MM
algorithm.
2 The nite mixture model
The
MM
algorithm searches for maximum likelihood estimates of the parameters of
a nite mixture model which could have generated a given dataset of biopolymer
sequences. We will refer to the dataset as
, where
is the
1
2
number of sequences in the dataset. The sequences
are assumed to be over
some xed alphabet, say
, which is given as input to the
1 2
algorithm. The mixture model used by
MM
does not actually model the dataset
directly. Instead, the dataset is broken up conceptually into all (overlapping)
subsequences of length which it contains. This new dataset will be referred to as
.
MM
learns a nite mixture model which models the new
1
2
dataset. Although this model does not, strictly speaking, model the original dataset,
in practice it is a good approximation, especially when care is taken to ensure that
the model does not predict that two overlapping subsequences in the new dataset
both were generated by the motif. This is done by enforcing a constraint on the
estimated probabilities of overlapping subsequences being motif occurrences. How
this constraint is enforced is discussed later.
The model for the new dataset consists of two components which model the
motif and background (non-motif) subsequences respectively. The motif model
used by
MM
says that each position in a subsequence which is an occurrence of the
motif is generated by an independent random variable describing a multinomial trial
with parameter
. That is, the probability of letter
appearing
1
in position in the motif is . The parameters
for
1
and
1
must be estimated from the data. The background model says that each
position in a subsequence which is not part of a motif occurrence is generated
independently, by a multinomial trial random variable with a common parameter
0
01
0
. In other words,
MM
assumes that a sequence of length
generated by the background model is a sequence of
independent samples from
a single background distribution. The overall model for the new dataset which
MM
uses is that the motif model (with probability
1
) or the background model (with
1
is
probability
2
1
) is chosen by nature and then a sequence of length
generated according to the probability distribution governing the model chosen. In
summary, the parameters for the overall model of the data assumed by
MM
are the
mixing parameter
1
2
, vectors of letter frequencies for the motif model
, and a single vector of letter frequencies for the background
1
1 2
model
2
0
.
3
§
DC
¥
¥
@¥9 "
  
 
&
$%"  #  " "© § !
 
¦
© §
 
¦    ¦ ¦¨¦
§
0A
¥
B921
H
B961
  1 §
87$61    25©4321
1
V  
XW1    1 T 1U§ T
©
H HSRQH
© §
¥
0'    ' ' (§ '
©
)  
H
PI
$ 1 #  G§ 1
    1©
§ H
A
EF # 
  
In order to nd more than one motif in a dataset, a value called an “erasing
factor” is associated with each letter in the dataset. These values are all initially set
to 1 to indicate that no erasing has occurred. After
MM
converges to a motif, the
erasing factors are all adjusted downward towards 0 in proportion to the probability
that each character is part of an occurrence of the motif just found. The erasing
factors are used by
MM
in the reestimation of and effectively remove occurrences
of motifs already found from the dataset. This method of nding multiple motifs is
discussed in more detail in [Bailey and Elkan, 1993].
The motif discovered by
MM
can be used to search for motif occurrences in
other datasets. It can be used directly to estimate the conditional probability of
any subsequence of length
given the motif. It can also be used to compute the
likelihood ratio for testing whether the subsequence was generated by the motif or
the background components of the model. This can be done by using the estimated
values of the parameter to create a log-odds matrix of the type referred to as
a “specicity matrix” by Hertz
et al.
[1990]. The estimated value of the mixing
parameter, , can be used to calculate the optimum threshold for using the log-odds
matrix as a classier. In addition, the log-odds matrix can be used as a prole
(without gaps) [Gribskov
et al.,
1990] of the motif in the other applications of
prole analysis.
3 Expectation maximization in nite mixture models
The
MM
algorithm uses expectation maximization (EM) to discover motifs in
datasets of sequences. The next two sections describe how EM can be applied
to the problem at hand and how the likelihoods of the motifs found are calculated.
3.1 The MM algorithm
The
MM
algorithm does maximum likelihood estimation: its objective is to discover
those values of the parameters of the overall model which maximize the likelihood
of the data. To do this, the expectation maximization algorithm for nite mixture
models of Aitkin and Rubin [1985] is used. This iterative procedure nds values for
1
2
and
1 2
which (locally) maximize the likelihood of the data
given the model.
A nite mixture model assumes that data
1
2
arises from two or more groups with known distributional forms but different,
unknown parameters. Expectation maximization (EM) can be used to nd maximum
likelihood estimates of the unknown parameters
1
2
where is the number of groups
4
&
where
T
e
#   b§
c
dT   T T © T
©
)0'     ' ' b§ '
T T ©a§ T
¥
T
H Sa`H
H© §
H
is the number of samples
Zgłoś jeśli naruszono regulamin