memeplus.tech(5).pdf

(176 KB) Pobierz

FITTING A MIXTURE MODEL BY

EXPECTATION MAXIMIZATION

TO DISCOVER MOTIFS IN BIOPOLYMERS

UCSD Technical Report CS94-351

TIMOTHY L. BAILEY

TBAILEY@CS.UCSD.EDU

Department of Computer Science and Engineering,

University of California at San Diego, La Jolla, California 92093-0114

phone: (619) 534-8187

fax: (619) 534-7029

CHARLES ELKAN

ELKAN@CS.UCSD.EDU

Department of Computer Science and Engineering,

University of California at San Diego, La Jolla, California 92093-0114

phone: (619) 534-8897

fax: (619) 534-7029

ABSTRACT:

The algorithm described in this paper discovers one or more motifs in a

collection of DNA or protein sequences by using the technique of expectation maximization

to t a two-component nite mixture model to the set of sequences. Multiple motifs are

found by tting a two-component nite mixture model to the data, probabilistically erasing

the occurrences of the motif thus found, and repeating the process to nd successive motifs.

The algorithm requires only a set of sequences and a number specifying the width of the

motifs as input. It returns a model of each motif and a threshold which together can be used

as a Bayes-optimal classier for searching for occurrences of the motif in other databases.

The algorithm estimates how many times each motif occurs in the input dataset and outputs

an alignment of the occurrences of the motif. The algorithm is capable of discovering

several different motifs with differing numbers of occurrences in a single dataset. Motifs

are discovered by treating the set of sequences as though they were created by a stochastic

process which can be modelled as a mixture of two densities, one of which generated the

occurrences of the motif, and the other the rest of the positions in the sequences. Expec-

tation maximization is used to estimate the parameters of the two densities and the mixing

parameter.

KEYWORDS:

Unsupervised learning, expectation maximization, mixture model,

consensus pattern, motif, biopolymer, binding site.

Contents

Introduction

The nite mixture model

Expectation maximization in nite mixture models

3.1 The MM algorithm

3.2 Computing the expected likelihood of a model

Implementation of MM

4.1 The MEME+ algorithm

4.2 Output of MEME+

4.3 Finding good starting points for MM

Experimental Results

5.1 MEME+ discovers multiple motifs

5.2 Sensitivity to prior

5.3 Sensitivity to noise

5.4 Sensitivity to subsampling

Discussion

Acknowledgements:

Timothy Bailey is supported by NIH Genome Analysis Pre-

Doctoral Training Grant No. HG00005. Charles Elkan is supported in part by the

National Science Foundation under Award No. IRI-9110813.

¢£¢£¢£¢¢£

¢£¢£¢£¢¢£¢ ¡¤¡¢£¢£¢¢£¢£¢¡

¢£¢£¢£¢¢£¢ ¡¤¡¢

¢£¢£¢£¢¢£¢ ¡¤¡¢£¢£¢¢£¢£¢¡

¢£¢£¢£¢¢£¢ ¡¤¡¢£¢£¢¢£¢

¢£¢£¢£¢¢£¢ ¡¤¡¢£¢£¢¢£

¢£¢£¢£¢¢£¢ ¡¤¡¢£¢£¢¢£¢£¢¡

¢£¢£¢£¢¢£¢ ¡¤¡¢£

1 Introduction

Finding a cluster of numerous similar subsequences in a set of biopolymer sequences

is evidence that the subsequences occur not by chance but because they share some

biological function. For example, the shared biological function which accounts

for the similarity of a subset of subsequences might be a common protein binding

site or splice junction in the case of DNA sequences, or the active sites of related

enzymes in the case of protein sequences. This paper describes an algorithm called

which, given a dataset of possibly related biopolymer sequences, estimates

the parameters of a probabilistic model which could have generated the dataset.

The probabilistic model is a two-component nite mixture model. One component

describes a set of similar subsequences of xed width (the “motif”), while the

other component describes all other positions in the sequences (the “background”).

Fitting the model includes estimating the relative frequency of motif occurrences.

This estimate can be used to select the threshold for a Bayes-optimal classier for

nding occurrences of the motif in other databases.

The

algorithm is an extension of the expectation maximization technique

for tting nite mixture models developed by Aitkin and Rubin [1985]. It is

related to the algorithm based on expectation maximization described by Lawrence

and Reilly [1990], but it relaxes the assumption that each sequence in the dataset

contains one occurrence of the motif. In other words,

solves an unsupervised

learning problem: it is intended to be useful for discovering new motifs in datasets

that may or may not contain motifs, treating each subsequence of width

in the

dataset as an unlabeled sample. On the other hand, the algorithm of Lawrence and

Reilly [1990] treats each sequence as a labeled sample, and solves a supervised

learning problem.

is also related to the algorithm described in [Bailey and Elkan, 1993].

Unlike that algorithm,

estimates cluster size (number of occurrences of the

motif) at the same time it is learning models of the motif and the background. This

removes the need for the user of the algorithm to know in advance the number of

times the motif occurs in the dataset. This makes it possible to search for motifs in

datasets about which very little is known.

The

algorithm has been implemented as an option to the

MEME

software

for discovering multiple motifs in biopolymer sequences [Bailey and Elkan, 1993].

can therefore be used to discover multiple motifs in a dataset. Briey, this is

done by repeatedly applying

to the dataset and then probabilistically erasing

all occurrences of the discovered motif. Because

estimates the number of

occurrences of each motif,

MEME

using

is able to nd motifs with different

numbers of occurrences in a single dataset. This increases the usefulness of

MEME

as a tool for exploring datasets that contain more than one motif.

The rest of this paper is organized as follows. Section 2 describes the nite

mixture model used by

, and Section 3 gives the analysis needed to apply

the expectation maximization idea to this type of model. Section 4 describes the

implementation of

in the context of

MEME

. Section 5 presents experimental

results of using the

algorithm to discover motifs in several DNA and protein

datasets. Finally, Section 6 concludes the paper by discussing the strengths and

limitations of the

algorithm.

2 The nite mixture model

The

algorithm searches for maximum likelihood estimates of the parameters of

a nite mixture model which could have generated a given dataset of biopolymer

sequences. We will refer to the dataset as

, where

is the

number of sequences in the dataset. The sequences

are assumed to be over

some xed alphabet, say

, which is given as input to the

1 2

algorithm. The mixture model used by

does not actually model the dataset

directly. Instead, the dataset is broken up conceptually into all (overlapping)

subsequences of length which it contains. This new dataset will be referred to as

learns a nite mixture model which models the new

dataset. Although this model does not, strictly speaking, model the original dataset,

in practice it is a good approximation, especially when care is taken to ensure that

the model does not predict that two overlapping subsequences in the new dataset

both were generated by the motif. This is done by enforcing a constraint on the

estimated probabilities of overlapping subsequences being motif occurrences. How

this constraint is enforced is discussed later.

The model for the new dataset consists of two components which model the

motif and background (non-motif) subsequences respectively. The motif model

used by

says that each position in a subsequence which is an occurrence of the

motif is generated by an independent random variable describing a multinomial trial

with parameter

. That is, the probability of letter

appearing

in position in the motif is . The parameters

for

and

must be estimated from the data. The background model says that each

position in a subsequence which is not part of a motif occurrence is generated

independently, by a multinomial trial random variable with a common parameter

. In other words,

assumes that a sequence of length

generated by the background model is a sequence of

independent samples from

a single background distribution. The overall model for the new dataset which

uses is that the motif model (with probability

) or the background model (with

probability

) is chosen by nature and then a sequence of length

generated according to the probability distribution governing the model chosen. In

summary, the parameters for the overall model of the data assumed by

are the

mixing parameter

, vectors of letter frequencies for the motif model

, and a single vector of letter frequencies for the background

1 2

model

@¥9 "

$%" # " "© § !

¦ ¦ ¦¨¦

B921

B961

1 §

87$61 25©4321

Y§

XW1 1 T 1U§ T

H HSRQH

0' ' ' (§ '

)

$ 1 # G§ 1

1©

§ H

EF #

In order to nd more than one motif in a dataset, a value called an “erasing

factor” is associated with each letter in the dataset. These values are all initially set

to 1 to indicate that no erasing has occurred. After

converges to a motif, the

erasing factors are all adjusted downward towards 0 in proportion to the probability

that each character is part of an occurrence of the motif just found. The erasing

factors are used by

in the reestimation of and effectively remove occurrences

of motifs already found from the dataset. This method of nding multiple motifs is

discussed in more detail in [Bailey and Elkan, 1993].

The motif discovered by

can be used to search for motif occurrences in

other datasets. It can be used directly to estimate the conditional probability of

any subsequence of length

given the motif. It can also be used to compute the

likelihood ratio for testing whether the subsequence was generated by the motif or

the background components of the model. This can be done by using the estimated

values of the parameter to create a log-odds matrix of the type referred to as

a “specicity matrix” by Hertz

et al.

[1990]. The estimated value of the mixing

parameter, , can be used to calculate the optimum threshold for using the log-odds

matrix as a classier. In addition, the log-odds matrix can be used as a prole

(without gaps) [Gribskov

et al.,

1990] of the motif in the other applications of

prole analysis.

3 Expectation maximization in nite mixture models

The

algorithm uses expectation maximization (EM) to discover motifs in

datasets of sequences. The next two sections describe how EM can be applied

to the problem at hand and how the likelihoods of the motifs found are calculated.

3.1 The MM algorithm

The

algorithm does maximum likelihood estimation: its objective is to discover

those values of the parameters of the overall model which maximize the likelihood

of the data. To do this, the expectation maximization algorithm for nite mixture

models of Aitkin and Rubin [1985] is used. This iterative procedure nds values for

and

1 2

which (locally) maximize the likelihood of the data

given the model.

A nite mixture model assumes that data

arises from two or more groups with known distributional forms but different,

unknown parameters. Expectation maximization (EM) can be used to nd maximum

likelihood estimates of the unknown parameters

where is the number of groups

where

# b§

dT T T © T

)0' ' ' b§ '

T T ©a§ T

H Sa`H

H© §

is the number of samples

Plik z chomika:

xyzgeo

Inne pliki z tego folderu:

dokowanie.rar (20349 KB)
Brightwork-master(1).zip (12725 KB)
Gibas2001DevelopingBioinformatics(1).pdf (2861 KB)
wyklad_inauguracyjny_2011(4).pdf (2065 KB)
Wiktor_Pęczar_praca_inż(2).doc (1964 KB)

memeplus.tech(5).pdf

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: