lecture08_text.pdf

(318 KB) Pobierz

THE DEVELOPMENT OF THE POTENTIAL AND ACADEMIC PROGRAMMES OF WROCŁAW UNIVERSITY OF TECHNOLOGY

Lecture 8: Methods of pattern recognition

Bioinformatics

Paweł K˛ dzierski

for identiﬁcation of sequence features, secondary structures etc.

Contents of the lecture

•

This interpretation is valuable in cases, when the rule govern-

ing the probability of some event is unknown or uncertain, and

we want to build up or improve our knowledge with observed

evidence.

•

Bayesian statistics provide sound scientiﬁc basis for many

machine-learning methods.

Further reading:

http://en.wikipedia.org/wiki/Bayesian_probability

http://en.wikipedia.org/wiki/Bayesian_inference

Assume we are drawing balls from a box:

•

The ﬁrst ball turned out to be white. On the basis of this ob-

servation we may construct a hypothesis:

”the box contains only white balls”.

1.0

•

The second ball turned out black. The initial hypothesis must

be revised (it is proven

false

by the new evidence).

We may propose a new hypothesis – for example:

0.5

(but the ”experimantal support” is still very weak).

•

if after many iterations the

ratio is

≈

1, this new hy-

pothesis is well supported.

p(E|H)p(H)

p(E)

Contents

Statistical methods

1.1 Bayesian statistics . . . . . . . . . . . . . . . . . .

1.2 Hidden Markov Models . . . . . . . . . . . . . . .

1.3 Proﬁle (knowledge) based recognition . . . . . . .

Heuristic machine learning

2.1 Neural Networks . . . . . . . . . . . . . . . . . .

2.2 Genetic Algorithms . . . . . . . . . . . . . . . . .

Applications

3.1 Performance in practice . . . . . . . . . . . . . . .

3.2 Classiﬁcation of methods . . . . . . . . . . . . . .

Introduction

The purpose:

•

recognition of secondary structures

•

teritary structures

•

functional residues, motifs, domains etc.

The methods of recognition of sequence features (which are ex-

pressed in form of some

patterns

in the sequence) can be divided

into two groups:

•

Knowledge-based. The recognition takes into account the

knowledge of patterns in characterized sequences and their

similarity to the uncharacterized ones.

•

Machine learning methods. These may be based on probabil-

ity theory or heuristics.

p(H|E)

•

is an observation (a new evidence);

•

is some hypothesis, and

p(H)

prior probability

of this

hypothesis (probability assumed

before

observation of

•

p(E|H)

is the conditional probability of

assuming, that

holds. That is, this is probability of event

E resulting

from

hypothesis

•

p(E)

a priori

probability of observing

under all possible,

mutually exclusive hypotheses:

p(E)

∑

p(E|H

)p(H

)

Statistical methods

1.1 Bayesian statistics

Bayesian interpretation of probability

•

Thomas Bayes (1702-1761) interpreted the probability as the

”state of knowledge”, allowing the application of probability

theory in cases when the true probabilities are unknown a pri-

ori, or inaccurate.

Project co-ﬁnanced by European Union within European Social Fund

THE DEVELOPMENT OF THE POTENTIAL AND ACADEMIC PROGRAMMES OF WROCŁAW UNIVERSITY OF TECHNOLOGY

•

p(H|E)

is the (estimated) probability that

is true, including

•

To make use of Bayes’ theorem, we must know probabilities

from

all

alternative hypotheses!

p(E)

∑

p(E|H

)p(H

)

–

...

∑

...

∑

...

p(#1)

0.500

0.300

0.155

0.073

0.033

0.014

0.006

0.014

...

0.997

0.999

p(#2)

0.500

0.700

0.855

0.927

0.967

0.986

0.994

0.986

...

0.003

0.001

•

In the previous example, our hypothesis

stated ”p

0.5”; but if we are wrong (not

H),

there is no way to evaluate

p(E|not H).

. .

Another example:

•

There are two boxes: #1 and #2. In the ﬁrst there is 30% white

(W) and 70% black (B) balls, and in the second 70% white

and 30% black. The boxes look identical.

•

We got one of the boxes and we pulled a single white ball out.

What is the probability, that we pulled from box #1?

•

– the box tested is #1,

– it is #2.

•

Since both boxes are identical, we assume that initially (with-

out any additional evidence)

p(H

) =

p(H

) =

0.5.

•

our experimental evidence is a white ball:

•

p(W

)

means

assuming we draw out of box #1, the proba-

bility of choosing a white ball is

0.3.

p(W

)p(H

)

p(W

)p(H

) +

p(W

)p(H

)

0.3

0.5

0.3

0.5

0.7

0.5

Bayesian statistics and bioinformatics

Example 1:

•

Data: frequencies of residues, probabilities of mutations,

MSA.

•

Hypotheses: phylogenetic trees

This is equivalent to

tree likelihood

evaluation.

Example 2:

•

Data: dependencies of residue frequencies on protein sec-

ondary structure, a sequence.

•

Hypotheses:

–

a particular sequence (fragment) forms

α-helix,

–

it forms

β-sheet,

–

it is unstructured.

Example 3:

p(H

) =

p(H

) =

•

Data: sequences with experimentally characterized secondary

structures.

•

Hypotheses:

–

residue X is part of an

α-helix;

p(H

) =

0.7

–

it is part of a

β-sheet,

–

it is part of an unstructured fragment.

–

...

∑

...

∑

...

p(#1)

0.500

0.300

0.500

0.300

...

0.700

0.845

0.927

0.967

0.986

0.967

•

Result: probabilities of occurence of X within speciﬁc sec-

ondary structure.

This is an example of

machine learning model.

1.2

Hidden Markov Models

What is a Markov model?

•

Markov Chain

is a chain of events whose probabilities de-

pend only on the current state (of the process which generates

these events).

Project co-ﬁnanced by European Union within European Social Fund

THE DEVELOPMENT OF THE POTENTIAL AND ACADEMIC PROGRAMMES OF WROCŁAW UNIVERSITY OF TECHNOLOGY

•

zeroth order Markov Model

is a model describing the prob-

abilities of

independent

events.

•

ﬁrst order Markov Model

is a model of probability of

events, dependent on the (chain of length 1 of) previous

events.

Let an event be the occurence of a base in DNA. Then,

•

-order MM is simply the base frequencies,

•

-order MM deﬁnes the

conditional

probability of occurence

of base

given that the previous base was

p(x

i−1

Hidden Markov Model

A HMM is a model of probability of events in a Markov chain,

where the ”current state” is (partially) hidden.

•

A zeroth-order MM has no ”state”.

•

For the ﬁrst-order MM from previous example, the ”state” is

the previous base.

This is an observable state.

•

”Hidden state” (i.e. not directly evident) may represent the

information: coding/non-coding sequence.

1.3

Proﬁle (knowledge) based recognition

Proﬁle-based recognition

•

A proﬁle is a notion similar to PSSM.

•

If a PSSM contains sequences with characterized features,

they may be recognized by similarity.

•

The advantage of proﬁles is context-speciﬁc comparison.

•

The context is not (necessarily) limited a priori like the length

(order) of HMM chains.

HMM Proﬁles

•

The probabilities of producing a given (observed) output

, etc. are called

emission probabilities;

•

The probabilities of change of the current state

0→1

and

1→0

are called

transition probabilities;

•

Both

and

depend conditionally on the current state;

•

All the possible hidden states form a complete ensemble

of hypotheses; their probabilities can be evaluated using

bayesian statistics on the basis of observed evidence.

•

The model can be taught with a training set to recognize hid-

den states in new data (supervised

learning).

12 3

WH . . En

WH . . Y .

W- . .E.

S H. . E .

T He . Y .

WHe r E .

•

No arbitral gap penalties!

•

Can be used to construct proﬁles (PSSM-alike) from MSA

provided.

•

Can be used to search for and to evaluate similarity.

•

Able to construct optimal MSA (of the highest likelihood)!

•

It is also possible that the model itself propose which hypoth- Example software – HMMER:

http:\\hmmer.janelia.org

esis has

maximum likelihood

of being true, given the evidence GPU-HMMER (speed factor 30-40x on Nvidia Tesla C1060)

(unsupervised

learning).

http:\\www.nvidia.pl/page/hmmer_on_tesla.html

Project co-ﬁnanced by European Union within European Social Fund

THE DEVELOPMENT OF THE POTENTIAL AND ACADEMIC PROGRAMMES OF WROCŁAW UNIVERSITY OF TECHNOLOGY

Heuristic machine learning

•

Some way to ”mutate” the gene-encoded solution must be im-

plemented;

•

The idea is to have a population of solutions, which undergo

mutations and ”mating” (crossing-over), and select the best

ones from generation to generation.

1. A population of initial solutions is generated;

2. The population undergoes random mutations;

3. The population undergoes ”mating” – that is, generation of

new ”genes” by combining the existing ones;

4. A subset of best ”genes” (solutions) is selected as a ”new gen-

eration”;

5. The process is reiterated from step 2.

2.1 Neural Networks

An analogy of brains

Simple neural network

3.1

Applications

Performance in practice

•

Recognition of sequence features

–

Protein motifs, secondary structures, binding/active

sites, domains;

–

Coding sequences in genomes, gene functional sites, in-

trons/exons, etc.

•

Recognition of structure (fold)

•

Recognition of function

Comments on neural networks

Quality:

•

Applicable to a variety of problems.

•

Three or more "layers" of neurons are used in practice.

•

The input function is usually non-linear; commonly used is

a sigmoidal function with some (parametrized) sensitivity

threshold.

•

The network must be trained prior to usage (to generate sen-

sible parameters), and therefore some measure of the result

quality must be deﬁned

a priori.

•

there is no statistical model behind the idea.

•

Success rate 50-90% depending on method and problem at

hand.

2.2 Genetic Algorithms

•

The ”solution” of a problem must be encoded in form of

a ”gene” which could be copied fragmentwise;

•

An optimality measure of the ”solution” must be deﬁned;

Project co-ﬁnanced by European Union within European Social Fund

THE DEVELOPMENT OF THE POTENTIAL AND ACADEMIC PROGRAMMES OF WROCŁAW UNIVERSITY OF TECHNOLOGY

•

Fold recognition is based on sequence composition and order-

ing

•

Motivated by the notion that structure is evolutionary more

conserved than sequence.

•

Reliable for well recognizable patterns (e.g. coils)

•

Less reliable for secondary structures

•

Not too good for prediction of local features (single important

residues)

•

Worse than comparative methods with highly similar tem-

plates

•

One approach to fold recognition is based on secondary struc-

ture prediction and comparison.

•

This subclass of methods is based on the observation that sec-

ondary structure similarity can exceed 80% for sequences that

exhibit less than 10% sequence similarity.

•

Only as good as the underlying secondary structure prediction

method.

•

The approach is based on predicting one dimensional (se-

quence) features, and identifying a similar fold by comparing

these features to the ones of known folds.

60%

Accuracy of secondary structure predictions: 76%

(1990s)

(2004)

current

3.2 Classiﬁcation of methods

Methods for prediction of sequence features

Secondary structures, domains, binding sites etc.

•

Comparative (PSSM and other proﬁles)

•

Composition based (HMM, neural networks)

Tertiary structure (can also be regarded as „sequence feature”)

•

Template based (proﬁles/alignments)

•

Fold recognition

•

Threading

•

Fragment-based methods

•

Ab initio (methods that do not use database information).

Assume Anﬁnsen thermodynamic hypothesis (1973) on fold-

ing to global free energy minimum.

•

Comparative methods fail when no templates of sufﬁcient

similarity are available

•

The quality of predictions decays with lower similarity to an-

notated templates

•

Threading methods score how well the sequence ﬁts a partic-

ular fold.

•

They may be comparative (based on alignment of sequence to

known template structures).

•

. . . or similar to fold recognition methods, but with scoring

function based on predicted structural rather than sequence

features

•

Threading-based methods could be quite expensive. Globally

optimal protein threading is known to be NP-hard.

•

Some threading methods ignore pairwise interaction between

residues. In this case, the threading problem can be solved

with dynamic programming; however use of pairwise residue

potentials is more accurate.

•

Fragment-based methods use konformer (rotamer) libraries of

short peptide fragments to limit conformational search space

•

The structure scoring function may be physics-based or

knowledge-based

•

Most prominent example is the

Rosetta

algorithm

Project co-ﬁnanced by European Union within European Social Fund

Plik z chomika:

twoj.inzynier

Inne pliki z tego folderu:

lecture08_handouts.pdf (3200 KB)
lecture08_text.pdf (318 KB)

lecture08_text.pdf

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: