lecture08_text.pdf

(318 KB) Pobierz
THE DEVELOPMENT OF THE POTENTIAL AND ACADEMIC PROGRAMMES OF WROCŁAW UNIVERSITY OF TECHNOLOGY
Lecture 8: Methods of pattern recognition
Bioinformatics
Paweł K˛ dzierski
e
for identification of sequence features, secondary structures etc.
Contents of the lecture
This interpretation is valuable in cases, when the rule govern-
ing the probability of some event is unknown or uncertain, and
we want to build up or improve our knowledge with observed
evidence.
1
1
2
3
3
3
4
4
4
5
Bayesian statistics provide sound scientific basis for many
machine-learning methods.
Further reading:
http://en.wikipedia.org/wiki/Bayesian_probability
http://en.wikipedia.org/wiki/Bayesian_inference
Assume we are drawing balls from a box:
The first ball turned out to be white. On the basis of this ob-
servation we may construct a hypothesis:
”the box contains only white balls”.
p
W
=
1.0
The second ball turned out black. The initial hypothesis must
be revised (it is proven
false
by the new evidence).
We may propose a new hypothesis – for example:
p
W
=
p
B
=
0.5
(but the ”experimantal support” is still very weak).
if after many iterations the
n
W
/n
B
ratio is
1, this new hy-
pothesis is well supported.
p(E|H)p(H)
p(E)
Contents
1
Statistical methods
1.1 Bayesian statistics . . . . . . . . . . . . . . . . . .
1.2 Hidden Markov Models . . . . . . . . . . . . . . .
1.3 Profile (knowledge) based recognition . . . . . . .
Heuristic machine learning
2.1 Neural Networks . . . . . . . . . . . . . . . . . .
2.2 Genetic Algorithms . . . . . . . . . . . . . . . . .
Applications
3.1 Performance in practice . . . . . . . . . . . . . . .
3.2 Classification of methods . . . . . . . . . . . . . .
2
3
Introduction
The purpose:
recognition of secondary structures
teritary structures
functional residues, motifs, domains etc.
The methods of recognition of sequence features (which are ex-
pressed in form of some
patterns
in the sequence) can be divided
into two groups:
Knowledge-based. The recognition takes into account the
knowledge of patterns in characterized sequences and their
similarity to the uncharacterized ones.
Machine learning methods. These may be based on probabil-
ity theory or heuristics.
p(H|E)
=
E
is an observation (a new evidence);
H
is some hypothesis, and
p(H)
is
prior probability
of this
hypothesis (probability assumed
before
observation of
E)
p(E|H)
is the conditional probability of
E
assuming, that
H
holds. That is, this is probability of event
E resulting
from
hypothesis
H.
p(E)
is
a priori
probability of observing
E
under all possible,
mutually exclusive hypotheses:
p(E)
=
p(E|H
i
)p(H
i
)
i
1
Statistical methods
1.1 Bayesian statistics
Bayesian interpretation of probability
Thomas Bayes (1702-1761) interpreted the probability as the
”state of knowledge”, allowing the application of probability
theory in cases when the true probabilities are unknown a pri-
ori, or inaccurate.
Project co-financed by European Union within European Social Fund
1
THE DEVELOPMENT OF THE POTENTIAL AND ACADEMIC PROGRAMMES OF WROCŁAW UNIVERSITY OF TECHNOLOGY
p(H|E)
is the (estimated) probability that
H
is true, including
E.
To make use of Bayes’ theorem, we must know probabilities
from
all
alternative hypotheses!
p(E)
=
p(E|H
i
)p(H
i
)
i
E
W
W
W
W
W
W
B
...
B
B
W
0
1
2
3
4
5
6
6
...
6
6
B
0
0
0
0
0
0
0
1
...
13
14
p(#1)
0.500
0.300
0.155
0.073
0.033
0.014
0.006
0.014
...
0.997
0.999
p(#2)
0.500
0.700
0.855
0.927
0.967
0.986
0.994
0.986
...
0.003
0.001
In the previous example, our hypothesis
H
stated ”p
W
=
p
B
=
0.5”; but if we are wrong (not
H),
there is no way to evaluate
p(E|not H).
. .
Another example:
There are two boxes: #1 and #2. In the first there is 30% white
(W) and 70% black (B) balls, and in the second 70% white
and 30% black. The boxes look identical.
We got one of the boxes and we pulled a single white ball out.
What is the probability, that we pulled from box #1?
H
1
– the box tested is #1,
H
2
– it is #2.
Since both boxes are identical, we assume that initially (with-
out any additional evidence)
p(H
1
) =
p(H
2
) =
0.5.
our experimental evidence is a white ball:
E
=
W
p(W
|H
1
)
means
assuming we draw out of box #1, the proba-
bility of choosing a white ball is
0.3.
p(W
|H
1
)p(H
1
)
p(W
|H
1
)p(H
1
) +
p(W
|H
2
)p(H
2
)
0.3
·
0.5
=
0.3
0.3
·
0.5
+
0.7
·
0.5
Bayesian statistics and bioinformatics
Example 1:
Data: frequencies of residues, probabilities of mutations,
MSA.
Hypotheses: phylogenetic trees
This is equivalent to
tree likelihood
evaluation.
Example 2:
Data: dependencies of residue frequencies on protein sec-
ondary structure, a sequence.
Hypotheses:
a particular sequence (fragment) forms
α-helix,
or
it forms
β-sheet,
or
it is unstructured.
Example 3:
p(H
1
|W
) =
p(H
1
|W
) =
Data: sequences with experimentally characterized secondary
structures.
Hypotheses:
residue X is part of an
α-helix;
p(H
2
|W
) =
0.7
it is part of a
β-sheet,
or
it is part of an unstructured fragment.
E
W
B
W
...
B
B
B
B
B
W
W
0
1
1
2
...
7
7
7
7
7
8
B
0
0
1
1
...
8
9
10
11
12
12
p(#1)
0.500
0.300
0.500
0.300
...
0.700
0.845
0.927
0.967
0.986
0.967
Result: probabilities of occurence of X within specific sec-
ondary structure.
This is an example of
machine learning model.
1.2
Hidden Markov Models
What is a Markov model?
A
Markov Chain
is a chain of events whose probabilities de-
pend only on the current state (of the process which generates
these events).
2
Project co-financed by European Union within European Social Fund
THE DEVELOPMENT OF THE POTENTIAL AND ACADEMIC PROGRAMMES OF WROCŁAW UNIVERSITY OF TECHNOLOGY
A
zeroth order Markov Model
is a model describing the prob-
abilities of
independent
events.
A
first order Markov Model
is a model of probability of
events, dependent on the (chain of length 1 of) previous
events.
Let an event be the occurence of a base in DNA. Then,
0
th
-order MM is simply the base frequencies,
1
st
-order MM defines the
conditional
probability of occurence
of base
b,
given that the previous base was
a:
r
ab
=
p(x
i
=
b
|
x
i−1
=
a)
Hidden Markov Model
A HMM is a model of probability of events in a Markov chain,
where the ”current state” is (partially) hidden.
A zeroth-order MM has no ”state”.
For the first-order MM from previous example, the ”state” is
the previous base.
This is an observable state.
”Hidden state” (i.e. not directly evident) may represent the
information: coding/non-coding sequence.
1.3
Profile (knowledge) based recognition
Profile-based recognition
A profile is a notion similar to PSSM.
If a PSSM contains sequences with characterized features,
they may be recognized by similarity.
The advantage of profiles is context-specific comparison.
The context is not (necessarily) limited a priori like the length
(order) of HMM chains.
HMM Profiles
The probabilities of producing a given (observed) output
p
0A
,
p
0T
,
p
0G
, etc. are called
emission probabilities;
The probabilities of change of the current state
t
0→1
and
t
1→0
are called
transition probabilities;
Both
p
and
t
depend conditionally on the current state;
All the possible hidden states form a complete ensemble
of hypotheses; their probabilities can be evaluated using
bayesian statistics on the basis of observed evidence.
The model can be taught with a training set to recognize hid-
den states in new data (supervised
learning).
12 3
WH . . En
WH . . Y .
W- . .E.
S H. . E .
T He . Y .
WHe r E .
No arbitral gap penalties!
Can be used to construct profiles (PSSM-alike) from MSA
provided.
Can be used to search for and to evaluate similarity.
Able to construct optimal MSA (of the highest likelihood)!
It is also possible that the model itself propose which hypoth- Example software – HMMER:
http:\\hmmer.janelia.org
esis has
maximum likelihood
of being true, given the evidence GPU-HMMER (speed factor 30-40x on Nvidia Tesla C1060)
(unsupervised
learning).
http:\\www.nvidia.pl/page/hmmer_on_tesla.html
Project co-financed by European Union within European Social Fund
3
THE DEVELOPMENT OF THE POTENTIAL AND ACADEMIC PROGRAMMES OF WROCŁAW UNIVERSITY OF TECHNOLOGY
2
Heuristic machine learning
Some way to ”mutate” the gene-encoded solution must be im-
plemented;
The idea is to have a population of solutions, which undergo
mutations and ”mating” (crossing-over), and select the best
ones from generation to generation.
1. A population of initial solutions is generated;
2. The population undergoes random mutations;
3. The population undergoes ”mating” – that is, generation of
new ”genes” by combining the existing ones;
4. A subset of best ”genes” (solutions) is selected as a ”new gen-
eration”;
5. The process is reiterated from step 2.
2.1 Neural Networks
An analogy of brains
Simple neural network
3
3.1
Applications
Performance in practice
Recognition of sequence features
Protein motifs, secondary structures, binding/active
sites, domains;
Coding sequences in genomes, gene functional sites, in-
trons/exons, etc.
Recognition of structure (fold)
Recognition of function
Comments on neural networks
Quality:
Applicable to a variety of problems.
Three or more "layers" of neurons are used in practice.
The input function is usually non-linear; commonly used is
a sigmoidal function with some (parametrized) sensitivity
threshold.
The network must be trained prior to usage (to generate sen-
sible parameters), and therefore some measure of the result
quality must be defined
a priori.
there is no statistical model behind the idea.
Success rate 50-90% depending on method and problem at
hand.
2.2 Genetic Algorithms
The ”solution” of a problem must be encoded in form of
a ”gene” which could be copied fragmentwise;
An optimality measure of the ”solution” must be defined;
Project co-financed by European Union within European Social Fund
4
THE DEVELOPMENT OF THE POTENTIAL AND ACADEMIC PROGRAMMES OF WROCŁAW UNIVERSITY OF TECHNOLOGY
Fold recognition is based on sequence composition and order-
ing
Motivated by the notion that structure is evolutionary more
conserved than sequence.
Reliable for well recognizable patterns (e.g. coils)
Less reliable for secondary structures
Not too good for prediction of local features (single important
residues)
Worse than comparative methods with highly similar tem-
plates
One approach to fold recognition is based on secondary struc-
ture prediction and comparison.
This subclass of methods is based on the observation that sec-
ondary structure similarity can exceed 80% for sequences that
exhibit less than 10% sequence similarity.
Only as good as the underlying secondary structure prediction
method.
The approach is based on predicting one dimensional (se-
quence) features, and identifying a similar fold by comparing
these features to the ones of known folds.
60%
Accuracy of secondary structure predictions: 76%
?%
(1990s)
(2004)
current
3.2 Classification of methods
Methods for prediction of sequence features
Secondary structures, domains, binding sites etc.
Comparative (PSSM and other profiles)
Composition based (HMM, neural networks)
Tertiary structure (can also be regarded as „sequence feature”)
Template based (profiles/alignments)
Fold recognition
Threading
Fragment-based methods
Ab initio (methods that do not use database information).
Assume Anfinsen thermodynamic hypothesis (1973) on fold-
ing to global free energy minimum.
Comparative methods fail when no templates of sufficient
similarity are available
The quality of predictions decays with lower similarity to an-
notated templates
Threading methods score how well the sequence fits a partic-
ular fold.
They may be comparative (based on alignment of sequence to
known template structures).
. . . or similar to fold recognition methods, but with scoring
function based on predicted structural rather than sequence
features
Threading-based methods could be quite expensive. Globally
optimal protein threading is known to be NP-hard.
Some threading methods ignore pairwise interaction between
residues. In this case, the threading problem can be solved
with dynamic programming; however use of pairwise residue
potentials is more accurate.
Fragment-based methods use konformer (rotamer) libraries of
short peptide fragments to limit conformational search space
The structure scoring function may be physics-based or
knowledge-based
Most prominent example is the
Rosetta
algorithm
5
Project co-financed by European Union within European Social Fund
Zgłoś jeśli naruszono regulamin