lecture08_text.pdf
(
318 KB
)
Pobierz
THE DEVELOPMENT OF THE POTENTIAL AND ACADEMIC PROGRAMMES OF WROCŁAW UNIVERSITY OF TECHNOLOGY
Lecture 8: Methods of pattern recognition
Bioinformatics
Paweł K˛ dzierski
e
for identification of sequence features, secondary structures etc.
Contents of the lecture
•
This interpretation is valuable in cases, when the rule govern-
ing the probability of some event is unknown or uncertain, and
we want to build up or improve our knowledge with observed
evidence.
1
1
2
3
3
3
4
4
4
5
•
Bayesian statistics provide sound scientific basis for many
machine-learning methods.
Further reading:
http://en.wikipedia.org/wiki/Bayesian_probability
http://en.wikipedia.org/wiki/Bayesian_inference
Assume we are drawing balls from a box:
•
The first ball turned out to be white. On the basis of this ob-
servation we may construct a hypothesis:
”the box contains only white balls”.
p
W
=
1.0
•
The second ball turned out black. The initial hypothesis must
be revised (it is proven
false
by the new evidence).
We may propose a new hypothesis – for example:
p
W
=
p
B
=
0.5
(but the ”experimantal support” is still very weak).
•
if after many iterations the
n
W
/n
B
ratio is
≈
1, this new hy-
pothesis is well supported.
p(E|H)p(H)
p(E)
Contents
1
Statistical methods
1.1 Bayesian statistics . . . . . . . . . . . . . . . . . .
1.2 Hidden Markov Models . . . . . . . . . . . . . . .
1.3 Profile (knowledge) based recognition . . . . . . .
Heuristic machine learning
2.1 Neural Networks . . . . . . . . . . . . . . . . . .
2.2 Genetic Algorithms . . . . . . . . . . . . . . . . .
Applications
3.1 Performance in practice . . . . . . . . . . . . . . .
3.2 Classification of methods . . . . . . . . . . . . . .
2
3
Introduction
The purpose:
•
recognition of secondary structures
•
teritary structures
•
functional residues, motifs, domains etc.
The methods of recognition of sequence features (which are ex-
pressed in form of some
patterns
in the sequence) can be divided
into two groups:
•
Knowledge-based. The recognition takes into account the
knowledge of patterns in characterized sequences and their
similarity to the uncharacterized ones.
•
Machine learning methods. These may be based on probabil-
ity theory or heuristics.
p(H|E)
=
•
E
is an observation (a new evidence);
•
H
is some hypothesis, and
p(H)
is
prior probability
of this
hypothesis (probability assumed
before
observation of
E)
•
p(E|H)
is the conditional probability of
E
assuming, that
H
holds. That is, this is probability of event
E resulting
from
hypothesis
H.
•
p(E)
is
a priori
probability of observing
E
under all possible,
mutually exclusive hypotheses:
p(E)
=
∑
p(E|H
i
)p(H
i
)
i
1
Statistical methods
1.1 Bayesian statistics
Bayesian interpretation of probability
•
Thomas Bayes (1702-1761) interpreted the probability as the
”state of knowledge”, allowing the application of probability
theory in cases when the true probabilities are unknown a pri-
ori, or inaccurate.
Project co-financed by European Union within European Social Fund
1
THE DEVELOPMENT OF THE POTENTIAL AND ACADEMIC PROGRAMMES OF WROCŁAW UNIVERSITY OF TECHNOLOGY
•
p(H|E)
is the (estimated) probability that
H
is true, including
E.
•
To make use of Bayes’ theorem, we must know probabilities
from
all
alternative hypotheses!
p(E)
=
∑
p(E|H
i
)p(H
i
)
i
E
–
W
W
W
W
W
W
B
...
B
B
∑
W
0
1
2
3
4
5
6
6
...
6
6
∑
B
0
0
0
0
0
0
0
1
...
13
14
p(#1)
0.500
0.300
0.155
0.073
0.033
0.014
0.006
0.014
...
0.997
0.999
p(#2)
0.500
0.700
0.855
0.927
0.967
0.986
0.994
0.986
...
0.003
0.001
•
In the previous example, our hypothesis
H
stated ”p
W
=
p
B
=
0.5”; but if we are wrong (not
H),
there is no way to evaluate
p(E|not H).
. .
Another example:
•
There are two boxes: #1 and #2. In the first there is 30% white
(W) and 70% black (B) balls, and in the second 70% white
and 30% black. The boxes look identical.
•
We got one of the boxes and we pulled a single white ball out.
What is the probability, that we pulled from box #1?
•
H
1
– the box tested is #1,
H
2
– it is #2.
•
Since both boxes are identical, we assume that initially (with-
out any additional evidence)
p(H
1
) =
p(H
2
) =
0.5.
•
our experimental evidence is a white ball:
E
=
W
•
p(W
|H
1
)
means
assuming we draw out of box #1, the proba-
bility of choosing a white ball is
0.3.
p(W
|H
1
)p(H
1
)
p(W
|H
1
)p(H
1
) +
p(W
|H
2
)p(H
2
)
0.3
·
0.5
=
0.3
0.3
·
0.5
+
0.7
·
0.5
Bayesian statistics and bioinformatics
Example 1:
•
Data: frequencies of residues, probabilities of mutations,
MSA.
•
Hypotheses: phylogenetic trees
This is equivalent to
tree likelihood
evaluation.
Example 2:
•
Data: dependencies of residue frequencies on protein sec-
ondary structure, a sequence.
•
Hypotheses:
–
a particular sequence (fragment) forms
α-helix,
or
–
it forms
β-sheet,
or
–
it is unstructured.
Example 3:
p(H
1
|W
) =
p(H
1
|W
) =
•
Data: sequences with experimentally characterized secondary
structures.
•
Hypotheses:
–
residue X is part of an
α-helix;
p(H
2
|W
) =
0.7
–
it is part of a
β-sheet,
or
–
it is part of an unstructured fragment.
E
–
W
B
W
...
B
B
B
B
B
W
∑
W
0
1
1
2
...
7
7
7
7
7
8
∑
B
0
0
1
1
...
8
9
10
11
12
12
p(#1)
0.500
0.300
0.500
0.300
...
0.700
0.845
0.927
0.967
0.986
0.967
•
Result: probabilities of occurence of X within specific sec-
ondary structure.
This is an example of
machine learning model.
1.2
Hidden Markov Models
What is a Markov model?
•
A
Markov Chain
is a chain of events whose probabilities de-
pend only on the current state (of the process which generates
these events).
2
Project co-financed by European Union within European Social Fund
THE DEVELOPMENT OF THE POTENTIAL AND ACADEMIC PROGRAMMES OF WROCŁAW UNIVERSITY OF TECHNOLOGY
•
A
zeroth order Markov Model
is a model describing the prob-
abilities of
independent
events.
•
A
first order Markov Model
is a model of probability of
events, dependent on the (chain of length 1 of) previous
events.
Let an event be the occurence of a base in DNA. Then,
•
0
th
-order MM is simply the base frequencies,
•
1
st
-order MM defines the
conditional
probability of occurence
of base
b,
given that the previous base was
a:
r
ab
=
p(x
i
=
b
|
x
i−1
=
a)
Hidden Markov Model
A HMM is a model of probability of events in a Markov chain,
where the ”current state” is (partially) hidden.
•
A zeroth-order MM has no ”state”.
•
For the first-order MM from previous example, the ”state” is
the previous base.
This is an observable state.
•
”Hidden state” (i.e. not directly evident) may represent the
information: coding/non-coding sequence.
1.3
Profile (knowledge) based recognition
Profile-based recognition
•
A profile is a notion similar to PSSM.
•
If a PSSM contains sequences with characterized features,
they may be recognized by similarity.
•
The advantage of profiles is context-specific comparison.
•
The context is not (necessarily) limited a priori like the length
(order) of HMM chains.
HMM Profiles
•
The probabilities of producing a given (observed) output
p
0A
,
p
0T
,
p
0G
, etc. are called
emission probabilities;
•
The probabilities of change of the current state
t
0→1
and
t
1→0
are called
transition probabilities;
•
Both
p
and
t
depend conditionally on the current state;
•
All the possible hidden states form a complete ensemble
of hypotheses; their probabilities can be evaluated using
bayesian statistics on the basis of observed evidence.
•
The model can be taught with a training set to recognize hid-
den states in new data (supervised
learning).
12 3
WH . . En
WH . . Y .
W- . .E.
S H. . E .
T He . Y .
WHe r E .
•
No arbitral gap penalties!
•
Can be used to construct profiles (PSSM-alike) from MSA
provided.
•
Can be used to search for and to evaluate similarity.
•
Able to construct optimal MSA (of the highest likelihood)!
•
It is also possible that the model itself propose which hypoth- Example software – HMMER:
http:\\hmmer.janelia.org
esis has
maximum likelihood
of being true, given the evidence GPU-HMMER (speed factor 30-40x on Nvidia Tesla C1060)
(unsupervised
learning).
http:\\www.nvidia.pl/page/hmmer_on_tesla.html
Project co-financed by European Union within European Social Fund
3
THE DEVELOPMENT OF THE POTENTIAL AND ACADEMIC PROGRAMMES OF WROCŁAW UNIVERSITY OF TECHNOLOGY
2
Heuristic machine learning
•
Some way to ”mutate” the gene-encoded solution must be im-
plemented;
•
The idea is to have a population of solutions, which undergo
mutations and ”mating” (crossing-over), and select the best
ones from generation to generation.
1. A population of initial solutions is generated;
2. The population undergoes random mutations;
3. The population undergoes ”mating” – that is, generation of
new ”genes” by combining the existing ones;
4. A subset of best ”genes” (solutions) is selected as a ”new gen-
eration”;
5. The process is reiterated from step 2.
2.1 Neural Networks
An analogy of brains
Simple neural network
3
3.1
Applications
Performance in practice
•
Recognition of sequence features
–
Protein motifs, secondary structures, binding/active
sites, domains;
–
Coding sequences in genomes, gene functional sites, in-
trons/exons, etc.
•
Recognition of structure (fold)
•
Recognition of function
Comments on neural networks
Quality:
•
Applicable to a variety of problems.
•
Three or more "layers" of neurons are used in practice.
•
The input function is usually non-linear; commonly used is
a sigmoidal function with some (parametrized) sensitivity
threshold.
•
The network must be trained prior to usage (to generate sen-
sible parameters), and therefore some measure of the result
quality must be defined
a priori.
•
there is no statistical model behind the idea.
•
Success rate 50-90% depending on method and problem at
hand.
2.2 Genetic Algorithms
•
The ”solution” of a problem must be encoded in form of
a ”gene” which could be copied fragmentwise;
•
An optimality measure of the ”solution” must be defined;
Project co-financed by European Union within European Social Fund
4
THE DEVELOPMENT OF THE POTENTIAL AND ACADEMIC PROGRAMMES OF WROCŁAW UNIVERSITY OF TECHNOLOGY
•
Fold recognition is based on sequence composition and order-
ing
•
Motivated by the notion that structure is evolutionary more
conserved than sequence.
•
Reliable for well recognizable patterns (e.g. coils)
•
Less reliable for secondary structures
•
Not too good for prediction of local features (single important
residues)
•
Worse than comparative methods with highly similar tem-
plates
•
One approach to fold recognition is based on secondary struc-
ture prediction and comparison.
•
This subclass of methods is based on the observation that sec-
ondary structure similarity can exceed 80% for sequences that
exhibit less than 10% sequence similarity.
•
Only as good as the underlying secondary structure prediction
method.
•
The approach is based on predicting one dimensional (se-
quence) features, and identifying a similar fold by comparing
these features to the ones of known folds.
60%
Accuracy of secondary structure predictions: 76%
?%
(1990s)
(2004)
current
3.2 Classification of methods
Methods for prediction of sequence features
Secondary structures, domains, binding sites etc.
•
Comparative (PSSM and other profiles)
•
Composition based (HMM, neural networks)
Tertiary structure (can also be regarded as „sequence feature”)
•
Template based (profiles/alignments)
•
Fold recognition
•
Threading
•
Fragment-based methods
•
Ab initio (methods that do not use database information).
Assume Anfinsen thermodynamic hypothesis (1973) on fold-
ing to global free energy minimum.
•
Comparative methods fail when no templates of sufficient
similarity are available
•
The quality of predictions decays with lower similarity to an-
notated templates
•
Threading methods score how well the sequence fits a partic-
ular fold.
•
They may be comparative (based on alignment of sequence to
known template structures).
•
. . . or similar to fold recognition methods, but with scoring
function based on predicted structural rather than sequence
features
•
Threading-based methods could be quite expensive. Globally
optimal protein threading is known to be NP-hard.
•
Some threading methods ignore pairwise interaction between
residues. In this case, the threading problem can be solved
with dynamic programming; however use of pairwise residue
potentials is more accurate.
•
Fragment-based methods use konformer (rotamer) libraries of
short peptide fragments to limit conformational search space
•
The structure scoring function may be physics-based or
knowledge-based
•
Most prominent example is the
Rosetta
algorithm
5
Project co-financed by European Union within European Social Fund
Plik z chomika:
twoj.inzynier
Inne pliki z tego folderu:
lecture08_handouts.pdf
(3200 KB)
lecture08_text.pdf
(318 KB)
Inne foldery tego chomika:
lecture_01
lecture_02
lecture_03
lecture_04
lecture_05
Zgłoś jeśli
naruszono regulamin