ab02-narayanan-ai.pdf

(403 KB) Pobierz
Artificial Intelligence Techniques for Bioinformatics
A. Narayanan
*
, E.C. Keedwell
*
and B. Olsson
**
*
**
School of Engineering and Computer Sciences, University of Exeter, Exeter EX4 4QF, UK.
Department of Computer Science, University of Skövde, Box 408, 54128 Skövde, Sweden.
Abstract
This review article aims to provide an overview of the ways in which techniques from artificial
intelligence can be usefully employed in bioinformatics, both for modelling biological data and
for making new discoveries. The paper covers three techniques: symbolic machine learning
approaches (nearest-neighbour and identification tree techniques); artificial neural networks; and
genetic algorithms. Each technique is introduced and then supported with examples taken from
the bioinformatics literature. These examples include folding prediction, viral protease cleavage
prediction, classification, multiple sequence alignment and microarray gene expression analysis.
Author contact details:
A.Narayanan, hard mail address as above.
A.Narayanan@ex.ac.uk..
Tel: (+UK)1392 264064.
Fax: (+UK)1392 264067.
E.C. Keedwell, hard mail address as above.
E.C.Keedwell@ex.ac.uk.
Tel: same as above.
B. Olsson, hard mail address as above.
bjorn.olsson@ida.his.se.
Tel: (+Sweden) 500-44 83 16.
Fax: (+Sweden) 500-44 83 99.
Keywords:
Artificial intelligence; machine learning; bioinformatics; neural networks; genetic
algorithms; data mining.
Please note that this article has been accepted for publication in Applied Bioinformatics.
1
Artificial Intelligence Techniques for Bioinformatics
A. Narayanan
*
, E.C. Keedwell
*
and B. Olsson
**
*
**
School of Engineering and Computer Sciences, University of Exeter, Exeter EX4 4QF, UK.
Department of Computer Science, University of Skövde, Box 408, 54128 Skövde, Sweden.
Abstract
This review article aims to provide an overview of the ways in which techniques from artificial
intelligence can be usefully employed in bioinformatics, both for modelling biological data and
for making new discoveries. The paper covers three techniques: symbolic machine learning
approaches (nearest-neighbour and identification tree techniques); artificial neural networks; and
genetic algorithms. Each technique is introduced and then supported with examples taken from
the bioinformatics literature. These examples include folding prediction, viral protease cleavage
prediction, classification, multiple sequence alignment and microarray gene expression analysis.
1. Introduction
There is growing interest in the application of artificial intelligence (AI) techniques in
bioinformatics. In particular, there is an appreciation that many of the problems in bioinformatics
need a new way of being addressed given either the intractability of current approaches or the
lack of an informed and intelligent way to exploit biological data. For an instance of the latter,
there is an urgent need to identify new methods for extracting gene and protein networks from the
rapidly proliferating gene expression and proteomic datasets. There is currently very little
understanding of how standard techniques of analysing gene expression and proteomic data, such
as clustering, correlation identification and self organising maps, can contribute directly to the
‘reverse engineering’ of gene networks or metabolic pathways. Instead, such techniques aid in the
first analysis of datasets which are typically very large (thousands of genes can be measured on
one microarray), requiring further biological knowledge from human experts for the identification
of appropriate and relevant causal relationships between two or more genes or proteins. There is a
need to merge biological knowledge with computational techniques for extracting relevant and
appropriate genes from the thousands of genes measured. For an instance of the former,
predicting the way a protein folds from first principles may well be feasible given some
algorithms for protein sequences of 20 or so amino acids, but once the sequences become
biologically plausible (200 or 300 amino acids and more) current protein folding algorithms
which work on first principles rapidly become intractable.
On the other hand, artificial intelligence is an area of computer science that has been around since
the 1950s, specialising in dealing with problems considered intractable by computer scientists
through the use of heuristics and probabilistic approaches. AI approaches excel when dealing
with problems where there is no requirement for ‘the absolutely provably correct or best’ answer
(a ‘strong’ constraint) but where, rather, the requirement is for an answer which is better than one
currently known or which is acceptable within certain defined constraints (a ‘weak’ constraint).
Given that many problems in bioinformatics do not have a strong constraints, there is plenty of
scope for the application of AI techniques to a number of bioinformatics problems.
2
What is interesting is that, despite the apparent suitability of AI techniques for bioinformatics
problems, there is actually very little published on such applications when one considers the vast
and increasing amount of published papers in bioinformatics. The aim of this review paper is not
to identify all previous work in bioinformatics that has involved AI techniques. This is because
there will be some AI techniques which have not yet been applied to bioinformatics problems and
a review which covers previous work would therefore be narrow. Instead, the aim of this paper is
to introduce bioinformatics researchers to three particular AI techniques, two of which may be
known (neural networks and symbolic machine learning) and one of which may not be so well
known (genetic algorithms). One of the intriguing aspects of the last technique is the rather
philosophically appealing idea of applying techniques from AI which have been influenced by
developments in our understanding of biological phenomena to biological problems themselves.
The paper consists of three parts. First, classical symbolic machine learning techniques will be
introduced (nearest neighbour and identification tree approaches), including examples of
bioinformatics applications in secondary protein structure prediction and viral protease
cleavability. Next, neural networks will be introduced, with a number of applications described.
Third, genetic algorithms will be covered, with applications in multiple sequence alignment and
RNA folding prediction.
3
2. Symbolic machine learning
2.1 Nearest neighbour approaches
We first introduce decision trees. A decision tree has the following properties: each node is
connected to a set of possible answers; each non-leaf node is connected to a test which splits its
set of possible answers into subsets corresponding to different test results; and each branch carries
a particular test result’s subset to another node.
2.1.1 Introduction
To see how decision trees are useful for nearest neighbour calculations, consider 8 blocks of
known width, height and colour (Winston, 1992). A new block then appears of known size but
unknown colour. On the basis of existing information, can we make an informed guess as to what
the colour of the new block is? To answer this question, we need to assume a
consistency
heuristic,
as follows. Find the most similar case, as measured by known properties, for which the
property is known; then guess that the unknown property is the same as the known property. This
is the basis of all
nearest neighbour
calculations. Although such nearest neighbour calculations
can be performed by keeping all samples in memory and calculating the nearest neighbour of a
new sample only when required by comparing the new sample with previously stored samples,
there are advantages in storing the information about how to calculate nearest neighbours in the
form of a decision tree, as will be seen later.
For our example problem above, we first need to calculate, for the 8 blocks of known size and
colour, a decision space using width and height only (since these are the known properties of the
ninth block), where each of the 8 blocks is located as a point within its own unique region of
space. Once the 8 blocks have been assigned an unique region, we then calculate, for the ninth
block of known width and height, a point in the same feature space. Then, depending on what the
colour of its ‘nearest neighbour’ is in the region it occupies, we allocate the colour of the nearest
neighbour to the ninth block. Notice that the problem consisted of asking for an ‘informed guess’,
not a provably correct answer. That is, in many applications it is important to attribute a property
of some sort to an object when the object’s real property is not known. Rather than having to
leave the object out of consideration, an attribution of a property to the object with unknown
property may be useful and desirable, even if it cannot be proved that the attributed property is
the real property. At least, with nearest neighbour calculations, there is some systematicity (the
consistency heuristic) in the way that an unknown property is attributed.
So, for our problem above, we shall divide up the 8 blocks in advance of nearest neighbour
calculation (i.e. before we calculate the nearest neighbour of the ninth block). To do this, we
divide the 8 blocks by height followed by width, then height, width ... until only one block
remains in each set. We ensure that we divide so that an equal number of cases falls on either
side. The eight blocks with known colour are first placed on the feature space, using their width
and height measures as co-ordinates (Figure 1 (a)). The ninth object (width 1, height 4) is also
located in this feature space, but it may not be clear what its nearest neighbour is (i.e. which
region of space it occupies), that is, the object could be orange or red. To decide which, the 8
blocks of known size are first divided into two equal subsets using, say, the height attribute. The
tallest of the shorter subset has height 2 and the shortest of the taller subset has height 5. The
midpoint 3.5 is therefore chosen as the dividing line. The next step is to use the second attribute,
width, to separate each of the two subsets from Figure 1(b) into further subsets (Figure 1 (c)). For
the shorter subset, the wider of the two narrow blocks is 2 and the narrower of the two wide
4
blocks is 4. The midpoint is therefore 3. For the taller subset, a similar form of reasoning leads to
the midpoint being 3.5. Note that the two subsets have different midpoints. Since each block does
not occupy its own region of space yet, we return to height (since there are only two known
attributes for each object) and split each of the four subsets once more (Figure 1 (d)). For the two
taller subsets, the midpoints are both, coincidentally, 5.5 (between 5 and 6). For the two shorter
subsets, the midpoints are both, coincidentally, 1.5. Each block now has its own region of space.
Once we have divided up the cases, we can then generate a decision tree, using the mid-points
discovered as test nodes (Figure 1 (e)) in the order in which they were found. Once we have the
tree, we can then trace a path down the tree for the ninth block, following the appropriate paths
depending on the outcome of each test, and allocate a colour to this block (orange). While this
conclusion may have been reached through a visual inspection of the feature space alone (Figure
1 (a)), in most cases we will be dealing with a multi-dimensional feature space, so some
systematic method for calculating nearest neighbours will be required which takes into account
many more than two dimensions. Also, nearest neighbour techniques provide only
approximations as to what the missing values may be. Any data entered into a database system, or
any inferences drawn from data obtained from nearest neighbour calculations, should always be
flagged to ensure that these approximations are not taken to have the same status as ‘facts’. New
information may come in later which requires these approximations to be deleted and replaced
with real data.
5
Zgłoś jeśli naruszono regulamin