ab02-narayanan-ai.pdf

(403 KB) Pobierz

Artificial Intelligence Techniques for Bioinformatics

A. Narayanan

, E.C. Keedwell

and B. Olsson

School of Engineering and Computer Sciences, University of Exeter, Exeter EX4 4QF, UK.

Department of Computer Science, University of Skövde, Box 408, 54128 Skövde, Sweden.

Abstract

This review article aims to provide an overview of the ways in which techniques from artificial

intelligence can be usefully employed in bioinformatics, both for modelling biological data and

for making new discoveries. The paper covers three techniques: symbolic machine learning

approaches (nearest-neighbour and identification tree techniques); artificial neural networks; and

genetic algorithms. Each technique is introduced and then supported with examples taken from

the bioinformatics literature. These examples include folding prediction, viral protease cleavage

prediction, classification, multiple sequence alignment and microarray gene expression analysis.

Author contact details:

A.Narayanan, hard mail address as above.

A.Narayanan@ex.ac.uk..

Tel: (+UK)1392 264064.

Fax: (+UK)1392 264067.

E.C. Keedwell, hard mail address as above.

E.C.Keedwell@ex.ac.uk.

Tel: same as above.

B. Olsson, hard mail address as above.

bjorn.olsson@ida.his.se.

Tel: (+Sweden) 500-44 83 16.

Fax: (+Sweden) 500-44 83 99.

Keywords:

Artificial intelligence; machine learning; bioinformatics; neural networks; genetic

algorithms; data mining.

Please note that this article has been accepted for publication in Applied Bioinformatics.

Artificial Intelligence Techniques for Bioinformatics

A. Narayanan

, E.C. Keedwell

and B. Olsson

School of Engineering and Computer Sciences, University of Exeter, Exeter EX4 4QF, UK.

Department of Computer Science, University of Skövde, Box 408, 54128 Skövde, Sweden.

Abstract

This review article aims to provide an overview of the ways in which techniques from artificial

intelligence can be usefully employed in bioinformatics, both for modelling biological data and

for making new discoveries. The paper covers three techniques: symbolic machine learning

approaches (nearest-neighbour and identification tree techniques); artificial neural networks; and

genetic algorithms. Each technique is introduced and then supported with examples taken from

the bioinformatics literature. These examples include folding prediction, viral protease cleavage

prediction, classification, multiple sequence alignment and microarray gene expression analysis.

1. Introduction

There is growing interest in the application of artificial intelligence (AI) techniques in

bioinformatics. In particular, there is an appreciation that many of the problems in bioinformatics

need a new way of being addressed given either the intractability of current approaches or the

lack of an informed and intelligent way to exploit biological data. For an instance of the latter,

there is an urgent need to identify new methods for extracting gene and protein networks from the

rapidly proliferating gene expression and proteomic datasets. There is currently very little

understanding of how standard techniques of analysing gene expression and proteomic data, such

as clustering, correlation identification and self organising maps, can contribute directly to the

‘reverse engineering’ of gene networks or metabolic pathways. Instead, such techniques aid in the

first analysis of datasets which are typically very large (thousands of genes can be measured on

one microarray), requiring further biological knowledge from human experts for the identification

of appropriate and relevant causal relationships between two or more genes or proteins. There is a

need to merge biological knowledge with computational techniques for extracting relevant and

appropriate genes from the thousands of genes measured. For an instance of the former,

predicting the way a protein folds from first principles may well be feasible given some

algorithms for protein sequences of 20 or so amino acids, but once the sequences become

biologically plausible (200 or 300 amino acids and more) current protein folding algorithms

which work on first principles rapidly become intractable.

On the other hand, artificial intelligence is an area of computer science that has been around since

the 1950s, specialising in dealing with problems considered intractable by computer scientists

through the use of heuristics and probabilistic approaches. AI approaches excel when dealing

with problems where there is no requirement for ‘the absolutely provably correct or best’ answer

(a ‘strong’ constraint) but where, rather, the requirement is for an answer which is better than one

currently known or which is acceptable within certain defined constraints (a ‘weak’ constraint).

Given that many problems in bioinformatics do not have a strong constraints, there is plenty of

scope for the application of AI techniques to a number of bioinformatics problems.

What is interesting is that, despite the apparent suitability of AI techniques for bioinformatics

problems, there is actually very little published on such applications when one considers the vast

and increasing amount of published papers in bioinformatics. The aim of this review paper is not

to identify all previous work in bioinformatics that has involved AI techniques. This is because

there will be some AI techniques which have not yet been applied to bioinformatics problems and

a review which covers previous work would therefore be narrow. Instead, the aim of this paper is

to introduce bioinformatics researchers to three particular AI techniques, two of which may be

known (neural networks and symbolic machine learning) and one of which may not be so well

known (genetic algorithms). One of the intriguing aspects of the last technique is the rather

philosophically appealing idea of applying techniques from AI which have been influenced by

developments in our understanding of biological phenomena to biological problems themselves.

The paper consists of three parts. First, classical symbolic machine learning techniques will be

introduced (nearest neighbour and identification tree approaches), including examples of

bioinformatics applications in secondary protein structure prediction and viral protease

cleavability. Next, neural networks will be introduced, with a number of applications described.

Third, genetic algorithms will be covered, with applications in multiple sequence alignment and

RNA folding prediction.

2. Symbolic machine learning

2.1 Nearest neighbour approaches

We first introduce decision trees. A decision tree has the following properties: each node is

connected to a set of possible answers; each non-leaf node is connected to a test which splits its

set of possible answers into subsets corresponding to different test results; and each branch carries

a particular test result’s subset to another node.

2.1.1 Introduction

To see how decision trees are useful for nearest neighbour calculations, consider 8 blocks of

known width, height and colour (Winston, 1992). A new block then appears of known size but

unknown colour. On the basis of existing information, can we make an informed guess as to what

the colour of the new block is? To answer this question, we need to assume a

consistency

heuristic,

as follows. Find the most similar case, as measured by known properties, for which the

property is known; then guess that the unknown property is the same as the known property. This

is the basis of all

nearest neighbour

calculations. Although such nearest neighbour calculations

can be performed by keeping all samples in memory and calculating the nearest neighbour of a

new sample only when required by comparing the new sample with previously stored samples,

there are advantages in storing the information about how to calculate nearest neighbours in the

form of a decision tree, as will be seen later.

For our example problem above, we first need to calculate, for the 8 blocks of known size and

colour, a decision space using width and height only (since these are the known properties of the

ninth block), where each of the 8 blocks is located as a point within its own unique region of

space. Once the 8 blocks have been assigned an unique region, we then calculate, for the ninth

block of known width and height, a point in the same feature space. Then, depending on what the

colour of its ‘nearest neighbour’ is in the region it occupies, we allocate the colour of the nearest

neighbour to the ninth block. Notice that the problem consisted of asking for an ‘informed guess’,

not a provably correct answer. That is, in many applications it is important to attribute a property

of some sort to an object when the object’s real property is not known. Rather than having to

leave the object out of consideration, an attribution of a property to the object with unknown

property may be useful and desirable, even if it cannot be proved that the attributed property is

the real property. At least, with nearest neighbour calculations, there is some systematicity (the

consistency heuristic) in the way that an unknown property is attributed.

So, for our problem above, we shall divide up the 8 blocks in advance of nearest neighbour

calculation (i.e. before we calculate the nearest neighbour of the ninth block). To do this, we

divide the 8 blocks by height followed by width, then height, width ... until only one block

remains in each set. We ensure that we divide so that an equal number of cases falls on either

side. The eight blocks with known colour are first placed on the feature space, using their width

and height measures as co-ordinates (Figure 1 (a)). The ninth object (width 1, height 4) is also

located in this feature space, but it may not be clear what its nearest neighbour is (i.e. which

region of space it occupies), that is, the object could be orange or red. To decide which, the 8

blocks of known size are first divided into two equal subsets using, say, the height attribute. The

tallest of the shorter subset has height 2 and the shortest of the taller subset has height 5. The

midpoint 3.5 is therefore chosen as the dividing line. The next step is to use the second attribute,

width, to separate each of the two subsets from Figure 1(b) into further subsets (Figure 1 (c)). For

the shorter subset, the wider of the two narrow blocks is 2 and the narrower of the two wide

blocks is 4. The midpoint is therefore 3. For the taller subset, a similar form of reasoning leads to

the midpoint being 3.5. Note that the two subsets have different midpoints. Since each block does

not occupy its own region of space yet, we return to height (since there are only two known

attributes for each object) and split each of the four subsets once more (Figure 1 (d)). For the two

taller subsets, the midpoints are both, coincidentally, 5.5 (between 5 and 6). For the two shorter

subsets, the midpoints are both, coincidentally, 1.5. Each block now has its own region of space.

Once we have divided up the cases, we can then generate a decision tree, using the mid-points

discovered as test nodes (Figure 1 (e)) in the order in which they were found. Once we have the

tree, we can then trace a path down the tree for the ninth block, following the appropriate paths

depending on the outcome of each test, and allocate a colour to this block (orange). While this

conclusion may have been reached through a visual inspection of the feature space alone (Figure

1 (a)), in most cases we will be dealing with a multi-dimensional feature space, so some

systematic method for calculating nearest neighbours will be required which takes into account

many more than two dimensions. Also, nearest neighbour techniques provide only

approximations as to what the missing values may be. Any data entered into a database system, or

any inferences drawn from data obtained from nearest neighbour calculations, should always be

flagged to ensure that these approximations are not taken to have the same status as ‘facts’. New

information may come in later which requires these approximations to be deleted and replaced

with real data.

Plik z chomika:

xyzgeo

Inne pliki z tego folderu:

Wyklad10_11.pdf (1764 KB)
ab02-narayanan-ai.pdf (403 KB)
Metody i techniki sztucznej inteligencji - Leszek Rutkowski.pdf (45341 KB)

ab02-narayanan-ai.pdf

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: