Algorithms and Data Structures for External Memory [Vitter 2008-06-09].pdf

(907 KB) Pobierz

Foundations and Trends

Theoretical Computer Science

Vol. 2, No. 4 (2006) 305–474

c 2008 J. S. Vitter

DOI: 10.1561/0400000014

Algorithms and Data Structures

for External Memory

Jeﬀrey Scott Vitter

Department of Computer Science, Purdue University, West Lafayette,

Indiana, 47907–2107, USA, jsv@purdue.edu

Abstract

Data sets in large applications are often too massive to ﬁt completely

inside the computer’s internal memory. The resulting input/output

communication (or I/O) between fast internal memory and slower

external memory (such as disks) can be a major performance bottle-

neck. In this manuscript, we survey the state of the art in the design

and analysis of algorithms and data structures for

external memory

(or

for short), where the goal is to exploit locality and parallelism in

order to reduce the I/O costs. We consider a variety of EM paradigms

for solving batched and online problems eﬃciently in external memory.

For the batched problem of sorting and related problems like per-

muting and fast Fourier transform, the key paradigms include distribu-

tion and merging. The paradigm of disk striping oﬀers an elegant way

to use multiple disks in parallel. For sorting, however, disk striping can

be nonoptimal with respect to I/O, so to gain further improvements we

discuss distribution and merging techniques for using the disks inde-

pendently. We also consider useful techniques for batched EM problems

involving matrices, geometric data, and graphs.

In the online domain, canonical EM applications include dictionary

lookup and range searching. The two important classes of indexed

data structures are based upon extendible hashing and B-trees. The

paradigms of ﬁltering and bootstrapping provide convenient means in

online data structures to make eﬀective use of the data accessed from

disk. We also re-examine some of the above EM problems in slightly

diﬀerent settings, such as when the data items are moving, when the

data items are variable-length such as character strings, when the data

structure is compressed to save space, or when the allocated amount of

internal memory can change dynamically.

Programming tools and environments are available for simplifying

the EM programming task. We report on some experiments in the

domain of spatial databases using the TPIE system (Transparent Par-

allel I/O programming Environment). The newly developed EM algo-

rithms and data structures that incorporate the paradigms we discuss

are signiﬁcantly faster than other methods used in practice.

Preface

I ﬁrst became fascinated about the tradeoﬀs between computing and

memory usage while a graduate student at Stanford University. Over

the following years, this theme has inﬂuenced much of what I have

done professionally, not only in the ﬁeld of external memory algorithms,

which this manuscript is about, but also on other topics such as data

compression, data mining, databases, prefetching/caching, and random

sampling.

The reality of the computer world is that no matter how fast com-

puters are and no matter how much data storage they provide, there

will always be a desire and need to push the envelope. The solution is

not to wait for the next generation of computers, but rather to examine

the fundamental constraints in order to understand the limits of what

is possible and to translate that understanding into eﬀective solutions.

In this manuscript you will consider a scenario that arises often

in large computing applications, namely, that the relevant data sets

are simply too massive to ﬁt completely inside the computer’s internal

memory and must instead reside on disk. The resulting input/output

communication (or I/O) between fast internal memory and slower

external memory (such as disks) can be a major performance

307

308

bottleneck. This manuscript provides a detailed overview of the design

and analysis of algorithms and data structures for

external memory

(or simply

), where the goal is to exploit locality and parallelism in

order to reduce the I/O costs. Along the way, you will learn a variety

of EM paradigms for solving batched and online problems eﬃciently.

For the batched problem of sorting and related problems like per-

muting and fast Fourier transform, the two fundamental paradigms

are distribution and merging. The paradigm of disk striping oﬀers an

elegant way to use multiple disks in parallel. For sorting, however,

disk striping can be nonoptimal with respect to I/O, so to gain fur-

ther improvements we discuss distribution and merging techniques for

using the disks independently, including an elegant duality property

that yields state-of-the-art algorithms. You will encounter other useful

techniques for batched EM problems involving matrices (such as matrix

multiplication and transposition), geometric data (such as ﬁnding inter-

sections and constructing convex hulls) and graphs (such as list ranking,

connected components, topological sorting, and shortest paths).

In the online domain, which involves constructing data structures

to answer queries, we discuss two canonical EM search applications:

dictionary lookup and range searching. Two important paradigms

for developing indexed data structures for these problems are hash-

ing (including extendible hashing) and tree-based search (including

B-trees). The paradigms of ﬁltering and bootstrapping provide con-

venient means in online data structures to make eﬀective use of the

data accessed from disk. You will also be exposed to some of the above

EM problems in slightly diﬀerent settings, such as when the data items

are moving, when the data items are variable-length (e.g., strings of

text), when the data structure is compressed to save space, and when

the allocated amount of internal memory can change dynamically.

Programming tools and environments are available for simplifying

the EM programming task. You will see some experimental results in

the domain of spatial databases using the TPIE system, which stands

for Transparent Parallel I/O programming Environment. The newly

developed EM algorithms and data structures that incorporate the

paradigms discussed in this manuscript are signiﬁcantly faster than

other methods used in practice.

Plik z chomika:

musli_com

Algorithms and Data Structures for External Memory [Vitter 2008-06-09].pdf

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: