Algorithms and Data Structures for External Memory [Vitter 2008-06-09].pdf

(907 KB) Pobierz
Foundations and Trends
R
in
Theoretical Computer Science
Vol. 2, No. 4 (2006) 305–474
c 2008 J. S. Vitter
DOI: 10.1561/0400000014
Algorithms and Data Structures
for External Memory
Jeffrey Scott Vitter
Department of Computer Science, Purdue University, West Lafayette,
Indiana, 47907–2107, USA, jsv@purdue.edu
Abstract
Data sets in large applications are often too massive to fit completely
inside the computer’s internal memory. The resulting input/output
communication (or I/O) between fast internal memory and slower
external memory (such as disks) can be a major performance bottle-
neck. In this manuscript, we survey the state of the art in the design
and analysis of algorithms and data structures for
external memory
(or
EM
for short), where the goal is to exploit locality and parallelism in
order to reduce the I/O costs. We consider a variety of EM paradigms
for solving batched and online problems efficiently in external memory.
For the batched problem of sorting and related problems like per-
muting and fast Fourier transform, the key paradigms include distribu-
tion and merging. The paradigm of disk striping offers an elegant way
to use multiple disks in parallel. For sorting, however, disk striping can
be nonoptimal with respect to I/O, so to gain further improvements we
discuss distribution and merging techniques for using the disks inde-
pendently. We also consider useful techniques for batched EM problems
involving matrices, geometric data, and graphs.
In the online domain, canonical EM applications include dictionary
lookup and range searching. The two important classes of indexed
data structures are based upon extendible hashing and B-trees. The
paradigms of filtering and bootstrapping provide convenient means in
online data structures to make effective use of the data accessed from
disk. We also re-examine some of the above EM problems in slightly
different settings, such as when the data items are moving, when the
data items are variable-length such as character strings, when the data
structure is compressed to save space, or when the allocated amount of
internal memory can change dynamically.
Programming tools and environments are available for simplifying
the EM programming task. We report on some experiments in the
domain of spatial databases using the TPIE system (Transparent Par-
allel I/O programming Environment). The newly developed EM algo-
rithms and data structures that incorporate the paradigms we discuss
are significantly faster than other methods used in practice.
Preface
I first became fascinated about the tradeoffs between computing and
memory usage while a graduate student at Stanford University. Over
the following years, this theme has influenced much of what I have
done professionally, not only in the field of external memory algorithms,
which this manuscript is about, but also on other topics such as data
compression, data mining, databases, prefetching/caching, and random
sampling.
The reality of the computer world is that no matter how fast com-
puters are and no matter how much data storage they provide, there
will always be a desire and need to push the envelope. The solution is
not to wait for the next generation of computers, but rather to examine
the fundamental constraints in order to understand the limits of what
is possible and to translate that understanding into effective solutions.
In this manuscript you will consider a scenario that arises often
in large computing applications, namely, that the relevant data sets
are simply too massive to fit completely inside the computer’s internal
memory and must instead reside on disk. The resulting input/output
communication (or I/O) between fast internal memory and slower
external memory (such as disks) can be a major performance
307
308
bottleneck. This manuscript provides a detailed overview of the design
and analysis of algorithms and data structures for
external memory
(or simply
EM
), where the goal is to exploit locality and parallelism in
order to reduce the I/O costs. Along the way, you will learn a variety
of EM paradigms for solving batched and online problems efficiently.
For the batched problem of sorting and related problems like per-
muting and fast Fourier transform, the two fundamental paradigms
are distribution and merging. The paradigm of disk striping offers an
elegant way to use multiple disks in parallel. For sorting, however,
disk striping can be nonoptimal with respect to I/O, so to gain fur-
ther improvements we discuss distribution and merging techniques for
using the disks independently, including an elegant duality property
that yields state-of-the-art algorithms. You will encounter other useful
techniques for batched EM problems involving matrices (such as matrix
multiplication and transposition), geometric data (such as finding inter-
sections and constructing convex hulls) and graphs (such as list ranking,
connected components, topological sorting, and shortest paths).
In the online domain, which involves constructing data structures
to answer queries, we discuss two canonical EM search applications:
dictionary lookup and range searching. Two important paradigms
for developing indexed data structures for these problems are hash-
ing (including extendible hashing) and tree-based search (including
B-trees). The paradigms of filtering and bootstrapping provide con-
venient means in online data structures to make effective use of the
data accessed from disk. You will also be exposed to some of the above
EM problems in slightly different settings, such as when the data items
are moving, when the data items are variable-length (e.g., strings of
text), when the data structure is compressed to save space, and when
the allocated amount of internal memory can change dynamically.
Programming tools and environments are available for simplifying
the EM programming task. You will see some experimental results in
the domain of spatial databases using the TPIE system, which stands
for Transparent Parallel I/O programming Environment. The newly
developed EM algorithms and data structures that incorporate the
paradigms discussed in this manuscript are significantly faster than
other methods used in practice.
Zgłoś jeśli naruszono regulamin