walsh(1).pdf

(254 KB) Pobierz

Markov Chain Monte Carlo

and Gibbs Sampling

Lecture Notes for EEB 581, version 26 April 2004 c B. Walsh 2004

A major limitation towards more widespread implementation of Bayesian ap-

proaches is that obtaining the posterior distribution often requires the integration

of high-dimensional functions. This can be computationally very difﬁcult, but

several approaches short of direct integration have been proposed (reviewed by

Smith 1991, Evans and Swartz 1995, Tanner 1996). We focus here on

Markov

Chain Monte Carlo

(MCMC) methods, which attempt to simulate direct draws

from some complex distribution of interest. MCMC approaches are so-named be-

cause one uses the previous sample values to randomly generate the next sample

value, generating a

Markov chain

(as the transition probabilities between sample

values are only a function of the most recent sample value).

The realization in the early 1990’s (Gelfand and Smith 1990) that one particu-

lar MCMC method, the

Gibbs sampler,

is very widely applicable to a broad class

of Bayesian problems has sparked a major increase in the application of Bayesian

analysis, and this interest is likely to continue expanding for sometime to come.

MCMC methods have their roots in the Metropolis algorithm (Metropolis and

Ulam 1949, Metropolis et al. 1953), an attempt by physicists to compute com-

plex integrals by expressing them as expectations for some distribution and then

estimate this expectation by drawing samples from that distribution. The Gibbs

sampler (Geman and Geman 1984) has its origins in image processing. It is thus

somewhat ironic that the powerful machinery of MCMC methods had essentially

no impact on the ﬁeld of statistics until rather recently. Excellent (and detailed)

treatments of MCMC methods are found in Tanner (1996) and Chapter two of

Draper (2000). Additional references are given in the particular sections below.

MONTE CARLO INTEGRATION

The original

Monte Carlo

approach was a method developed by physicists to use

random number generation to compute integrals. Suppose we wish to compute

a complex integral

h(x) dx

(1a)

If we can decompose

h(x)

into the production of a function

(x)

and a probability

MCMC AND GIBBS SAMPLING

density function

p(x)

deﬁned over the interval

(a,

b),

then note that

h(x) dx

(x)

p(x) dx

p(x)

[

(x) ]

(1b)

so that the integral can be expressed as an expectation of

(x)

over the density

p(x).

Thus, if we draw a large number

· · ·

, x

of random variables from the

density

p(x),

then

h(x) dx

p(x)

[

(x) ]

)

i=1

(1c)

This is referred to as

Monte Carlo integration.

Monte Carlo integration can be used to approximate posterior (or marginal

posterior) distributions required for a Bayesian analysis. Consider the integral

I(y)

x) p(x) dx,

which we approximate by

I(y)

)

i=1

(2a)

where

are draws from the density

p(x).

The estimated

Monte Carlo standard

error

is given by

[

I(y)

] =

−

)

−

I(y)

i=1

(2b)

Importance Sampling

Suppose the density

p(x)

roughly approximates the density (of interest)

q(x),

then

(x)

q(x)dx

(x)

q(x)

p(x)

p(x)dx

p(x)

(x)

q(x)

p(x)

(3a)

This forms the basis for the method of

importance sampling,

with

(x)

q(x)dx

)

i=1

q(x

)

p(x

)

(3b)

where the

are drawn from the distribution given by

p(x).

For example, if we

are interested in a marginal density as a function of

y, J(y)

x) q(x)dx,

we approximate this by

J(y)

)

i=1

q(x

)

p(x

)

(4)

MCMC AND GIBBS SAMPLING

where

are drawn from the approximating density

An alternative formulation of importance sampling is to use

(x)

q(x)dx

i=1

)

i=1

where

g(x

)

p(x

)

(5a)

where

are drawn from the density

p(x).

This has an associated Monte Carlo

variance of

Var

i=1

)

−

i=1

(5b)

INTRODUCTION TO MARKOV CHAINS

Before introducing the Metropolis-Hastings algorithm and the Gibbs sampler, a

few introductory comments on Markov chains are in order. Let

denote the value

of a random variable at time

and let the

state space

refer to the range of possible

values. The random variable is a

Markov process

if the transition probabilities

between different values in the state space depend only on the random variable’s

current state, i.e.,

Pr(X

t+1

· · ·

, X

) = Pr(X

t+1

)

(6)

Thus for a Markov random variable the only information about the past needed

to predict the future is the current state of the random variable, knowledge of the

values of earlier states do not change the transition probability. A

Markov chain

refers to a sequence of random variables

· · ·

, X

)

generated by a Markov

process. A particular chain is deﬁned most critically by its

transition probabilities

(or the

transition kernel),

(i,

→

j),

which is the probability that a

process at state space

moves to state

in a single step,

(i,

→

= Pr(X

t+1

)

(7a)

We will often use the notation

→

to imply a move from

as many texts

deﬁne

(i,

→

i),

so we will use the arrow notation to avoid confusion.

Let

(7b)

(t) = Pr(X

)

denote the probability that the chain is in state

at time

and let

π(t)

denote the

row vector of the state space probabilities at step

We start the chain by specifying

a starting vector

π(0).

Often all the elements of

π(0)

are zero except for a single

element of 1, corresponding to the process starting in that particular state. As

the chain progresses, the probability values get spread out over the possible state

space.

MCMC AND GIBBS SAMPLING

The probability that the chain has state value

at time (or step)

+ 1

given by the

Chapman-Kolomogrov equation,

which sums over the probability

of being in a particular state at the current step and the transition probability from

that state into state

(t + 1) = Pr(X

t+1

)

Pr(X

t+1

)

Pr(X

)

→

i) π

(t) =

(k,

i) π

(t)

(7)

Successive iteration of the Chapman-Kolomogrov equation describes the evolu-

tion of the chain.

We can more compactly write the Chapman-Kolomogrov equations in matrix

form as follows. Deﬁne the

probability transition matrix

as the matrix whose

i, jth

element is

(i,

j),

the probability of moving from state

to state

j, P

→

j).

(Note this implies that the rows sum to one, as

(i,

→

= 1.)

The Chapman-Kolomogrov equation becomes

π(t

+ 1) =

π(t)P

(8a)

Using the matrix form, we immediately see how to quickly interate the Chapman-

Kolomogrov equation, as

π(t)

π(t

−

1)P = (π(t

−

2)P)P =

π(t

−

2)P

Continuing in this fashion shows that

π(t)

π(0)P

(n)

(8b)

(8c)

Deﬁning the

n-step

transition probability

as the probability that the process

is in state j given that it started in state

i n

steps ago, i..e.,

= Pr(X

t+n

)

(n)

(8d)

it immediately follows that

is just the

ij-th

element of

Finally, a Markov chain is said to be

irreducibile

if there exists a positive

(n )

integer such that

for all

i, j.

That is, all states

communicate

with each

other, as one can always go from any state to any other state (although it may take

more than one step). Likewise, a chain is said to be

aperiodic

when the number

of steps required to move between two states (say

and

is not required to be

multiple of some integer. Put another way, the chain is not forced into some cycle

of ﬁxed length between certain states.

MCMC AND GIBBS SAMPLING

Example 1.

Suppose the state space are (Rain, Sunny, Cloudy) and weather

follows a Markov process. Thus, the probability of tomorrow’s weather simply

depends on today’s weather, and not any other previous days. If this is the case, the

observation that it has rained for three straight days does not alter the probability

of tomorrow weather compared to the situation where (say) it rained today but

was sunny for the last week. Suppose the probability transitions given today is

rainy are

P( Rain tomorrow

Rain today ) = 0.5,

P( Sunny tomorrow

Rain today ) = 0.25,

P( Cloudy tomorrow

Rain today ) = 0.25,

The ﬁrst row of the transition probability matrix thus becomes

(0.5, 0.25, 0.25)

Suppose the rest of the transition matrix is given by



0.5 0.25 0.25



0.5



0.25 0.25 0.5

Note that this Markov chain is irreducible, as all states communicate with each

other.

Suppose today is sunny. What is the expected weather two days from now? Seven

days? Here

π(0)

= ( 0 1 0 )

, giving



π(2)

π(0)P

= ( 0.375 0.25 0.375 )

and

π(7)

π(0)P

= ( 0.4 0.2 0.4 )

0 0 )

. The expected

Conversely, suppose today is rainy, so that

π(0)

= ( 1

weather becomes

π(2)

= ( 0.4375 0.1875 0.375 )

and

π(7)

= ( 0.4 0.2 0.4 )

Note that after a sufﬁcient amount of time, the expected weather in independent of

the starting value. In other words, the chain has reached a

stationary distribution,

where the probability values are independent of the actual starting value.

As the above example illustrates, a Markov chain may reach a

stationary

distribution

∗

, where the vector of probabilities of being in any particular given

state is independent of the initial condition. The stationary distribution satisﬁes

∗

(9)

Plik z chomika:

xyzgeo

Inne pliki z tego folderu:

hmm (2).pdf (504 KB)
Ewolucyjne_Metory_Uczenia_Ukrytych_Modeli_Markowa (1).pdf (577 KB)
B2_07-HMM.pdf (249 KB)
Rozprawa_FaMar.pdf (11572 KB)
walsh(1).pdf (254 KB)

walsh(1).pdf

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: