walsh(1).pdf
(
254 KB
)
Pobierz
Markov Chain Monte Carlo
and Gibbs Sampling
Lecture Notes for EEB 581, version 26 April 2004 c B. Walsh 2004
A major limitation towards more widespread implementation of Bayesian ap-
proaches is that obtaining the posterior distribution often requires the integration
of high-dimensional functions. This can be computationally very difficult, but
several approaches short of direct integration have been proposed (reviewed by
Smith 1991, Evans and Swartz 1995, Tanner 1996). We focus here on
Markov
Chain Monte Carlo
(MCMC) methods, which attempt to simulate direct draws
from some complex distribution of interest. MCMC approaches are so-named be-
cause one uses the previous sample values to randomly generate the next sample
value, generating a
Markov chain
(as the transition probabilities between sample
values are only a function of the most recent sample value).
The realization in the early 1990’s (Gelfand and Smith 1990) that one particu-
lar MCMC method, the
Gibbs sampler,
is very widely applicable to a broad class
of Bayesian problems has sparked a major increase in the application of Bayesian
analysis, and this interest is likely to continue expanding for sometime to come.
MCMC methods have their roots in the Metropolis algorithm (Metropolis and
Ulam 1949, Metropolis et al. 1953), an attempt by physicists to compute com-
plex integrals by expressing them as expectations for some distribution and then
estimate this expectation by drawing samples from that distribution. The Gibbs
sampler (Geman and Geman 1984) has its origins in image processing. It is thus
somewhat ironic that the powerful machinery of MCMC methods had essentially
no impact on the field of statistics until rather recently. Excellent (and detailed)
treatments of MCMC methods are found in Tanner (1996) and Chapter two of
Draper (2000). Additional references are given in the particular sections below.
MONTE CARLO INTEGRATION
The original
Monte Carlo
approach was a method developed by physicists to use
random number generation to compute integrals. Suppose we wish to compute
a complex integral
b
h(x) dx
a
(1a)
If we can decompose
h(x)
into the production of a function
f
(x)
and a probability
1
2
MCMC AND GIBBS SAMPLING
density function
p(x)
defined over the interval
(a,
b),
then note that
b
b
h(x) dx
=
a
a
f
(x)
p(x) dx
=
E
p(x)
[
f
(x) ]
(1b)
so that the integral can be expressed as an expectation of
f
(x)
over the density
p(x).
Thus, if we draw a large number
x
1
,
· · ·
, x
n
of random variables from the
density
p(x),
then
b
h(x) dx
=
E
p(x)
[
f
(x) ]
a
1
n
n
f
(x
i
)
i=1
(1c)
This is referred to as
Monte Carlo integration.
Monte Carlo integration can be used to approximate posterior (or marginal
posterior) distributions required for a Bayesian analysis. Consider the integral
I(y)
=
f
(y
|
x) p(x) dx,
which we approximate by
1
I(y)
=
n
n
f
(y
|
x
i
)
i=1
(2a)
where
x
i
are draws from the density
p(x).
The estimated
Monte Carlo standard
error
is given by
1
SE
[
I(y)
] =
n
2
1
n
−
1
n
f
(y
|
x
i
)
−
I(y)
i=1
2
(2b)
Importance Sampling
Suppose the density
p(x)
roughly approximates the density (of interest)
q(x),
then
f
(x)
q(x)dx
=
f
(x)
q(x)
p(x)
p(x)dx
=
E
p(x)
f
(x)
q(x)
p(x)
(3a)
This forms the basis for the method of
importance sampling,
with
f
(x)
q(x)dx
1
n
n
f
(x
i
)
i=1
q(x
i
)
p(x
i
)
(3b)
where the
x
i
are drawn from the distribution given by
p(x).
For example, if we
are interested in a marginal density as a function of
y, J(y)
=
f
(y
|
x) q(x)dx,
we approximate this by
J(y)
1
n
n
f
(y
|
x
i
)
i=1
q(x
i
)
p(x
i
)
(4)
MCMC AND GIBBS SAMPLING
where
x
i
are drawn from the approximating density
p.
An alternative formulation of importance sampling is to use
n
n
3
f
(x)
q(x)dx
I
=
i=1
w
i
f
(x
i
)
i=1
w
i
,
where
w
i
=
g(x
i
)
p(x
i
)
(5a)
where
x
i
are drawn from the density
p(x).
This has an associated Monte Carlo
variance of
n
Var
I
=
i=1
w
i
f
(x
i
)
−
I
2
n
w
i
i=1
(5b)
INTRODUCTION TO MARKOV CHAINS
Before introducing the Metropolis-Hastings algorithm and the Gibbs sampler, a
few introductory comments on Markov chains are in order. Let
X
t
denote the value
of a random variable at time
t,
and let the
state space
refer to the range of possible
X
values. The random variable is a
Markov process
if the transition probabilities
between different values in the state space depend only on the random variable’s
current state, i.e.,
Pr(X
t+1
=
s
j
|
X
0
=
s
k
,
· · ·
, X
t
=
s
i
) = Pr(X
t+1
=
s
j
|
X
t
=
s
i
)
(6)
Thus for a Markov random variable the only information about the past needed
to predict the future is the current state of the random variable, knowledge of the
values of earlier states do not change the transition probability. A
Markov chain
refers to a sequence of random variables
(X
0
,
· · ·
, X
n
)
generated by a Markov
process. A particular chain is defined most critically by its
transition probabilities
(or the
transition kernel),
P
(i,
j)
=
P
(i
→
j),
which is the probability that a
process at state space
s
i
moves to state
s
j
in a single step,
P
(i,
j)
=
P
(i
→
j)
= Pr(X
t+1
=
s
j
|
X
t
=
s
i
)
(7a)
We will often use the notation
P
(i
→
j)
to imply a move from
i
to
j,
as many texts
define
P
(i,
j)
=
P
(j
→
i),
so we will use the arrow notation to avoid confusion.
Let
(7b)
π
j
(t) = Pr(X
t
=
s
j
)
denote the probability that the chain is in state
j
at time
t,
and let
π(t)
denote the
row vector of the state space probabilities at step
t.
We start the chain by specifying
a starting vector
π(0).
Often all the elements of
π(0)
are zero except for a single
element of 1, corresponding to the process starting in that particular state. As
the chain progresses, the probability values get spread out over the possible state
space.
4
MCMC AND GIBBS SAMPLING
The probability that the chain has state value
s
i
at time (or step)
t
+ 1
is
given by the
Chapman-Kolomogrov equation,
which sums over the probability
of being in a particular state at the current step and the transition probability from
that state into state
s
i
,
π
i
(t + 1) = Pr(X
t+1
=
s
i
)
=
k
Pr(X
t+1
=
s
i
|
X
t
=
s
k
)
·
Pr(X
t
=
s
k
)
P
(k
→
i) π
k
(t) =
k
k
=
P
(k,
i) π
k
(t)
(7)
Successive iteration of the Chapman-Kolomogrov equation describes the evolu-
tion of the chain.
We can more compactly write the Chapman-Kolomogrov equations in matrix
form as follows. Define the
probability transition matrix
P
as the matrix whose
i, jth
element is
P
(i,
j),
the probability of moving from state
i
to state
j, P
(i
→
j).
(Note this implies that the rows sum to one, as
j
P
(i,
j)
=
j
P
(i
→
j)
= 1.)
The Chapman-Kolomogrov equation becomes
π(t
+ 1) =
π(t)P
(8a)
Using the matrix form, we immediately see how to quickly interate the Chapman-
Kolomogrov equation, as
π(t)
=
π(t
−
1)P = (π(t
−
2)P)P =
π(t
−
2)P
2
Continuing in this fashion shows that
π(t)
=
π(0)P
t
(n)
(8b)
(8c)
Defining the
n-step
transition probability
p
ij
as the probability that the process
is in state j given that it started in state
i n
steps ago, i..e.,
p
ij
= Pr(X
t+n
=
s
j
|
X
t
=
s
i
)
(n)
(n)
(8d)
it immediately follows that
p
ij
is just the
ij-th
element of
P
n
.
Finally, a Markov chain is said to be
irreducibile
if there exists a positive
(n )
integer such that
p
ij
ij
>
0
for all
i, j.
That is, all states
communicate
with each
other, as one can always go from any state to any other state (although it may take
more than one step). Likewise, a chain is said to be
aperiodic
when the number
of steps required to move between two states (say
x
and
y)
is not required to be
multiple of some integer. Put another way, the chain is not forced into some cycle
of fixed length between certain states.
MCMC AND GIBBS SAMPLING
5
Example 1.
Suppose the state space are (Rain, Sunny, Cloudy) and weather
follows a Markov process. Thus, the probability of tomorrow’s weather simply
depends on today’s weather, and not any other previous days. If this is the case, the
observation that it has rained for three straight days does not alter the probability
of tomorrow weather compared to the situation where (say) it rained today but
was sunny for the last week. Suppose the probability transitions given today is
rainy are
P( Rain tomorrow
|
Rain today ) = 0.5,
P( Sunny tomorrow
|
Rain today ) = 0.25,
P( Cloudy tomorrow
|
Rain today ) = 0.25,
The first row of the transition probability matrix thus becomes
(0.5, 0.25, 0.25)
.
Suppose the rest of the transition matrix is given by
0.5 0.25 0.25
P
=
0.5
0
0.5
0.25 0.25 0.5
Note that this Markov chain is irreducible, as all states communicate with each
other.
Suppose today is sunny. What is the expected weather two days from now? Seven
days? Here
π(0)
= ( 0 1 0 )
, giving
π(2)
=
π(0)P
2
= ( 0.375 0.25 0.375 )
and
π(7)
=
π(0)P
7
= ( 0.4 0.2 0.4 )
0 0 )
. The expected
Conversely, suppose today is rainy, so that
π(0)
= ( 1
weather becomes
π(2)
= ( 0.4375 0.1875 0.375 )
and
π(7)
= ( 0.4 0.2 0.4 )
Note that after a sufficient amount of time, the expected weather in independent of
the starting value. In other words, the chain has reached a
stationary distribution,
where the probability values are independent of the actual starting value.
As the above example illustrates, a Markov chain may reach a
stationary
distribution
π
∗
, where the vector of probabilities of being in any particular given
state is independent of the initial condition. The stationary distribution satisfies
π
∗
=
π
∗
P
(9)
Plik z chomika:
xyzgeo
Inne pliki z tego folderu:
hmm (2).pdf
(504 KB)
Ewolucyjne_Metory_Uczenia_Ukrytych_Modeli_Markowa (1).pdf
(577 KB)
B2_07-HMM.pdf
(249 KB)
Rozprawa_FaMar.pdf
(11572 KB)
walsh(1).pdf
(254 KB)
Inne foldery tego chomika:
sieci neuronowe
sztuczna inteligencja
Zgłoś jeśli
naruszono regulamin