Mixing Rates of Markov Chains and Learning Dynamics

Abstract

How fast does a finite Markov chain forget where it started? The question has a clean answer in our setting, and a useful one for anything that learns by repeating an experiment. Perron-Frobenius hands us a unique stationary $\pi$ when $M$ has positive entries; the second eigenvalue of $M$ then sets the rate at which the iterates $M^n \mu_0$ relax back to $\pi$. I work this out using a small set of standard tools, primarily Cauchy-Schwarz, Birkhoff's ergodic theorem, Perron-Frobenius, and the mixing definition for measure-preserving transformations, and prove a quantitative form of that mixing definition in which the proof reduces to one application of the Cauchy-Schwarz inequality. The last third of the paper reads the result back into the language of reinforcement learning: a stationary policy on a Markov decision process induces a Markov chain, and the spectral gap of that chain is, more or less, the rate at which the agent can learn anything stable about itself. That last part is interpretive rather than technical. The math stays inside the standard Markov, ergodic, and spectral framework.

1. Introduction

Take a finite Markov chain and watch what happens to it. Start at one specific state, hit it with the transition matrix $M$ a bunch of times, and the distribution of where you are spreads out. Eventually, it forgets where it began. How fast it forgets is the rate I want to compute. The rough form of the answer is the second eigenvalue of $M$. The careful form is Theorem 4.1 below.

The qualitative version of the same statement is the standard mixing definition for measure-preserving transformations [4]. There, a measure-preserving $T$ on $(\Omega, \mathcal{A}, P)$ is called mixing if

for every $A, B \in \mathcal{A}$. That definition is asymptotic and rate-free. The point of this paper is to refine it, in the finite Markov setting, into a quantitative bound with an explicit dependence on the spectrum of $M$. Perron-Frobenius then guarantees that $M$ has the structure we need [7, 8, 6]: a unique stationary $\pi$ at eigenvalue $1$, and the rest of the spectrum strictly inside the unit disk.

The question, written out, is the following: what controls the rate at which a Markov chain forgets its initial distribution, and what does that rate tell us about systems that learn from their own trajectories?

The second half of the question is what I find interesting, and what motivates the reinforcement learning interpretation in Section 5. Briefly: an RL agent following a fixed policy on a Markov decision process induces a Markov chain on the state space [17]. Almost every quantity that agent has to estimate from data, ie value functions, on-policy expectations, policy gradients, is a time average along the trajectories of that chain. So the mixing rate is not abstract for an RL agent. Slow mixing means high-variance estimates, sluggish exploration, and unstable updates. Fast mixing gives the opposite. In a real sense, mixing rate is learning rate.

The plan is short. Section 2 reviews finite Markov chains in the left-stochastic convention and proves Perron-Frobenius. Section 3 hooks the chain into ergodic theory and quotes Birkhoff. Section 4 is the technical core. Section 5 reads the result back as an RL statement. Section 6 closes.

2. Markov Chains and Stationary Distributions

Throughout the paper, $S = \{1, 2, \dots, s\}$ is a finite state space, $(\Omega, \mathcal{A}, P)$ is a probability space, and $X_0, X_1, X_2, \dots$ are $S$-valued random variables [6].

We use the left-stochastic convention [6], ie. the matrix $M$ acts on column probability vectors. Each entry $M_{ij}$ is the probability of going from state $j$ to state $i$, so entries are non-negative and each column sums to $1$:

Equivalently, $M^{\top} \mathbf{1} = \mathbf{1}$, where $\mathbf{1} = (1, \dots, 1)^{\top}$. So $1$ is automatically an eigenvalue of $M^{\top}$, and hence of $M$. The eigenvector of $M$ at eigenvalue $1$, normalized to a probability vector, is what we want.

If $X_0 \sim \pi$ then $X_n \sim \pi$ for all $n \ge 0$, so $\pi$ is the law that the chain leaves invariant. The first basic question is whether such a $\pi$ exists, and whether it is unique. Perron-Frobenius answers both. The proof below follows the approach in [6].

Proof. Let $X = \{x \in \mathbb{R}^s : x_i \ge 0,\ \sum_i x_i = 1\}$ be the standard simplex. It is closed, bounded, and convex. Because $M$ is left-stochastic, for any $v \in X$ the vector $Mv$ has non-negative entries and total mass $\sum_i (Mv)_i = \sum_j v_j = 1$, so $Mv \in X$ already. The map $v \mapsto Mv$ is continuous, so by the Brouwer fixed point theorem there is some $\pi \in X$ with $M \pi = \pi$.

For uniqueness we use a strict-inequality argument. Suppose $Mw = w$ for some $w \in \mathbb{R}^s$ that has at least one negative entry, normalized so $\sum_i |w_i| = 1$. Let $|w|$ denote the vector with entries $|w_j|$. Then

$$\sum_i (M|w|)_i\ =\ \sum_j |w_j|\ =\ 1,$$

so $M|w|$ is also a probability vector. For each $i$,

$$|w_i|\ =\ \Bigl| \sum_j M_{ij} w_j \Bigr|\ \le\ \sum_j M_{ij} |w_j|\ =\ (M|w|)_i.$$

The triangle inequality is strict at any $i$ where the $w_j$ have mixed signs, since $M_{ij} > 0$ for every $j$. But $\sum_i |w_i| = \sum_i (M|w|)_i = 1$ forces equality everywhere. So $w$ cannot have entries of both signs. Hence $w$, after possibly multiplying by $-1$, lies in $X$. Now suppose $\pi'$ is another fixed point in $X$. Then $\pi - \pi'$ is a $1$-eigenvector summing to $0$. If it is nonzero it has entries of both signs, which the argument above just ruled out. So $\pi = \pi'$.

For positivity, $\pi_i = (M\pi)_i = \sum_j M_{ij} \pi_j > 0$, since $M_{ij} > 0$ and $\pi$ is a nonzero probability vector. $\square$

2.1 Worked example: the two-state chain

A single example will return throughout the paper. Fix $a, b \in (0, 1)$ and let

Column $1$ is the transition law from state $1$, and column $2$ from state $2$. Solving $M\pi = \pi$ with $\pi_1 + \pi_2 = 1$ gives

The eigenvalues of $M$ are $\lambda_1 = 1$ and $\lambda_2 = 1 - a - b$, the latter visible from $\lambda_1 + \lambda_2 = \mathrm{tr}(M) = 2 - a - b$. As long as $(a,b) \ne (0,0)$ and $(a,b) \ne (1,1)$, we have $|\lambda_2| < 1$. We will see in Section 4 that $|\lambda_2|$ is the rate at which the chain forgets its initial state.

3. Ergodicity, Birkhoff, and Mixing

A Markov chain also fits inside the standard ergodic-theoretic framework [2, 3, 4], and that fitting is what makes the spectral approach below work. Take the chain $(X_n)$ with transition matrix $M$ and stationary distribution $\pi$. Let $\Omega = S^{\mathbb{N}}$ be the space of one-sided sequences with the product $\sigma$-algebra, and $T : \Omega \to \Omega$ the shift $(T \omega)_n = \omega_{n+1}$. The Kolmogorov extension theorem then hands us a unique probability $P$ on $\Omega$ for which the coordinate maps $X_n(\omega) = \omega_n$ form the chain in question. Stationarity of $\pi$ is exactly the statement that $T$ is measure-preserving [6]. The dictionary back and forth between chain and shift is faithful: irreducibility of the chain matches ergodicity of the shift, and irreducibility plus aperiodicity matches mixing in the sense below.

For a Markov chain at stationarity, Birkhoff says that the time average of any reward functional along a trajectory converges almost surely to the space average $E_{\pi}[X]$. The full proof goes via Hopf's maximal ergodic theorem [2] and is not reproduced here. What matters for us is the negative observation: Birkhoff alone gives no rate. It says "the chain forgets," but not "how fast."

The standard mixing definition for measure-preserving transformations is what carries quantitative content.

and characterizes weak mixing in terms of the Koopman operator $U_T(X) = X(T^{-1})$ on $L^2$. The hard work of relating mixing to spectrum lives there. What we need here is just the definition: $T$ mixing means $P[T^{-n}(A) \cap B] \to P[A]\, P[B]$, with no rate specified.

4. Mixing Rates and the Spectral Gap

Mixing in the sense of Definition 3.3 only says the gap goes to zero. A rate would tell us how fast it gets there. To even formulate a rate we need a distance between probability vectors, a definition of mixing time, and a tool for relating the time to a property of $M$. The first two come immediately. The third is the spectral gap.

For an irreducible aperiodic chain, $\| M^n \delta_x - \pi \|_{TV} \to 0$ for every $x$ [9]. The question, again, is at what rate. The cleanest answer comes in the reversible case.

Reversibility makes $M$ symmetric in the right inner product. Equip $\mathbb{R}^S$ with the $\ell^2(\pi)$ inner product

and consider $L f := M^{\top} f$ acting on functions $f : S \to \mathbb{R}$. Reversibility gives self-adjointness $\langle f, L g \rangle_{\pi} = \langle L f, g \rangle_{\pi}$. (Quick check: $\langle f, L g \rangle_{\pi} = \sum_{i,j} \pi_j M_{ij} f(j) g(i)$, and reversibility rewrites $\pi_j M_{ij} = \pi_i M_{ji}$, which gives $\langle L f, g \rangle_{\pi}$.)

and an orthonormal basis $\{f_k\}_{k=1}^s$ of eigenfunctions in $\ell^2(\pi)$ with $f_1 \equiv 1$. Aperiodicity rules out $\lambda_s = -1$ [9], so

is the absolute spectral gap. Eigenvalues of $L = M^{\top}$ and $M$ coincide, so $\lambda_*$ also bounds the modulus of every non-unit eigenvalue of $M$.

Proof. Define $\rho_n : S \to \mathbb{R}$ by $\rho_n(y) = (M^n \delta_x)(y) / \pi(y)$, the density of the time-$n$ distribution started at $x$ relative to $\pi$. The key observation is that $\rho_n = L^n \rho_0$. To see this, write

$$\frac{(M\mu)(y)}{\pi(y)}\ =\ \sum_z \frac{M_{yz}}{\pi(y)}\, \mu(z)\ =\ \sum_z \frac{M_{zy}}{\pi(z)}\, \mu(z),$$

where the second equality uses reversibility $\pi_y M_{yz} = \pi_z M_{zy}$. The right-hand side is $\sum_z M_{zy} (\mu(z) / \pi(z)) = (L (\mu / \pi))(y)$. Iterating gives $\rho_n = L^n \rho_0$.

Expand $\rho_0$ in the eigenbasis of $L$. The coefficients are $\langle \rho_0, f_k \rangle_{\pi} = \sum_y \pi(y) (\delta_x(y) / \pi(y)) f_k(y) = f_k(x)$, so

$$\rho_0\ =\ 1 + \sum_{k=2}^s f_k(x)\, f_k.$$

Applying $L^n$,

$$\rho_n\ =\ 1 + \sum_{k=2}^s \lambda_k^n\, f_k(x)\, f_k.$$

By orthonormality of $\{f_k\}$ in $\ell^2(\pi)$,

$$\| \rho_n - 1 \|_{\ell^2(\pi)}^2\ =\ \sum_{k=2}^s \lambda_k^{2n}\, f_k(x)^2\ \le\ \lambda_*^{2n} \sum_{k=2}^s f_k(x)^2.$$

Parseval gives $\sum_{k=1}^s f_k(x)^2 = \| \rho_0 \|_{\ell^2(\pi)}^2 = \sum_y \pi(y) (\delta_x(y) / \pi(y))^2 = 1 / \pi(x)$, and since $f_1 \equiv 1$ we get $\sum_{k=2}^s f_k(x)^2 = (1 - \pi(x)) / \pi(x)$. So

$$\| \rho_n - 1 \|_{\ell^2(\pi)}\ \le\ \lambda_*^n\, \sqrt{\frac{1 - \pi(x)}{\pi(x)}}.$$

Now apply the Cauchy-Schwarz inequality [1]: with the constant random variable $1$ on $(S, \pi)$,

$$\sum_y \pi(y)\, |\rho_n(y) - 1|\ \le\ \Bigl( \sum_y \pi(y)\, (\rho_n(y) - 1)^2 \Bigr)^{1/2}\ =\ \| \rho_n - 1 \|_{\ell^2(\pi)}.$$

The left side is $\sum_y |M^n \delta_x(y) - \pi(y)| = 2 \| M^n \delta_x - \pi \|_{TV}$, which finishes the proof. $\square$

Inverting the bound for $n$ gives the standard mixing-time corollary, the form that is most useful when comparing chains of different sizes.

This is the inequality the paper hinges on. The spectral gap $\gamma$ is the rate at which the chain converges. A small gap means slow mixing.

Remark (Cheeger and conductance).

The spectral gap $\gamma$ has a geometric counterpart. The conductance of $M$ is

$$\Phi\ :=\ \min_{\substack{A \subseteq S \\ 0 < \pi(A) \le 1/2}} \frac{1}{\pi(A)} \sum_{i \in A,\, j \notin A} \pi(i)\, M_{ji},$$

the minimum probability of escaping a set, weighted by stationary mass. Cheeger's inequality, in the discrete form due to Sinclair-Jerrum [14] (after [13]), gives

$$\frac{\Phi^2}{2}\ \le\ \gamma\ \le\ 2 \Phi.$$

So bottlenecks in the chain's geometry are the same obstacle as a small spectral gap. This is the rigorous version of the intuition behind the exploration paragraph in Section 5: a state space with a narrow neck mixes slowly and is therefore statistically expensive to learn from.

4.1 Decay of correlations

The same eigenexpansion gives a decay-of-correlations bound. For $f, g \in \ell^2(\pi)$ with $E_{\pi}[f] = E_{\pi}[g] = 0$,

This is the mixing statement of Definition 3.3, with an explicit geometric rate $\lambda_*^n$. So mixing in the qualitative sense [4] is the same statement as mixing at rate $\lambda_*$ in the spectral sense, just with explicit numbers.

4.2 Worked example continued

For the two-state chain of Section 2.1, $\lambda_2 = 1 - a - b$ and $\pi_{\min} = \min(a, b) / (a + b)$. Reversibility checks: $\pi_2 M_{12} = (a / (a+b)) \cdot b = ab/(a+b) = (b/(a+b)) \cdot a = \pi_1 M_{21}$. Theorem 4.4 then gives

When $a + b$ is close to $1$ the gap is large and the chain mixes essentially in one step. When $a + b$ is close to $0$ or $2$ the gap is small and mixing is slow. The interpretation is just what the formula already says: rare transitions, or near-flipping, give long memory.

The bound of Theorem 4.4 is sharp here. A direct computation using $\delta_1 = \pi + (a/(a+b))(1, -1)^{\top}$ and the second eigenvector $(1, -1)^{\top}$ gives

In the symmetric case $a = b = p$, the prefactor $a/(a+b) = 1/2$ matches $\tfrac{1}{2}\sqrt{(1-\pi(1))/\pi(1)} = 1/2$ exactly, so Theorem 4.4 is tight at every $n$. For $a = 0.2,\ b = 0.1$ the formula above gives $(2/3)(0.7)^n$ while the bound gives $(\sqrt{2}/2)(0.7)^n$; the ratio is $4/(3\sqrt{2}) \approx 0.94$, so the bound overshoots by under $7\%$.

5. Reinforcement Learning Interpretation

Now we apply the picture above to reinforcement learning. One annoying notational issue first. Standard RL references use $\pi$ for the policy and also $d^\pi$ for the stationary distribution of the induced chain [17], but in our setting $\pi$ already denotes the stationary distribution of $M$. So in this section let $\rho$ denote the policy and $\mu^{\rho}$ the stationary distribution of the chain it induces. Keeps everything readable.

5.1 Policies as Markov chains

A finite Markov decision process has a state space $S$, an action space $A$, transition kernels $P(s' \mid s, a)$, and a reward function $r : S \times A \to \mathbb{R}$. A stationary policy $\rho : S \to \Delta(A)$ collapses the MDP to a Markov chain on $S$ with transition matrix

left-stochastic in our convention, so $M^{\rho} \mu^{\rho} = \mu^{\rho}$ for the stationary distribution. In the irreducible aperiodic case Theorem 2.3 applies and gives a unique $\mu^{\rho}$, and Theorem 4.4 gives a spectral gap $\gamma^{\rho} = 1 - \lambda_*(M^{\rho})$ that controls the rate at which on-policy trajectories forget where they started [17, 15].

5.2 Three consequences of the spectral gap

Sample complexity of value estimation.

by the Monte Carlo estimator $\hat{J}_n = \tfrac{1}{n} \sum_{k=0}^{n-1} r(S_k, A_k)$ along a single trajectory. By Birkhoff (Theorem 3.2), $\hat{J}_n \to J(\rho)$ almost surely. The rate depends on the chain's autocovariances, which the spectral gap controls.

Proposition 5.1 (Asymptotic variance and effective sample size)

Suppose $M^{\rho}$ is irreducible, aperiodic, and reversible with respect to $\mu^{\rho}$, with absolute spectral gap $\gamma^{\rho} = 1 - \lambda_*(M^{\rho})$. Let $\bar{r} : S \times A \to \mathbb{R}$ satisfy $E_{\mu^{\rho}}[\bar{r}] = 0$. Then for the chain started at stationarity,

$$\mathrm{Var}_{\mu^{\rho}}(\hat{J}_n)\ \le\ \frac{2}{n\, \gamma^{\rho}}\, \| \bar{r} \|_{\ell^2(\mu^{\rho})}^2,$$

where $\hat{J}_n = \tfrac{1}{n} \sum_{k=0}^{n-1} \bar{r}(S_k, A_k)$.

Proof. By stationarity, $\mathrm{Var}_{\mu^{\rho}}(\bar{r}(S_k, A_k)) = \| \bar{r} \|^2$ for every $k$. The decay-of-correlations bound from Section 4 gives $|\mathrm{Cov}_{\mu^{\rho}}(\bar{r}(S_j, A_j), \bar{r}(S_k, A_k))| \le \lambda_*^{|k-j|}\, \| \bar{r} \|^2$. Summing the variances and covariances,

$$n^2\, \mathrm{Var}(\hat{J}_n)\ \le\ n\, \| \bar{r} \|^2\, \Bigl( 1 + 2 \sum_{m=1}^{\infty} \lambda_*^m \Bigr)\ =\ n\, \| \bar{r} \|^2\, \frac{1 + \lambda_*}{1 - \lambda_*}\ \le\ \frac{2n}{\gamma^{\rho}}\, \| \bar{r} \|^2,$$

since $1 + \lambda_* \le 2$. Dividing by $n^2$ gives the claim. $\square$

The effective sample size of $n$ on-policy steps is therefore roughly $n \gamma^{\rho}$, not $n$. Slow-mixing policies are statistically expensive.

Exploration.

The mixing time $t_{\mathrm{mix}}(\varepsilon)$ is the timescale over which the chain forgets where it started. By Corollary 4.5, this scales like $1 / \gamma^{\rho}$. For a tabular agent doing $\varepsilon$-greedy or Boltzmann exploration, this is exactly the timescale at which on-policy data starts to look representative of $\mu^{\rho}$. A small spectral gap is a quantitative version of the qualitative observation that bottlenecked state spaces are hard to explore.

Stability and generalization.

Policy gradient and temporal-difference methods build updates from rolling averages along trajectories [17]. By the decay-of-correlations bound, samples $\Theta(1 / \gamma^{\rho})$ steps apart behave roughly independently, so the empirical estimates that drive these updates concentrate around their on-policy means at a rate set by $\gamma^{\rho}$. A high spectral gap is the Markov-chain analogue of the classical generalization-from-iid story.

5.3 Honest framing

None of the three statements above is a new theorem. Each has a sharper form in the RL literature: asymptotic-variance theorems for Markov-chain Monte Carlo [15, 12, 11], mixing-time-based regret bounds for tabular RL, concentration inequalities for ergodic processes. What I do want to argue is a simple point [4, 6]: in a Markov system, the spectrum of the transition operator is the right object. RL inherits this for free, because every fixed-policy on-policy quantity is, in the end, a function of the spectrum of $M^{\rho}$. Modern reinforcement learning hides this under layers of function approximation. In the tabular case it is bare.

6. Discussion

Looking back at what we did. Perron-Frobenius (Theorem 2.3) gave us existence and uniqueness of $\pi$. Birkhoff (Theorem 3.2) said time averages along a trajectory converge almost surely to space averages under $\pi$. Mixing in the qualitative sense [4] is the decoupling of $A$ and $T^{-n}(B)$. The spectral gap $\gamma = 1 - \lambda_*$ supplied the rate, via Theorem 4.4. So the whole mixing-rate story sits inside the standard Markov-ergodic-spectral framework [6, 4, 9], with one extra ingredient (Cauchy-Schwarz in $\ell^2(\pi)$) glueing everything together.

A few directions go past what we did here. Infinite or continuous state spaces want the Doeblin or Harris recurrence framework [15]. Non-reversible chains lose self-adjointness and need pseudo-spectral or Cheeger-type bounds [14]. On the lattice $\mathbb{Z}^d$ a Fourier-theoretic analysis governs recurrence (Pólya [18], see also [5]), which is the infinite analogue of the finite spectral picture above. On the learning side, function approximation breaks the tabular spectral picture, and policy updates make the induced chain non-stationary in time. These are real open questions, not just gestures.

The takeaway is the one already suggested by the abstract. The spectral gap of an induced Markov chain is the rate at which a learning agent can turn observed trajectories into stable estimates of its own behavior. Well-mixing policies are the ones whose stationary statistics can actually be estimated. The badly-mixing ones are not.

References

[1] O. Knill, Jensen-Hölder-Minkowski Inequalities, lecture notes, Harvard University, 2026.

[2] O. Knill, The Ergodic Theorem, lecture notes, Harvard University, 2026.

[3] O. Knill, Recurrence, lecture notes, Harvard University, 2026.

[4] O. Knill, Mixing, lecture notes, Harvard University, 2026.

[5] O. Knill, Random Walks, lecture notes, Harvard University, 2026.

[6] O. Knill, Markov Chains, lecture notes, Harvard University, 2026.

[7] O. Perron, "Zur Theorie der Matrices," Mathematische Annalen 64 (1907), 248–263.

[8] G. Frobenius, "Über Matrizen aus positiven Elementen," Sitzungsberichte der Preussischen Akademie der Wissenschaften zu Berlin (1908), 471–476.

[9] D. A. Levin, Y. Peres, and E. L. Wilmer, Markov Chains and Mixing Times, 2nd ed., American Mathematical Society, 2017.

[10] G. D. Birkhoff, "Proof of the ergodic theorem," Proceedings of the National Academy of Sciences 17 (1931), 656–660.

[11] D. Aldous and J. A. Fill, Reversible Markov Chains and Random Walks on Graphs, unfinished monograph (recompiled 2014).

[12] P. Diaconis and D. Stroock, "Geometric bounds for eigenvalues of Markov chains," Annals of Applied Probability 1 (1991), 36–61.

[13] J. Cheeger, "A lower bound for the smallest eigenvalue of the Laplacian," in Problems in Analysis (R. C. Gunning, ed.), Princeton University Press, 1970, 195–199.

[14] A. Sinclair and M. Jerrum, "Approximate counting, uniform generation and rapidly mixing Markov chains," Information and Computation 82 (1989), 93–133.

[15] S. Meyn and R. L. Tweedie, Markov Chains and Stochastic Stability, 2nd ed., Cambridge University Press, 2009.

[16] J. R. Norris, Markov Chains, Cambridge University Press, 1997.

[17] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed., MIT Press, 2018.

[18] G. Pólya, "Über eine Aufgabe der Wahrscheinlichkeitsrechnung betreffend die Irrfahrt im Strassennetz," Mathematische Annalen 84 (1921), 149–160.