How fast does a finite Markov chain forget where it started? The question has a clean answer in our setting, and a useful one for anything that learns by repeating an experiment. Perron-Frobenius hands us a unique stationary $\pi$ when $M$ has positive entries; the second eigenvalue of $M$ then sets the rate at which the iterates $M^n \mu_0$ relax back to $\pi$. I work this out using a small set of standard tools, primarily Cauchy-Schwarz, Birkhoff's ergodic theorem, Perron-Frobenius, and the mixing definition for measure-preserving transformations, and prove a quantitative form of that mixing definition in which the proof reduces to one application of the Cauchy-Schwarz inequality. The last third of the paper reads the result back into the language of reinforcement learning: a stationary policy on a Markov decision process induces a Markov chain, and the spectral gap of that chain is, more or less, the rate at which the agent can learn anything stable about itself. That last part is interpretive rather than technical. The math stays inside the standard Markov, ergodic, and spectral framework.
Take a finite Markov chain and watch what happens to it. Start at one specific state, hit it with the transition matrix $M$ a bunch of times, and the distribution of where you are spreads out. Eventually, it forgets where it began. How fast it forgets is the rate I want to compute. The rough form of the answer is the second eigenvalue of $M$. The careful form is Theorem 4.1 below.
The qualitative version of the same statement is the standard mixing definition for measure-preserving transformations [4]. There, a measure-preserving $T$ on $(\Omega, \mathcal{A}, P)$ is called mixing if
for every $A, B \in \mathcal{A}$. That definition is asymptotic and rate-free. The point of this paper is to refine it, in the finite Markov setting, into a quantitative bound with an explicit dependence on the spectrum of $M$. Perron-Frobenius then guarantees that $M$ has the structure we need [7, 8, 6]: a unique stationary $\pi$ at eigenvalue $1$, and the rest of the spectrum strictly inside the unit disk.
The question, written out, is the following: what controls the rate at which a Markov chain forgets its initial distribution, and what does that rate tell us about systems that learn from their own trajectories?
The second half of the question is what I find interesting, and what motivates the reinforcement learning interpretation in Section 5. Briefly: an RL agent following a fixed policy on a Markov decision process induces a Markov chain on the state space [17]. Almost every quantity that agent has to estimate from data, ie value functions, on-policy expectations, policy gradients, is a time average along the trajectories of that chain. So the mixing rate is not abstract for an RL agent. Slow mixing means high-variance estimates, sluggish exploration, and unstable updates. Fast mixing gives the opposite. In a real sense, mixing rate is learning rate.
The plan is short. Section 2 reviews finite Markov chains in the left-stochastic convention and proves Perron-Frobenius. Section 3 hooks the chain into ergodic theory and quotes Birkhoff. Section 4 is the technical core. Section 5 reads the result back as an RL statement. Section 6 closes.
Throughout the paper, $S = \{1, 2, \dots, s\}$ is a finite state space, $(\Omega, \mathcal{A}, P)$ is a probability space, and $X_0, X_1, X_2, \dots$ are $S$-valued random variables [6].
The sequence $(X_n)_{n \ge 0}$ is a (time-homogeneous) Markov chain on $S$ if for every $n \ge 0$ and every $i_0, \dots, i_{n+1} \in S$,
and the right-hand side does not depend on $n$.
We use the left-stochastic convention [6], ie. the matrix $M$ acts on column probability vectors. Each entry $M_{ij}$ is the probability of going from state $j$ to state $i$, so entries are non-negative and each column sums to $1$:
Equivalently, $M^{\top} \mathbf{1} = \mathbf{1}$, where $\mathbf{1} = (1, \dots, 1)^{\top}$. So $1$ is automatically an eigenvalue of $M^{\top}$, and hence of $M$. The eigenvector of $M$ at eigenvalue $1$, normalized to a probability vector, is what we want.
A probability vector $\pi = (\pi_1, \dots, \pi_s)^{\top}$ on $S$, ie. with $\pi_i \ge 0$ and $\sum_i \pi_i = 1$, is a stationary distribution of $M$ if $M \pi = \pi$.
If $X_0 \sim \pi$ then $X_n \sim \pi$ for all $n \ge 0$, so $\pi$ is the law that the chain leaves invariant. The first basic question is whether such a $\pi$ exists, and whether it is unique. Perron-Frobenius answers both. The proof below follows the approach in [6].
Let $M$ be an $s \times s$ left-stochastic matrix with strictly positive entries, $M_{ij} > 0$. Then there is a unique probability vector $\pi$ with $M \pi = \pi$, and every entry of $\pi$ is strictly positive.
For uniqueness we use a strict-inequality argument. Suppose $Mw = w$ for some $w \in \mathbb{R}^s$ that has at least one negative entry, normalized so $\sum_i |w_i| = 1$. Let $|w|$ denote the vector with entries $|w_j|$. Then
so $M|w|$ is also a probability vector. For each $i$,
The triangle inequality is strict at any $i$ where the $w_j$ have mixed signs, since $M_{ij} > 0$ for every $j$. But $\sum_i |w_i| = \sum_i (M|w|)_i = 1$ forces equality everywhere. So $w$ cannot have entries of both signs. Hence $w$, after possibly multiplying by $-1$, lies in $X$. Now suppose $\pi'$ is another fixed point in $X$. Then $\pi - \pi'$ is a $1$-eigenvector summing to $0$. If it is nonzero it has entries of both signs, which the argument above just ruled out. So $\pi = \pi'$.
For positivity, $\pi_i = (M\pi)_i = \sum_j M_{ij} \pi_j > 0$, since $M_{ij} > 0$ and $\pi$ is a nonzero probability vector. $\square$
The positive-entries hypothesis is stronger than needed. If $M$ is irreducible and aperiodic, some power $M^k$ has all entries positive, and Theorem 2.3 applied to $M^k$ gives the same conclusion for $M$ [16, 9].
A single example will return throughout the paper. Fix $a, b \in (0, 1)$ and let
Column $1$ is the transition law from state $1$, and column $2$ from state $2$. Solving $M\pi = \pi$ with $\pi_1 + \pi_2 = 1$ gives
The eigenvalues of $M$ are $\lambda_1 = 1$ and $\lambda_2 = 1 - a - b$, the latter visible from $\lambda_1 + \lambda_2 = \mathrm{tr}(M) = 2 - a - b$. As long as $(a,b) \ne (0,0)$ and $(a,b) \ne (1,1)$, we have $|\lambda_2| < 1$. We will see in Section 4 that $|\lambda_2|$ is the rate at which the chain forgets its initial state.
A Markov chain also fits inside the standard ergodic-theoretic framework [2, 3, 4], and that fitting is what makes the spectral approach below work. Take the chain $(X_n)$ with transition matrix $M$ and stationary distribution $\pi$. Let $\Omega = S^{\mathbb{N}}$ be the space of one-sided sequences with the product $\sigma$-algebra, and $T : \Omega \to \Omega$ the shift $(T \omega)_n = \omega_{n+1}$. The Kolmogorov extension theorem then hands us a unique probability $P$ on $\Omega$ for which the coordinate maps $X_n(\omega) = \omega_n$ form the chain in question. Stationarity of $\pi$ is exactly the statement that $T$ is measure-preserving [6]. The dictionary back and forth between chain and shift is faithful: irreducibility of the chain matches ergodicity of the shift, and irreducibility plus aperiodicity matches mixing in the sense below.
$T$ is ergodic if every $T$-invariant set is trivial, ie. $T^{-1}(A) = A$ implies $P[A] \in \{0, 1\}$.
Let $T$ be ergodic and $X \in L^1(P)$. Set $X_k = X(T^k)$ and $S_n = \sum_{k=0}^{n-1} X_k$. Then $S_n / n \to E[X]$ almost everywhere.
For a Markov chain at stationarity, Birkhoff says that the time average of any reward functional along a trajectory converges almost surely to the space average $E_{\pi}[X]$. The full proof goes via Hopf's maximal ergodic theorem [2] and is not reproduced here. What matters for us is the negative observation: Birkhoff alone gives no rate. It says "the chain forgets," but not "how fast."
The standard mixing definition for measure-preserving transformations is what carries quantitative content.
$T$ is mixing if for every $A, B \in \mathcal{A}$,
$T$ is weakly mixing if for every $A, B \in \mathcal{A}$,
The standard result [4] establishes the strict hierarchy
and characterizes weak mixing in terms of the Koopman operator $U_T(X) = X(T^{-1})$ on $L^2$. The hard work of relating mixing to spectrum lives there. What we need here is just the definition: $T$ mixing means $P[T^{-n}(A) \cap B] \to P[A]\, P[B]$, with no rate specified.
Poincaré recurrence [3] is the closely related qualitative fact that any set $A$ with $P[A] > 0$ is hit again by the orbit, ie. $P[T^n(A) \cap A] > 0$ for some $n \le 1/P[A]$. Poincaré says trajectories return; mixing says they decorrelate. Section 4 works out how fast.
Mixing in the sense of Definition 3.3 only says the gap goes to zero. A rate would tell us how fast it gets there. To even formulate a rate we need a distance between probability vectors, a definition of mixing time, and a tool for relating the time to a property of $M$. The first two come immediately. The third is the spectral gap.
For probability vectors $\mu, \nu$ on $S$,
For $\varepsilon \in (0, 1)$, the $\varepsilon$-mixing time of $M$ is
where $\delta_x$ is the unit mass at $x$.
For an irreducible aperiodic chain, $\| M^n \delta_x - \pi \|_{TV} \to 0$ for every $x$ [9]. The question, again, is at what rate. The cleanest answer comes in the reversible case.
$M$ is reversible with respect to $\pi$ if $\pi_j M_{ij} = \pi_i M_{ji}$ for all $i, j \in S$.
Reversibility makes $M$ symmetric in the right inner product. Equip $\mathbb{R}^S$ with the $\ell^2(\pi)$ inner product
and consider $L f := M^{\top} f$ acting on functions $f : S \to \mathbb{R}$. Reversibility gives self-adjointness $\langle f, L g \rangle_{\pi} = \langle L f, g \rangle_{\pi}$. (Quick check: $\langle f, L g \rangle_{\pi} = \sum_{i,j} \pi_j M_{ij} f(j) g(i)$, and reversibility rewrites $\pi_j M_{ij} = \pi_i M_{ji}$, which gives $\langle L f, g \rangle_{\pi}$.)
By the spectral theorem, $L$ has $s$ real eigenvalues
and an orthonormal basis $\{f_k\}_{k=1}^s$ of eigenfunctions in $\ell^2(\pi)$ with $f_1 \equiv 1$. Aperiodicity rules out $\lambda_s = -1$ [9], so
is the absolute spectral gap. Eigenvalues of $L = M^{\top}$ and $M$ coincide, so $\lambda_*$ also bounds the modulus of every non-unit eigenvalue of $M$.
Let $M$ be irreducible, aperiodic, and reversible with respect to $\pi$. For every $x \in S$ and every $n \ge 1$,
where the second equality uses reversibility $\pi_y M_{yz} = \pi_z M_{zy}$. The right-hand side is $\sum_z M_{zy} (\mu(z) / \pi(z)) = (L (\mu / \pi))(y)$. Iterating gives $\rho_n = L^n \rho_0$.
Expand $\rho_0$ in the eigenbasis of $L$. The coefficients are $\langle \rho_0, f_k \rangle_{\pi} = \sum_y \pi(y) (\delta_x(y) / \pi(y)) f_k(y) = f_k(x)$, so
Applying $L^n$,
By orthonormality of $\{f_k\}$ in $\ell^2(\pi)$,
Parseval gives $\sum_{k=1}^s f_k(x)^2 = \| \rho_0 \|_{\ell^2(\pi)}^2 = \sum_y \pi(y) (\delta_x(y) / \pi(y))^2 = 1 / \pi(x)$, and since $f_1 \equiv 1$ we get $\sum_{k=2}^s f_k(x)^2 = (1 - \pi(x)) / \pi(x)$. So
Now apply the Cauchy-Schwarz inequality [1]: with the constant random variable $1$ on $(S, \pi)$,
The left side is $\sum_y |M^n \delta_x(y) - \pi(y)| = 2 \| M^n \delta_x - \pi \|_{TV}$, which finishes the proof. $\square$
Inverting the bound for $n$ gives the standard mixing-time corollary, the form that is most useful when comparing chains of different sizes.
$\displaystyle t_{\mathrm{mix}}(\varepsilon)\ \le\ \frac{1}{\gamma}\, \log\!\left( \frac{1}{2 \varepsilon \sqrt{\pi_{\min}}} \right)$, where $\pi_{\min} = \min_{x \in S} \pi(x)$.
This is the inequality the paper hinges on. The spectral gap $\gamma$ is the rate at which the chain converges. A small gap means slow mixing.
The spectral gap $\gamma$ has a geometric counterpart. The conductance of $M$ is
the minimum probability of escaping a set, weighted by stationary mass. Cheeger's inequality, in the discrete form due to Sinclair-Jerrum [14] (after [13]), gives
So bottlenecks in the chain's geometry are the same obstacle as a small spectral gap. This is the rigorous version of the intuition behind the exploration paragraph in Section 5: a state space with a narrow neck mixes slowly and is therefore statistically expensive to learn from.
The same eigenexpansion gives a decay-of-correlations bound. For $f, g \in \ell^2(\pi)$ with $E_{\pi}[f] = E_{\pi}[g] = 0$,
Pulling out $\lambda_*^n$ and applying Cauchy-Schwarz to the remaining sum,
Choosing $f = \mathbf{1}_A - \pi(A)$, $g = \mathbf{1}_B - \pi(B)$ recovers
This is the mixing statement of Definition 3.3, with an explicit geometric rate $\lambda_*^n$. So mixing in the qualitative sense [4] is the same statement as mixing at rate $\lambda_*$ in the spectral sense, just with explicit numbers.
For the two-state chain of Section 2.1, $\lambda_2 = 1 - a - b$ and $\pi_{\min} = \min(a, b) / (a + b)$. Reversibility checks: $\pi_2 M_{12} = (a / (a+b)) \cdot b = ab/(a+b) = (b/(a+b)) \cdot a = \pi_1 M_{21}$. Theorem 4.4 then gives
When $a + b$ is close to $1$ the gap is large and the chain mixes essentially in one step. When $a + b$ is close to $0$ or $2$ the gap is small and mixing is slow. The interpretation is just what the formula already says: rare transitions, or near-flipping, give long memory.
The bound of Theorem 4.4 is sharp here. A direct computation using $\delta_1 = \pi + (a/(a+b))(1, -1)^{\top}$ and the second eigenvector $(1, -1)^{\top}$ gives
In the symmetric case $a = b = p$, the prefactor $a/(a+b) = 1/2$ matches $\tfrac{1}{2}\sqrt{(1-\pi(1))/\pi(1)} = 1/2$ exactly, so Theorem 4.4 is tight at every $n$. For $a = 0.2,\ b = 0.1$ the formula above gives $(2/3)(0.7)^n$ while the bound gives $(\sqrt{2}/2)(0.7)^n$; the ratio is $4/(3\sqrt{2}) \approx 0.94$, so the bound overshoots by under $7\%$.
Now we apply the picture above to reinforcement learning. One annoying notational issue first. Standard RL references use $\pi$ for the policy and also $d^\pi$ for the stationary distribution of the induced chain [17], but in our setting $\pi$ already denotes the stationary distribution of $M$. So in this section let $\rho$ denote the policy and $\mu^{\rho}$ the stationary distribution of the chain it induces. Keeps everything readable.
A finite Markov decision process has a state space $S$, an action space $A$, transition kernels $P(s' \mid s, a)$, and a reward function $r : S \times A \to \mathbb{R}$. A stationary policy $\rho : S \to \Delta(A)$ collapses the MDP to a Markov chain on $S$ with transition matrix
left-stochastic in our convention, so $M^{\rho} \mu^{\rho} = \mu^{\rho}$ for the stationary distribution. In the irreducible aperiodic case Theorem 2.3 applies and gives a unique $\mu^{\rho}$, and Theorem 4.4 gives a spectral gap $\gamma^{\rho} = 1 - \lambda_*(M^{\rho})$ that controls the rate at which on-policy trajectories forget where they started [17, 15].
Consider an agent estimating
by the Monte Carlo estimator $\hat{J}_n = \tfrac{1}{n} \sum_{k=0}^{n-1} r(S_k, A_k)$ along a single trajectory. By Birkhoff (Theorem 3.2), $\hat{J}_n \to J(\rho)$ almost surely. The rate depends on the chain's autocovariances, which the spectral gap controls.
Suppose $M^{\rho}$ is irreducible, aperiodic, and reversible with respect to $\mu^{\rho}$, with absolute spectral gap $\gamma^{\rho} = 1 - \lambda_*(M^{\rho})$. Let $\bar{r} : S \times A \to \mathbb{R}$ satisfy $E_{\mu^{\rho}}[\bar{r}] = 0$. Then for the chain started at stationarity,
where $\hat{J}_n = \tfrac{1}{n} \sum_{k=0}^{n-1} \bar{r}(S_k, A_k)$.
since $1 + \lambda_* \le 2$. Dividing by $n^2$ gives the claim. $\square$
The effective sample size of $n$ on-policy steps is therefore roughly $n \gamma^{\rho}$, not $n$. Slow-mixing policies are statistically expensive.
The mixing time $t_{\mathrm{mix}}(\varepsilon)$ is the timescale over which the chain forgets where it started. By Corollary 4.5, this scales like $1 / \gamma^{\rho}$. For a tabular agent doing $\varepsilon$-greedy or Boltzmann exploration, this is exactly the timescale at which on-policy data starts to look representative of $\mu^{\rho}$. A small spectral gap is a quantitative version of the qualitative observation that bottlenecked state spaces are hard to explore.
Policy gradient and temporal-difference methods build updates from rolling averages along trajectories [17]. By the decay-of-correlations bound, samples $\Theta(1 / \gamma^{\rho})$ steps apart behave roughly independently, so the empirical estimates that drive these updates concentrate around their on-policy means at a rate set by $\gamma^{\rho}$. A high spectral gap is the Markov-chain analogue of the classical generalization-from-iid story.
None of the three statements above is a new theorem. Each has a sharper form in the RL literature: asymptotic-variance theorems for Markov-chain Monte Carlo [15, 12, 11], mixing-time-based regret bounds for tabular RL, concentration inequalities for ergodic processes. What I do want to argue is a simple point [4, 6]: in a Markov system, the spectrum of the transition operator is the right object. RL inherits this for free, because every fixed-policy on-policy quantity is, in the end, a function of the spectrum of $M^{\rho}$. Modern reinforcement learning hides this under layers of function approximation. In the tabular case it is bare.
Looking back at what we did. Perron-Frobenius (Theorem 2.3) gave us existence and uniqueness of $\pi$. Birkhoff (Theorem 3.2) said time averages along a trajectory converge almost surely to space averages under $\pi$. Mixing in the qualitative sense [4] is the decoupling of $A$ and $T^{-n}(B)$. The spectral gap $\gamma = 1 - \lambda_*$ supplied the rate, via Theorem 4.4. So the whole mixing-rate story sits inside the standard Markov-ergodic-spectral framework [6, 4, 9], with one extra ingredient (Cauchy-Schwarz in $\ell^2(\pi)$) glueing everything together.
A few directions go past what we did here. Infinite or continuous state spaces want the Doeblin or Harris recurrence framework [15]. Non-reversible chains lose self-adjointness and need pseudo-spectral or Cheeger-type bounds [14]. On the lattice $\mathbb{Z}^d$ a Fourier-theoretic analysis governs recurrence (Pólya [18], see also [5]), which is the infinite analogue of the finite spectral picture above. On the learning side, function approximation breaks the tabular spectral picture, and policy updates make the induced chain non-stationary in time. These are real open questions, not just gestures.
The takeaway is the one already suggested by the abstract. The spectral gap of an induced Markov chain is the rate at which a learning agent can turn observed trajectories into stable estimates of its own behavior. Well-mixing policies are the ones whose stationary statistics can actually be estimated. The badly-mixing ones are not.
[1] O. Knill, Jensen-Hölder-Minkowski Inequalities, lecture notes, Harvard University, 2026.
[2] O. Knill, The Ergodic Theorem, lecture notes, Harvard University, 2026.
[3] O. Knill, Recurrence, lecture notes, Harvard University, 2026.
[4] O. Knill, Mixing, lecture notes, Harvard University, 2026.
[5] O. Knill, Random Walks, lecture notes, Harvard University, 2026.
[6] O. Knill, Markov Chains, lecture notes, Harvard University, 2026.
[7] O. Perron, "Zur Theorie der Matrices," Mathematische Annalen 64 (1907), 248–263.
[8] G. Frobenius, "Über Matrizen aus positiven Elementen," Sitzungsberichte der Preussischen Akademie der Wissenschaften zu Berlin (1908), 471–476.
[9] D. A. Levin, Y. Peres, and E. L. Wilmer, Markov Chains and Mixing Times, 2nd ed., American Mathematical Society, 2017.
[10] G. D. Birkhoff, "Proof of the ergodic theorem," Proceedings of the National Academy of Sciences 17 (1931), 656–660.
[11] D. Aldous and J. A. Fill, Reversible Markov Chains and Random Walks on Graphs, unfinished monograph (recompiled 2014).
[12] P. Diaconis and D. Stroock, "Geometric bounds for eigenvalues of Markov chains," Annals of Applied Probability 1 (1991), 36–61.
[13] J. Cheeger, "A lower bound for the smallest eigenvalue of the Laplacian," in Problems in Analysis (R. C. Gunning, ed.), Princeton University Press, 1970, 195–199.
[14] A. Sinclair and M. Jerrum, "Approximate counting, uniform generation and rapidly mixing Markov chains," Information and Computation 82 (1989), 93–133.
[15] S. Meyn and R. L. Tweedie, Markov Chains and Stochastic Stability, 2nd ed., Cambridge University Press, 2009.
[16] J. R. Norris, Markov Chains, Cambridge University Press, 1997.
[17] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed., MIT Press, 2018.
[18] G. Pólya, "Über eine Aufgabe der Wahrscheinlichkeitsrechnung betreffend die Irrfahrt im Strassennetz," Mathematische Annalen 84 (1921), 149–160.