StatNLP: Essential Information Theory, Part 3

Wed 28 March 2018

After a bit of a hiatus, it's time to wrap up the Information Theory introduction. In this post I'm going to talk about Mutual Information and KL-Divergence. I'll look at why these two are important, how we get them from what we already know, and what we can do with them (with an example or two for good measure). After this post we can hit the ground running with the StatNLP book, woohoo!

Mutual Information

At the end of the previous post, we looked at the relationship between joint entropy and conditional entropy. Recall that \(H(X,Y) = H(X) + H(Y \vert X)\). In this formula, we can reverse \(X\) and \(Y\) to get \(H(X,Y) = H(Y) + H(X \vert Y)\). The two are equivalent. If we rearrange those formulas we can derive the formula for mutual information:

$$ \begin{align} H(X) + H(Y \vert X) &= H(Y) + H(X \vert Y) \\ H(X) - H(X \vert Y) &= H(Y) - H(Y \vert X) \\ &= I(X;Y) \end{align} $$

Mutual information describes the reduction in uncertainty of one random variable that you get if you know another random variable. For example if \(X\) is a random variable indicating whether it is currently raining or not, and \(Y\) is a random variable indicating whether the ground or not is wet, if you know that the ground is wet (\(Y=1\)), that will reduce the amount of uncertainty in determining whether it is raining (\(X\)).

How do you calculate mutual information? From Equation 2.36 in StatNLP:

$$ \begin{align} I(X;Y) &= H(X) - H(X \vert Y) \\ &= H(X) + H(Y) - H(X,Y) \textit{ by the chain rule} \\ &= \sum_x p(x) \log \frac{1}{p(x)} + \sum_y p(y) \log \frac{1}{p(y)} + \sum_{x,y} p(x,y) \log p(x,y) \\ &= \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)} \end{align} $$

In the case where two random variables are independent, then \(I(X;Y) = 0\). Why? Because \(H(X \vert Y) = H(X)\), so we have \(I(X;Y) = H(X) - H(X) = 0\). This makes sense since mutual information is measuring the reduction in uncertainty for one variable when you know another one. If you know \(Y\), but it is independent of \(X\), there is no reduction in uncertainty for \(X\).


The next definition is for Kullback-Leibler divergence, or KL-divergence. KL-divergence measures how far one probability distribution is from another:

$$ D(p \vert \vert q) = \sum_{x \in X} p(x) \log \frac{p(x)}{q(x)} $$

It's worth noting that KL-divergence is not a distance metric since it is not symmetric, \(D(p \vert \vert q) \neq D(q \vert \vert p)\). It's useful for determining how good of an estimate q is for p. And looking at our equation for mutual information, we see that it can be expressed as a KL-divergence: \(I(X;Y) = D(p(x,y) \vert \vert p(x)p(y))\). This shows us that mutual information is a way to quantify the dependence of a joint distribution.

Cross Entropy

Imagine that you have some random variable \(X\) with some true probability distribution \(p(x)\). You don't know what \(p\) is, but you have this other distribution \(q\), and you want to know how good of a job it does at approximating \(p(x)\). To answer this question requires cross entropy: \(H(X, q) = H(X) + D(p \vert \vert q)\). Now wait a minute. The formula for cross entropy still requires that we know \(p(x)\). Let's rearrange some terms:

$$ \begin{align} H(X, q) &= H(X) + D(p \vert \vert q) \\ &= - \sum_x p(x) \log p(x) + \sum_x p(x) \log \frac{p(x)}{q(x)} \\ &= - \sum_x p(x) \log p(x) + \sum_x p(x) \log p(x) + \sum_x p(x) \log \frac{1}{q(x)} \\ &= \sum_x p(x) \frac{1}{\log q(x)} \\ &= \mathbb{E}_p [\log \frac{1}{q(x)}] \end{align} $$

We can calculate this if we know \(q\).

Wrapping Up

These last few posts had a lot of definitions and math in them, but hopefully they are helpful for defining some key concepts in Information Theory. Entropy is a useful tool for machine learning and NLP, and upcoming posts will be using these concepts.