### StatNLP: Essential Information Theory, Part 2

Fri 23 February 2018

Welcome back! In the last post we started our discussion of Information Theory, and defined some key building blocks related to calculating information for a single random variable. This time we'll look at the relationships between two random variables, and what they can tell us (if anything) about the informativeness of each other.

## Joint Entropy

The first thing we'll look at is joint entropy. For two random variables, $$X$$ and $$Y$$, the joint entropy between them is

\begin{align} H(X, Y) &= - \sum_{x \in X} \sum_{y \in Y} p(x, y) \log p(x, y) \\ &= \sum_{x \in X} \sum_{y \in Y} p(x, y) \log \frac{1}{p(x, y)} \end{align}

Joint entropy refers to the amount of information you would need to determine the values of two random variables. As an example, we'll go back to coin flipping from last time. Now, we have two independent coins, both of them unbiased (so $$p(x=H) = p(x=T) = p(y=H) = p(y=T) = 0.5$$).

The joint entropy for these two random variables is as follows (below the order for the variables is $$(x,y)$$, so $$p(H,H)$$ is really $$p(x=H, y=H)$$:

\begin{align} H(X, Y) &= \sum_{x \in X} \sum_{y \in Y} p(x, y) \log \frac{1}{ p(x, y)} \\ &= p(H,H)\log \frac{1}{p(H,H)} + p(H,T)\log \frac{1}{p(H,T)} \\ &+ p(T,H)\log \frac{1}{p(T,H)} + p(T,T)\log \frac{1}{p(T,T)} \\ &= \frac{1}{4}\log 4 + \frac{1}{4}\log 4+ \frac{1}{4}\log 4+ \frac{1}{4}\log 4 \\ &= 2 \end{align}

This should make sense, since we showed last time that the entropy of a single unbiased coin is 1 bit. If we have two coins, we then double our information.

## Conditional Entropy

Conditional entropy is related to joint entropy. We're still dealing with two random variables, but with conditional entropy we're interested in how much information is gained when we know one of the outcomes.

\begin{align} H(Y \vert X) &= \sum_{x \in X} p(x) H(Y \vert X = x) \\ &= \sum_{x \in X} p(x) [\sum_{y \in Y} p(y \vert x) \log \frac{1}{p(y \vert x)}] \\ &= \sum_{x \in X} \sum_{y \in Y} p(x, y) log \frac{1}{p(y \vert x)} \end{align}

Ok, so what just happened? Let's take the previous set of equations one at a time. In the first line we define the conditional entropy as a weighted average of the conditional entropies of $$Y$$ given each possible value of $$X$$. In the second line we just expand that out, using the definition of entropy. Finally, $$p(x)p(y|x) = p(x,y)$$ so we can rewrite the equation as in the third line.

Joint and conditional entropy are closely related. With joint entropy, we are measuring the amount of information in two random variables. Conditional entropy we already know one of the two random variables, and so we're measuring how much additional information is in the other. We can also look at it mathematically, starting with the definition of joint entropy:

\begin{align} H(X, Y) &= \sum_{x \in X} \sum_{y \in Y} p(x, y) \log \frac{1}{ p(x, y)} \\ &= \sum_{x \in X} \sum_{y \in Y} p(x,y) \log \frac{1}{ p(x) p(y\vert x)} \\ &= \sum_{x \in X} \sum_{y \in Y} p(x,y)[ \log \frac{1}{p(x)} + \log \frac{1}{p(y \vert x)}] \\ &= \sum_{x \in X} \sum_{y \in Y} p(x, y) \log \frac{1}{p(x)} + \sum_{x \in X} \sum_{y \in Y} p(x, y) \log \frac{1}{p(y \vert x)} \\ &= \sum_{x \in X} p(x) \log \frac{1}{p(x)} + \sum_{x \in X} \sum_{y \in Y} p(x, y) \log \frac{1}{p(y \vert x)} \\ &= H(X) + H(Y \vert X) \end{align}

Let's walk through that. In eqn. 12, we rewrite $$p(x,y)$$ in the log. In eqn 13. we split that log product into a sum of logs. In eqn. 14 we move $$p(x,y)$$ into the sum, and in eqn. 15 we use the fact that $$\sum_{y \in Y} p(x, y) = p(x)$$.

So to get the joint entropy of two random variables, you calculate the entropy of one, and add the conditional entropy of the other, given the first one. If the two random variables are independent, then $$H(X, Y) = H(X) + H(Y)$$. We saw that above with our two coin example. If you swap all of the x's and y's in the above, you'll see that the order doesn't matter, and $$H(X,Y) = H(X) + H(Y \vert X) = H(Y) + H(X \vert Y)$$

## Wrapping Up

There were a lot of equations here, but hopefully they all make sense. We're still trying to measure information (as in standard entropy for a single random variable) but now we can look at the information relationship between pairs of random variables. Next time we'll look at two more advanced concepts: mutual information and KL-Divergence.