Information theory sits comfortably in between electrical engineering and statistics / machine learning. It’s founding came from the 1948 masters thesis of Claude Shannon—arguably the most important masters thesis ever written. In “A Mathematical Theory of Communication”, he laid the building blocks of information theory, one of which is entropy. Before I talk about that though, it’s necessary to explicitly state the definition of information.

The word “information” can be an easy source of confusion, because its colloquial use doesn’t exactly match its mathematical definition. What it is, is simply the answer to a question of some kind. So, 1 bit of information is the answer to 1 yes-or-no question. If you have more signal options other than just 0 and 1, such as \(\{-3V, -1V, 1V, 3V\}\), which was used by Edison, then you can make your questions more definitive than yes or no ones. In this post, I’m going to stick to binary digits (bits).

Note – We’re consider everything in the context of a specific system one with a message source, a channel, and a receiver. We largely consider things from the point of the receiver, but it is important to realize the implications of the message source and channel.

Before we consider entropy, let’s look at the following relationship. If you and a friend are sending messages, and you have formulated a list of *N* yes-or-no questions, then the number of possible messages you can send is \(2^N\). We call this idea the message space, represented by the letter \(S\), and it just means the set of all possible messages.

$$\begin{align}

S &= 2^{\text{# of questions}} \\

& =2^{\text{# of bits}}

\end{align}$$

This should make sense. If we have 3 bits, or 3 questions, then there are \(2^3 = 8\) combinations of those bits, so that is the set of all possible messages we could represent.

We now need to inject probability. Entropy is a measure of unpredictability of information, so choice is an important aspect of it. If we had a message source that could only send 1’s (yes’s), no information would be gained for a given message, because to any question we ask, the source can only give 1 answer—yes. Certainty gives no information, and highly probable events only give a little. Improbable events, however, give a lot of information. This is analogous to being a detective and interviewing a crime suspect. If the suspect says what you expect, you aren’t learning much. But, if he surprises you with an answer you didn’t expect, a larger amount of information is gained.

So, the conclusion from the above reasoning is that entropy depends the probability of the answer. Let’s define it now, and do a few examples to provide the intuition of its definition.

$$H = -\sum p_i \cdot log(p_i)$$

In essence, \(log(p_i)\) is the information gained from response *i* being given, and \(p_i\) is the probability of getting that response. Thinking that way, entropy is an expected value, the expected amount of information acquired from each symbol sent. Lastly, it’s negative because log base 2 of any number in \([0, 1]\) yields a negative, but we are *gaining* information for each symbol (off-on pulses, 0’s or 1’s, etc) sent, so we want the final answer to be positive. More on the negative later.

Consider a message source, where each bit sent is given by a fair coin toss, so naturally, a 0 (tails) has a probability of \(\frac{1}{2}\), and a 1 (heads) has a probability of \(\frac{1}{2}\). The entropy is then given by

$$\begin{align}

H &= -(p_0 \cdot log(p_0) + p_1 \cdot log(p_1)) \\

& = -\Big( \frac{1}{2} \cdot log \Big( \frac{1}{2} \Big) +\frac{1}{2} \cdot log \Big( \frac{1}{2} \Big) \Big) \\

& = -\Big( -\frac{1}{2}-\frac{1}{2} \Big) \\ &= 1 \frac{bit}{symbol}

\end{align}$$

This means that, on average, it will take 1 bit to send each symbol, or, for every heads or tails response we want to send, it will take one bit.

However, entropy is a measure of uncertainty. Probability matters. Remember that the probable is less informative than the improbable. If we considered the same scenario, but with an unfair coin, say with a \(p_0 = \frac{1}{8}\). That means \(p_1 = \frac{7}{8}\). Then, calculating the entropy yields:

$$\begin{align}

H &= -(p_0 \cdot log(p_0) + p_1 \cdot log(p_1)) \\

& = -\Big( \frac{1}{8} \cdot log \Big( \frac{1}{8} \Big) +\frac{7}{8} \cdot log \Big( \frac{7}{8} \Big) \Big) \\

& = -\Big( -.2599 – .1168 \Big) \\ &= .3678 \frac{bit}{symbol}

\end{align}$$

Notice how the information yielded by the improbable event, a 0, represented by \(log(\frac{1}{8})\) , is substantially higher than the information yielded by the probable event, represented by \(log(\frac{7}{8})\) . What this means is, if we have a message source who chooses between a 0 and a 1 by flipping an unfair coin with the corresponding probabilities \(p_0 = \frac{1}{8}\) and \(p_1 = \frac{7}{8}\) , then on average, we will need \(.3678 \)bits per symbol that we want to send. You might be thinking—how is this possible?

It has to do with how we encode the message. You choose the shortest binary codes to represent the most probable message, and the longest to represent the most improbable. This is analogous to choosing the binary digit 1 to represent the word “the” in the English language, or “e” in the English alphabet, and then choosing 11111 to represent an “x”, a much less probable letter. This is the essence of a Huffman coding, the most efficient way to encode information.

It’s important to see that entropy is maximized when we have probabilities that are exactly \(\frac{1}{n}\), with *n* events. In this case, it’s difficult to gain much efficiency with the encoding, since all messages are equally likely. This could be a message source forming a message with 6 equally probable words, perhaps chosen by the rolling of a die. Let it be clear that, even though the message source may choose between 6 words to form his message, in this case, it is being sent using only binary digits, as represented in the entropy formula by log * base 2*.

In general, if we have *n* symbols we are trying to represent using bits, and each symbol (word, letter, poker hand, etc), has an equal probability, then we can use a special case of the entropy formula; below is its derivation. Note that \(p_1 = p_2 = … = p_n = \frac{1}{n}\).

$$\begin{align}

H &= -(p_0 log(p_0) + p_1 log(p_1) + … + p_n log(p_n) \\

& = -\Big( \frac{1}{n} + \frac{1}{n} + … + \frac{1}{n} \Big)\Big(log \Big( \frac{1}{n} \Big) \Big) \\

& = -\Big( 1 \cdot log \Big(\frac{1}{n} \Big) \Big) \\ &= -log\Big(\frac{1}{n}\Big)

\end{align}$$

One of the cooler things about this is the relationship between \(p_i\) and the \(log(p_i)\)—the information we gain from response *i*. Consider the simple coin toss example, and the varying probabilities we can consider for \(p_0\) (and thus \(p_1\), since \(p_1 = 1 – p_0\)). This table comes from [1].

I think that table does a great job of explicitly showing the intuition for the entropy formula. It should be clear now why there is a negative, as we have to cancel out the negative that comes out of the log equation.

I should probably add that there entropy in information theory and entropy in physics are not exactly the same thing. There is a relationship, but it is not super straightforward. You can read more about it here.

In summary, information is a response to a question, and a bit is a response to a yes-or-no question. Entropy is a measure of uncertainty contained within a certain amount of information. And Claude Shannon is a badass.

Sources:

- An Introduction to Information Theory by John Pierce. This is a great book, and it’s pretty in depth.