Information theory sits comfortably in between electrical engineering and statistics / machine learning. It’s founding came from the 1948 masters thesis of Claude Shannon—arguably the most important masters thesis ever written. In “A Mathematical Theory of Communication”, he laid the building blocks of information theory, one of which is entropy. Before I talk about that though, it’s necessary to explicitly state the definition of information. The word “information” can be an easy source of confusion, because its colloquial use doesn’t exactly match its mathematical definition. What it is, is simply the answer to a question of some kind. So, 1 bit of information is the answer to 1 yes-or-no question. If you have more signal options other than just 0 and 1, such as \(\{-3V, -1V, 1V, 3V\}\), which was used by Edison, then you can make your questions more definitive than yes or no ones. In this post, I’m going to stick to binary digits (bits). Note – We’re consider everything in the context of a specific system one with a message source, a channel, and a receiver. We largely consider things from the point of the receiver, but it is important to realize the implications of the message source and channel. Before we consider entropy, let’s look at the following relationship. If you and a friend are sending messages, and you have formulated a list of N yes-or-no questions, then the number of possible messages you can send is \(2^N\). We call this idea the message space, represented by the letter \(S\), and it just means the set of all possible messages. $$\begin{align} S &= 2^{\text{# of questions}} \\ & =2^{\text{# of bits}} \end{align}$$ This should make sense. If we have 3 bits, or 3 questions, then there are \(2^3 = 8\) combinations of those bits, so that is the set of all possible messages we could represent. We now need to inject probability. Entropy is a measure of unpredictability of information, so choice is an important aspect of it. If we had a message source that could only send 1’s (yes’s), no information would be gained for a given message, because to any question we ask, the source can only give 1 answer—yes. Certainty gives no information, and highly probable events only give a little. Improbable events, however, give a lot of information. This is analogous to being a detective and interviewing a crime suspect. If the suspect says what…

Read More