Information Theory

Table of Contents

Short Summary: Information theory is the scientific study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. The field is at the intersection of probability theory, statistics, computer science, statistical mechanics, information engineering, and electrical engineering.


Surprise is inversely proportional to the probability of an event. Learning that a high-probability event has taken place (e.g., the sun rising) is much less of a surprise and gives less information than learning that a low-probability event (e.g., rain in a hot summer day) has occurred. Therefore, the less likely the occurrence of an event, the greater information it conveys. In the case where an event is a-priori known to occur for certain, then no information is conveyed by it. On the other hand, an extremely intermittent event conveys a lot of information as it surprises us and informs us that a very improbable state exists.


Entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. It could also be thought of as the amount of information needed to remove uncertainty from the system. Given a discrete random variable \(X\), with possible outcomes \({x_{1}, \dots, x_{n}}\), which occur with probability \(P(x_{1}), \dots, P(x_{n})\) the entropy of \(X\) is formally defined as:

\begin{equation} H_P= -\sum_{x \in X} P(x) \log{P(x)} \end{equation}

Cross Entropy

Cross Entropy has the information interpretation quantifying the average number of total bits per message wasted in the transmission of values if the desired distribution of values \(x \in X\) is \(P(x)\), but we erroneously encoded them according to distribution \(Q(x)\). Formally, the cross entropy has an information interpretation quantifying how many bits are wasted by using the wrong code. It could also be thought of as the amount of information that exists between two probability distributions. It is clear from the formula tha cross-entropy calculates the total entropy between the distributions.

\begin{equation} H_P(Q) = -\sum_{x \in X} P(x) \log{Q(x)} \end{equation}

KL Divergence

KL Divergence indicates the average number of additional bits per message required for transmission of values \(x \in X\) which are distributed according to \(P(x)\), but we erroneously encoded them according to distribution \(Q(x)\). This makes sense since you have to "pay" for additional bits to compensate for not knowing the true distribution, thus using a code that was optimized according to other distribution. This is one of the reason that the KL-divergence is also known as relative entropy. The KL-Divergence is asymmetric, because if we gain information by encoding \(P(X)\) using \(Q(X)\), then in the opposite case, we would lose information if we encode \(Q(X)\) using \(P(X)\). The divergence between two probability distributions is the measurement of distance that exists between them. But it is important to note that KL Divergence is not a distance metric, as it is not symmetric. In neural networs, using KL Divergence as the loss function would lead to the same effect as using Cross-Entropy since \(H_P\) is fixed. It is clear from the formula that KL divergence calculates the relative entropy between two probability distributions.

\begin{equation} D_{KL}(P \mid\mid Q) = \sum_{x \in X} P(x) \log \frac{P(x)}{Q(x)} = H_{P}(Q) - H(P) \end{equation}