16.3. Information Theory¶

For thousnads of years, we are submerging in a world of information. From Shakespeare’s Sonnet to researchers’ paper on Cornell ArXiv, from Van Gogh’s printing Starry Night to Beethoven’s music Symphony No. 5, from the first programming language Plankalkül to the state-of-the-art machine learning algorithms. No matter what are the format of these information, everything follows the rule of information theory for millennium of years. With information theory, we can measure and compare how much information is presented in different signals. In this section, we will illustrate the fundamental concepts of information theory and its applications.

Before we get started, let us understand the relationship between machine learning and information theory. While machine learning is aimed at extracting interesting signals from data and making critical predictions, information theory is a field of study about encoding, decoding, transmitting, and manipulating information. As a result, information theory provides a strong backbone of fundamental machine learning. For example, cross entropy loss in Section 3.4 is used as an objective function in many machine learning algorithms. Such a loss is a direct application of information theory.

16.3.1. Information¶

Let us start with the “soul” of information theory - information. Information can be encoded in anything with a particular sequence of one or more encoding formats. Rather than caring about the knowledge, information represents the degree of surprise or the abstract possibility of the event. For example, if we want to elucidate an unusual event, we need a lot information. Hence, unriddling this event can help us gain plenty of information. Whereas, for a common event, we may not need much information since we already know it pretty well.

In this era of information explosion, how do we quantify the meaningfulness of a piece of information? Are there any fundamental math theory to measure and compare different pieces of information? Our ancestors had tried to solve these questions for years. Until year 1948, Claude E. Shannon published his well-known book A Mathematical Theory of Communication, the discipline of information theory had not been established. Shannon was an American mathematician, electrical engineer, and crytographer known as “the father of information theory”. In his book, Shannon introduced the concept of information entropy at the first time. We will begin our information theory journey here.

16.3.1.1. Self-information¶

Since information embodies the abstract possibility of an event, how do we map the possibility to the number of bits? Shannon introduced the term bit, which Shannon credited to John Tukey, as the unit of information. So what is a “bit” and why do we use it to measure information? Historically, an antique transmitter can only send or receive two codes, $$0$$ and $$1$$. In this way, any information symbol is encoded by a series of $$0$$ and $$1$$, and each digit represents $$1$$ bit. And hence, a series of codes of length $$n$$ contains $$n$$ bits of information.

Now, suppose that for any series of codes, each $$0$$ or $$1$$ occurs with a probability of $$\frac{1}{2}$$. Hence, an event $$X$$ with a series of codes of length $$n$$, occurs with a probability of $$\frac{1}{2^n}$$. At the same time, as we mentioned before, this series contains $$n$$ bits of information. So, can we generalize to a math function which can tranfer the probability $$p$$ to the number of bits? Shannon gave the answer by defining self-information

(16.3.1)$I(X) = - \log_2 (p)$

as the bits of information we have received for this event $$X$$. Note that we will always use base-2 logarithms in this section. For the sake of simplicity, the rest of this section will omit the subscript 2 in the logarithm notation, i.e., $$\log(.)$$ always refers to $$\log_2(.)$$. For example, the code “0010” has a self-information

(16.3.2)$I(\text{0010"}) = - \log (p(\text{0010"})) = - \log \left( \frac{1}{2^4} \right) = 4 \text{ bits}.$

We can simply calculate self information in MXNet as shown below. Before that, let us first import all the necessary packages in this section.

from mxnet import np
from mxnet.ndarray import nansum
from mxnet.metric import NegativeLogLikelihood
import random

def self_information(p):
return -np.log2(p)

self_information(1/64)
6.0

16.3.2. Entropy¶

As self-information only measures the information of a single discrete event, we need a more generalized measure for any random variable of either discrete or continuous distribution.

16.3.2.1. Definition¶

For any random variable $$X$$ that follows a probability distribution $$P$$ with a probability density function (p.d.f.) or a probability mass function (p.m.f.) $$p(x)$$, we measure the expected amount of information through entropy (or Shannon entropy)

(16.3.3)$H(X) = - E_{x \sim P} [\log p(x)].$

To be specific, if $$X$$ is discrete,

(16.3.4)$H(X) = - \sum_i p_i \log p_i \text{, where } p_i = P(X_i).$

Otherwise, if $$X$$ is continuous, we also refer entropy as differential entropy

(16.3.5)$H(X) = - \int_x p(x) \log p(x) \; dx.$

In MXNet, we can define entropy as below.

def entropy(p):
entropy = - p * np.log2(p)
out = nansum(entropy.as_nd_ndarray()) # nansum will sum up the non-nan number
return out

entropy(np.array([0.1, 0.5, 0.2, 0.3]))
[1.8176681]
<NDArray 1 @cpu(0)>

16.3.2.2. Interpretations¶

You may be curious: in the entropy definition, why we use an expectation of a negative logarithm? Here are some intuitions.

First, why do we use a logarithm function $$\log$$? Suppose that $$p(x) = f_1(x) f_2(x) \ldots f_n(x)$$, where each component function $$f_i(x)$$ is independent from each other. This means that each $$f_i(x)$$ contributes independently to the total information obtained from $$p(x)$$. As a result, we need the entropy formula to possess the additive functionality over a mixed independent distributions. Luckily, $$\log$$ can naturally turn a product of probability distributions to a summation of their individual information.

Next, why do we use a negative $$\log$$? Intuitively, more frequent events should contain less information than less common events, since we often gain more information from an unusual case than from a regular one. However, $$\log$$ is monotonically increasing with the probabilities, whereas we need to construct a monotonically decreasing relationship between the probability of events and their entropy. Hence, we add a negative sign in front of $$\log$$ function.

Last, where does the expectation function come from? Consider a random variable $$X$$ follows a given probability distribution, entropy can be interpreted as the average amount of surprise from observing $$X$$. For example, imagine that a slot machine system emits statistical independently symbols $${s_1, \ldots, s_k}$$ with probabilities $${p_1, \ldots, p_k}$$ respectively. Then the entropy of this system equals to the average self-information from observing each output, i.e.,

(16.3.6)$H(S) = \sum_i {p_i \cdot I(s_i)} = - \sum_i {p_i \cdot \log p_i}.$

16.3.2.3. Properties of Entropy¶

By the above examples and interpretations, we can derive the following properties of entropy. Here, we refer to X as an event and P as the probability distribution of X.

• Entropy is non-negative, i.e., $$H(X) \geq 0, \forall X$$.

• If $$X \sim P$$ with a p.d.f. or a p.m.f. $$p(x)$$, and we try to estimate $$P$$ by a new probability distribution $$Q$$ with a p.d.f. or a p.m.f. $$q(x)$$, then

(16.3.7)$H(X) = - E_{x \sim P} [\log p(x)] \leq - E_{x \sim P} [\log q(x)], \text{ with equality if and only if } P = Q.$

Alternatively, $$H(X)$$ gives a lower bound of the average number of bits needed to encode symbols drawn from $$P$$.

• If $$X \sim P$$, then $$x$$ conveys the maximum amount of information if it spreads evenly among all possible outcomes. Specifically, if the probability distribution $$P$$ is discrete with $$k$$-class $$\{p_1, \ldots, p_k \}$$, then

(16.3.8)$H(X) \leq \log(k), \text{ with equality if and only if } p_i = \frac{1}{k}, \forall x_i.$

Otherwise, if $$P$$ is continuous, then $$P$$ has the highest entropy if it is a uniform distribution.

16.3.3. Mutual Information¶

Previouly we defined entropy of a single random variable $$X$$, how about the entropy of a pair random variables $$(X,Y)$$? Does knowing the entropy of one of $$X$$ or $$Y$$ help us with knowing the entropy of this pair?

For the following discussion, we always use $$(X,Y)$$ as a pair of random variables that follows a joint probability distribution $$P$$ with a p.d.f. or a p.m.f. $$p_{X,Y}(x,y)$$, while $$X$$ and $$Y$$ follow probability distribution $$p_X(x)$$ and $$p_Y(y)$$, respectively.

16.3.3.1. Joint Entropy¶

Similar to entropy of a single random variable, the joint entropy $$H(X,Y)$$ of a pair random variables $$(X,Y)$$ is defined as

(16.3.9)$H(X,Y) = −E_{(x,y) \sim P} [\log p_{X,Y}(x, y)].$

Precisely, on the one hand, if $$(X,Y)$$ is a pair of discrete random variables, then

(16.3.10)$H(X,Y) = - \sum_{x} \sum_{y} p_{X,Y}(x,y) \log p_{X,Y}(x,y).$

On the other hand, if $$(X,Y)$$ is a pair of continuous random variables, then the differential joint entropy is defined as

(16.3.11)$H(X,Y) = - \int_{x, y} p_{X,Y}(x,y) \ \log p_{X,Y}(x,y) \;dx \;dy.$

Let us implement joint entropy from scratch in MXNet.

def joint_entropy(p_xy):
joint_ent = -p_xy * np.log2(p_xy)
# nansum will sum up the non-nan number
out = nansum(joint_ent.as_nd_ndarray())
return out

joint_entropy(np.array([0.1, 0.5, 0.2, 0.3]))
[1.8176681]
<NDArray 1 @cpu(0)>

16.3.3.2. Conditional Entropy¶

Now suppose that we know the joint distribution of a pair of random variables $$(X,Y)$$. What is more, we know the probability distribution of X. In this setting, we can calculate the entropy $$H(X)$$ and joint entropy $$H(X, Y)$$. But can we know about the random variable $$Y$$ with the above information? In common-sense point of view, knowing $$X$$ may allude some information about the related variables $$Y$$. In the probability theory, we have the definition of the conditional probability to measure the relationship between variables. As a result, we can define the conditional entropy $$H(Y \mid X)$$ as

(16.3.12)$H(Y \mid X) = - E_{(x,y) \sim P} [\log p(y \mid x)],$

where $$p(y \mid x) = \frac{p_{X,Y}(x, y)}{p_X(x)}$$ is the conditional probability. Specificly, if $$(X,Y)$$ is a pair of discrete random variables, then

(16.3.13)$H(Y \mid X) = - \sum_{x} \sum_{y} p(x,y) \log p(y \mid x).$

On the flip side, if $$(X,Y)$$ is a pair of continuous random variables, then the differential joint entropy is defined as

(16.3.14)$H(Y \mid X) = - \int_x \int_y p(x,y) \ \log p(y \mid x) \;dx \;dy.$

Back to explain our common-sense intuition, how does the conditional entropy $$H(Y \mid X)$$ relate to the entropy $$H(X)$$ and the joint entropy $$H(X, Y)$$? Unsurprisingly, the naturalness of the above definitions alludes that

(16.3.15)$H(Y \mid X) = H(X,Y) - H(X).$

We can interpret the conditional entropy $$H(Y \mid X)$$ as, given the joint information of $$(X, Y)$$, the uncertainty that removed by knowing the information of $$X$$.

Now, let us implement conditional entropy from scratch in MXNet.

def conditional_entropy(p_xy, p_x):
p_y_given_x = p_xy/p_x
cond_ent = -p_xy * np.log2(p_y_given_x)
# nansum will sum up the non-nan number
out = nansum(cond_ent.as_nd_ndarray())
return out

conditional_entropy(np.array([0.1, 0.5, 0.2, 0.3]), np.array([1, 1, 1, 1]))
[1.8176681]
<NDArray 1 @cpu(0)>

16.3.3.3. Mutual Information¶

Given the previous setting of random variables $$(X, Y)$$, you may wonder: “How much information do $$X$$ and $$Y$$ share?” Luckily, mutual information of $$(X,Y)$$ can give us the answer as

(16.3.16)$I(X,Y) = −E_{x} E_{y} \left\{ p_{X,Y}(x, y) \log\frac{p_{X,Y}(x, y)}{p_X(x) p_Y(y)} \right\}.$

As shown in the image Fig. 16.3.1, if we know $$H(X)$$ (i.e., the amount of uncertainty in $$X$$) and $$H(X \mid Y)$$ (i.e., amount of uncertainty in $$X$$ which remains after $$Y$$ is known), mutual information tells us the average reduction in uncertainty about $$X$$ that results from learning the value of $$Y$$. As a result, the following terms are numerically equivalent to $$I(X,Y)$$:

• $$H(X) − H(X \mid Y)$$
• $$H(Y) − H(Y \mid X)$$
• $$H(X) + H(Y) − H(X,Y)$$
• $$H(X,Y) - H(Y \mid X) − H(X \mid Y)$$ Fig. 16.3.1 Mutual information’s relationship with joint entropy and conditional entropy.

Now, let us implement mutual information from scratch.

def mutual_information(p_xy, p_x, p_y):
p = p_xy / (p_x * p_y)
mutual = -p_xy * np.log2(p)
# nansum will sum up the non-nan number
out = nansum(mutual.as_nd_ndarray())
return out

mutual_information(np.array([0.1, 0.5, 0.2, 0.3]), np.array([2, 4, 8, 4]), np.array([0.5, 0.25, 0.125, 0.25]))
[1.8176681]
<NDArray 1 @cpu(0)>

16.3.3.4. Properties of Mutual Information¶

Rather than memorizing the definition of mutual information, you only need to keep in mind its notable properties:

• Mutual information is symmetric, i.e., $$I(X,Y) = I(Y,X)$$.

• Mutual information is non-negative, i.e., $$I(X,Y) \geq 0$$.

• $$I(X,Y) = 0$$ if and only if $$X$$ and $$Y$$ are independent. For example, if $$X$$ and $$Y$$ are independent, then knowing $$Y$$ does not give any information about $$X$$ and vice versa, so their mutual information is zero.

• Alternatively, if $$X$$ is a function of $$Y$$, then all information conveyed by $$Y$$ is shared with $$X$$ as

(16.3.17)$I(X,Y) = H(Y) = H(X).$

16.3.3.5. Applications of Mutual Information¶

Mutual information may be a little abstract in it pure definition. So how does it related to machine leanring? In natural language processing, one of the most difficult problems is the “ambiguation”. For example, recently a headline in the news reported that “Amazon is on fire”. You may wonder whether the company Amazon’s building is on fire, or the rainforest Amazon is on fire.

In this case, mutual information can help us solve the ambiguation problem. To be specific, we first find the group of words that each has a relatively large mutual information with the company Amazon, such as e-commerce, technology, and online. Next, we find another group of words that each has a relatively large mutual information with the rainforest Amazon, such as rain, forest, and tropical. Finally, when we need to interpret “Amazon” again, we can simply compare which group has more occurrence in the context of the word Amazon.

16.3.4. Kullback–Leibler Divergence¶

As what we have discussed in Section 2.4, norms can be used for measuring distance between two points in space of any dimensionality. However, can we measure the distance between two probability distributions, $$P$$ and $$Q$$? Let us paraphrase it using the Kullback–Leibler (KL) divergence, which is quite useful in many machine learning problems.

16.3.4.1. Definition¶

Given a random variable $$X$$ that follows the true probability distribution $$P$$ with a p.d.f. or a p.m.f. $$p(x)$$, and we estimate $$P$$ by another probability distribution $$Q$$ with a p.d.f. or a p.m.f. $$q(x)$$. Then the Kullback–Leibler (KL) divergence (or relative entropy) between $$P$$ and $$Q$$ is

(16.3.18)$D_{\mathrm{KL}}(P\|Q) = E_{x \sim P} \left[ \log \frac{p(x)}{q(x)} \right].$

In MXNet, let us implement the KL divergence from Scratch.

def kl_divergence(p, q):
kl = p * np.log2(p / q)
out = nansum(kl.as_nd_ndarray())
return out.abs().asscalar()

16.3.4.2. KL Divergence Properties¶

Even though the above definition looks formidable, you do not need to memorize it. However, the following properties may be worth your while to peruse.

• KL divergence is non-symmetric, i.e.,

(16.3.19)$D_{\mathrm{KL}}(P\|Q) \neq D_{\mathrm{KL}}(Q\|P), \text{ if } P \neq Q.$
• KL divergence is non-negative, i.e.,

(16.3.20)$D_{\mathrm{KL}}(P\|Q) \geq 0.$

Note that the equality holds only when $$P = Q$$.

• If there exists an $$x$$ such that $$p(x) > 0$$ and $$q(x) = 0$$, then $$D_{\mathrm{KL}}(P\|Q) = \infty$$.

• Relationship between KL divergence and mutual information is also intimating. Besides the relationship shown in Fig. 16.3.1, $$I(X,Y)$$ is also numerically equivalent with the following terms:

1. $D_{:raw-latex:\mathrm{KL}}(P(X, Y) | P(X)P(Y))$;
2. $E_Y { D_{:raw-latex:\mathrm{KL}}(P(X :raw-latex:mid Y) | P(X)) }$;
3. $E_X { D_{:raw-latex:\mathrm{KL}}(P(Y :raw-latex:mid X) | P(Y)) }$.

For the first term, we interpret mutual information as the KL divergence between $$P(X, Y)$$ and the product of $$P(X)$$ and $$P(Y)$$. For the second term, mutual information tells us the average reduction in uncertainty about $$Y$$ that results from learning the value of the $$X$$’s distribution. Similarly to the third term.

16.3.4.3. Example¶

If you are still confused about why KL divergence is non-symmetric, let us go through a toy example.

First, let us generate and sort three ndarrays of length $$10,000$$: an objective ndarray $$p$$ which follows a normal distribution $$N(0,1)$$, and two candidate ndarrays $$q_1$$ and $$q_2$$ which follow normal distributions $$N(-1,1)$$ and $$N(1,1)$$ respectively.

random.seed(1)

nd_length = 10000
p = np.random.normal(loc=0, scale=1, size=(nd_length, ))
q1 = np.random.normal(loc=-1, scale=1, size=(nd_length, ))
q2 = np.random.normal(loc=1, scale=1, size=(nd_length, ))

p = np.array(sorted(p.asnumpy()))
q1 = np.array(sorted(q1.asnumpy()))
q2 = np.array(sorted(q2.asnumpy()))

Since $$q_1$$ and $$q_2$$ are symmetric with respect to the y-axis ($$x=0$$), we expect a similar absolute value of KL divergence between $$D_{\mathrm{KL}}(p\|q_1)$$ and $$D_{\mathrm{KL}}(p\|q_2)$$. As you can see below, there is only a 1% off between $$D_{\mathrm{KL}}(p\|q_1)$$ and $$D_{\mathrm{KL}}(p\|q_2)$$.

kl_pq1 = kl_divergence(p, q1)
kl_pq2 = kl_divergence(p, q2)
similar_percentage = abs(kl_pq1 - kl_pq2) / ((kl_pq1 + kl_pq2) / 2) * 100

kl_pq1, kl_pq2, similar_percentage
(8549.594, 8729.888, 2.086798218714529)

In contrast, you may find that $$D_{\mathrm{KL}}(q_2 \|p)$$ and $$D_{\mathrm{KL}}(p \| q_2)$$ are off a lot, with around 44% off as shown below.

kl_q2p = kl_divergence(q2, p)
differ_percentage = abs(kl_q2p - kl_pq2) / ((kl_q2p + kl_pq2) / 2) * 100

kl_q2p, differ_percentage
(13452.535, 42.58008875270748)

16.3.5. Cross Entropy¶

If you are curious about what are the applications of information theory on deep learning, here is a quick example. We define the true distribution $$P$$ with probability distribution $$p(x)$$, and the estimated distribution $$Q$$ with probability distribution $$q(x)$$, and we will use them in the rest of this section.

Say now we need to solve a binary classification problem based on given $$n$$ data points {$$x_1, \ldots, x_n$$}. Assume that we encode $$1$$ and $$0$$ as the positive and negative class label $$y_i$$ respectively, and our neural network is parameterized by $$\theta$$. If we aim to find a best $$\theta$$ so that $$\hat{y}_i= p_{\theta}(y_i \mid x_i)$$, it is natually to apply the maximum log-likelihood approach. To be specific, for true labels $$y_i$$ and predictions $$\hat{y}_i= p_{\theta}(y_i \mid x_i)$$, the probability to be classified as positive is $$\pi_i= p_{\theta}(y_i = 1 \mid x_i)$$. Hence, the log-likelihood function would be

(16.3.21)\begin{split}\begin{aligned} l(\theta) &= \log L(\theta) \\ &= \log \prod_{i=1}^n \pi_i^{y_i} (1 - \pi_i)^{1 - y_i} \\ &= \sum_{i=1}^n y_i \log(\pi_i) + (1 - y_i) \log (1 - \pi_i) \\ \end{aligned}\end{split}

Maximizing the log-likelihood function $$l(\theta)$$ is identical to minimizing $$- l(\theta)$$, and hence we can find the best $$\theta$$ from here. To generalize the above loss to any distributions, we also called $$-l(\theta)$$ the cross entropy loss $$\mathrm{CE}(y, \hat{y})$$, where $$y$$ follows the true distribution $$P$$ and $$\hat{y}$$ follows the estimated distribution $$Q$$. Let us see how cross entropy loss is defined!

16.3.5.1. Definition¶

Like KL divergence, for a random variable $$X$$, we can also measure the divergence between the estimating distribution $$Q$$ and the true distribution $$P$$ via cross entropy,

(16.3.22)$\mathrm{CE}(P,Q) = - E_{x \sim P} [\log(q(x))].$

It can also be interpreted as the summation of the entropy $$H(P)$$ and the KL divergence between $$P$$ and $$Q$$, i.e.,

(16.3.23)$\mathrm{CE} (P, Q) = H(P) + D_{\mathrm{KL}}(P\|Q).$

In MXNet, we can implement the cross entropy loss as below.

def cross_entropy(y_hat, y):
ce = -np.log(y_hat[range(len(y_hat)), y])
return ce.mean()

Now define two ndarrays for the labels and predictions, and calculate the cross entropy loss of them.

labels = np.array([0, 2])
preds = np.array([[0.3, 0.6, 0.1], [0.2, 0.3, 0.5]])

cross_entropy(preds, labels)
array(0.94856)

16.3.5.2. Properties¶

As being widely accepted, cross entropy can be used to define a loss function in the optimization problem. It turns out that the following are equivalent:

1. Maximizing predictive probability of $$Q$$ for distribution $$P$$, (i.e., $$E_{x \sim P} [\log (q(x))]$$);
2. Minimizing cross entropy $$\mathrm{CE} (P,Q)$$;
3. Minimizing the KL divergence $$D_{\mathrm{KL}}(P\|Q)$$.

The definition of cross entropy indirectly proves the equivalent relationship between objective 2 and objective 3, as long as the entropy of true data $$H(P)$$ is constant. Now, let us elucidate the equivalence between objective 1 and objective 2 through the below example.

16.3.5.3. Cross Entropy as An Objective Function of Multi-class Classification¶

If we dive deep into the classification objective function with cross entropy loss $$\mathrm{CE}$$, you may find minimizing $$\mathrm{CE}$$ is equivalent to maximizing the log-likelihood function $$L$$. Let us work through a toy example together.

To begin with, suppose that we are given a dataset with $$n$$ samples, and it can be classified in $$k$$-class. For each data point $$i$$, we represent any $$k$$-class label $$\mathbf{y}_i = (y_{i1}, \ldots, y_{ik})$$ by the one-hot encoding method. To be specific, if the data point $$i$$ belongs to class $$J$$, then we set the $$j$$-th entry to $$1$$, and all other components to $$0$$, i.e.,

(16.3.24)$\begin{split}y_{ij} = \begin{cases}1 & j \in J; \\ 0 &\text{otherwise.}\end{cases}\end{split}$

For instance, if a multi-class classfication problem contains three classes $$A$$, $$B$$, and $$C$$, then the labels $$\mathbf{y}_i$$ can be encoded in {$$A: (1, 0, 0); B: (0, 1, 0); C: (0, 0, 1)$$}.

Assume that our neural network is parameterized by $$\theta$$. For true label vectors $$\mathbf{y}_i$$ and predictions

(16.3.25)$\hat{\mathbf{y}}_i= p_{\theta}(\mathbf{y}_i \mid \mathbf{x}_i) = \sum_{j=1}^k y_{ij} p_{\theta} (y_{ij} \mid \mathbf{x}_i).$

Hence, the cross entropy loss would be

(16.3.26)$\begin{split}\mathrm{CE}(\mathbf{y}, \hat{\mathbf{y}}) = - \sum_{i=1}^n \mathbf{y}_i \log \hat{\mathbf{y}}_i = - \sum_{i=1}^n \sum_{j=1}^k y_{ij} \log{p_{\theta} (y_{ij} \mid \mathbf{x}_i)}.\\\end{split}$

On the other side, we can also demystify the problem through maximum likelihood estimation. To begin with, let us quickly introduce a $$k$$-class multinoulli distribution. It is an extension of the bernoulli distribution from binary class to multi-class. If a random variable $$\mathbf{z} = (z_{1}, \ldots, z_{k})$$ follows a $$k$$-class multinoulli distribution with probabilities $$\mathbf{p} =$$ ($$p_{1}, \ldots, p_{k}$$), i.e.,

(16.3.27)$p(\mathbf{z}) = p(z_1, \ldots, z_k) = \mathrm{Multi} (p_1, \ldots, p_k), \text{ where } \sum_{i=1}^k p_i = 1,$

then the joint probability mass function(p.m.f.) of $$\mathbf{z}$$ is

(16.3.28)$\mathbf{p}^\mathbf{z} = \prod_{j=1}^k p_{j}^{z_{j}}.$

Now back to our example, it is easy to see that each data point, $$\mathbf{y}_i$$, is following a $$k$$-class multinoulli distribution with probabilities $$\boldsymbol{\pi} =$$ ($$\pi_{1}, \ldots, \pi_{k}$$). Therefore, the joint p.m.f. of each data point $$\mathbf{y}_i$$ is $$\mathbf{\pi}^{\mathbf{y}_i} = \prod_{j=1}^k \pi_{j}^{y_{ij}}.$$ Hence, the log-likelihood function would be

(16.3.29)\begin{split}\begin{aligned} l(\theta) = \log L(\theta) = \log \prod_{i=1}^n \boldsymbol{\pi}^{\mathbf{y}_i} = \log \prod_{i=1}^n \prod_{j=1}^k \pi_{j}^{y_{ij}} = \sum_{i=1}^n \sum_{j=1}^k y_{ij} \log{\pi_{j}}.\\ \end{aligned}\end{split}

Since in maximum likelihood estimation, we maximizing the objective function $l(:raw-latex:\theta)$ by having $$\pi_{j} = p_{\theta} (y_{ij} \mid \mathbf{x}_i)$$. Therefore, for any multi-class classification, maximizing the above log-likelihood function $$l(\theta)$$ is equivalent to minimizing the CE loss $$\mathrm{CE}(y, \hat{y})$$.

To test the above proof, let us apply the built-in measure NegativeLogLikelihood in MXNet. Using the same labels and preds as in the earlier example, we will get the same numerical loss as the previous example up to the 5 decimal place.

nll_loss = NegativeLogLikelihood()
nll_loss.update(labels.as_nd_ndarray(), preds.as_nd_ndarray())
nll_loss.get()
('nll-loss', 0.9485599994659424)