# Cross Entropy Loss Explained

A high level overview of PyTorch's torch.nn.CrossEntropyLoss, without all the math

Cross entropy loss is a loss function that can be used for multi-class classification using neural networks. Chapter 5 of the fast.ai textbook outlines the use of cross entropy loss for binary classification, so in this post, we will take a look at classification for 3 classes.

## Softmax

The softmax function ensures 2 things:

- activations are all between 0 and 1
- activations sum to 1.

For multi-class classification, we need an activation per class (in the final layer). Each activation then indicates the relative confidence of each class being the true label. Therefore, we can get the predicted prababilities that each class is the true label by applying the softmax function to the final column of activations.

Given $C$ total classes, for any class $k,$ let's say $x_k$ represents the activation for $c$. Then, the softmax activation for an arbitrary class $c$ is equal to

$$\frac{e^{x_c}}{\sum^C_{k=1}e^{x_k}}.$$

In Python code, this would be

```
def softmax(x): return exp(x) / exp(x).sum(dim=1, keepdim=True)
```

Note that the code version returns a tensor/array of softmax activations.

For demonstration purposes, let's first create a set of activations using `torch.randn`

, assuming we have 6 objects to classify into 3 classes.

```
acts = torch.randn((6,3))*2
acts
```

Let's also set our target labels:

```
targ = tensor([0,1,0,2,2,0])
```

To take the softmax of our initial (random) activations, we need to pass `acts`

into `torch.softmax`

:

```
sm_acts = torch.softmax(acts, dim=1)
sm_acts
```

```
idx = range(6)
sm_acts[idx, targ]
```

`F.nll_loss`

does the same thing, but flips the sign of each number in the tensor. PyTorch defaults to taking the mean of the losses; to prevent this, we can pass `reduction='none'`

as a parameter.

```
result = -F.nll_loss(sm_acts, targ, reduction='none')
result
```

### Taking the Logarithm

We take the (natural) logarithm of `result`

for two reasons:

- prevents under/overflow when performing mathematical operations
- differences between small numbers is amplified

In our case, `result`

relfects the predicted probability of the correct label, so when the prediction is "good" (closer to 1), we want our loss function to return a small value (and vice versa). We can achieve this by taking the negative of the log:

```
loss = -torch.log(result)
loss
```

```
lsm_acts = F.log_softmax(acts, dim=1)
loss = F.nll_loss(lsm_acts, targ, reduction='none')
loss
```

In practice, this is exactly what `nn.CrossEntropyLoss`

does:

```
nn.CrossEntropyLoss(reduction='none')(acts, targ)
```

The output loss tensors for all three approaches are equivalent as expected!