## required libraries

`from fastai.vision.all import *`

How to build a (very) simple neural network for MNIST digit classification

ml

projects

kaggle-competition

Author

Seongbin Park

Published

August 5, 2022

This post will cover how to classify handwritten digits of the MNIST dataset using a simple neural network. At the same time, I will be taking a stab at the Kaggle Digit Recognizer contest.

Credits: I will be working off of chapter 4 of the fast.ai book, which covers binary classification of 3’s and 7’s. Other resources are linked.

First, we will have to import the MNIST dataset itself. We can import it using the fast.ai library (`path = untar_data(URLs.MNIST)`

), but I will download the dataset from kaggle instead.

If you are following along and haven’t set up the kaggle API yet, do so by following along the README of the official repo. You will need an account to do so. After everything is set up, we can run the following code block:

```
Downloading digit-recognizer.zip to /home/jupyter/projects/digit-classifier
0%| | 0.00/15.3M [00:00<?, ?B/s]
100%|███████████████████████████████████████| 15.3M/15.3M [00:00<00:00, 161MB/s]
```

Note that in Jupyter notebooks, the exclamation mark ! is used to execute shell commands. The dataset should be downloaded in your project directory as a zip file. Run the following code block to extract the contents to a file named MNIST_dataset:

```
Archive: digit-recognizer.zip
inflating: MNIST_dataset/sample_submission.csv
inflating: MNIST_dataset/test.csv
inflating: MNIST_dataset/train.csv
```

Let’s take a look at `test.csv`

(the test set) and `train.csv`

(the training set):

pixel0 | pixel1 | pixel2 | pixel3 | pixel4 | pixel5 | pixel6 | pixel7 | pixel8 | pixel9 | ... | pixel774 | pixel775 | pixel776 | pixel777 | pixel778 | pixel779 | pixel780 | pixel781 | pixel782 | pixel783 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

3 rows × 784 columns

label | pixel0 | pixel1 | pixel2 | pixel3 | pixel4 | pixel5 | pixel6 | pixel7 | pixel8 | ... | pixel774 | pixel775 | pixel776 | pixel777 | pixel778 | pixel779 | pixel780 | pixel781 | pixel782 | pixel783 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

3 rows × 785 columns

Now that we downloaded the data, we need to shape it for training and validating.

To train our model, we need to separate and normalize the independent (pixels) and dependent (label) variables. The labels will be represented using one hot encoding.

```
y_train_numeric = df_train['label']
rows = np.arange(y_train_numeric.size)
y_train = tensor(np.zeros((y_train_numeric.size, 10)))
y_train[rows, y_train_numeric] = 1
X_train.shape, y_train.shape
```

`(torch.Size([42000, 784]), torch.Size([42000, 10]))`

`X_train.shape`

and `y_train.shape`

tells us that we have 42000 digits in our dataset, with each digit having 784 pixels. We will use tensors to take advantage of faster GPU computations.

We want to create a Pytorch `Dataset`

, which is required to return a tuple of `(x,y)`

when indexed. Python provides a `zip`

function which, when combined with `list`

, can do this easily:

Next, we want to split our dataset `ds`

into a training and validation set:

Later, we will be using stochastic gradient descent, which requires that we have “mini-batches” of our dataset. We can create a `DataLoader`

from our `train`

dataset to do so:

`(torch.Size([256, 784]), torch.Size([256, 10]))`

We can do the same for our validation (`val`

) dataset:

Now that our data is ready, we can start training our classification model. We will start with a linear model, then add some non-linearity to it!

First, we must randomly initialize the bias and all weights for each pixel. Since we have 10 labels (one for each digit), there must be 10 outputs, so our weights matrix is of size `784x10`

.

The prediction given a tensor `x`

is

\[\text{prediction} = x \cdot \text{weights} + \text{bias}.\]

To calculate a gradient, we need a loss function. Since there are more than 2 labels, we will use cross entropy loss, which is related to the softmax function instead of a sigmoid function (which is used for binary classification).

For testing and demonstration purposes, let’s work with a smaller batch than the ones created when shaping our data.

```
tensor([[ -0.5287, 12.6047, -9.0290, -1.7505, 3.7686, 18.8489, -1.8141,
-12.2232, -5.1421, -6.6316],
[ 18.8942, 10.7898, 9.2573, 7.9989, -1.2884, 19.0238, -5.8788,
6.5045, -10.2431, 5.5865],
[ 6.3639, 14.0687, 0.7705, -1.3580, 1.4220, 7.3108, -7.4359,
-6.8101, -5.9212, 23.7016],
[-14.7847, 3.0711, -0.6092, 2.2720, -1.1361, 3.7617, 5.1197,
5.3868, -1.5228, -7.6523]], grad_fn=<AddBackward0>)
```

Now we can calculate the gradients:

`(torch.Size([784, 10]), tensor(2.4328e-10), tensor([5.9605e-08]))`

The following function combines the above code and generalizes to models other than `linear1`

.

`(tensor(4.8657e-10), tensor([1.1921e-07]))`

Using the calculated gradients, we can update the weights for each epoch. We need to specify a learning rate and reset the gradients to 0, since `loss.backward`

actually adds the gradients of loss to any gradients that are currently stored.

Finally, we can define a function that trains the model for one epoch:

We also probably want to check the accuracy of our model. The label that the model predicts is the label with the highest activation:

To get the accuracy for the whole epoch, we must call `batch_accuracy`

with batches of the validation dataset, then take the mean over all batches.

Finally, we can see if our code works by checking if the accuracy improves!

```
params = weights, bias
for i in range(40):
train_epoch(linear1, params)
print(validate_epoch(linear1), end=' ')
```

`0.8213 0.8532 0.8661 0.8736 0.8769 0.8808 0.8848 0.8869 0.8895 0.8914 0.8929 0.894 0.8949 0.8964 0.8973 0.8977 0.8978 0.8979 0.8981 0.8986 0.8998 0.9 0.9004 0.9016 0.9017 0.9027 0.9027 0.9032 0.9037 0.9047 0.9053 0.9058 0.9061 0.9062 0.9061 0.9062 0.906 0.9061 0.906 0.9063 `

`nn.Linear`

does the same thing as our `init_params`

and `linear1`

functions together. Also, fastai’s `SGD`

class provides us with functions that takes care of updating the parameters and reseting the gradients of our model. By replacing some code, we can boil the training portion of our MNIST classifer down to the following:

```
def mnist_loss(xb, yb):
loss = nn.CrossEntropyLoss()
return loss(xb, yb)
def calc_grad(xb, yb, model):
preds = model(xb)
loss = mnist_loss(preds, yb)
loss.backward()
def train_epoch_simple(model):
for xb,yb in dl:
calc_grad(xb, yb, model)
opt.step()
opt.zero_grad()
def train_model(model, epochs):
for i in range(epochs):
train_epoch_simple(model)
print(validate_epoch(model), end=' ')
linear_model = nn.Linear(28*28,10)
opt = SGD(linear_model.parameters(), lr=1)
train_model(linear_model, 20)
```

`0.8983 0.9057 0.9092 0.9113 0.9143 0.9149 0.9154 0.9154 0.9162 0.9161 0.9166 0.9167 0.9166 0.9166 0.9164 0.9169 0.917 0.9173 0.9173 0.9175 `

Fast.ai provides us with `Learner.fit`

, which we can use instead of `train_model`

to significantly reduce the amount of code we need to write. To use the function, we must create a `Learner`

, which requires a `DataLoaders`

of our training and validation datasets:

Then, we pass in `DataLoaders`

, the model, the optimization function, the loss function, and optionally any metrics to print into the `Learner`

constructor to create one:

Finally, we can call `Learner.fit`

:

epoch | train_loss | valid_loss | batch_accuracy | time |
---|---|---|---|---|

0 | 0.413918 | 0.365917 | 0.897400 | 00:00 |

1 | 0.326721 | 0.337396 | 0.905100 | 00:00 |

2 | 0.302027 | 0.325679 | 0.908500 | 00:00 |

3 | 0.289524 | 0.319092 | 0.910000 | 00:00 |

4 | 0.281248 | 0.314855 | 0.912800 | 00:00 |

5 | 0.275084 | 0.311911 | 0.913200 | 00:00 |

6 | 0.270192 | 0.309764 | 0.913500 | 00:00 |

7 | 0.266153 | 0.308146 | 0.913900 | 00:00 |

8 | 0.262723 | 0.306901 | 0.914300 | 00:00 |

9 | 0.259748 | 0.305928 | 0.914900 | 00:00 |

To expand upon our model, we can add another layer on top of what we have now. However, mathematically speaking, the composition of two linear functions is another linear function. Therefore, stacking two linear classifiers on top of each other is equivalent to having just one linear classifier.

Therefore, we must add some non-linearity between linear layers. We often do this by through activation functions; a common one is the `ReLU`

function:

`nn.Sequential`

creates a module that will call each of the listed layers or functions.

Our first layer takes in 784 inputs (pixels) and outputs 60 numbers. Those 60 numbers are then each passed into the `ReLU`

function before going into the second layer. The second layer has 10 outputs, which as before, is the probability of each digit being the lable.

We can train this model using `Learner.fit`

as well (we are using more epochs and smaller learning rate, since it is a larger model):

The output is ommitted to save room; the training process is recorded in `learn.recorder`

, with the table of output stored in the `values`

attribute, so we can plot the accuracy over training as:

Though our very basic model is far from perfect, we can still submit it to the competition! Recall that we stored the test.csv data into the df_test `DataFrame`

. We need to first normalize the pixels then plug it into our model:

Finally, we can create a submission file in our current directory:

Then submit to Kaggle!