required libraries
from fastai.vision.all import *
Seongbin Park
August 5, 2022
This post will cover how to classify handwritten digits of the MNIST dataset using a simple neural network. At the same time, I will be taking a stab at the Kaggle Digit Recognizer contest.
Credits: I will be working off of chapter 4 of the fast.ai book, which covers binary classification of 3’s and 7’s. Other resources are linked.
First, we will have to import the MNIST dataset itself. We can import it using the fast.ai library (path = untar_data(URLs.MNIST)
), but I will download the dataset from kaggle instead.
If you are following along and haven’t set up the kaggle API yet, do so by following along the README of the official repo. You will need an account to do so. After everything is set up, we can run the following code block:
Downloading digit-recognizer.zip to /home/jupyter/projects/digit-classifier
0%| | 0.00/15.3M [00:00<?, ?B/s]
100%|███████████████████████████████████████| 15.3M/15.3M [00:00<00:00, 161MB/s]
Note that in Jupyter notebooks, the exclamation mark ! is used to execute shell commands. The dataset should be downloaded in your project directory as a zip file. Run the following code block to extract the contents to a file named MNIST_dataset:
Archive: digit-recognizer.zip
inflating: MNIST_dataset/sample_submission.csv
inflating: MNIST_dataset/test.csv
inflating: MNIST_dataset/train.csv
Let’s take a look at test.csv
(the test set) and train.csv
(the training set):
pixel0 | pixel1 | pixel2 | pixel3 | pixel4 | pixel5 | pixel6 | pixel7 | pixel8 | pixel9 | ... | pixel774 | pixel775 | pixel776 | pixel777 | pixel778 | pixel779 | pixel780 | pixel781 | pixel782 | pixel783 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 rows × 784 columns
label | pixel0 | pixel1 | pixel2 | pixel3 | pixel4 | pixel5 | pixel6 | pixel7 | pixel8 | ... | pixel774 | pixel775 | pixel776 | pixel777 | pixel778 | pixel779 | pixel780 | pixel781 | pixel782 | pixel783 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 rows × 785 columns
Now that we downloaded the data, we need to shape it for training and validating.
To train our model, we need to separate and normalize the independent (pixels) and dependent (label) variables. The labels will be represented using one hot encoding.
y_train_numeric = df_train['label']
rows = np.arange(y_train_numeric.size)
y_train = tensor(np.zeros((y_train_numeric.size, 10)))
y_train[rows, y_train_numeric] = 1
X_train.shape, y_train.shape
(torch.Size([42000, 784]), torch.Size([42000, 10]))
X_train.shape
and y_train.shape
tells us that we have 42000 digits in our dataset, with each digit having 784 pixels. We will use tensors to take advantage of faster GPU computations.
We want to create a Pytorch Dataset
, which is required to return a tuple of (x,y)
when indexed. Python provides a zip
function which, when combined with list
, can do this easily:
Next, we want to split our dataset ds
into a training and validation set:
Later, we will be using stochastic gradient descent, which requires that we have “mini-batches” of our dataset. We can create a DataLoader
from our train
dataset to do so:
(torch.Size([256, 784]), torch.Size([256, 10]))
We can do the same for our validation (val
) dataset:
Now that our data is ready, we can start training our classification model. We will start with a linear model, then add some non-linearity to it!
First, we must randomly initialize the bias and all weights for each pixel. Since we have 10 labels (one for each digit), there must be 10 outputs, so our weights matrix is of size 784x10
.
The prediction given a tensor x
is
\[\text{prediction} = x \cdot \text{weights} + \text{bias}.\]
To calculate a gradient, we need a loss function. Since there are more than 2 labels, we will use cross entropy loss, which is related to the softmax function instead of a sigmoid function (which is used for binary classification).
For testing and demonstration purposes, let’s work with a smaller batch than the ones created when shaping our data.
tensor([[ -0.5287, 12.6047, -9.0290, -1.7505, 3.7686, 18.8489, -1.8141,
-12.2232, -5.1421, -6.6316],
[ 18.8942, 10.7898, 9.2573, 7.9989, -1.2884, 19.0238, -5.8788,
6.5045, -10.2431, 5.5865],
[ 6.3639, 14.0687, 0.7705, -1.3580, 1.4220, 7.3108, -7.4359,
-6.8101, -5.9212, 23.7016],
[-14.7847, 3.0711, -0.6092, 2.2720, -1.1361, 3.7617, 5.1197,
5.3868, -1.5228, -7.6523]], grad_fn=<AddBackward0>)
Now we can calculate the gradients:
(torch.Size([784, 10]), tensor(2.4328e-10), tensor([5.9605e-08]))
The following function combines the above code and generalizes to models other than linear1
.
(tensor(4.8657e-10), tensor([1.1921e-07]))
Using the calculated gradients, we can update the weights for each epoch. We need to specify a learning rate and reset the gradients to 0, since loss.backward
actually adds the gradients of loss to any gradients that are currently stored.
Finally, we can define a function that trains the model for one epoch:
We also probably want to check the accuracy of our model. The label that the model predicts is the label with the highest activation:
To get the accuracy for the whole epoch, we must call batch_accuracy
with batches of the validation dataset, then take the mean over all batches.
Finally, we can see if our code works by checking if the accuracy improves!
params = weights, bias
for i in range(40):
train_epoch(linear1, params)
print(validate_epoch(linear1), end=' ')
0.8213 0.8532 0.8661 0.8736 0.8769 0.8808 0.8848 0.8869 0.8895 0.8914 0.8929 0.894 0.8949 0.8964 0.8973 0.8977 0.8978 0.8979 0.8981 0.8986 0.8998 0.9 0.9004 0.9016 0.9017 0.9027 0.9027 0.9032 0.9037 0.9047 0.9053 0.9058 0.9061 0.9062 0.9061 0.9062 0.906 0.9061 0.906 0.9063
nn.Linear
does the same thing as our init_params
and linear1
functions together. Also, fastai’s SGD
class provides us with functions that takes care of updating the parameters and reseting the gradients of our model. By replacing some code, we can boil the training portion of our MNIST classifer down to the following:
def mnist_loss(xb, yb):
loss = nn.CrossEntropyLoss()
return loss(xb, yb)
def calc_grad(xb, yb, model):
preds = model(xb)
loss = mnist_loss(preds, yb)
loss.backward()
def train_epoch_simple(model):
for xb,yb in dl:
calc_grad(xb, yb, model)
opt.step()
opt.zero_grad()
def train_model(model, epochs):
for i in range(epochs):
train_epoch_simple(model)
print(validate_epoch(model), end=' ')
linear_model = nn.Linear(28*28,10)
opt = SGD(linear_model.parameters(), lr=1)
train_model(linear_model, 20)
0.8983 0.9057 0.9092 0.9113 0.9143 0.9149 0.9154 0.9154 0.9162 0.9161 0.9166 0.9167 0.9166 0.9166 0.9164 0.9169 0.917 0.9173 0.9173 0.9175
Fast.ai provides us with Learner.fit
, which we can use instead of train_model
to significantly reduce the amount of code we need to write. To use the function, we must create a Learner
, which requires a DataLoaders
of our training and validation datasets:
Then, we pass in DataLoaders
, the model, the optimization function, the loss function, and optionally any metrics to print into the Learner
constructor to create one:
Finally, we can call Learner.fit
:
epoch | train_loss | valid_loss | batch_accuracy | time |
---|---|---|---|---|
0 | 0.413918 | 0.365917 | 0.897400 | 00:00 |
1 | 0.326721 | 0.337396 | 0.905100 | 00:00 |
2 | 0.302027 | 0.325679 | 0.908500 | 00:00 |
3 | 0.289524 | 0.319092 | 0.910000 | 00:00 |
4 | 0.281248 | 0.314855 | 0.912800 | 00:00 |
5 | 0.275084 | 0.311911 | 0.913200 | 00:00 |
6 | 0.270192 | 0.309764 | 0.913500 | 00:00 |
7 | 0.266153 | 0.308146 | 0.913900 | 00:00 |
8 | 0.262723 | 0.306901 | 0.914300 | 00:00 |
9 | 0.259748 | 0.305928 | 0.914900 | 00:00 |
To expand upon our model, we can add another layer on top of what we have now. However, mathematically speaking, the composition of two linear functions is another linear function. Therefore, stacking two linear classifiers on top of each other is equivalent to having just one linear classifier.
Therefore, we must add some non-linearity between linear layers. We often do this by through activation functions; a common one is the ReLU
function:
nn.Sequential
creates a module that will call each of the listed layers or functions.
Our first layer takes in 784 inputs (pixels) and outputs 60 numbers. Those 60 numbers are then each passed into the ReLU
function before going into the second layer. The second layer has 10 outputs, which as before, is the probability of each digit being the lable.
We can train this model using Learner.fit
as well (we are using more epochs and smaller learning rate, since it is a larger model):
The output is ommitted to save room; the training process is recorded in learn.recorder
, with the table of output stored in the values
attribute, so we can plot the accuracy over training as:
Though our very basic model is far from perfect, we can still submit it to the competition! Recall that we stored the test.csv data into the df_test DataFrame
. We need to first normalize the pixels then plug it into our model:
Finally, we can create a submission file in our current directory:
Then submit to Kaggle!