This post will cover how to fine tune a NLP classification model using Hugging Face Transformers. I will be using the cleaned reddit depression dataset, which specifies whether or not a post was made in the r/depression subreddit, to train my model.

The final model will be able to classify whether or not a block of text was written in the r/depression subreddit with 98% accuracy. I will create a demo to have users input text and see if it is shows signs of depression.

The similarity of a block of text to posts in r/depression is not perfectly correlated to the text showing signs of clinical depression, so the accuracy of our demo cannot be quantified. However, it can still provide some insight into what type of texts might have been written by depressed patients.

Credits go to the hugging face documentation as well as fast.ai, which are both great educational resources.

Obtaining Data

First, I will fetch the dataset using opendatasets:

try:
    import opendatasets as od
except:
    !pip install opendatasets
    import opendatasets as od

from pathlib import Path

path = Path("depression-reddit-cleaned")

if not path.exists():
    od.download("https://www.kaggle.com/datasets/infamouscoder/depression-reddit-cleaned")

Brief EDA

Now that the dataset is imported, we can create a Dataframe:

import pandas as pd

df = pd.read_csv(path/'depression_dataset_reddit_cleaned.csv')

df

	clean_text	is_depression
0	we understand that most people who reply immed...	1
1	welcome to r depression s check in post a plac...	1
2	anyone else instead of sleeping more when depr...	1
3	i ve kind of stuffed around a lot in my life d...	1
4	sleep is my greatest and most comforting escap...	1
...	...	...
7726	is that snow	0
7727	moulin rouge mad me cry once again	0
7728	trying to shout but can t find people on the list	0
7729	ughh can t find my red sox hat got ta wear thi...	0
7730	slept wonderfully finally tried swatching for ...	0

7731 rows × 2 columns

It seems like we have 7731 examples in our dataset and 2 columns: clean_text and is_depression.

df.count()

clean_text       7731
is_depression    7731
dtype: int64

Since this is a cleaned dataset, there are no null values or weird labels:

df.isnull().sum()

clean_text       0
is_depression    0
dtype: int64

df['is_depression'].unique()

array([1, 0])

Let’s take a look at a text block:

df['clean_text'][2]

'anyone else instead of sleeping more when depressed stay up all night to avoid the next day from coming sooner may be the social anxiety in me but life is so much more peaceful when everyone else is asleep and not expecting thing of you'

Since neural networks expects numbers, not sentences, as inputs, we must somehow convert text blocks into a sequence of numbers. Therefore, each text block is first split up up into tokens (through tokenization), which are then converted to numbers (through numericalization).

Tokenization

Before we tokenize our data, we need to convert our Dataframe into a Dataset.

from datasets import Dataset

ds = Dataset.from_pandas(df)
ds

Dataset({
    features: ['clean_text', 'is_depression'],
    num_rows: 7731
})

This is for later, but Hugging Face Transformers always assumes that your labels has the column name labels. In our dataset it’s currently score, so we should to rename it:

ds = ds.rename_columns({'is_depression':'labels'})
ds

Dataset({
    features: ['clean_text', 'labels'],
    num_rows: 7731
})

To import a tokenizer, we need to use AutoTokenizer:

from transformers import AutoTokenizer

I will use the DistilBERT base model, which, as the name suggests, is a distiled version of the BERT base model.

model_name = "distilbert-base-uncased"

We will use from_pretrained to instantiate a tokenizer class from a pretrained model vocabulary. The tokenizer class to instantiate is selected based on the model (“distilbert-base-uncased” in our case).

tokenizer = AutoTokenizer.from_pretrained(model_name)

Creating a preprocessing function to tokenize text and truncate sequences:

def tokenize_function(examples):
    return tokenizer(examples["clean_text"], truncation=True)

Using the function above and the Datasets map function, we can apply the preprocessing function over the entire dataset. You can speed up the map function by setting batched=True to process multiple elements of the dataset at once:

tokenized_ds = ds.map(tokenize_function, batched=True)

While it is possible to pad your text in the tokenizer function by setting padding=True, dynamic padding is more efficient. data_collator will be used later for this purpose.

from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Creating Validation and Test Sets

We can easily split our dataset into training and validation sets using train_test_split:

dds = tokenized_ds.train_test_split(0.25, seed=42)
dds

DatasetDict({
    train: Dataset({
        features: ['clean_text', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 5798
    })
    test: Dataset({
        features: ['clean_text', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 1933
    })
})

Train

I will turn off the warnings returned by Hugging Face for readability:

import warnings, logging

warnings.simplefilter('ignore')
logging.disable(logging.WARNING)

Similarly to how we instantiated our tokenizer, we will instantiate our model using from_pretrained.

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

To evaluate our model’s performance, we will use accuracy:

from datasets import load_metric

metric = load_metric("accuracy")

import numpy as np

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

We will be using the Trainer class, which provides an API for feature-complete training in PyTorch.

Before instantiating a Trainer, we need to create a TrainingArguments to access all the points of customization during training:

training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch"
)

Creating and training a Trainer:

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dds["train"],
    eval_dataset=dds["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

[726/726 02:26, Epoch 2/2]

Epoch	Training Loss	Validation Loss	Accuracy
1	No log	0.069008	0.976203
2	0.126100	0.073128	0.981376

TrainOutput(global_step=726, training_loss=0.09943613467466075, metrics={'train_runtime': 160.3661, 'train_samples_per_second': 72.31, 'train_steps_per_second': 4.527, 'total_flos': 1083160046271312.0, 'train_loss': 0.09943613467466075, 'epoch': 2.0})

The accuracy is not bad, but I wanted to see if I could tweak the hyperparameters to improve the performance of our model. To make creating trainers easier, I defined a get_trainer function:

def get_trainer(model_name, data_collator=None, lr=2e-5, bs=16, epochs=3, train=dds["train"], test=dds["test"]):
    
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
    
    training_args = TrainingArguments(
        output_dir="./results",
        learning_rate=lr,
        per_device_train_batch_size=bs,
        per_device_eval_batch_size=bs,
        num_train_epochs=epochs,
        weight_decay=0.01,
        evaluation_strategy="epoch",
        fp16=True,
    )
    
    return Trainer(
        model=model,
        args=training_args,
        train_dataset=train,
        eval_dataset=test,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
    )

Since the whole dataset takes a while to train, I selected a smaller subset of the dataset for testing purposes:

small_train = dds["train"].shuffle(seed=42).select(range(1000))
small_eval = dds["test"].shuffle(seed=42).select(range(1000))

In case our GPU runs out of memory, we can empty the cache:

import torch, gc

gc.collect()
torch.cuda.empty_cache()

for i in range(-5, -3):
    lr = 10**i
    trainer = get_trainer(model_name, lr=lr, bs=32, epochs=2, train=small_train, test=small_eval)
    print(lr)
    trainer.train()
    gc.collect()
    torch.cuda.empty_cache()

1e-05

[64/64 00:12, Epoch 2/2]

Epoch	Training Loss	Validation Loss	Accuracy
1	No log	0.459894	0.856000
2	No log	0.383288	0.870000

0.0001

[64/64 00:12, Epoch 2/2]

Epoch	Training Loss	Validation Loss	Accuracy
1	No log	0.106224	0.964000
2	No log	0.087845	0.972000

After many trials, I concluded that other models do not provide a performance benefit significant enough to make up for the time they consume to train. Also, the model seems to perform better when padding our data using data_collator than not.

The best learning rate for a batch size of 32 seems to be 1e-4, so I trained with the whole training dataset using these hyperparameters:

trainer = get_trainer(model_name, lr=1e-4, bs=32, epochs=2, data_collator=data_collator)
trainer.train()

[364/364 01:03, Epoch 2/2]

Epoch	Training Loss	Validation Loss	Accuracy
1	No log	0.070679	0.978272
2	No log	0.056363	0.984480

TrainOutput(global_step=364, training_loss=0.09329656454233023, metrics={'train_runtime': 63.2764, 'train_samples_per_second': 183.26, 'train_steps_per_second': 5.753, 'total_flos': 1309037659010832.0, 'train_loss': 0.09329656454233023, 'epoch': 2.0})

trainer.evaluate()

[61/61 00:03]

{'eval_loss': 0.05636342242360115,
 'eval_accuracy': 0.9844800827728919,
 'eval_runtime': 3.1079,
 'eval_samples_per_second': 621.973,
 'eval_steps_per_second': 19.628,
 'epoch': 2.0}

Accuracy of 98.4%! We will save the model using save_model to use in our demo:

trainer.save_model("./model")

Hugging face pipelines simplify inference. The code block below uses the model that we trained above to determine whether “Today is a great day!” and “I have no motivation to do anything. I feel useless.” show signs of depression (or more accurately, how similar they are to posts written in r/depression).

from transformers import pipeline

examples = ["Today is a great day!", "I have no motivation to do anything. I feel useless."]
pipe = pipeline("text-classification", model="./model", tokenizer=tokenizer)
pipe(examples)

[{'label': 'LABEL_0', 'score': 0.9913545250892639},
 {'label': 'LABEL_1', 'score': 0.9631195068359375}]

In our case, ‘LABEL_1’ means is_depression is 1 and ‘LABEL_0’ means otherwise. I will convert these values to true and false then convert the output of the pipe to a {label: score} dictionary, since that is what gradio requires.

def is_depression(txt):
    pred_dict = pipe(txt)
    for d in pred_dict: 
        d['label'] = False if d['label'] == 'LABEL_0' else True
    return [{item['label']: item['score']} for item in pred_dict]

is_depression(["The tacos I ate today were horrible", "I am losing interest in things I had enjoyed. I hate life"])

[{False: 0.9979074001312256}, {True: 0.9663745164871216}]

Gradio supports 1 input and 1 output (as far as I am aware), so our function shouldn’t return a list:

def predict_gradio(txt):
    return is_depression(txt)[0]

Finally, we can create our interface:

import gradio as gr

title = "Depression Classifier"
description = "A NLP classifier trained with Hugging Face Transformers."
interpretation='default'
enable_queue=True

gr.Interface(fn=predict_gradio, inputs=gr.inputs.Textbox(label="Text"), outputs=gr.outputs.Label(label="is_depression"), title=title,description=description,article=article,examples=examples,interpretation=interpretation,enable_queue=enable_queue).launch(share=True)

I built an web application hosted here using this gradio api! On how to do this, refer to this article.