try:
import opendatasets as od
except:
!pip install opendatasets
import opendatasets as od
This post will cover how to fine tune a NLP classification model using Hugging Face Transformers. I will be using the cleaned reddit depression dataset, which specifies whether or not a post was made in the r/depression subreddit, to train my model.
The final model will be able to classify whether or not a block of text was written in the r/depression subreddit with 98% accuracy. I will create a demo to have users input text and see if it is shows signs of depression.
The similarity of a block of text to posts in r/depression is not perfectly correlated to the text showing signs of clinical depression, so the accuracy of our demo cannot be quantified. However, it can still provide some insight into what type of texts might have been written by depressed patients.
Credits go to the hugging face documentation as well as fast.ai, which are both great educational resources.
Obtaining Data
First, I will fetch the dataset using opendatasets
:
from pathlib import Path
= Path("depression-reddit-cleaned") path
if not path.exists():
"https://www.kaggle.com/datasets/infamouscoder/depression-reddit-cleaned") od.download(
Brief EDA
Now that the dataset is imported, we can create a Dataframe
:
import pandas as pd
= pd.read_csv(path/'depression_dataset_reddit_cleaned.csv') df
df
clean_text | is_depression | |
---|---|---|
0 | we understand that most people who reply immed... | 1 |
1 | welcome to r depression s check in post a plac... | 1 |
2 | anyone else instead of sleeping more when depr... | 1 |
3 | i ve kind of stuffed around a lot in my life d... | 1 |
4 | sleep is my greatest and most comforting escap... | 1 |
... | ... | ... |
7726 | is that snow | 0 |
7727 | moulin rouge mad me cry once again | 0 |
7728 | trying to shout but can t find people on the list | 0 |
7729 | ughh can t find my red sox hat got ta wear thi... | 0 |
7730 | slept wonderfully finally tried swatching for ... | 0 |
7731 rows × 2 columns
It seems like we have 7731 examples in our dataset and 2 columns: clean_text
and is_depression
.
df.count()
clean_text 7731
is_depression 7731
dtype: int64
Since this is a cleaned dataset, there are no null values or weird labels:
sum() df.isnull().
clean_text 0
is_depression 0
dtype: int64
'is_depression'].unique() df[
array([1, 0])
Let’s take a look at a text block:
'clean_text'][2] df[
'anyone else instead of sleeping more when depressed stay up all night to avoid the next day from coming sooner may be the social anxiety in me but life is so much more peaceful when everyone else is asleep and not expecting thing of you'
Since neural networks expects numbers, not sentences, as inputs, we must somehow convert text blocks into a sequence of numbers. Therefore, each text block is first split up up into tokens (through tokenization), which are then converted to numbers (through numericalization).
Tokenization
Before we tokenize our data, we need to convert our Dataframe
into a Dataset
.
from datasets import Dataset
= Dataset.from_pandas(df)
ds ds
Dataset({
features: ['clean_text', 'is_depression'],
num_rows: 7731
})
This is for later, but Hugging Face Transformers always assumes that your labels has the column name labels
. In our dataset it’s currently score
, so we should to rename it:
= ds.rename_columns({'is_depression':'labels'})
ds ds
Dataset({
features: ['clean_text', 'labels'],
num_rows: 7731
})
To import a tokenizer, we need to use AutoTokenizer
:
from transformers import AutoTokenizer
I will use the DistilBERT base model, which, as the name suggests, is a distiled version of the BERT base model.
= "distilbert-base-uncased" model_name
We will use from_pretrained
to instantiate a tokenizer class from a pretrained model vocabulary. The tokenizer class to instantiate is selected based on the model (“distilbert-base-uncased” in our case).
= AutoTokenizer.from_pretrained(model_name) tokenizer
Creating a preprocessing function to tokenize text and truncate sequences:
def tokenize_function(examples):
return tokenizer(examples["clean_text"], truncation=True)
Using the function above and the Datasets map function, we can apply the preprocessing function over the entire dataset. You can speed up the map function by setting batched=True to process multiple elements of the dataset at once:
= ds.map(tokenize_function, batched=True) tokenized_ds
While it is possible to pad your text in the tokenizer function by setting padding=True
, dynamic padding is more efficient. data_collator
will be used later for this purpose.
from transformers import DataCollatorWithPadding
= DataCollatorWithPadding(tokenizer=tokenizer) data_collator
Creating Validation and Test Sets
We can easily split our dataset into training and validation sets using train_test_split
:
= tokenized_ds.train_test_split(0.25, seed=42)
dds dds
DatasetDict({
train: Dataset({
features: ['clean_text', 'labels', 'input_ids', 'attention_mask'],
num_rows: 5798
})
test: Dataset({
features: ['clean_text', 'labels', 'input_ids', 'attention_mask'],
num_rows: 1933
})
})
Train
I will turn off the warnings returned by Hugging Face for readability:
import warnings, logging
'ignore')
warnings.simplefilter( logging.disable(logging.WARNING)
Similarly to how we instantiated our tokenizer, we will instantiate our model using from_pretrained
.
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
= AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2) model
To evaluate our model’s performance, we will use accuracy:
from datasets import load_metric
= load_metric("accuracy") metric
import numpy as np
def compute_metrics(eval_pred):
= eval_pred
logits, labels = np.argmax(logits, axis=-1)
predictions return metric.compute(predictions=predictions, references=labels)
We will be using the Trainer class, which provides an API for feature-complete training in PyTorch.
Before instantiating a Trainer
, we need to create a TrainingArguments
to access all the points of customization during training:
= TrainingArguments(
training_args ="./results",
output_dir=2e-5,
learning_rate=16,
per_device_train_batch_size=16,
per_device_eval_batch_size=2,
num_train_epochs=0.01,
weight_decay="epoch"
evaluation_strategy )
Creating and training a Trainer
:
= Trainer(
trainer =model,
model=training_args,
args=dds["train"],
train_dataset=dds["test"],
eval_dataset=tokenizer,
tokenizer=data_collator,
data_collator=compute_metrics,
compute_metrics )
trainer.train()
Epoch | Training Loss | Validation Loss | Accuracy |
---|---|---|---|
1 | No log | 0.069008 | 0.976203 |
2 | 0.126100 | 0.073128 | 0.981376 |
TrainOutput(global_step=726, training_loss=0.09943613467466075, metrics={'train_runtime': 160.3661, 'train_samples_per_second': 72.31, 'train_steps_per_second': 4.527, 'total_flos': 1083160046271312.0, 'train_loss': 0.09943613467466075, 'epoch': 2.0})
The accuracy is not bad, but I wanted to see if I could tweak the hyperparameters to improve the performance of our model. To make creating trainers easier, I defined a get_trainer
function:
def get_trainer(model_name, data_collator=None, lr=2e-5, bs=16, epochs=3, train=dds["train"], test=dds["test"]):
= AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
model
= TrainingArguments(
training_args ="./results",
output_dir=lr,
learning_rate=bs,
per_device_train_batch_size=bs,
per_device_eval_batch_size=epochs,
num_train_epochs=0.01,
weight_decay="epoch",
evaluation_strategy=True,
fp16
)
return Trainer(
=model,
model=training_args,
args=train,
train_dataset=test,
eval_dataset=tokenizer,
tokenizer=data_collator,
data_collator=compute_metrics,
compute_metrics )
Since the whole dataset takes a while to train, I selected a smaller subset of the dataset for testing purposes:
= dds["train"].shuffle(seed=42).select(range(1000))
small_train = dds["test"].shuffle(seed=42).select(range(1000)) small_eval
In case our GPU runs out of memory, we can empty the cache:
import torch, gc
gc.collect() torch.cuda.empty_cache()
for i in range(-5, -3):
= 10**i
lr = get_trainer(model_name, lr=lr, bs=32, epochs=2, train=small_train, test=small_eval)
trainer print(lr)
trainer.train()
gc.collect() torch.cuda.empty_cache()
1e-05
Epoch | Training Loss | Validation Loss | Accuracy |
---|---|---|---|
1 | No log | 0.459894 | 0.856000 |
2 | No log | 0.383288 | 0.870000 |
0.0001
Epoch | Training Loss | Validation Loss | Accuracy |
---|---|---|---|
1 | No log | 0.106224 | 0.964000 |
2 | No log | 0.087845 | 0.972000 |
After many trials, I concluded that other models do not provide a performance benefit significant enough to make up for the time they consume to train. Also, the model seems to perform better when padding our data using data_collator
than not.
The best learning rate for a batch size of 32 seems to be 1e-4, so I trained with the whole training dataset using these hyperparameters:
= get_trainer(model_name, lr=1e-4, bs=32, epochs=2, data_collator=data_collator)
trainer trainer.train()
Epoch | Training Loss | Validation Loss | Accuracy |
---|---|---|---|
1 | No log | 0.070679 | 0.978272 |
2 | No log | 0.056363 | 0.984480 |
TrainOutput(global_step=364, training_loss=0.09329656454233023, metrics={'train_runtime': 63.2764, 'train_samples_per_second': 183.26, 'train_steps_per_second': 5.753, 'total_flos': 1309037659010832.0, 'train_loss': 0.09329656454233023, 'epoch': 2.0})
trainer.evaluate()
{'eval_loss': 0.05636342242360115,
'eval_accuracy': 0.9844800827728919,
'eval_runtime': 3.1079,
'eval_samples_per_second': 621.973,
'eval_steps_per_second': 19.628,
'epoch': 2.0}
Accuracy of 98.4%! We will save the model using save_model
to use in our demo:
"./model") trainer.save_model(
Hugging face pipelines simplify inference. The code block below uses the model that we trained above to determine whether “Today is a great day!” and “I have no motivation to do anything. I feel useless.” show signs of depression (or more accurately, how similar they are to posts written in r/depression).
from transformers import pipeline
= ["Today is a great day!", "I have no motivation to do anything. I feel useless."]
examples = pipeline("text-classification", model="./model", tokenizer=tokenizer)
pipe pipe(examples)
[{'label': 'LABEL_0', 'score': 0.9913545250892639},
{'label': 'LABEL_1', 'score': 0.9631195068359375}]
In our case, ‘LABEL_1’ means is_depression is 1 and ‘LABEL_0’ means otherwise. I will convert these values to true and false then convert the output of the pipe to a {label: score}
dictionary, since that is what gradio requires.
def is_depression(txt):
= pipe(txt)
pred_dict for d in pred_dict:
'label'] = False if d['label'] == 'LABEL_0' else True
d[return [{item['label']: item['score']} for item in pred_dict]
"The tacos I ate today were horrible", "I am losing interest in things I had enjoyed. I hate life"]) is_depression([
[{False: 0.9979074001312256}, {True: 0.9663745164871216}]
Gradio supports 1 input and 1 output (as far as I am aware), so our function shouldn’t return a list:
def predict_gradio(txt):
return is_depression(txt)[0]
Finally, we can create our interface:
import gradio as gr
= "Depression Classifier"
title = "A NLP classifier trained with Hugging Face Transformers."
description ='default'
interpretation=True
enable_queue
=predict_gradio, inputs=gr.inputs.Textbox(label="Text"), outputs=gr.outputs.Label(label="is_depression"), title=title,description=description,article=article,examples=examples,interpretation=interpretation,enable_queue=enable_queue).launch(share=True) gr.Interface(fn
I built an web application hosted here using this gradio api! On how to do this, refer to this article.