SetFit- Few shot learning for Youtube Comments

Tushar Tiwari
Geek Culture
Published in
4 min readDec 22, 2022

--

Problem Statement:

Typically ML done today relies on the labeled data. In real world its not easy to find labeled data sitting and waiting for some Data Scientist to pick it up and start building models.

According to a report “ The global data collection and labeling market size was valued at USD 1.67 billion in 2021 and is expected to expand at a compound annual growth rate (CAGR) of 25.1% from 2022 to 2030”.

There is this notion even among Data Scientist that to have good performance they need massive amount of data. This is where techniques like few-shots learning comes in to rescue.

What is Few-shot learning?

It is a part of supervised ML ,where the model is given a few of labeled examples and then trained using it to make inference(prediction) on unseen examples .

The goal is to reduce the amount of training data required to achieve similar performance on a given task.

What is SetFit ?

SetFit (Sentence Transformer Fine-Tuning) : an efficient framework for few-shot fine-tuning of Sentence Transformers.

Training is divided into a two stage process:

  1. Sentence Transformer Fine tuning: This happens in Siamese manner on sentence pairs, where goal is to maximize distance between semantically different sentences and minimize the distance between semantically similar sentences.
  2. Classification head training: Takes the rich text embedding and with the class label forms the training set for classification. Logistic regression model is used as classification model.( In future i think this could be any classification model.)
Two stage training process

For more detail how this all works and benchmark results refer blog and SetFit paper.

Just by replacing the base sentence transformer with a multilingual transformer, it gives good results.

At inference:

  1. Unseen sample is passed to the sentence transformer(fine tuned using few training examples) to generate dense embedding.
  2. This rich text embedding is used by the classification head to return a class label.

Dataset Source:

The dataset is collected using Youtube API. It contains collection of YouTube comments for Data Science related channels. Kaggle datasets

Note: If running on kaggle first uninstall tensorflow, as there is issue with the pre-installed version of tf.

%pip uninstall tensorflow -y

Jump to the section Inference the Model if you directly wish to see the result in action.

Install Setfit

%pip install setfit

Let’s jump into the code example.

Problem Statement:

Imagine yourself as a educator teaching on YouTube, you wanted to see the questions coming form students. But the comments section is kind of a mess. Build a system to filter out questions form the clutter of comments.

Formally , given a comment predict if text is a question or not.

Load the dataset into pandas data-frame and filter out the comment field.

comments_df = pd.read_csv('/kaggle/input/youtube-data-science-channels-comments/Coreyms_comments.csv',engine='python')
comments_text_df = pd.DataFrame(comments_df['snippet_topLevelComment_snippet_textOriginal'])

Have the look at the sample comments. Run the below cell to generate some sample for each class(question and not_question).

comments_text_df.sample(5)
sample comments

Data Curation

I have collected a sample of 16 text for both class. Remember we don’t have labled training data here.

# questions is a list of text comments which are questions.
# not_questions is a list of text comments which are not questions.
df = pd.DataFrame()
df['text'] = questions
df['label'] = True

df1 = pd.DataFrame()
df1['text'] = not_questions
df1['label'] = False

combined_comments_df = pd.concat([df,df1])
combined_comments_df = combined_comments_df.reset_index(drop =True)

Convert pandas dataFrame to Hugging Face Datasets

from datasets import Dataset
dataset = Dataset.from_pandas(combined_comments_df)
dataset
from setfit import SetFitModel
model_id = 'sentence-transformers/all-MiniLM-L6-v2'
model = SetFitModel.from_pretrained(model_id)

This model_id can be modified to any of the Pre-trained model form sentence-transformers .

Training SetFit

Lets define our SetFitTrainer.

from sentence_transformers.losses import CosineSimilarityLoss

from setfit import SetFitTrainer

trainer = SetFitTrainer(
model=model,
train_dataset=dataset,
eval_dataset=dataset,
loss_class=CosineSimilarityLoss,
num_iterations=20,
num_epochs=10,
column_mapping={"text": "text", "label": "label"},
)
trainer.train()

I have used the same dataset as eval_dataset as well. We can define trainer without eval_dataset as well.

Note: Its better to have a separate eval_dataset. For which one has to label few datapoints again.

That’s it we have trained a classifier with only 16 examples each.

If you wish you can push your model to the hub and share it with others.

Inference the Model

To serve SetFit model is not much different form serving any HuggingFace model.

Following have shown the 2 examples each class.

from setfit import SetFitModel

model = SetFitModel.from_pretrained("tushifire/setfit_youtube_comments_is_a_question")

# Run inference
preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])
preds
# array([False, False])

preds = model(["""what video do I watch that takes the html_output and insert it into the actual html page?""",
"Why does for loop end without a break statement"])
preds
# array([ True, True])

The results are quite good.

We can improve the performance of the classifier by:

  1. Using larger pre-trained Model
  2. Doing a hyper parameter search
  3. Finding out a pre-trained model on same domain as social media comments, education domain, etc.

Have a look at complete code at kaggle Notebook.

Reference

  1. Data Market: https://www.grandviewresearch.com/industry-analysis/data-collection-labeling-market
  2. Hugging Face blog: https://huggingface.co/blog/setfit
  3. Pretrained models: https://www.sbert.net/docs/pretrained_models.html
  4. Github repo: https://github.com/huggingface/setfit
  5. Efficient Few-Shot Learning Without Prompts https://arxiv.org/abs/2209.11055

--

--

Tushar Tiwari
Geek Culture

Finding insights from data | Fascinated by how Markets works.