Clickbait Headline Detection

01 — Introduction

What is Clickbait?

Clickbait refers to headlines engineered to provoke curiosity and compel clicks — often through sensationalism, vague promises, or emotional manipulation — while the actual content fails to deliver on the implied value.

"You Won't Believe What This Celebrity Did Next!" is the archetypal clickbait — designed to trigger, not to inform.

Detecting clickbait automatically is critical for:

→ Improving news feed quality and user trust
→ Fighting misinformation and low-quality journalism
→ Training media literacy at scale via AI tools

02 — Objective

Project Goals

01 Build a binary text classifier that identifies news headlines as clickbait or non-clickbait with high accuracy using classical ML techniques.
02 Compare the performance of Logistic Regression, Naive Bayes, and SVM on TF-IDF-vectorised headline features.
03 Evaluate each model using Accuracy, Precision, Recall, and F1-score, then validate on real-world custom headlines.

03 — Dataset

Kaggle Clickbait Dataset

Source Kaggle

Total Headlines 32,000

Clickbait (label=1) 15,999

Non-Clickbait (label=0) 16,001

Train / Test Split 80% / 20%

Features headline, label

Dataset URL kaggle.com/datasets/amananandrai/clickbait-dataset

The dataset is perfectly balanced (~50/50 split), eliminating class-imbalance bias. Headlines were sourced from BuzzFeed (clickbait) and mainstream news outlets (non-clickbait).

04 — Methodology

NLP Pipeline

STEP 01

📥

Data Ingestion

Download dataset via Kaggle API; load into Pandas DataFrame and inspect class balance.

Kaggle · Pandas

STEP 02

🧹

Text Cleaning

Lowercase, strip URLs and punctuation, remove extra whitespace using regex.

re · string

STEP 03

✂️

Train/Test Split

Stratified 80/20 split ensures label proportions are preserved in both sets.

sklearn

STEP 04

🔢

TF-IDF Vectorisation

Top 5,000 unigram+bigram features weighted by term frequency–inverse document frequency.

TfidfVectorizer

STEP 05

🤖

Model Training

Logistic Regression and Naive Bayes trained; SVM included for comparison.

LR · NB · SVM

STEP 06

📊

Evaluation

Accuracy, Precision, Recall, F1-score and Confusion Matrix compared across models.

sklearn.metrics

05 — Models

Classifiers Used

Logistic Regression

Learns a weight per TF-IDF token. Words like "shocking" and "you won't believe" get high weights — directly interpretable. Outputs probability scores, not just binary labels.

Best for interpretability & probability outputs

Naive Bayes

Assumes feature independence; computes P(clickbait | words). Highly efficient on sparse TF-IDF matrices. Achieved slightly higher overall accuracy.

SVM (Linear)

Finds maximum-margin hyperplane in high-dimensional TF-IDF space. Robust to overfitting in sparse text; strong baseline comparator.

06 — Results & Evaluation

Model Performance

Metric	Logistic Regression	Naive Bayes ▲ Best
Accuracy	96.89%	97.11%
Precision	97.38%	96.43%
Recall	96.37%	97.84%
F1-Score	96.87%	97.13%

Live Predictions (LR Model)

NOT CLICKBAIT 85.4% Scientists discover new Alzheimer's treatment
CLICKBAIT 92.6% You won't believe what this celebrity did next!
NOT CLICKBAIT 96.9% Government announces new budget for 2025
CLICKBAIT 97.4% 10 shocking things doctors don't want you to know

07 — Conclusion

Key Takeaways

🎯

High-Accuracy Results Both models exceeded 96% accuracy on 6,400 test headlines, validating that TF-IDF + classical ML is highly effective for headline-level clickbait detection.

🏆

Naive Bayes Wins Narrowly Naive Bayes (97.11%) edged Logistic Regression (96.89%) across all four metrics, with notably fewer false negatives. Making it more reliable at catching real clickbait.