Detecting Fake News in Telugu With BERT

title: "Detecting Fake News in Telugu With BERT" date: "2024-10-01" description: "How I built a fake news detection system for Telugu using BERTbase and transfer learning in Python, achieving 88% Precision and 85% Accuracy on a custom-scraped dataset." tags: ["ml", "nlp", "bert", "python"] published: true image: ""

Telugu is spoken by ~80 million people, but when it comes to NLP research and tools, it's largely underserved compared to English or even Hindi. Fake news in regional languages is a real problem — and there's very little work done on detecting it at scale.

This project was my attempt to change that, even in a small way.

The problem

Fake news detection for English is a well-studied problem. You can find pre-trained classifiers, large labeled datasets, and established benchmarks. For Telugu? Almost nothing.

The core challenges:

No labeled dataset — I had to build one from scratch
Morphological complexity — Telugu is highly inflectional, which makes tokenization harder
Code-switching — Telugu news often mixes English words, especially technical terms
Limited compute — I was training on free Colab GPUs

Dataset creation

I scraped ~2,400 news articles from Telugu news websites, split across two categories:

Real news: Articles from established Telugu newspapers (Eenadu, Sakshi)
Fake news: Articles from known misinformation pages flagged by fact-checkers

Labeling was mostly automatic (based on source credibility) with manual review of edge cases.

import pandas as pd
from sklearn.model_selection import train_test_split
 
df = pd.read_csv("telugu_news.csv")
print(f"Total articles: {len(df)}")
print(f"Real: {df['label'].value_counts()[0]}")
print(f"Fake: {df['label'].value_counts()[1]}")
 
# Output:
# Total articles: 2412
# Real: 1289
# Fake: 1123

Why multilingual BERT?

I chose bert-base-multilingual-cased (mBERT) from HuggingFace. The reasons:

Pre-trained on Telugu — mBERT was trained on 104 languages including Telugu, so it has some understanding of Telugu morphology out of the box
No language-specific preprocessing — Unlike traditional ML approaches, I don't need to build a custom tokenizer or stemmer
Transfer learning — Even with a small dataset, fine-tuning a pre-trained model gives much better results than training from scratch

The alternative was xlm-roberta-base, which generally outperforms mBERT on low-resource languages. I tested both — XLM-R was ~2% better but required more compute.

Fine-tuning approach

from transformers import (
    BertTokenizerFast,
    BertForSequenceClassification,
    Trainer,
    TrainingArguments,
)
import torch
 
MODEL_NAME = "bert-base-multilingual-cased"
tokenizer = BertTokenizerFast.from_pretrained(MODEL_NAME)
model = BertForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=2,
)
 
def tokenize_batch(batch):
    return tokenizer(
        batch["text"],
        padding="max_length",
        truncation=True,
        max_length=256,
    )
 
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=4,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
)

I used max_length=256 instead of 512 to fit within Colab's memory constraints. Since most Telugu news articles are around 150-200 tokens anyway, this didn't hurt performance significantly.

Results

After 4 epochs of fine-tuning on 80% of the dataset:

Metric	Score
Accuracy	85%
Precision	88%
Recall	83%
F1 Score	85.4%

The recall being higher than precision means the model is slightly over-predicting "fake" — it errs on the side of caution, which is arguably the right behavior for a content moderation use case.

Error analysis

The most common failure modes:

Satire articles: The model sometimes classifies satirical news as fake (technically true, but not the intent)
Opinion pieces: Opinion columns on divisive topics confused the model
Breaking news: Real breaking news sometimes reads like misinformation before full context is established

# Confusion matrix
from sklearn.metrics import classification_report
 
print(classification_report(y_true, y_pred, target_names=["Real", "Fake"]))
 
#               precision    recall  f1-score   support
#         Real       0.89      0.85      0.87       258
#         Fake       0.84      0.89      0.87       224
#     accuracy                           0.87       482

What I'd do better

If I were revisiting this project:

Use XLM-RoBERTa — more robust on low-resource languages
Larger dataset — 2,400 articles is small by NLP standards. Even 10,000 would likely push F1 over 92%
Cross-source validation — My test set comes from the same sources as training data. Truly out-of-distribution evaluation would be more rigorous
Explainability — LIME or SHAP visualizations to show which parts of the article the model focused on

The bigger picture

This was primarily a learning project, but it points to a real need. Regional language NLP in India is years behind English NLP. If you're an NLP researcher or enthusiast, low-resource Indian languages are genuinely unsolved and impactful problems.

The dataset and model weights are available on my GitHub if you want to build on this work.