Back to blog
mlnlpbertpython

Detecting Fake News in Telugu With BERT

How I built a fake news detection system for Telugu using BERTbase and transfer learning in Python, achieving 88% Precision and 85% Accuracy on a custom-scraped dataset.

TS
Tharun Sai Putta
October 1, 2024
4 min read

title: "Detecting Fake News in Telugu With BERT" date: "2024-10-01" description: "How I built a fake news detection system for Telugu using BERTbase and transfer learning in Python, achieving 88% Precision and 85% Accuracy on a custom-scraped dataset." tags: ["ml", "nlp", "bert", "python"] published: true image: ""

Telugu is spoken by ~80 million people, but when it comes to NLP research and tools, it's largely underserved compared to English or even Hindi. Fake news in regional languages is a real problem — and there's very little work done on detecting it at scale.

This project was my attempt to change that, even in a small way.

The problem

Fake news detection for English is a well-studied problem. You can find pre-trained classifiers, large labeled datasets, and established benchmarks. For Telugu? Almost nothing.

The core challenges:

  1. No labeled dataset — I had to build one from scratch
  2. Morphological complexity — Telugu is highly inflectional, which makes tokenization harder
  3. Code-switching — Telugu news often mixes English words, especially technical terms
  4. Limited compute — I was training on free Colab GPUs

Dataset creation

I scraped ~2,400 news articles from Telugu news websites, split across two categories:

  • Real news: Articles from established Telugu newspapers (Eenadu, Sakshi)
  • Fake news: Articles from known misinformation pages flagged by fact-checkers

Labeling was mostly automatic (based on source credibility) with manual review of edge cases.

import pandas as pd
from sklearn.model_selection import train_test_split
 
df = pd.read_csv("telugu_news.csv")
print(f"Total articles: {len(df)}")
print(f"Real: {df['label'].value_counts()[0]}")
print(f"Fake: {df['label'].value_counts()[1]}")
 
# Output:
# Total articles: 2412
# Real: 1289
# Fake: 1123

Why multilingual BERT?

I chose bert-base-multilingual-cased (mBERT) from HuggingFace. The reasons:

  1. Pre-trained on Telugu — mBERT was trained on 104 languages including Telugu, so it has some understanding of Telugu morphology out of the box
  2. No language-specific preprocessing — Unlike traditional ML approaches, I don't need to build a custom tokenizer or stemmer
  3. Transfer learning — Even with a small dataset, fine-tuning a pre-trained model gives much better results than training from scratch

The alternative was xlm-roberta-base, which generally outperforms mBERT on low-resource languages. I tested both — XLM-R was ~2% better but required more compute.

Fine-tuning approach

from transformers import (
    BertTokenizerFast,
    BertForSequenceClassification,
    Trainer,
    TrainingArguments,
)
import torch
 
MODEL_NAME = "bert-base-multilingual-cased"
tokenizer = BertTokenizerFast.from_pretrained(MODEL_NAME)
model = BertForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=2,
)
 
def tokenize_batch(batch):
    return tokenizer(
        batch["text"],
        padding="max_length",
        truncation=True,
        max_length=256,
    )
 
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=4,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
)

I used max_length=256 instead of 512 to fit within Colab's memory constraints. Since most Telugu news articles are around 150-200 tokens anyway, this didn't hurt performance significantly.

Results

After 4 epochs of fine-tuning on 80% of the dataset:

MetricScore
Accuracy85%
Precision88%
Recall83%
F1 Score85.4%

The recall being higher than precision means the model is slightly over-predicting "fake" — it errs on the side of caution, which is arguably the right behavior for a content moderation use case.

Error analysis

The most common failure modes:

  1. Satire articles: The model sometimes classifies satirical news as fake (technically true, but not the intent)
  2. Opinion pieces: Opinion columns on divisive topics confused the model
  3. Breaking news: Real breaking news sometimes reads like misinformation before full context is established
# Confusion matrix
from sklearn.metrics import classification_report
 
print(classification_report(y_true, y_pred, target_names=["Real", "Fake"]))
 
#               precision    recall  f1-score   support
#         Real       0.89      0.85      0.87       258
#         Fake       0.84      0.89      0.87       224
#     accuracy                           0.87       482

What I'd do better

If I were revisiting this project:

  1. Use XLM-RoBERTa — more robust on low-resource languages
  2. Larger dataset — 2,400 articles is small by NLP standards. Even 10,000 would likely push F1 over 92%
  3. Cross-source validation — My test set comes from the same sources as training data. Truly out-of-distribution evaluation would be more rigorous
  4. Explainability — LIME or SHAP visualizations to show which parts of the article the model focused on

The bigger picture

This was primarily a learning project, but it points to a real need. Regional language NLP in India is years behind English NLP. If you're an NLP researcher or enthusiast, low-resource Indian languages are genuinely unsolved and impactful problems.

The dataset and model weights are available on my GitHub if you want to build on this work.

TS

Tharun Sai Putta

Product Engineer @ Protectt.ai

Building Android security SDKs, IDE plugins, and cross-platform tooling. IIITDM Kancheepuram CSE alumnus.