Detecting Fake News in Telugu With BERT
How I built a fake news detection system for Telugu using BERTbase and transfer learning in Python, achieving 88% Precision and 85% Accuracy on a custom-scraped dataset.
title: "Detecting Fake News in Telugu With BERT" date: "2024-10-01" description: "How I built a fake news detection system for Telugu using BERTbase and transfer learning in Python, achieving 88% Precision and 85% Accuracy on a custom-scraped dataset." tags: ["ml", "nlp", "bert", "python"] published: true image: ""
Telugu is spoken by ~80 million people, but when it comes to NLP research and tools, it's largely underserved compared to English or even Hindi. Fake news in regional languages is a real problem — and there's very little work done on detecting it at scale.
This project was my attempt to change that, even in a small way.
The problem
Fake news detection for English is a well-studied problem. You can find pre-trained classifiers, large labeled datasets, and established benchmarks. For Telugu? Almost nothing.
The core challenges:
- No labeled dataset — I had to build one from scratch
- Morphological complexity — Telugu is highly inflectional, which makes tokenization harder
- Code-switching — Telugu news often mixes English words, especially technical terms
- Limited compute — I was training on free Colab GPUs
Dataset creation
I scraped ~2,400 news articles from Telugu news websites, split across two categories:
- Real news: Articles from established Telugu newspapers (Eenadu, Sakshi)
- Fake news: Articles from known misinformation pages flagged by fact-checkers
Labeling was mostly automatic (based on source credibility) with manual review of edge cases.
Why multilingual BERT?
I chose bert-base-multilingual-cased (mBERT) from HuggingFace. The reasons:
- Pre-trained on Telugu — mBERT was trained on 104 languages including Telugu, so it has some understanding of Telugu morphology out of the box
- No language-specific preprocessing — Unlike traditional ML approaches, I don't need to build a custom tokenizer or stemmer
- Transfer learning — Even with a small dataset, fine-tuning a pre-trained model gives much better results than training from scratch
The alternative was xlm-roberta-base, which generally outperforms mBERT on low-resource languages. I tested both — XLM-R was ~2% better but required more compute.
Fine-tuning approach
I used max_length=256 instead of 512 to fit within Colab's memory constraints. Since most Telugu news articles are around 150-200 tokens anyway, this didn't hurt performance significantly.
Results
After 4 epochs of fine-tuning on 80% of the dataset:
| Metric | Score |
|---|---|
| Accuracy | 85% |
| Precision | 88% |
| Recall | 83% |
| F1 Score | 85.4% |
The recall being higher than precision means the model is slightly over-predicting "fake" — it errs on the side of caution, which is arguably the right behavior for a content moderation use case.
Error analysis
The most common failure modes:
- Satire articles: The model sometimes classifies satirical news as fake (technically true, but not the intent)
- Opinion pieces: Opinion columns on divisive topics confused the model
- Breaking news: Real breaking news sometimes reads like misinformation before full context is established
What I'd do better
If I were revisiting this project:
- Use XLM-RoBERTa — more robust on low-resource languages
- Larger dataset — 2,400 articles is small by NLP standards. Even 10,000 would likely push F1 over 92%
- Cross-source validation — My test set comes from the same sources as training data. Truly out-of-distribution evaluation would be more rigorous
- Explainability — LIME or SHAP visualizations to show which parts of the article the model focused on
The bigger picture
This was primarily a learning project, but it points to a real need. Regional language NLP in India is years behind English NLP. If you're an NLP researcher or enthusiast, low-resource Indian languages are genuinely unsolved and impactful problems.
The dataset and model weights are available on my GitHub if you want to build on this work.