Text classification is one of those machine learning problems that sounds abstract until you realize it is running silently behind almost every digital product you use daily. When Gmail intercepts a phishing email before it reaches your inbox, that is text classification. When Netflix recommends a show based on reviews you have written, that is text classification. When a customer service platform automatically routes your support ticket to the right team, that is text classification. When TikTok decides a piece of content violates community guidelines before a human ever sees it, that is text classification — operating at billions of inferences per day.
In this tutorial, we are going to build a complete, production-aware NLP text classification system from scratch using Python. We will start with raw text data, process it through a full pipeline, train and compare multiple classifiers, evaluate them rigorously, and then peek into the modern transformer era with HuggingFace. We will also discuss how to take a trained model into production.
The NLP Pipeline: From Raw Text to Predictions
Raw text is messy, inconsistent, and structurally variable in a way that most ML algorithms cannot handle directly. The NLP pipeline bridges that gap with six stages:
- Data Collection — loading a labeled dataset
- Exploration — understanding class distribution and text length
- Preprocessing — cleaning, normalizing, and tokenizing text
- Feature Extraction — converting text to numerical vectors
- Model Training — fitting one or more classifiers
- Evaluation — measuring performance with the right metrics
Step 1: Data Collection and Exploration
We will use the 20 Newsgroups dataset — approximately 18,000 newsgroup posts across 20 topic categories, bundled with scikit-learn. We work with 6 categories for clarity:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_20newsgroups
from collections import Counter
categories = [
"sci.space", "rec.sport.hockey", "talk.politics.guns",
"comp.graphics", "alt.atheism", "soc.religion.christian"
]
train_data = fetch_20newsgroups(
subset="train", categories=categories,
remove=("headers", "footers", "quotes"),
random_state=42
)
test_data = fetch_20newsgroups(
subset="test", categories=categories,
remove=("headers", "footers", "quotes"),
random_state=42
)
class_names = train_data.target_names
lengths = [len(doc.split()) for doc in train_data.data]
print(f"Training: {len(train_data.data)} docs | Test: {len(test_data.data)} docs")
print(f"Mean length: {np.mean(lengths):.0f} words | Median: {np.median(lengths):.0f}")
We remove headers, footers, and quotes to prevent the classifier from learning to identify posters by their email signatures rather than the actual content — which would inflate accuracy while teaching the model nothing useful.
Step 2: Text Preprocessing
Raw text contains variation that carries no meaning for classification: uppercase letters, punctuation, stopwords like “the” and “is.” Preprocessing reduces this noise systematically:
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download("punkt")
nltk.download("punkt_tab")
nltk.download("stopwords")
nltk.download("wordnet")
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()
def preprocess_text(text):
text = text.lower()
text = re.sub(r"https?://S+|www.S+", "", text) # URLs
text = re.sub(r"S+@S+", "", text) # emails
text = re.sub(r"d+", "", text) # numbers
text = text.translate(str.maketrans("", "", string.punctuation))
tokens = word_tokenize(text)
tokens = [t for t in tokens if t not in stop_words and len(t) > 2]
tokens = [lemmatizer.lemmatize(t) for t in tokens]
return " ".join(tokens)
import time
start = time.time()
train_clean = [preprocess_text(doc) for doc in train_data.data]
test_clean = [preprocess_text(doc) for doc in test_data.data]
print(f"Preprocessed {len(train_clean)+len(test_clean)} docs in {time.time()-start:.1f}s")
print("Before:", train_data.data[0][:100])
print("After:", train_clean[0][:100])
Lemmatization uses morphological analysis to return real dictionary base forms (“running” → “run”, “studies” → “study”). It’s slower than stemming but more linguistically correct and generally produces cleaner features for classification.
Step 3: Feature Extraction with TF-IDF
TF-IDF (Term Frequency–Inverse Document Frequency) converts cleaned text into numerical vectors by weighting words that are distinctive to specific documents higher than words appearing everywhere. This is the most powerful classical NLP representation:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(
max_features=50000, # keep 50k most frequent terms
ngram_range=(1, 2), # include both unigrams and bigrams
min_df=2, # ignore terms in fewer than 2 documents
max_df=0.95, # ignore terms in more than 95% of documents
sublinear_tf=True, # apply log normalization to term frequencies
strip_accents="unicode",
analyzer="word"
)
X_train = vectorizer.fit_transform(train_clean)
X_test = vectorizer.transform(test_clean)
y_train = train_data.target
y_test = test_data.target
print(f"Train matrix: {X_train.shape}")
print(f"Sparsity: {1 - X_train.nnz / (X_train.shape[0] * X_train.shape[1]):.4f}")
# Top features for sci.space
cat_idx = class_names.index("sci.space")
cat_mask = (y_train == cat_idx)
feature_names = vectorizer.get_feature_names_out()
cat_tfidf = X_train[cat_mask].mean(axis=0).A1
top15 = cat_tfidf.argsort()[-15:][::-1]
print("Top features for sci.space:", [feature_names[i] for i in top15])
sublinear_tf=True prevents a word appearing 100 times from having 100x the weight of a word appearing once — a more realistic scaling that almost always improves performance. The resulting matrix is extremely sparse, which is why scikit-learn stores it in compressed format rather than a dense array.
Step 4: Training and Comparing Classifiers
Never bet everything on a single algorithm. Train several, compare rigorously with cross-validation, and let the data tell you which wins:
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings("ignore")
# Each pipeline bundles vectorization + classifier
pipelines = {
"Multinomial Naive Bayes": Pipeline([
("tfidf", TfidfVectorizer(max_features=50000, ngram_range=(1, 2),
min_df=2, max_df=0.95, sublinear_tf=True)),
("clf", MultinomialNB(alpha=0.1))
]),
"Linear SVM": Pipeline([
("tfidf", TfidfVectorizer(max_features=50000, ngram_range=(1, 2),
min_df=2, max_df=0.95, sublinear_tf=True)),
("clf", LinearSVC(C=1.0, max_iter=2000, random_state=42))
]),
"Logistic Regression": Pipeline([
("tfidf", TfidfVectorizer(max_features=50000, ngram_range=(1, 2),
min_df=2, max_df=0.95, sublinear_tf=True)),
("clf", LogisticRegression(C=5.0, max_iter=1000,
solver="saga", random_state=42, n_jobs=-1))
])
}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
results = {}
for name, pipeline in pipelines.items():
scores = cross_val_score(pipeline, train_clean, y_train,
cv=cv, scoring="f1_weighted", n_jobs=-1)
results[name] = scores
print(f"{name}: {scores.mean():.4f} +/- {scores.std():.4f}")
Multinomial Naive Bayes is extremely fast, works well with small datasets, and is interpretable — it literally learns the probability of each word per class. Linear SVM (LinearSVC) is a workhorse for high-dimensional sparse text data. Logistic Regression gives calibrated probability outputs and is highly interpretable through its feature weights. Using StratifiedKFold ensures each fold has the same class proportions; scoring on f1_weighted handles class imbalance better than accuracy.
Step 5: Evaluation Deep Dive
Accuracy is the most seductive metric and often the most misleading. If 95% of emails are legitimate, a classifier labeling everything as legitimate achieves 95% accuracy while being completely useless. Always use disaggregated metrics:
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
# Train best model on full training set, evaluate on test
best_pipeline = pipelines["Linear SVM"]
best_pipeline.fit(train_clean, y_train)
y_pred = best_pipeline.predict(test_clean)
print("Classification Report — Linear SVM:")
print(classification_report(y_test, y_pred, target_names=class_names))
# Confusion matrix heatmap
cm = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(9, 7))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
xticklabels=[c.split(".")[-1] for c in class_names],
yticklabels=[c.split(".")[-1] for c in class_names],
linewidths=0.5, ax=ax)
ax.set_title("Confusion Matrix — Linear SVM on 20 Newsgroups",
fontweight="bold", fontsize=13)
ax.set_xlabel("Predicted Label"); ax.set_ylabel("True Label")
plt.tight_layout(); plt.show()
Precision: of all documents labeled as class X, what fraction actually were X? Recall: of all true X documents, what fraction did I correctly find? F1: their harmonic mean. When classes are imbalanced (as in nearly all real-world text classification), F1 is the right primary metric.
Beyond TF-IDF: Transformers with HuggingFace
Classical TF-IDF methods with linear classifiers are remarkably powerful and should be your first choice for many production systems — they are fast, interpretable, and require minimal compute. But for tasks involving subtle language understanding — sarcasm detection, nuanced sentiment, complex entity relationships — transformer models like BERT operate on a completely different level.
BERT (Bidirectional Encoder Representations from Transformers) reads text in both directions simultaneously, giving it richer context for every word. With HuggingFace, you can use a pre-trained transformer pipeline in five lines:
from transformers import pipeline
sentiment = pipeline(
"sentiment-analysis",
model="distilbert-base-uncased-finetuned-sst-2-english",
device=-1 # CPU; device=0 for GPU
)
texts = [
"This machine learning tutorial completely changed how I think about data science!",
"I spent three hours debugging and it still does not work.",
"The results were neither impressive nor terrible, just mediocre."
]
for text in texts:
result = sentiment(text)[0]
print(f"Text: {text[:70]}")
print(f" Label: {result['label']}, Score: {result['score']:.4f}n")
When to use classical ML vs transformers: use TF-IDF + LinearSVC when you have fewer than 100k labeled examples, when inference speed or memory is constrained, or when the task is simple (topic classification, spam filtering). Use transformers when you need subtle semantic understanding, when accuracy is paramount and compute is available, or when you need to handle multiple languages.
Production Considerations
import joblib
from sklearn.pipeline import Pipeline
# Save complete pipeline (vectorizer + classifier together)
final_pipeline = Pipeline([
("tfidf", TfidfVectorizer(max_features=50000, ngram_range=(1, 2),
min_df=2, max_df=0.95, sublinear_tf=True)),
("clf", LinearSVC(C=1.0, max_iter=2000, random_state=42))
])
final_pipeline.fit(train_clean, y_train)
joblib.dump(final_pipeline, "newsgroup_classifier.joblib", compress=3)
# Load and use
loaded = joblib.load("newsgroup_classifier.joblib")
new_text = "NASA announced a new mission to explore the outer planets."
prediction = loaded.predict([preprocess_text(new_text)])
print(f"Predicted: {class_names[prediction[0]]}")
Using scikit-learn’s Pipeline object is non-negotiable for production — it ensures the exact same preprocessing and vectorization steps applied during training are applied during inference. A mismatched vectorizer is one of the most common and painful bugs in NLP production. For model drift monitoring, track the distribution of predicted class labels over time. A sudden shift often signals that the incoming text has changed and the model needs retraining.
Five Portfolio Projects to Build Your NLP Skills
1. SMS Spam Classifier (Beginner) — Use the UCI SMS Spam Collection dataset to build a binary spam/ham classifier. Deploy it as a Flask API. Key skills: preprocessing, TF-IDF, Naive Bayes, model saving.
2. Product Review Sentiment Analyzer (Beginner-Intermediate) — Use Amazon Product Reviews dataset to classify as positive, negative, or neutral. Extend with aspect-based sentiment: not just “is this review good?” but “what does the reviewer think about battery life specifically?” Key skills: multi-class classification, seaborn visualization.
3. News Article Topic Classifier (Intermediate) — Use AG News or BBC News dataset for a 4-6 category topic classifier. Add a Streamlit interface where you paste any article URL and see the predicted category in real time. Key skills: web scraping with BeautifulSoup, full ML pipeline, web app deployment.
4. Toxic Comment Detection (Intermediate-Advanced) — Use the Jigsaw Toxic Comment Classification dataset from Kaggle (multi-label: a comment can simultaneously be toxic, threatening, and obscene). Key skills: multi-label classification, class imbalance handling, threshold tuning, ethics in ML.
5. Resume Screening Classifier (Advanced) — A system that takes a job description and a set of resumes and ranks candidates by fit. Involves both classification and ranking. Key skills: PDF parsing, cosine similarity with TF-IDF, fine-tuning a sentence transformer model.
Every one of these projects, documented well on GitHub with a clear README and live demo, is strong portfolio material. The pipeline is in your hands — go classify the world.