Xây Dựng Mô Hình Học Máy Đầu Tiên: Hướng Dẫn Từng Bước

Xây dựng mô hình học máy đầu tiên với Python

Học máy (machine learning) không còn là lĩnh vực bí ẩn dành riêng cho các nhà khoa học dữ liệu có bằng tiến sĩ. Với Python và scikit-learn, bạn hoàn toàn có thể xây dựng và triển khai mô hình học máy hoạt động thực sự chỉ trong vài giờ. Bài hướng dẫn này sẽ dẫn bạn từng bước từ dữ liệu thô đến mô hình hoạt động, sử dụng bộ dữ liệu Iris cổ điển — đơn giản, có sẵn và đủ giàu để học tất cả các khái niệm cốt lõi.

Yêu Cầu và Cài Đặt Môi Trường

Trước khi bắt đầu, hãy đảm bảo bạn đã cài đặt Python 3.8 trở lên và các thư viện cần thiết:

pip install scikit-learn pandas numpy matplotlib seaborn

Khuyến nghị: sử dụng Jupyter Notebook hoặc Google Colab để chạy code từng bước và xem kết quả trực quan dễ hơn.

Bước 1: Nạp và Khám Phá Dữ Liệu (EDA)

Bộ dữ liệu Iris chứa 150 mẫu hoa iris với 4 đặc trưng (sepal length, sepal width, petal length, petal width) và 3 loài (setosa, versicolor, virginica). Đây là bài toán phân loại đa lớp điển hình.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris

# Nạp dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = iris.target_names[iris.target]

# Khám phá dữ liệu
print(df.shape)        # (150, 5)
print(df.head())
print(df.describe())
print(df['species'].value_counts())  # 50 mẫu mỗi loài — dataset cân bằng

# Kiểm tra missing values
print(df.isnull().sum())  # Không có missing values trong Iris

Bước 2: Trực Quan Hóa Dữ Liệu

Hiểu dữ liệu bằng mắt trước khi xây mô hình là thói quen tốt của mọi data scientist:

# Pair plot — hiển thị mối quan hệ giữa tất cả cặp đặc trưng
sns.pairplot(df, hue='species', palette='viridis')
plt.suptitle('Iris Dataset — Pair Plot', y=1.02)
plt.tight_layout()
plt.show()

# Heatmap correlation
plt.figure(figsize=(8, 6))
sns.heatmap(df.drop('species', axis=1).corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Feature Correlation Matrix')
plt.show()

Từ pair plot, bạn sẽ thấy rằng petal length và petal width là hai đặc trưng phân biệt loài iris tốt nhất — setosa hoàn toàn tách biệt với hai loài kia. Đây là insight quan trọng trước khi build model.

Bước 3: Chuẩn Bị Dữ Liệu

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Tách features và target
X = iris.data
y = iris.target

# Chia train/test với stratify để đảm bảo tỷ lệ lớp đồng đều
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)
print(f"Train: {X_train.shape}, Test: {X_test.shape}")
# Train: (120, 4), Test: (30, 4)

# Chuẩn hóa features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # fit trên train, transform train
X_test_scaled = scaler.transform(X_test)         # chỉ transform test (không fit lại)

Lưu ý quan trọng: luôn fit scaler trên tập train, sau đó chỉ transform tập test. Nếu fit trên toàn bộ dữ liệu, bạn sẽ bị “data leakage” — thông tin từ tập test ảnh hưởng đến quá trình training.

Bước 4: Xây Dựng Mô Hình — Logistic Regression

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Train Logistic Regression
lr_model = LogisticRegression(max_iter=200, random_state=42)
lr_model.fit(X_train_scaled, y_train)

# Đánh giá
y_pred_lr = lr_model.predict(X_test_scaled)
print("Logistic Regression Results:")
print(f"Accuracy: {lr_model.score(X_test_scaled, y_test):.4f}")
print("nClassification Report:")
print(classification_report(y_test, y_pred_lr, target_names=iris.target_names))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_lr)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=iris.target_names,
            yticklabels=iris.target_names)
plt.title('Confusion Matrix — Logistic Regression')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

Bước 5: So Sánh Nhiều Mô Hình

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

models = {
    'Logistic Regression': LogisticRegression(max_iter=200, random_state=42),
    'Random Forest':        RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM (RBF Kernel)':     SVC(kernel='rbf', random_state=42),
    'KNN (k=5)':            KNeighborsClassifier(n_neighbors=5),
    'Decision Tree':        DecisionTreeClassifier(max_depth=5, random_state=42),
}

results = {}
for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    acc = model.score(X_test_scaled, y_test)
    results[name] = acc
    print(f"{name:30s}: {acc:.4f}")

# Vẽ biểu đồ so sánh
plt.figure(figsize=(10, 5))
plt.barh(list(results.keys()), list(results.values()), color='steelblue')
plt.xlim(0.85, 1.01)
plt.xlabel('Accuracy')
plt.title('Model Comparison on Iris Dataset')
plt.tight_layout()
plt.show()

Bước 6: Tinh Chỉnh Siêu Tham Số với GridSearchCV

from sklearn.model_selection import GridSearchCV, cross_val_score

# Tìm tham số tốt nhất cho Random Forest
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth':    [None, 3, 5, 10],
    'min_samples_split': [2, 5, 10],
}

rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(
    rf, param_grid,
    cv=5,           # 5-fold cross-validation
    scoring='accuracy',
    n_jobs=-1,      # sử dụng tất cả CPU cores
    verbose=1
)
grid_search.fit(X_train_scaled, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score:   {grid_search.best_score_:.4f}")

# Đánh giá mô hình tốt nhất trên test set
best_model = grid_search.best_estimator_
test_acc = best_model.score(X_test_scaled, y_test)
print(f"Test accuracy:   {test_acc:.4f}")

Bước 7: Triển Khai Mô Hình với Flask API

Sau khi có mô hình tốt, bước cuối là đóng gói để sử dụng trong ứng dụng thực tế:

import pickle

# Lưu mô hình và scaler
with open('iris_model.pkl', 'wb') as f:
    pickle.dump(best_model, f)

with open('iris_scaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)

# app.py — Flask API đơn giản
from flask import Flask, request, jsonify
import pickle
import numpy as np

app = Flask(__name__)

with open('iris_model.pkl', 'rb') as f:
    model = pickle.load(f)
with open('iris_scaler.pkl', 'rb') as f:
    scaler = pickle.load(f)

iris_classes = ['setosa', 'versicolor', 'virginica']

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json()
    features = np.array(data['features']).reshape(1, -1)
    features_scaled = scaler.transform(features)
    prediction = model.predict(features_scaled)[0]
    probability = model.predict_proba(features_scaled)[0]
    return jsonify({
        'species': iris_classes[prediction],
        'confidence': float(probability.max())
    })

if __name__ == '__main__':
    app.run(debug=True)

Để test API: curl -X POST http://localhost:5000/predict -H "Content-Type: application/json" -d '{"features": [5.1, 3.5, 1.4, 0.2]}'

Những Bước Tiếp Theo

Bạn vừa hoàn thành một pipeline ML đầy đủ — từ EDA, preprocessing, training, evaluation đến deployment. Đây là nền tảng áp dụng được cho hầu hết bài toán phân loại thực tế.

Các bước tiếp theo để nâng cao: tìm hiểu feature engineering để cải thiện chất lượng đặc trưng, học về imbalanced datasets và SMOTE khi dataset không cân bằng, khám phá XGBoost và LightGBM cho tabular data, và cuối cùng là MLflow hoặc Weights & Biases để theo dõi experiments một cách chuyên nghiệp.

Tags Beginner Tutorial Python scikit-learn

Yêu Cầu và Cài Đặt Môi Trường

Bước 1: Nạp và Khám Phá Dữ Liệu (EDA)

Bước 2: Trực Quan Hóa Dữ Liệu

Bước 3: Chuẩn Bị Dữ Liệu

Bước 4: Xây Dựng Mô Hình — Logistic Regression

Bước 5: So Sánh Nhiều Mô Hình

Bước 6: Tinh Chỉnh Siêu Tham Số với GridSearchCV

Bước 7: Triển Khai Mô Hình với Flask API

Những Bước Tiếp Theo

Enjoyed this article?

Bài viết liên quan

Small Language Models và Edge AI: Khi AI Đến Gần Hơn Với Bạn

Thương Lượng Lương Trong Ngành Công Nghệ: Điều Phụ Nữ Cần Biết

Hướng Dẫn Toàn Diện Chuẩn Bị Phỏng Vấn Kỹ Thuật

Để lại bình luận Cancel reply

Cập nhật tin mới