Classification: Is This Part Good or Defective?

What Is Classification?

Classification assigns data points to predefined categories. In industry, this means deciding if a part is "pass" or "fail", if a machine state is "normal" or "critical", or if a vibration pattern indicates "balanced" or "bearing fault".

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

Decision Trees: Sequential Decisions

A decision tree splits data step by step, asking yes/no questions about features until it reaches a conclusion. This mirrors how a technician diagnoses faults: "Is the temperature above 80? If yes, is the vibration above 5 mm/s?"

from sklearn.tree import DecisionTreeClassifier, plot_tree

np.random.seed(42)
n = 600
temp = np.random.uniform(60, 95, n)
pressure = np.random.uniform(2.0, 5.0, n)
quality = ((temp > 82) & (pressure < 3.0)).astype(int)

df = pd.DataFrame({"temp_c": temp, "pressure_bar": pressure, "defect": quality})
X = df[["temp_c", "pressure_bar"]]
y = df["defect"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                      random_state=42)

tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X_train, y_train)
print(f"Accuracy: {accuracy_score(y_test, tree.predict(X_test)):.3f}")

fig, ax = plt.subplots(figsize=(14, 8))
plot_tree(tree, feature_names=["temp_c", "pressure_bar"],
          class_names=["OK", "Defect"], filled=True, rounded=True, ax=ax)
plt.tight_layout()
plt.show()

The tree's transparency is its greatest strength: you can explain every decision to a quality engineer in plain language.

Random Forests: Strength in Numbers

A single tree overfits easily. Random Forest builds hundreds of trees on random subsets, then takes a majority vote, dramatically improving robustness.

from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)
n = 1000
df = pd.DataFrame({
    "temp_c": np.random.uniform(60, 95, n),
    "pressure_bar": np.random.uniform(2.0, 5.0, n),
    "humidity_pct": np.random.uniform(30, 80, n),
    "cycle_time_s": np.random.uniform(10, 30, n)
})
df["defect"] = (
    (df["temp_c"] > 82) & (df["pressure_bar"] < 3.0) |
    (df["humidity_pct"] > 70) & (df["cycle_time_s"] > 25)
).astype(int)

X = df.drop("defect", axis=1)
y = df["defect"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                      random_state=42)

rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf.fit(X_train, y_train)
print(classification_report(y_test, rf.predict(X_test), target_names=["OK", "Defect"]))

importances = pd.Series(rf.feature_importances_, index=X.columns)
importances.sort_values().plot(kind="barh", figsize=(8, 4))
plt.title("Feature Importance for Defect Prediction")
plt.tight_layout()
plt.show()

SVM: Separating Classes With an Optimal Line

Support Vector Machines find the boundary that maximizes the margin between classes. The kernel trick handles non-linearly separable data.

from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

svm = SVC(kernel="rbf", C=1.0, gamma="scale", random_state=42)
svm.fit(X_train_scaled, y_train)
print(f"SVM Accuracy: {accuracy_score(y_test, svm.predict(X_test_scaled)):.3f}")

SVM and other distance-based algorithms are sensitive to feature scale. Temperature in degrees (60--95) would dominate pressure in bar (2--5) without scaling.

Choosing the Right Algorithm

Algorithm	Strengths	Weaknesses	Best For
Decision Tree	Interpretable, fast	Overfits easily	Explainable decisions
Random Forest	Robust, feature ranking	Slower, less transparent	General-purpose
SVM	Strong with small data	Slow on large datasets	High-dimensional data

models = {
    "Decision Tree": DecisionTreeClassifier(max_depth=5, random_state=42),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "SVM (RBF)": SVC(kernel="rbf", random_state=42)
}
for name, model in models.items():
    if "SVM" in name:
        model.fit(X_train_scaled, y_train)
        score = accuracy_score(y_test, model.predict(X_test_scaled))
    else:
        model.fit(X_train, y_train)
        score = accuracy_score(y_test, model.predict(X_test))
    print(f"{name}: {score:.3f}")

Practical Example: Automatically Classifying Produced Part Quality

A metal stamping factory inspects parts using four sensor measurements and wants to automate pass/fail decisions.

np.random.seed(42)
n = 2000
df = pd.DataFrame({
    "thickness_mm": np.random.normal(3.00, 0.10, n),
    "hardness_hrc": np.random.normal(58, 2, n),
    "surface_roughness_um": np.random.exponential(1.5, n),
    "press_force_kn": np.random.normal(500, 30, n)
})

df["quality"] = "pass"
df.loc[df["thickness_mm"] < 2.80, "quality"] = "fail"
df.loc[df["thickness_mm"] > 3.20, "quality"] = "fail"
df.loc[df["hardness_hrc"] < 54, "quality"] = "fail"
df.loc[df["surface_roughness_um"] > 4.0, "quality"] = "fail"

X = df.drop("quality", axis=1)
y = df["quality"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                      random_state=42)

clf = RandomForestClassifier(n_estimators=200, max_depth=8, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

Summary

In this lesson you learned three classification algorithms for industrial applications. Decision Trees provide transparent decisions. Random Forests combine many trees for robust predictions and feature importance. SVMs find optimal boundaries with scaled features. You compared all three and applied Random Forest to automate quality inspection. In the next lesson, you will explore unsupervised learning with clustering, where data has no labels.