Classification: Is This Part Good or Defective?
What Is Classification?
Classification assigns data points to predefined categories. In industry, this means deciding if a part is "pass" or "fail", if a machine state is "normal" or "critical", or if a vibration pattern indicates "balanced" or "bearing fault".
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
Decision Trees: Sequential Decisions
A decision tree splits data step by step, asking yes/no questions about features until it reaches a conclusion. This mirrors how a technician diagnoses faults: "Is the temperature above 80? If yes, is the vibration above 5 mm/s?"
from sklearn.tree import DecisionTreeClassifier, plot_tree
np.random.seed(42)
n = 600
temp = np.random.uniform(60, 95, n)
pressure = np.random.uniform(2.0, 5.0, n)
quality = ((temp > 82) & (pressure < 3.0)).astype(int)
df = pd.DataFrame({"temp_c": temp, "pressure_bar": pressure, "defect": quality})
X = df[["temp_c", "pressure_bar"]]
y = df["defect"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X_train, y_train)
print(f"Accuracy: {accuracy_score(y_test, tree.predict(X_test)):.3f}")
fig, ax = plt.subplots(figsize=(14, 8))
plot_tree(tree, feature_names=["temp_c", "pressure_bar"],
class_names=["OK", "Defect"], filled=True, rounded=True, ax=ax)
plt.tight_layout()
plt.show()
The tree's transparency is its greatest strength: you can explain every decision to a quality engineer in plain language.
Random Forests: Strength in Numbers
A single tree overfits easily. Random Forest builds hundreds of trees on random subsets, then takes a majority vote, dramatically improving robustness.
from sklearn.ensemble import RandomForestClassifier
np.random.seed(42)
n = 1000
df = pd.DataFrame({
"temp_c": np.random.uniform(60, 95, n),
"pressure_bar": np.random.uniform(2.0, 5.0, n),
"humidity_pct": np.random.uniform(30, 80, n),
"cycle_time_s": np.random.uniform(10, 30, n)
})
df["defect"] = (
(df["temp_c"] > 82) & (df["pressure_bar"] < 3.0) |
(df["humidity_pct"] > 70) & (df["cycle_time_s"] > 25)
).astype(int)
X = df.drop("defect", axis=1)
y = df["defect"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf.fit(X_train, y_train)
print(classification_report(y_test, rf.predict(X_test), target_names=["OK", "Defect"]))
importances = pd.Series(rf.feature_importances_, index=X.columns)
importances.sort_values().plot(kind="barh", figsize=(8, 4))
plt.title("Feature Importance for Defect Prediction")
plt.tight_layout()
plt.show()
SVM: Separating Classes With an Optimal Line
Support Vector Machines find the boundary that maximizes the margin between classes. The kernel trick handles non-linearly separable data.
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
svm = SVC(kernel="rbf", C=1.0, gamma="scale", random_state=42)
svm.fit(X_train_scaled, y_train)
print(f"SVM Accuracy: {accuracy_score(y_test, svm.predict(X_test_scaled)):.3f}")
SVM and other distance-based algorithms are sensitive to feature scale. Temperature in degrees (60--95) would dominate pressure in bar (2--5) without scaling.
Choosing the Right Algorithm
| Algorithm | Strengths | Weaknesses | Best For |
|---|---|---|---|
| Decision Tree | Interpretable, fast | Overfits easily | Explainable decisions |
| Random Forest | Robust, feature ranking | Slower, less transparent | General-purpose |
| SVM | Strong with small data | Slow on large datasets | High-dimensional data |
models = {
"Decision Tree": DecisionTreeClassifier(max_depth=5, random_state=42),
"Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
"SVM (RBF)": SVC(kernel="rbf", random_state=42)
}
for name, model in models.items():
if "SVM" in name:
model.fit(X_train_scaled, y_train)
score = accuracy_score(y_test, model.predict(X_test_scaled))
else:
model.fit(X_train, y_train)
score = accuracy_score(y_test, model.predict(X_test))
print(f"{name}: {score:.3f}")
Practical Example: Automatically Classifying Produced Part Quality
A metal stamping factory inspects parts using four sensor measurements and wants to automate pass/fail decisions.
np.random.seed(42)
n = 2000
df = pd.DataFrame({
"thickness_mm": np.random.normal(3.00, 0.10, n),
"hardness_hrc": np.random.normal(58, 2, n),
"surface_roughness_um": np.random.exponential(1.5, n),
"press_force_kn": np.random.normal(500, 30, n)
})
df["quality"] = "pass"
df.loc[df["thickness_mm"] < 2.80, "quality"] = "fail"
df.loc[df["thickness_mm"] > 3.20, "quality"] = "fail"
df.loc[df["hardness_hrc"] < 54, "quality"] = "fail"
df.loc[df["surface_roughness_um"] > 4.0, "quality"] = "fail"
X = df.drop("quality", axis=1)
y = df["quality"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
clf = RandomForestClassifier(n_estimators=200, max_depth=8, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))
Summary
In this lesson you learned three classification algorithms for industrial applications. Decision Trees provide transparent decisions. Random Forests combine many trees for robust predictions and feature importance. SVMs find optimal boundaries with scaled features. You compared all three and applied Random Forest to automate quality inspection. In the next lesson, you will explore unsupervised learning with clustering, where data has no labels.