Classification Algorithms: Smart Automated Decisions
What Are Classification Algorithms?
Imagine a quality inspector at a plastic parts factory examining each part and deciding: good or defective. That decision relies on observations (color, dimensions, weight) and accumulated experience. Classification algorithms do the same thing -- they learn from labeled examples to categorize new data into predefined classes.
Classification is a supervised machine learning task where a model learns from labeled data and then predicts the class of new data. Classes can be binary (good/defective) or multi-class (electrical/mechanical/thermal fault).
Decision Tree
The simplest and most interpretable classification algorithm. Think of it as a series of yes/no questions leading to a final decision.
Industrial example: Classifying injection-molded parts -- is the part good?
Is mold temperature > 210C?
|-- Yes -> Is injection pressure > 80 bar?
| |-- Yes -> Good (95% confidence)
| |-- No -> Defective - internal voids
|-- No -> Is cooling time > 15 seconds?
|-- Yes -> Good (88% confidence)
|-- No -> Defective - surface warping
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import numpy as np
# Training data: [mold temp, injection pressure, cooling time]
X = np.array([
[220, 85, 18], [200, 70, 12], [215, 90, 16],
[195, 60, 10], [225, 88, 20], [205, 75, 14],
[210, 82, 17], [190, 55, 9], [218, 86, 19],
[198, 65, 11], [222, 92, 21], [208, 78, 15]
])
# Labels: 1 = good, 0 = defective
y = np.array([1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
tree = DecisionTreeClassifier(max_depth=3)
tree.fit(X_train, y_train)
# Classify a new part
new_part = [[212, 83, 16]]
result = tree.predict(new_part)
print(f"Classification: {'Good' if result[0] == 1 else 'Defective'}")
Random Forest
The problem with a single tree: it can overfit the training data. The solution? Build hundreds of different trees and take a majority vote.
Imagine 100 quality inspectors with different experience -- each inspects the part and votes. The final decision is the majority opinion. This is far more accurate than relying on a single inspector.
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=100, max_depth=5)
forest.fit(X_train, y_train)
# Feature importance
feature_names = ["Mold Temp", "Injection Pressure", "Cooling Time"]
for name, importance in zip(feature_names, forest.feature_importances_):
print(f" {name}: {importance:.2%}")
Bonus: Random forest provides feature importance -- which factor affects quality the most. This helps process engineers focus improvement efforts.
Support Vector Machine (SVM)
SVM finds the optimal separating line (or hyperplane in higher dimensions) between classes, maximizing the margin between the closest points of each class.
Picture a 2D map: the x-axis is vibration and the y-axis is temperature. Red points (faults) and green points (normal) overlap slightly. SVM finds the line that separates them with the largest possible margin.
from sklearn.svm import SVC
svm_model = SVC(kernel='rbf', C=1.0, gamma='scale')
svm_model.fit(X_train, y_train)
prediction = svm_model.predict(new_part)
print(f"SVM classification: {'Good' if prediction[0] == 1 else 'Defective'}")
| Kernel Type | Use Case |
|---|---|
| Linear | Data separable by a straight line |
| Polynomial (poly) | Curved decision boundaries |
| Radial Basis Function (rbf) | Most flexible -- default choice |
k-Nearest Neighbors (k-NN)
The simplest algorithm of all: to classify a new point, look at the nearest k training points and choose the most common class among them.
Imagine you are a new engineer at a plant and you see a suspicious part. You ask the 5 nearest expert colleagues -- 4 say "defective" and 1 says "good." Decision: defective.
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
print(f"k-NN classification: {'Good' if knn.predict(new_part)[0] == 1 else 'Defective'}")
Warning: k-NN is slow with large datasets because it computes the distance to every training point at each prediction.
Algorithm Comparison
| Criterion | Decision Tree | Random Forest | SVM | k-NN |
|---|---|---|---|---|
| Interpretability | Very high | Medium | Low | Medium |
| Training speed | Fast | Medium | Slow with large data | No training |
| Prediction speed | Very fast | Fast | Fast | Slow |
| Overfitting resistance | No | Yes | Yes | Depends on k |
| Needs normalization | No | No | Yes | Yes |
Confusion Matrix
After training a model, how do we evaluate its performance? The confusion matrix reveals error types in detail.
Predicted
Good Defective
Actual Good | TP=85 | FP=5 |
Defect | FN=3 | TN=7 |
| Metric | Formula | Meaning |
|---|---|---|
| Precision | TP / (TP + FP) | Of those predicted "good," how many are truly good? |
| Recall | TP / (TP + FN) | Of the truly good, how many did we catch? |
| F1-Score | 2 x (P x R) / (P + R) | Balance between precision and recall |
| Accuracy | (TP + TN) / Total | Overall correct prediction rate |
from sklearn.metrics import classification_report, confusion_matrix
y_pred = forest.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred,
target_names=["Defective", "Good"]))
When Does Recall Matter More Than Precision?
In industry, passing a defective part to the customer (False Negative) is far worse than rejecting a good part (False Positive). Therefore we want high recall for the defective class -- detecting as many defects as possible even if we reject some good parts.
ROC Curve and Area Under Curve (AUC)
The ROC curve plots the relationship between the True Positive Rate (Recall) and the False Positive Rate at different classification thresholds.
- AUC = 1.0: Perfect model
- AUC = 0.5: No better than random guessing
- AUC > 0.9: Excellent for industrial applications
from sklearn.metrics import roc_auc_score
# Need probabilities, not class labels
y_proba = forest.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_proba)
print(f"AUC: {auc:.3f}")
Industrial Applications
Part Classification: Good vs. Defective
A camera on the production line captures an image of every part. A classification algorithm (typically random forest or neural network) extracts features from the image and decides immediately: pass or reject.
Fault Type Identification
Instead of just "fault detected," multi-class classification identifies the type: electrical fault (current spike), mechanical fault (abnormal vibration), thermal fault (overheating). This directs the maintenance team straight to the problem.
Raw Material Sorting
In recycling plants, infrared sensors with an SVM algorithm classify plastic by type (PET, HDPE, PVC) for automatic real-time sorting.
Practical Tips
- Start with random forest -- Excellent performance with minimal tuning.
- Balance the classes -- In industry, defects are rare (1-5%). Use
class_weight='balanced'or SMOTE. - Feature importance -- Use it to decide which sensors are worth investing in.
- Do not ignore the confusion matrix -- Overall accuracy alone is misleading with imbalanced classes.
- Normalize data -- Required for SVM and k-NN, optional for trees.
- Cross-validation -- Use 5-fold cross-validation for reliable evaluation.