Classification Algorithms: Smart Automated Decisions

What Are Classification Algorithms?

Imagine a quality inspector at a plastic parts factory examining each part and deciding: good or defective. That decision relies on observations (color, dimensions, weight) and accumulated experience. Classification algorithms do the same thing -- they learn from labeled examples to categorize new data into predefined classes.

Classification is a supervised machine learning task where a model learns from labeled data and then predicts the class of new data. Classes can be binary (good/defective) or multi-class (electrical/mechanical/thermal fault).

Decision Tree

The simplest and most interpretable classification algorithm. Think of it as a series of yes/no questions leading to a final decision.

Industrial example: Classifying injection-molded parts -- is the part good?

Is mold temperature > 210C?
|-- Yes -> Is injection pressure > 80 bar?
|   |-- Yes -> Good (95% confidence)
|   |-- No  -> Defective - internal voids
|-- No  -> Is cooling time > 15 seconds?
    |-- Yes -> Good (88% confidence)
    |-- No  -> Defective - surface warping

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import numpy as np

# Training data: [mold temp, injection pressure, cooling time]
X = np.array([
    [220, 85, 18], [200, 70, 12], [215, 90, 16],
    [195, 60, 10], [225, 88, 20], [205, 75, 14],
    [210, 82, 17], [190, 55, 9],  [218, 86, 19],
    [198, 65, 11], [222, 92, 21], [208, 78, 15]
])
# Labels: 1 = good, 0 = defective
y = np.array([1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

tree = DecisionTreeClassifier(max_depth=3)
tree.fit(X_train, y_train)

# Classify a new part
new_part = [[212, 83, 16]]
result = tree.predict(new_part)
print(f"Classification: {'Good' if result[0] == 1 else 'Defective'}")

Random Forest

The problem with a single tree: it can overfit the training data. The solution? Build hundreds of different trees and take a majority vote.

Imagine 100 quality inspectors with different experience -- each inspects the part and votes. The final decision is the majority opinion. This is far more accurate than relying on a single inspector.

from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators=100, max_depth=5)
forest.fit(X_train, y_train)

# Feature importance
feature_names = ["Mold Temp", "Injection Pressure", "Cooling Time"]
for name, importance in zip(feature_names, forest.feature_importances_):
    print(f"  {name}: {importance:.2%}")

Bonus: Random forest provides feature importance -- which factor affects quality the most. This helps process engineers focus improvement efforts.

Support Vector Machine (SVM)

SVM finds the optimal separating line (or hyperplane in higher dimensions) between classes, maximizing the margin between the closest points of each class.

Picture a 2D map: the x-axis is vibration and the y-axis is temperature. Red points (faults) and green points (normal) overlap slightly. SVM finds the line that separates them with the largest possible margin.

from sklearn.svm import SVC

svm_model = SVC(kernel='rbf', C=1.0, gamma='scale')
svm_model.fit(X_train, y_train)

prediction = svm_model.predict(new_part)
print(f"SVM classification: {'Good' if prediction[0] == 1 else 'Defective'}")

Kernel Type	Use Case
Linear	Data separable by a straight line
Polynomial (poly)	Curved decision boundaries
Radial Basis Function (rbf)	Most flexible -- default choice

k-Nearest Neighbors (k-NN)

The simplest algorithm of all: to classify a new point, look at the nearest k training points and choose the most common class among them.

Imagine you are a new engineer at a plant and you see a suspicious part. You ask the 5 nearest expert colleagues -- 4 say "defective" and 1 says "good." Decision: defective.

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
print(f"k-NN classification: {'Good' if knn.predict(new_part)[0] == 1 else 'Defective'}")

Warning: k-NN is slow with large datasets because it computes the distance to every training point at each prediction.

Algorithm Comparison

Criterion	Decision Tree	Random Forest	SVM	k-NN
Interpretability	Very high	Medium	Low	Medium
Training speed	Fast	Medium	Slow with large data	No training
Prediction speed	Very fast	Fast	Fast	Slow
Overfitting resistance	No	Yes	Yes	Depends on k
Needs normalization	No	No	Yes	Yes

Confusion Matrix

After training a model, how do we evaluate its performance? The confusion matrix reveals error types in detail.

                    Predicted
                  Good     Defective
Actual  Good   |  TP=85  |  FP=5    |
        Defect |  FN=3   |  TN=7    |

Metric	Formula	Meaning
Precision	TP / (TP + FP)	Of those predicted "good," how many are truly good?
Recall	TP / (TP + FN)	Of the truly good, how many did we catch?
F1-Score	2 x (P x R) / (P + R)	Balance between precision and recall
Accuracy	(TP + TN) / Total	Overall correct prediction rate

from sklearn.metrics import classification_report, confusion_matrix

y_pred = forest.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred,
      target_names=["Defective", "Good"]))

When Does Recall Matter More Than Precision?

In industry, passing a defective part to the customer (False Negative) is far worse than rejecting a good part (False Positive). Therefore we want high recall for the defective class -- detecting as many defects as possible even if we reject some good parts.

ROC Curve and Area Under Curve (AUC)

The ROC curve plots the relationship between the True Positive Rate (Recall) and the False Positive Rate at different classification thresholds.

AUC = 1.0: Perfect model
AUC = 0.5: No better than random guessing
AUC > 0.9: Excellent for industrial applications

from sklearn.metrics import roc_auc_score

# Need probabilities, not class labels
y_proba = forest.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_proba)
print(f"AUC: {auc:.3f}")

Industrial Applications

Part Classification: Good vs. Defective

A camera on the production line captures an image of every part. A classification algorithm (typically random forest or neural network) extracts features from the image and decides immediately: pass or reject.

Fault Type Identification

Instead of just "fault detected," multi-class classification identifies the type: electrical fault (current spike), mechanical fault (abnormal vibration), thermal fault (overheating). This directs the maintenance team straight to the problem.

Raw Material Sorting

In recycling plants, infrared sensors with an SVM algorithm classify plastic by type (PET, HDPE, PVC) for automatic real-time sorting.

Practical Tips

Start with random forest -- Excellent performance with minimal tuning.
Balance the classes -- In industry, defects are rare (1-5%). Use class_weight='balanced' or SMOTE.
Feature importance -- Use it to decide which sensors are worth investing in.
Do not ignore the confusion matrix -- Overall accuracy alone is misleading with imbalanced classes.
Normalize data -- Required for SVM and k-NN, optional for trees.
Cross-validation -- Use 5-fold cross-validation for reliable evaluation.