Home Wiki AI Fundamentals Classification Algorithms: Smart Automated Decisions
AI Fundamentals

Classification Algorithms: Smart Automated Decisions

What Are Classification Algorithms?

Imagine a quality inspector at a plastic parts factory examining each part and deciding: good or defective. That decision relies on observations (color, dimensions, weight) and accumulated experience. Classification algorithms do the same thing -- they learn from labeled examples to categorize new data into predefined classes.

Classification is a supervised machine learning task where a model learns from labeled data and then predicts the class of new data. Classes can be binary (good/defective) or multi-class (electrical/mechanical/thermal fault).

Decision Tree

The simplest and most interpretable classification algorithm. Think of it as a series of yes/no questions leading to a final decision.

Industrial example: Classifying injection-molded parts -- is the part good?

Is mold temperature > 210C?
|-- Yes -> Is injection pressure > 80 bar?
|   |-- Yes -> Good (95% confidence)
|   |-- No  -> Defective - internal voids
|-- No  -> Is cooling time > 15 seconds?
    |-- Yes -> Good (88% confidence)
    |-- No  -> Defective - surface warping
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import numpy as np

# Training data: [mold temp, injection pressure, cooling time]
X = np.array([
    [220, 85, 18], [200, 70, 12], [215, 90, 16],
    [195, 60, 10], [225, 88, 20], [205, 75, 14],
    [210, 82, 17], [190, 55, 9],  [218, 86, 19],
    [198, 65, 11], [222, 92, 21], [208, 78, 15]
])
# Labels: 1 = good, 0 = defective
y = np.array([1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

tree = DecisionTreeClassifier(max_depth=3)
tree.fit(X_train, y_train)

# Classify a new part
new_part = [[212, 83, 16]]
result = tree.predict(new_part)
print(f"Classification: {'Good' if result[0] == 1 else 'Defective'}")

Random Forest

The problem with a single tree: it can overfit the training data. The solution? Build hundreds of different trees and take a majority vote.

Imagine 100 quality inspectors with different experience -- each inspects the part and votes. The final decision is the majority opinion. This is far more accurate than relying on a single inspector.

from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators=100, max_depth=5)
forest.fit(X_train, y_train)

# Feature importance
feature_names = ["Mold Temp", "Injection Pressure", "Cooling Time"]
for name, importance in zip(feature_names, forest.feature_importances_):
    print(f"  {name}: {importance:.2%}")

Bonus: Random forest provides feature importance -- which factor affects quality the most. This helps process engineers focus improvement efforts.

Support Vector Machine (SVM)

SVM finds the optimal separating line (or hyperplane in higher dimensions) between classes, maximizing the margin between the closest points of each class.

Picture a 2D map: the x-axis is vibration and the y-axis is temperature. Red points (faults) and green points (normal) overlap slightly. SVM finds the line that separates them with the largest possible margin.

from sklearn.svm import SVC

svm_model = SVC(kernel='rbf', C=1.0, gamma='scale')
svm_model.fit(X_train, y_train)

prediction = svm_model.predict(new_part)
print(f"SVM classification: {'Good' if prediction[0] == 1 else 'Defective'}")
Kernel Type Use Case
Linear Data separable by a straight line
Polynomial (poly) Curved decision boundaries
Radial Basis Function (rbf) Most flexible -- default choice

k-Nearest Neighbors (k-NN)

The simplest algorithm of all: to classify a new point, look at the nearest k training points and choose the most common class among them.

Imagine you are a new engineer at a plant and you see a suspicious part. You ask the 5 nearest expert colleagues -- 4 say "defective" and 1 says "good." Decision: defective.

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
print(f"k-NN classification: {'Good' if knn.predict(new_part)[0] == 1 else 'Defective'}")

Warning: k-NN is slow with large datasets because it computes the distance to every training point at each prediction.

Algorithm Comparison

Criterion Decision Tree Random Forest SVM k-NN
Interpretability Very high Medium Low Medium
Training speed Fast Medium Slow with large data No training
Prediction speed Very fast Fast Fast Slow
Overfitting resistance No Yes Yes Depends on k
Needs normalization No No Yes Yes

Confusion Matrix

After training a model, how do we evaluate its performance? The confusion matrix reveals error types in detail.

                    Predicted
                  Good     Defective
Actual  Good   |  TP=85  |  FP=5    |
        Defect |  FN=3   |  TN=7    |
Metric Formula Meaning
Precision TP / (TP + FP) Of those predicted "good," how many are truly good?
Recall TP / (TP + FN) Of the truly good, how many did we catch?
F1-Score 2 x (P x R) / (P + R) Balance between precision and recall
Accuracy (TP + TN) / Total Overall correct prediction rate
from sklearn.metrics import classification_report, confusion_matrix

y_pred = forest.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred,
      target_names=["Defective", "Good"]))

When Does Recall Matter More Than Precision?

In industry, passing a defective part to the customer (False Negative) is far worse than rejecting a good part (False Positive). Therefore we want high recall for the defective class -- detecting as many defects as possible even if we reject some good parts.

ROC Curve and Area Under Curve (AUC)

The ROC curve plots the relationship between the True Positive Rate (Recall) and the False Positive Rate at different classification thresholds.

  • AUC = 1.0: Perfect model
  • AUC = 0.5: No better than random guessing
  • AUC > 0.9: Excellent for industrial applications
from sklearn.metrics import roc_auc_score

# Need probabilities, not class labels
y_proba = forest.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_proba)
print(f"AUC: {auc:.3f}")

Industrial Applications

Part Classification: Good vs. Defective

A camera on the production line captures an image of every part. A classification algorithm (typically random forest or neural network) extracts features from the image and decides immediately: pass or reject.

Fault Type Identification

Instead of just "fault detected," multi-class classification identifies the type: electrical fault (current spike), mechanical fault (abnormal vibration), thermal fault (overheating). This directs the maintenance team straight to the problem.

Raw Material Sorting

In recycling plants, infrared sensors with an SVM algorithm classify plastic by type (PET, HDPE, PVC) for automatic real-time sorting.

Practical Tips

  1. Start with random forest -- Excellent performance with minimal tuning.
  2. Balance the classes -- In industry, defects are rare (1-5%). Use class_weight='balanced' or SMOTE.
  3. Feature importance -- Use it to decide which sensors are worth investing in.
  4. Do not ignore the confusion matrix -- Overall accuracy alone is misleading with imbalanced classes.
  5. Normalize data -- Required for SVM and k-NN, optional for trees.
  6. Cross-validation -- Use 5-fold cross-validation for reliable evaluation.
classification decision-tree SVM random-forest confusion-matrix accuracy التصنيف شجرة القرار آلة المتجهات الداعمة الغابة العشوائية مصفوفة الارتباك الدقة