Practical Statistics for Engineers: Mean, Deviation, and Distributions in Factory Data
Mean, Median, and Mode
Understanding central tendency is the first step in analyzing any industrial dataset.
import numpy as np
import pandas as pd
from scipy import stats
production = pd.Series([450, 455, 448, 460, 452, 10, 455, 449, 458, 453,
451, 456, 447, 459, 454, 452, 450, 457, 448, 453])
print(f"Mean: {production.mean():.1f} units/hour")
print(f"Median: {production.median():.1f} units/hour")
print(f"Mode: {production.mode().values[0]} units/hour")
Notice the value 10 -- a shutdown event. The mean drops to 430.8, but the median stays at 452.5. In industrial settings where brief outages occur, the median is often more reliable than the mean.
- Mean: When data is clean and symmetric -- average power consumption over a stable shift.
- Median: When outliers or shutdowns are present -- typical cycle time with occasional jams.
- Mode: For categorical data -- most common fault code in a maintenance log.
Standard Deviation and Variance
While central tendency tells you the typical value, spread tells you how consistent your process is. In manufacturing, consistency is quality.
machine_a = np.random.normal(50.0, 0.5, 1000) # tight tolerance
machine_b = np.random.normal(50.0, 2.0, 1000) # loose tolerance
print(f"Machine A - Std: {machine_a.std():.3f}")
print(f"Machine B - Std: {machine_b.std():.3f}")
spec_lower, spec_upper = 49.0, 51.0
defect_rate_a = ((machine_a < spec_lower) | (machine_a > spec_upper)).mean()
defect_rate_b = ((machine_b < spec_lower) | (machine_b > spec_upper)).mean()
print(f"Machine A defect rate: {defect_rate_a:.2%}")
print(f"Machine B defect rate: {defect_rate_b:.2%}")
Both machines hit the target of 50.0 on average, but Machine B produces far more out-of-spec parts because of its wider spread.
The Normal Distribution and Its Importance in Industry
Many physical measurements follow a normal distribution (bell curve). This emerges from the Central Limit Theorem whenever many small random factors combine.
import matplotlib.pyplot as plt
measurements = np.random.normal(25.000, 0.015, 5000)
fig, ax = plt.subplots(figsize=(10, 5))
ax.hist(measurements, bins=60, density=True, alpha=0.7, label="Measurements")
x = np.linspace(24.93, 25.07, 200)
ax.plot(x, stats.norm.pdf(x, 25.0, 0.015), "r-", lw=2, label="Normal fit")
ax.axvline(24.97, color="orange", linestyle="--", label="Lower spec")
ax.axvline(25.03, color="orange", linestyle="--", label="Upper spec")
ax.set_xlabel("Diameter (mm)")
ax.set_title("Bearing Diameter Distribution vs Specification")
ax.legend()
plt.tight_layout()
plt.show()
Testing for Normality
stat, p_value = stats.shapiro(measurements[:500])
print(f"Shapiro-Wilk p-value: {p_value:.4f}")
print("Normal" if p_value > 0.05 else "Not normal")
Correlation: Is There a Relationship Between Two Variables?
In a factory, variables rarely act in isolation. Motor temperature rises with load, vibration increases with wear.
n = 500
load_pct = np.random.uniform(30, 100, n)
motor_temp = 40 + 0.35 * load_pct + np.random.normal(0, 3, n)
vibration = 1.0 + 0.02 * load_pct + np.random.normal(0, 0.5, n)
df = pd.DataFrame({"load_pct": load_pct, "motor_temp": motor_temp,
"vibration": vibration})
print(df.corr())
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].scatter(df["load_pct"], df["motor_temp"], alpha=0.4, s=10)
axes[0].set_xlabel("Load (%)")
axes[0].set_ylabel("Motor Temp (C)")
axes[0].set_title(f"r = {df['load_pct'].corr(df['motor_temp']):.3f}")
axes[1].scatter(df["load_pct"], df["vibration"], alpha=0.4, s=10)
axes[1].set_xlabel("Load (%)")
axes[1].set_ylabel("Vibration (mm/s)")
axes[1].set_title(f"r = {df['load_pct'].corr(df['vibration']):.3f}")
plt.tight_layout()
plt.show()
Important: Correlation does not imply causation. Two sensors may correlate simply because both respond to the same hidden factor.
Hypothesis Testing: Is the Difference Real?
When you change a machine parameter or switch suppliers, you need to know if the observed difference is real or random variation.
before = np.random.normal(50.0, 1.0, 200)
after = np.random.normal(50.3, 0.9, 200)
t_stat, p_value = stats.ttest_ind(before, after)
print(f"t-statistic: {t_stat:.3f}, p-value: {p_value:.4f}")
print("Significant difference" if p_value < 0.05 else "No significant difference")
Choose the right test: t-test for comparing two group means, Chi-squared for comparing proportions, ANOVA for more than two groups.
Practical Example: Statistical Analysis of Product Quality Data
Analyze quality data from a plastic injection molding line producing bottle caps.
np.random.seed(42)
n = 1000
shift_a_weight = np.random.normal(2.50, 0.08, n)
shift_b_weight = np.random.normal(2.53, 0.12, n)
df = pd.DataFrame({
"weight_g": np.concatenate([shift_a_weight, shift_b_weight]),
"shift": ["A"] * n + ["B"] * n
})
print(df.groupby("shift")["weight_g"].describe())
t_stat, p_value = stats.ttest_ind(shift_a_weight, shift_b_weight)
print(f"\nt-test p-value: {p_value:.6f}")
print("Shifts differ significantly" if p_value < 0.05 else "No significant difference")
for shift, group in df.groupby("shift"):
out_of_spec = ((group["weight_g"] < 2.35) | (group["weight_g"] > 2.65)).mean()
print(f"Shift {shift} out-of-spec rate: {out_of_spec:.2%}")
Summary
In this lesson you applied core statistical methods to industrial data. You learned when to use mean, median, and mode, used standard deviation to measure process consistency, verified normal distributions, calculated correlations between sensor variables, and ran hypothesis tests to determine if differences are statistically significant. These techniques form the analytical foundation for every machine learning model. In the next lesson, you will use regression to predict continuous outcomes from your data.