Probability and Reliability of Industrial Systems
What Is Industrial Reliability?
Reliability engineering uses probability and statistics to predict, prevent, and manage equipment failures in industrial systems. In modern manufacturing, maintenance has evolved from reactive "fix it when it breaks" to a data-driven science that forecasts failures and optimizes maintenance schedules.
The core question reliability engineering answers: "What is the probability that this system will perform its intended function without failure for a specified period under stated conditions?"
Fundamental Probability Distributions
Normal Distribution
The normal (Gaussian) distribution is the most common in nature and industry. Its bell-shaped curve is symmetric about the mean:
f(x) = (1 / (sigma * sqrt(2*pi))) * e^(-(x-mu)^2 / (2*sigma^2))
Where:
mu= mean (center of the distribution)sigma= standard deviation (spread measure)sigma^2= variance
The empirical rule (68-95-99.7):
- 68% of values fall within
mu +/- 1*sigma - 95% of values fall within
mu +/- 2*sigma - 99.7% of values fall within
mu +/- 3*sigma
Industrial application: Dimensions of CNC-machined parts follow a normal distribution. If the target diameter is 50mm with a standard deviation of 0.02mm, then 99.7% of parts will be between 49.94mm and 50.06mm. This is the foundation of Statistical Process Control (SPC) and Six Sigma methodology.
Exponential Distribution
The exponential distribution models the time between random failures — failures that occur independently of equipment age:
f(t) = lambda * e^(-lambda*t)
R(t) = e^(-lambda*t)
Where:
lambda= failure rate (failures per unit time)R(t)= reliability function — probability of surviving without failure until time tMTTF = 1/lambda= mean time to failure
Key property: Memoryless — the probability of failure in the next hour is the same regardless of how long the equipment has been running. This models the middle phase of equipment life (random failures).
Example: If a pump has failure rate lambda = 0.001 failures/hour:
MTTF = 1/0.001 = 1000hours- Probability of surviving 500 hours:
R(500) = e^(-0.001*500) = e^(-0.5) = 0.607or 60.7%
Weibull Distribution
The Weibull distribution is the most important in reliability engineering because it can model different failure patterns by varying a single parameter:
f(t) = (beta/eta) * (t/eta)^(beta-1) * e^(-(t/eta)^beta)
R(t) = e^(-(t/eta)^beta)
Where:
beta= shape parameter (determines the failure pattern)eta= scale parameter (characteristic life — the time at which 63.2% have failed)
The shape parameter tells the whole story:
| beta Value | Failure Pattern | Industrial Meaning |
|---|---|---|
| beta < 1 | Decreasing failure rate | Infant mortality (manufacturing defects) |
| beta = 1 | Constant failure rate | Random failures (= exponential distribution) |
| beta > 1 | Increasing failure rate | Wear-out and aging |
| beta ~ 2 | — | Linear wear (belts, seals) |
| beta ~ 3.5 | — | Approximates normal distribution (bearings) |
The Bathtub Curve: Equipment Life Story
The bathtub curve describes how the failure rate changes over an equipment's lifetime:
Failure Rate lambda(t)
|
|\ Wear-out
| \ (beta > 1)
| \___________________/
| Useful Life /
| (beta = 1) /
| /
+--------------------> Time
Infant Mortality
(beta < 1)
Phase 1 — Infant mortality: High but decreasing failure rate. Causes: manufacturing defects, installation errors, material flaws. Solution: Quality inspection, burn-in testing.
Phase 2 — Useful life: Low, constant failure rate. Failures are random — shocks, operator errors, unexpected conditions. Solution: Corrective maintenance and spare parts inventory.
Phase 3 — Wear-out: Increasing failure rate. Causes: mechanical wear, material fatigue, insulation degradation. Solution: Scheduled preventive maintenance or replacement before entering this phase.
Key Reliability Metrics
MTBF — Mean Time Between Failures
MTBF = Total operating time / Number of failures
Example: A motor ran for 8,760 hours (one year) and failed 4 times:
MTBF = 8760/4 = 2190 hours
MTTR — Mean Time To Repair
MTTR = Total repair time / Number of repairs
Example: The four repairs took 2 + 5 + 3 + 6 = 16 hours:
MTTR = 16/4 = 4 hours
Availability
A = MTBF / (MTBF + MTTR)
For our example: A = 2190 / (2190 + 4) = 0.9982 or 99.82% — excellent.
Industry benchmarks:
| Equipment Type | Required Availability | Typical MTBF |
|---|---|---|
| Industrial pump | > 95% | 5,000 - 20,000 hours |
| Electric motor | > 98% | 30,000 - 100,000 hours |
| PLC / Controller | > 99.9% | 100,000+ hours |
| Safety system | > 99.99% | 1,000,000+ hours |
System Reliability for Combined Systems
Series System
If any component fails, the entire system fails — like a chain breaking at its weakest link:
R_system = R1 * R2 * R3 * ... * Rn
Example: A production line with 3 machines, each at 0.95 reliability:
R = 0.95^3 = 0.857 — total reliability dropped to 85.7%.
Lesson: As the number of series components increases, total reliability drops rapidly.
Parallel System (Redundancy)
The system operates as long as at least one component functions:
R_system = 1 - (1-R1) * (1-R2) * ... * (1-Rn)
Example: Two parallel pumps, each at 0.90 reliability:
R = 1 - (1-0.90)^2 = 1 - 0.01 = 0.99 — a jump from 90% to 99%.
Industrial application: This is why factories use standby pumps — not a luxury, but a mathematical necessity to achieve acceptable reliability.
Mixed Systems
In practice, most industrial systems are combinations of series and parallel configurations. The approach: decompose the system into series and parallel subsystems, compute each subsystem's reliability, then combine.
Estimating Failure Parameters from Field Data
In the plant, you collect failure data and analyze it statistically:
- Data collection: Record the date, time, type, and repair duration of every failure
- Rank failure times: Sort failure times in ascending order
- Estimate Weibull parameters: Using Weibull probability paper or Maximum Likelihood Estimation (MLE)
- Predict: Use the fitted distribution to calculate reliability at any future time
Approximate Weibull parameter estimation — median rank method:
- Sort failure times:
t1 <= t2 <= ... <= tn - Compute cumulative probability:
F(ti) ≈ (i - 0.3) / (n + 0.4) - Plot
ln(ln(1/(1-F)))versusln(t)— the slope isbetaand the intercept giveseta
Practical example: You recorded failures of 10 bearings (in hours): 1200, 1500, 1800, 2000, 2100, 2400, 2600, 2800, 3100, 3500. Weibull analysis yields beta = 2.8 (wear-out) and eta = 2700 hours. This means preventive replacement at 2000 hours will prevent most unexpected failures.