Probability and Reliability of Industrial Systems

What Is Industrial Reliability?

Reliability engineering uses probability and statistics to predict, prevent, and manage equipment failures in industrial systems. In modern manufacturing, maintenance has evolved from reactive "fix it when it breaks" to a data-driven science that forecasts failures and optimizes maintenance schedules.

The core question reliability engineering answers: "What is the probability that this system will perform its intended function without failure for a specified period under stated conditions?"

Fundamental Probability Distributions

Normal Distribution

The normal (Gaussian) distribution is the most common in nature and industry. Its bell-shaped curve is symmetric about the mean:

f(x) = (1 / (sigma * sqrt(2*pi))) * e^(-(x-mu)^2 / (2*sigma^2))

Where:

mu = mean (center of the distribution)
sigma = standard deviation (spread measure)
sigma^2 = variance

The empirical rule (68-95-99.7):

68% of values fall within mu +/- 1*sigma
95% of values fall within mu +/- 2*sigma
99.7% of values fall within mu +/- 3*sigma

Industrial application: Dimensions of CNC-machined parts follow a normal distribution. If the target diameter is 50mm with a standard deviation of 0.02mm, then 99.7% of parts will be between 49.94mm and 50.06mm. This is the foundation of Statistical Process Control (SPC) and Six Sigma methodology.

Exponential Distribution

The exponential distribution models the time between random failures — failures that occur independently of equipment age:

f(t) = lambda * e^(-lambda*t)

R(t) = e^(-lambda*t)

Where:

lambda = failure rate (failures per unit time)
R(t) = reliability function — probability of surviving without failure until time t
MTTF = 1/lambda = mean time to failure

Key property: Memoryless — the probability of failure in the next hour is the same regardless of how long the equipment has been running. This models the middle phase of equipment life (random failures).

Example: If a pump has failure rate lambda = 0.001 failures/hour:

MTTF = 1/0.001 = 1000 hours
Probability of surviving 500 hours: R(500) = e^(-0.001*500) = e^(-0.5) = 0.607 or 60.7%

Weibull Distribution

The Weibull distribution is the most important in reliability engineering because it can model different failure patterns by varying a single parameter:

f(t) = (beta/eta) * (t/eta)^(beta-1) * e^(-(t/eta)^beta)

R(t) = e^(-(t/eta)^beta)

Where:

beta = shape parameter (determines the failure pattern)
eta = scale parameter (characteristic life — the time at which 63.2% have failed)

The shape parameter tells the whole story:

beta Value	Failure Pattern	Industrial Meaning
beta < 1	Decreasing failure rate	Infant mortality (manufacturing defects)
beta = 1	Constant failure rate	Random failures (= exponential distribution)
beta > 1	Increasing failure rate	Wear-out and aging
beta ~ 2	—	Linear wear (belts, seals)
beta ~ 3.5	—	Approximates normal distribution (bearings)

The Bathtub Curve: Equipment Life Story

The bathtub curve describes how the failure rate changes over an equipment's lifetime:

Failure Rate lambda(t)
    |
    |\          Wear-out
    | \         (beta > 1)
    |  \___________________/
    |   Useful Life        /
    |   (beta = 1)        /
    |                    /
    +--------------------> Time
    Infant Mortality
    (beta < 1)

Phase 1 — Infant mortality: High but decreasing failure rate. Causes: manufacturing defects, installation errors, material flaws. Solution: Quality inspection, burn-in testing.

Phase 2 — Useful life: Low, constant failure rate. Failures are random — shocks, operator errors, unexpected conditions. Solution: Corrective maintenance and spare parts inventory.

Phase 3 — Wear-out: Increasing failure rate. Causes: mechanical wear, material fatigue, insulation degradation. Solution: Scheduled preventive maintenance or replacement before entering this phase.

Key Reliability Metrics

MTBF — Mean Time Between Failures

MTBF = Total operating time / Number of failures

Example: A motor ran for 8,760 hours (one year) and failed 4 times: MTBF = 8760/4 = 2190 hours

MTTR — Mean Time To Repair

MTTR = Total repair time / Number of repairs

Example: The four repairs took 2 + 5 + 3 + 6 = 16 hours: MTTR = 16/4 = 4 hours

Availability

A = MTBF / (MTBF + MTTR)

For our example: A = 2190 / (2190 + 4) = 0.9982 or 99.82% — excellent.

Industry benchmarks:

Equipment Type	Required Availability	Typical MTBF
Industrial pump	> 95%	5,000 - 20,000 hours
Electric motor	> 98%	30,000 - 100,000 hours
PLC / Controller	> 99.9%	100,000+ hours
Safety system	> 99.99%	1,000,000+ hours

System Reliability for Combined Systems

Series System

If any component fails, the entire system fails — like a chain breaking at its weakest link:

R_system = R1 * R2 * R3 * ... * Rn

Example: A production line with 3 machines, each at 0.95 reliability: R = 0.95^3 = 0.857 — total reliability dropped to 85.7%.

Lesson: As the number of series components increases, total reliability drops rapidly.

Parallel System (Redundancy)

The system operates as long as at least one component functions:

R_system = 1 - (1-R1) * (1-R2) * ... * (1-Rn)

Example: Two parallel pumps, each at 0.90 reliability: R = 1 - (1-0.90)^2 = 1 - 0.01 = 0.99 — a jump from 90% to 99%.

Industrial application: This is why factories use standby pumps — not a luxury, but a mathematical necessity to achieve acceptable reliability.

Mixed Systems

In practice, most industrial systems are combinations of series and parallel configurations. The approach: decompose the system into series and parallel subsystems, compute each subsystem's reliability, then combine.

Estimating Failure Parameters from Field Data

In the plant, you collect failure data and analyze it statistically:

Data collection: Record the date, time, type, and repair duration of every failure
Rank failure times: Sort failure times in ascending order
Estimate Weibull parameters: Using Weibull probability paper or Maximum Likelihood Estimation (MLE)
Predict: Use the fitted distribution to calculate reliability at any future time

Approximate Weibull parameter estimation — median rank method:

Sort failure times: t1 <= t2 <= ... <= tn
Compute cumulative probability: F(ti) ≈ (i - 0.3) / (n + 0.4)
Plot ln(ln(1/(1-F))) versus ln(t) — the slope is beta and the intercept gives eta

Practical example: You recorded failures of 10 bearings (in hours): 1200, 1500, 1800, 2000, 2100, 2400, 2600, 2800, 3100, 3500. Weibull analysis yields beta = 2.8 (wear-out) and eta = 2700 hours. This means preventive replacement at 2000 hours will prevent most unexpected failures.