Exploring Industrial Data: Reading, Cleaning, and Understanding Sensor Data
Reading Sensor Data From CSV
Every industrial ML project starts with data, and that data usually lives in CSV files exported from SCADA systems, PLCs, or data historians.
import pandas as pd
import numpy as np
df = pd.read_csv("production_line_march.csv", parse_dates=["timestamp"])
df.set_index("timestamp", inplace=True)
print(f"Dataset shape: {df.shape}")
print(f"Time range: {df.index.min()} to {df.index.max()}")
Factory data often has quirks: semicolons as delimiters, comma decimal separators, or encoding problems from legacy systems.
df = pd.read_csv("plc_export.csv", sep=";", decimal=",", encoding="latin-1")
Exploring Data: describe and info
Before any analysis, understand the structure and quality of your dataset.
print(df.info())
print(df.describe())
The describe() output reveals minimum and maximum values that may indicate sensor malfunctions, while info() shows how many non-null entries exist per column. A column with 720 entries when others have 744 signals data loss during a specific period.
print(df["machine_state"].value_counts())
for col in df.select_dtypes(include=[np.number]).columns:
if df[col].std() == 0:
print(f"WARNING: {col} has zero variance")
Handling Missing Values
Missing data is the norm in industrial settings. Sensors go offline, communication drops, and maintenance windows create gaps.
print(df.isnull().sum())
# Forward fill: carry last known value (good for slow-changing sensors)
df["coolant_temp"] = df["coolant_temp"].ffill()
# Interpolation: estimate between known points (good for continuous signals)
df["vibration"] = df["vibration"].interpolate(method="linear")
# Drop rows only when critical columns are missing
df.dropna(subset=["motor_current", "spindle_speed"], inplace=True)
Detecting and Cleaning Outliers
A temperature sensor reading -999 or 9999 is not a real measurement -- it is an error code. Outliers like these ruin models if left uncleaned.
The IQR Method
def remove_outliers_iqr(series, factor=1.5):
q1 = series.quantile(0.25)
q3 = series.quantile(0.75)
iqr = q3 - q1
lower = q1 - factor * iqr
upper = q3 + factor * iqr
return series.clip(lower, upper)
for col in df.select_dtypes(include=[np.number]).columns:
df[col] = remove_outliers_iqr(df[col])
Z-Score Method
from scipy import stats
z_scores = np.abs(stats.zscore(df.select_dtypes(include=[np.number])))
outlier_mask = (z_scores > 3).any(axis=1)
print(f"Rows with extreme outliers: {outlier_mask.sum()}")
Visualization: matplotlib and seaborn
Visualization is not decoration -- it is how you discover patterns and validate cleaning steps.
import seaborn as sns
import matplotlib.pyplot as plt
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
df["motor_temp"].plot(ax=axes[0, 0], title="Motor Temperature Over Time")
df["vibration"].hist(ax=axes[0, 1], bins=50)
axes[0, 1].set_title("Vibration Distribution")
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap="coolwarm",
ax=axes[1, 0], fmt=".2f")
axes[1, 0].set_title("Sensor Correlation Matrix")
df[["motor_temp", "bearing_temp", "coolant_temp"]].boxplot(ax=axes[1, 1])
axes[1, 1].set_title("Temperature Sensors Comparison")
plt.tight_layout()
plt.show()
Practical Example: Cleaning a Full Month of Production Line Data
Apply every technique to a realistic dataset from a bottling plant.
import pandas as pd
import numpy as np
np.random.seed(42)
n = 744
timestamps = pd.date_range("2025-03-01", periods=n, freq="h")
df = pd.DataFrame({
"timestamp": timestamps,
"fill_volume_ml": np.random.normal(500, 5, n),
"line_speed_bpm": np.random.normal(200, 15, n),
"pressure_bar": np.random.normal(3.2, 0.3, n),
"reject_rate_pct": np.random.exponential(2, n)
})
df.loc[100:110, "fill_volume_ml"] = np.nan
df.loc[300, "pressure_bar"] = -999.0
df.loc[500, "line_speed_bpm"] = 9999.0
df.set_index("timestamp", inplace=True)
print("=== BEFORE CLEANING ===")
print(f"Missing values:\n{df.isnull().sum()}")
df.replace([-999.0, 9999.0], np.nan, inplace=True)
df.interpolate(method="linear", inplace=True)
for col in df.columns:
df[col] = remove_outliers_iqr(df[col])
print("\n=== AFTER CLEANING ===")
print(f"Missing values:\n{df.isnull().sum()}")
df.to_csv("production_line_march_clean.csv")
print("Cleaned data saved.")
Summary
In this lesson you learned the complete data preparation pipeline for industrial sensor data. You loaded CSV files with various formatting issues, explored datasets with describe() and info(), handled missing values using forward fill and interpolation, detected and removed outliers with IQR and Z-score methods, and created visualizations to verify your cleaning steps. Clean data is the foundation of every successful ML model. In the next lesson, you will apply statistical methods to extract deeper insights from your cleaned data.