Exploring Industrial Data: Reading, Cleaning, and Understanding Sensor Data

Reading Sensor Data From CSV

Every industrial ML project starts with data, and that data usually lives in CSV files exported from SCADA systems, PLCs, or data historians.

import pandas as pd
import numpy as np

df = pd.read_csv("production_line_march.csv", parse_dates=["timestamp"])
df.set_index("timestamp", inplace=True)

print(f"Dataset shape: {df.shape}")
print(f"Time range: {df.index.min()} to {df.index.max()}")

Factory data often has quirks: semicolons as delimiters, comma decimal separators, or encoding problems from legacy systems.

df = pd.read_csv("plc_export.csv", sep=";", decimal=",", encoding="latin-1")

Exploring Data: describe and info

Before any analysis, understand the structure and quality of your dataset.

print(df.info())
print(df.describe())

The describe() output reveals minimum and maximum values that may indicate sensor malfunctions, while info() shows how many non-null entries exist per column. A column with 720 entries when others have 744 signals data loss during a specific period.

print(df["machine_state"].value_counts())

for col in df.select_dtypes(include=[np.number]).columns:
    if df[col].std() == 0:
        print(f"WARNING: {col} has zero variance")

Handling Missing Values

Missing data is the norm in industrial settings. Sensors go offline, communication drops, and maintenance windows create gaps.

print(df.isnull().sum())

# Forward fill: carry last known value (good for slow-changing sensors)
df["coolant_temp"] = df["coolant_temp"].ffill()

# Interpolation: estimate between known points (good for continuous signals)
df["vibration"] = df["vibration"].interpolate(method="linear")

# Drop rows only when critical columns are missing
df.dropna(subset=["motor_current", "spindle_speed"], inplace=True)

Detecting and Cleaning Outliers

A temperature sensor reading -999 or 9999 is not a real measurement -- it is an error code. Outliers like these ruin models if left uncleaned.

The IQR Method

def remove_outliers_iqr(series, factor=1.5):
    q1 = series.quantile(0.25)
    q3 = series.quantile(0.75)
    iqr = q3 - q1
    lower = q1 - factor * iqr
    upper = q3 + factor * iqr
    return series.clip(lower, upper)

for col in df.select_dtypes(include=[np.number]).columns:
    df[col] = remove_outliers_iqr(df[col])

Z-Score Method

from scipy import stats

z_scores = np.abs(stats.zscore(df.select_dtypes(include=[np.number])))
outlier_mask = (z_scores > 3).any(axis=1)
print(f"Rows with extreme outliers: {outlier_mask.sum()}")

Visualization: matplotlib and seaborn

Visualization is not decoration -- it is how you discover patterns and validate cleaning steps.

import seaborn as sns
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
df["motor_temp"].plot(ax=axes[0, 0], title="Motor Temperature Over Time")
df["vibration"].hist(ax=axes[0, 1], bins=50)
axes[0, 1].set_title("Vibration Distribution")

sns.heatmap(df.corr(numeric_only=True), annot=True, cmap="coolwarm",
            ax=axes[1, 0], fmt=".2f")
axes[1, 0].set_title("Sensor Correlation Matrix")

df[["motor_temp", "bearing_temp", "coolant_temp"]].boxplot(ax=axes[1, 1])
axes[1, 1].set_title("Temperature Sensors Comparison")
plt.tight_layout()
plt.show()

Practical Example: Cleaning a Full Month of Production Line Data

Apply every technique to a realistic dataset from a bottling plant.

import pandas as pd
import numpy as np

np.random.seed(42)
n = 744
timestamps = pd.date_range("2025-03-01", periods=n, freq="h")

df = pd.DataFrame({
    "timestamp": timestamps,
    "fill_volume_ml": np.random.normal(500, 5, n),
    "line_speed_bpm": np.random.normal(200, 15, n),
    "pressure_bar": np.random.normal(3.2, 0.3, n),
    "reject_rate_pct": np.random.exponential(2, n)
})
df.loc[100:110, "fill_volume_ml"] = np.nan
df.loc[300, "pressure_bar"] = -999.0
df.loc[500, "line_speed_bpm"] = 9999.0
df.set_index("timestamp", inplace=True)

print("=== BEFORE CLEANING ===")
print(f"Missing values:\n{df.isnull().sum()}")

df.replace([-999.0, 9999.0], np.nan, inplace=True)
df.interpolate(method="linear", inplace=True)
for col in df.columns:
    df[col] = remove_outliers_iqr(df[col])

print("\n=== AFTER CLEANING ===")
print(f"Missing values:\n{df.isnull().sum()}")
df.to_csv("production_line_march_clean.csv")
print("Cleaned data saved.")

Summary

In this lesson you learned the complete data preparation pipeline for industrial sensor data. You loaded CSV files with various formatting issues, explored datasets with describe() and info(), handled missing values using forward fill and interpolation, detected and removed outliers with IQR and Z-score methods, and created visualizations to verify your cleaning steps. Clean data is the foundation of every successful ML model. In the next lesson, you will apply statistical methods to extract deeper insights from your cleaned data.