Reinforcement Learning: Machines Learn by Trial and Reward

What Is Reinforcement Learning?

Imagine a robot in a factory trying to learn the best way to arrange parts on a shipping pallet. At first it places parts randomly -- some fall off, some do not balance. But with each attempt it receives a reward (successful arrangement) or penalty (part fell). After thousands of attempts, it learns an optimal strategy that nobody programmed manually.

This is Reinforcement Learning (RL): an agent interacts with an environment and learns optimal behavior through trial and error, guided by reward signals.

Core Concepts

The Five Elements of Reinforcement Learning

Element	Definition	Industrial Example
Agent	The learner and decision maker	Intelligent control system
Environment	Everything the agent interacts with	Production line and equipment
State	Description of the current situation	Temperature, pressure, speed
Action	The decision the agent makes	Increase speed, reduce temperature
Reward	Signal evaluating the action	+10 for good part, -50 for breakdown

Policy

The policy is the agent's strategy -- a function that maps each state to the appropriate action. The goal is to find the optimal policy that maximizes cumulative long-term reward.

Policy: pi(s) -> a
Meaning: in state s, take action a

Value Function

Estimates the total expected reward starting from a given state. Think of it as an experienced engineer's estimate: "If we start from this situation, how much will we gain in the long run?"

import numpy as np

# Simplified example: furnace temperature control environment
class FurnaceEnvironment:
    def __init__(self):
        self.target_temp = 200      # target: 200 C
        self.current_temp = 150     # start: 150 C
        self.max_steps = 50

    def get_state(self):
        """State: difference from target"""
        return round(self.current_temp - self.target_temp, 1)

    def step(self, action):
        """
        Actions: 0 = reduce heating, 1 = hold, 2 = increase heating
        """
        if action == 0:
            self.current_temp -= 5
        elif action == 2:
            self.current_temp += 5

        # Add realistic noise
        self.current_temp += np.random.normal(0, 1)

        # Calculate reward
        error = abs(self.current_temp - self.target_temp)
        if error < 2:
            reward = 10        # excellent -- very close to target
        elif error < 10:
            reward = 1         # acceptable
        else:
            reward = -5        # far off -- penalty

        done = error > 50      # terminate if out of control
        return self.get_state(), reward, done

Q-Learning: Model-Free Learning

Q-Learning is the most famous classical RL algorithm. It builds a table (Q-Table) storing the value of every (state, action) pair -- meaning "how valuable is taking this action in this state?"

Bellman Equation

Q(s, a) <- Q(s, a) + alpha * [r + gamma * max Q(s', a') - Q(s, a)]

Symbol	Meaning
alpha	Learning rate -- how much we trust new information
gamma	Discount factor -- importance of future rewards
r	Immediate reward
max Q(s', a')	Best expected value from the next state

import numpy as np

class QLearningAgent:
    def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.95, epsilon=0.1):
        self.q_table = np.zeros((n_states, n_actions))
        self.alpha = alpha       # learning rate
        self.gamma = gamma       # discount factor
        self.epsilon = epsilon   # exploration probability

    def choose_action(self, state):
        """Epsilon-greedy: explore sometimes, exploit mostly"""
        if np.random.random() < self.epsilon:
            return np.random.randint(self.q_table.shape[1])  # random exploration
        return np.argmax(self.q_table[state])                # exploit best action

    def learn(self, state, action, reward, next_state):
        """Update Q-Table using Bellman equation"""
        best_next = np.max(self.q_table[next_state])
        td_target = reward + self.gamma * best_next
        td_error = td_target - self.q_table[state, action]
        self.q_table[state, action] += self.alpha * td_error

# Example: 21 states (temp diff from -50 to +50 in steps of 5) and 3 actions
agent = QLearningAgent(n_states=21, n_actions=3)
print(f"Q-Table size: {agent.q_table.shape}")
print("Training starts with zero values and improves with iterations")

Exploration vs. Exploitation

This is the fundamental dilemma in RL: should we try new actions (explore) or stick with the best known action (exploit)?

Consider a process engineer: should they try new machine settings that might improve production (but might cause defects), or stay with the current proven settings?

The epsilon-greedy strategy: with probability epsilon we explore randomly, and with probability (1-epsilon) we exploit the best known action. Epsilon starts high (lots of exploration) and gradually decreases.

Deep Q-Networks (DQN)

When the number of states is enormous (such as a camera image or hundreds of sensor readings), building a Q-table is impossible. The solution: replace it with a neural network that approximates the Q function.

import tensorflow as tf
from tensorflow.keras import layers, models

def build_dqn(state_size, n_actions):
    """Deep Q-Network"""
    model = models.Sequential([
        layers.Dense(64, activation='relu', input_shape=(state_size,)),
        layers.Dense(64, activation='relu'),
        layers.Dense(n_actions, activation='linear')  # Q-value for each action
    ])
    model.compile(optimizer=tf.keras.optimizers.Adam(0.001),
                  loss='mse')
    return model

# Environment with 20 sensors and 5 possible actions
dqn = build_dqn(state_size=20, n_actions=5)
dqn.summary()

Key DQN Techniques

Technique	Purpose
Experience Replay	Store experiences and replay them randomly to break temporal correlation
Target Network	A frozen copy of the network updated periodically for training stability
Reward Clipping	Limit rewards to [-1, +1] for training stability

Policy Gradient

Instead of learning the Q function and then deriving the policy from it, policy gradient learns the policy directly -- the probability of each action in each state.

Key advantage: It can handle continuous actions (such as "set motor speed to 1523.7 RPM") instead of only discrete actions.

# Simplified policy network example
policy_network = models.Sequential([
    layers.Dense(64, activation='relu', input_shape=(10,)),
    layers.Dense(32, activation='relu'),
    layers.Dense(3, activation='softmax')  # probabilities for 3 actions
])

# At each step:
# 1. Network outputs action probabilities
# 2. Sample an action according to probabilities
# 3. Compute reward
# 4. Update network: increase probability of high-reward actions

Approach Comparison

Criterion	Q-Learning / DQN	Policy Gradient
Action type	Discrete only	Discrete and continuous
Stability	More stable	Less stable (high variance)
Sample efficiency	Higher (reuses data)	Lower (needs more data)
Best for	Limited action count	Continuous control

Industrial Applications

Production Process Optimization

Consider a metal smelting furnace with dozens of parameters (temperature, oxygen pressure, feed rate). RL tries different combinations and learns optimal settings that reduce waste and improve metal quality.

# Example: production process optimization environment
class ProcessOptimizer:
    def __init__(self):
        self.params = {
            "temperature": 800,     # C
            "pressure": 50,         # bar
            "feed_rate": 100        # kg/hr
        }

    def evaluate(self):
        """Evaluate product quality and efficiency"""
        # In reality: actual measurements from the production line
        quality = self._quality_model()
        energy = self._energy_model()
        waste = self._waste_model()

        # Reward: high quality + energy efficiency - waste
        reward = quality * 10 - energy * 2 - waste * 5
        return reward

    def _quality_model(self):
        """Simplified quality model"""
        temp_score = 1.0 - abs(self.params["temperature"] - 820) / 100
        return max(0, min(1, temp_score))

    def _energy_model(self):
        return self.params["temperature"] / 1000

    def _waste_model(self):
        return max(0, (self.params["feed_rate"] - 110) / 100)

Robot Path Planning

An industrial robotic arm needs to move parts from one location to another while avoiding obstacles. RL learns efficient paths that minimize movement time and energy consumption while ensuring safety.

Why RL instead of traditional programming? Because the environment changes: part locations vary, new obstacles appear, and speed requirements shift. RL adapts automatically.

Energy Management

A factory with diesel generators, solar panels, and battery storage. RL decides every hour: which source to use, when to charge batteries, and when to sell surplus to the grid -- minimizing cost while guaranteeing uninterrupted supply.

Decision	Inputs	Objective
Run generator or not	Electricity price, current demand	Minimize cost
Charge/discharge battery	Charge level, weather forecast	Maximize solar usage
Sell to grid	Surplus, selling price	Increase revenue

Challenges of RL in Industry

Safety: The agent cannot try dangerous actions on real equipment. Solution: train in simulation then transfer to reality.
Slow learning: RL needs millions of trials. In industry each trial costs time and money.
Reward design: A poorly designed reward leads to unexpected behavior. For example: rewarding "maximum output" without penalizing quality may produce defective parts quickly.
Stability: DQN and policy gradient may not always converge -- they require careful tuning.

Practical Tips

Start with simulation -- Do not apply RL directly to real equipment. Build a simulation model first.
Design rewards carefully -- Spend sufficient time defining what is "good" and "bad" for your system.
Q-Learning for beginners -- Simpler and more stable. Move to DQN when the environment complexity increases.
Policy gradient for continuous control -- If your actions are continuous numbers (temperature, speed), use policy gradient methods.
Monitor safety -- Set hard constraints that the agent can never violate, even during training.
Incremental complexity -- Start with a simplified environment and add complexity gradually.