Reinforcement Learning: Machines Learn by Trial and Reward
What Is Reinforcement Learning?
Imagine a robot in a factory trying to learn the best way to arrange parts on a shipping pallet. At first it places parts randomly -- some fall off, some do not balance. But with each attempt it receives a reward (successful arrangement) or penalty (part fell). After thousands of attempts, it learns an optimal strategy that nobody programmed manually.
This is Reinforcement Learning (RL): an agent interacts with an environment and learns optimal behavior through trial and error, guided by reward signals.
Core Concepts
The Five Elements of Reinforcement Learning
| Element | Definition | Industrial Example |
|---|---|---|
| Agent | The learner and decision maker | Intelligent control system |
| Environment | Everything the agent interacts with | Production line and equipment |
| State | Description of the current situation | Temperature, pressure, speed |
| Action | The decision the agent makes | Increase speed, reduce temperature |
| Reward | Signal evaluating the action | +10 for good part, -50 for breakdown |
Policy
The policy is the agent's strategy -- a function that maps each state to the appropriate action. The goal is to find the optimal policy that maximizes cumulative long-term reward.
Policy: pi(s) -> a
Meaning: in state s, take action a
Value Function
Estimates the total expected reward starting from a given state. Think of it as an experienced engineer's estimate: "If we start from this situation, how much will we gain in the long run?"
import numpy as np
# Simplified example: furnace temperature control environment
class FurnaceEnvironment:
def __init__(self):
self.target_temp = 200 # target: 200 C
self.current_temp = 150 # start: 150 C
self.max_steps = 50
def get_state(self):
"""State: difference from target"""
return round(self.current_temp - self.target_temp, 1)
def step(self, action):
"""
Actions: 0 = reduce heating, 1 = hold, 2 = increase heating
"""
if action == 0:
self.current_temp -= 5
elif action == 2:
self.current_temp += 5
# Add realistic noise
self.current_temp += np.random.normal(0, 1)
# Calculate reward
error = abs(self.current_temp - self.target_temp)
if error < 2:
reward = 10 # excellent -- very close to target
elif error < 10:
reward = 1 # acceptable
else:
reward = -5 # far off -- penalty
done = error > 50 # terminate if out of control
return self.get_state(), reward, done
Q-Learning: Model-Free Learning
Q-Learning is the most famous classical RL algorithm. It builds a table (Q-Table) storing the value of every (state, action) pair -- meaning "how valuable is taking this action in this state?"
Bellman Equation
Q(s, a) <- Q(s, a) + alpha * [r + gamma * max Q(s', a') - Q(s, a)]
| Symbol | Meaning |
|---|---|
| alpha | Learning rate -- how much we trust new information |
| gamma | Discount factor -- importance of future rewards |
| r | Immediate reward |
| max Q(s', a') | Best expected value from the next state |
import numpy as np
class QLearningAgent:
def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.95, epsilon=0.1):
self.q_table = np.zeros((n_states, n_actions))
self.alpha = alpha # learning rate
self.gamma = gamma # discount factor
self.epsilon = epsilon # exploration probability
def choose_action(self, state):
"""Epsilon-greedy: explore sometimes, exploit mostly"""
if np.random.random() < self.epsilon:
return np.random.randint(self.q_table.shape[1]) # random exploration
return np.argmax(self.q_table[state]) # exploit best action
def learn(self, state, action, reward, next_state):
"""Update Q-Table using Bellman equation"""
best_next = np.max(self.q_table[next_state])
td_target = reward + self.gamma * best_next
td_error = td_target - self.q_table[state, action]
self.q_table[state, action] += self.alpha * td_error
# Example: 21 states (temp diff from -50 to +50 in steps of 5) and 3 actions
agent = QLearningAgent(n_states=21, n_actions=3)
print(f"Q-Table size: {agent.q_table.shape}")
print("Training starts with zero values and improves with iterations")
Exploration vs. Exploitation
This is the fundamental dilemma in RL: should we try new actions (explore) or stick with the best known action (exploit)?
Consider a process engineer: should they try new machine settings that might improve production (but might cause defects), or stay with the current proven settings?
The epsilon-greedy strategy: with probability epsilon we explore randomly, and with probability (1-epsilon) we exploit the best known action. Epsilon starts high (lots of exploration) and gradually decreases.
Deep Q-Networks (DQN)
When the number of states is enormous (such as a camera image or hundreds of sensor readings), building a Q-table is impossible. The solution: replace it with a neural network that approximates the Q function.
import tensorflow as tf
from tensorflow.keras import layers, models
def build_dqn(state_size, n_actions):
"""Deep Q-Network"""
model = models.Sequential([
layers.Dense(64, activation='relu', input_shape=(state_size,)),
layers.Dense(64, activation='relu'),
layers.Dense(n_actions, activation='linear') # Q-value for each action
])
model.compile(optimizer=tf.keras.optimizers.Adam(0.001),
loss='mse')
return model
# Environment with 20 sensors and 5 possible actions
dqn = build_dqn(state_size=20, n_actions=5)
dqn.summary()
Key DQN Techniques
| Technique | Purpose |
|---|---|
| Experience Replay | Store experiences and replay them randomly to break temporal correlation |
| Target Network | A frozen copy of the network updated periodically for training stability |
| Reward Clipping | Limit rewards to [-1, +1] for training stability |
Policy Gradient
Instead of learning the Q function and then deriving the policy from it, policy gradient learns the policy directly -- the probability of each action in each state.
Key advantage: It can handle continuous actions (such as "set motor speed to 1523.7 RPM") instead of only discrete actions.
# Simplified policy network example
policy_network = models.Sequential([
layers.Dense(64, activation='relu', input_shape=(10,)),
layers.Dense(32, activation='relu'),
layers.Dense(3, activation='softmax') # probabilities for 3 actions
])
# At each step:
# 1. Network outputs action probabilities
# 2. Sample an action according to probabilities
# 3. Compute reward
# 4. Update network: increase probability of high-reward actions
Approach Comparison
| Criterion | Q-Learning / DQN | Policy Gradient |
|---|---|---|
| Action type | Discrete only | Discrete and continuous |
| Stability | More stable | Less stable (high variance) |
| Sample efficiency | Higher (reuses data) | Lower (needs more data) |
| Best for | Limited action count | Continuous control |
Industrial Applications
Production Process Optimization
Consider a metal smelting furnace with dozens of parameters (temperature, oxygen pressure, feed rate). RL tries different combinations and learns optimal settings that reduce waste and improve metal quality.
# Example: production process optimization environment
class ProcessOptimizer:
def __init__(self):
self.params = {
"temperature": 800, # C
"pressure": 50, # bar
"feed_rate": 100 # kg/hr
}
def evaluate(self):
"""Evaluate product quality and efficiency"""
# In reality: actual measurements from the production line
quality = self._quality_model()
energy = self._energy_model()
waste = self._waste_model()
# Reward: high quality + energy efficiency - waste
reward = quality * 10 - energy * 2 - waste * 5
return reward
def _quality_model(self):
"""Simplified quality model"""
temp_score = 1.0 - abs(self.params["temperature"] - 820) / 100
return max(0, min(1, temp_score))
def _energy_model(self):
return self.params["temperature"] / 1000
def _waste_model(self):
return max(0, (self.params["feed_rate"] - 110) / 100)
Robot Path Planning
An industrial robotic arm needs to move parts from one location to another while avoiding obstacles. RL learns efficient paths that minimize movement time and energy consumption while ensuring safety.
Why RL instead of traditional programming? Because the environment changes: part locations vary, new obstacles appear, and speed requirements shift. RL adapts automatically.
Energy Management
A factory with diesel generators, solar panels, and battery storage. RL decides every hour: which source to use, when to charge batteries, and when to sell surplus to the grid -- minimizing cost while guaranteeing uninterrupted supply.
| Decision | Inputs | Objective |
|---|---|---|
| Run generator or not | Electricity price, current demand | Minimize cost |
| Charge/discharge battery | Charge level, weather forecast | Maximize solar usage |
| Sell to grid | Surplus, selling price | Increase revenue |
Challenges of RL in Industry
- Safety: The agent cannot try dangerous actions on real equipment. Solution: train in simulation then transfer to reality.
- Slow learning: RL needs millions of trials. In industry each trial costs time and money.
- Reward design: A poorly designed reward leads to unexpected behavior. For example: rewarding "maximum output" without penalizing quality may produce defective parts quickly.
- Stability: DQN and policy gradient may not always converge -- they require careful tuning.
Practical Tips
- Start with simulation -- Do not apply RL directly to real equipment. Build a simulation model first.
- Design rewards carefully -- Spend sufficient time defining what is "good" and "bad" for your system.
- Q-Learning for beginners -- Simpler and more stable. Move to DQN when the environment complexity increases.
- Policy gradient for continuous control -- If your actions are continuous numbers (temperature, speed), use policy gradient methods.
- Monitor safety -- Set hard constraints that the agent can never violate, even during training.
- Incremental complexity -- Start with a simplified environment and add complexity gradually.