Markov Decision Processes
MDPs provide the mathematical framework for RL. They model sequential decision-making where outcomes depend on actions taken in different states.
Definition
A Markov Decision Process is defined by states (S), actions (A), transition probabilities (P), rewards (R), and a discount factor (γ). The Markov property states that the future depends only on the current state, not the history.
Key Concepts
State
A representation of the environment at a given time. Everything the agent needs to make a decision.
Action
What the agent can do to interact with the environment.
Reward
Immediate feedback from the environment after taking an action. Goal is to maximize cumulative reward.
Policy
A mapping from states to actions. Can be deterministic or stochastic.
Value Function
Expected cumulative reward from a state (V) or state-action pair (Q).
Real-World Applications
AlphaGo
Gaming/AIDeepMind's Go-playing AI used RL to master the game, defeating world champions.
Robotics
RoboticsBoston Dynamics uses RL to teach robots to walk, run, and recover from disturbances.
Data Center Cooling
InfrastructureGoogle uses RL to optimize cooling systems, reducing energy consumption by 40%.
Code Example
import numpy as np
class GridWorld:
"""
Simple GridWorld environment
"""
def __init__(self, size=4):
self.size = size
self.state = [0, 0] # Start position
self.goal = [size-1, size-1] # Goal position
self.actions = ['up', 'down', 'left', 'right']
def reset(self):
self.state = [0, 0]
return tuple(self.state)
def step(self, action):
"""Take action, return (next_state, reward, done)"""
new_state = self.state.copy()
if action == 'up' and self.state[0] > 0:
new_state[0] -= 1
elif action == 'down' and self.state[0] < self.size - 1:
new_state[0] += 1
elif action == 'left' and self.state[1] > 0:
new_state[1] -= 1
elif action == 'right' and self.state[1] < self.size - 1:
new_state[1] += 1
self.state = new_state
# Check if goal reached
done = self.state == self.goal
reward = 1 if done else -0.1 # Small penalty for each step
return tuple(self.state), reward, done
class QLearning:
"""
Q-Learning agent
"""
def __init__(self, states, actions, learning_rate=0.1, discount=0.99, epsilon=0.1):
self.actions = actions
self.lr = learning_rate
self.gamma = discount
self.epsilon = epsilon
# Initialize Q-table
self.q_table = {}
for s in states:
self.q_table[s] = {a: 0 for a in actions}
def choose_action(self, state):
"""Epsilon-greedy action selection"""
if np.random.random() < self.epsilon:
return np.random.choice(self.actions)
else:
return max(self.q_table[state], key=self.q_table[state].get)
def update(self, state, action, reward, next_state, done):
"""Q-Learning update rule"""
current_q = self.q_table[state][action]
if done:
target = reward
else:
max_next_q = max(self.q_table[next_state].values())
target = reward + self.gamma * max_next_q
# Q-Learning update
self.q_table[state][action] += self.lr * (target - current_q)
# Training
env = GridWorld(size=4)
states = [(i, j) for i in range(4) for j in range(4)]
agent = QLearning(states, env.actions)
# Train for 1000 episodes
for episode in range(1000):
state = env.reset()
total_reward = 0
for _ in range(100): # Max steps per episode
action = agent.choose_action(state)
next_state, reward, done = env.step(action)
agent.update(state, action, reward, next_state, done)
total_reward += reward
state = next_state
if done:
break
if episode % 200 == 0:
print(f"Episode {episode}, Total Reward: {total_reward:.2f}")
# Show learned policy
print("\nLearned Policy:")
for i in range(4):
row = ""
for j in range(4):
if (i, j) == env.goal:
row += " G "
else:
best_action = max(agent.q_table[(i, j)], key=agent.q_table[(i, j)].get)
symbols = {'up': '↑', 'down': '↓', 'left': '←', 'right': '→'}
row += f" {symbols[best_action]} "
print(row)Q-Learning learns the value of state-action pairs through trial and error. The agent explores the environment, updates its Q-table based on rewards, and gradually learns the optimal policy.
Practice Problems
- 1Implement SARSA and compare with Q-Learning
- 2Add obstacles to the GridWorld
- 3Visualize the Q-values as they change during training
Summary
MDPs and Q-Learning form the foundation of RL. Understanding these concepts is essential before moving to deep RL methods like DQN and policy gradients.