AILearn
RL FundamentalsMarkov Decision Processes

Markov Decision Processes

45 min
RL Fundamentals

MDPs provide the mathematical framework for RL. They model sequential decision-making where outcomes depend on actions taken in different states.

Definition

A Markov Decision Process is defined by states (S), actions (A), transition probabilities (P), rewards (R), and a discount factor (γ). The Markov property states that the future depends only on the current state, not the history.

Key Concepts

State

A representation of the environment at a given time. Everything the agent needs to make a decision.

Action

What the agent can do to interact with the environment.

Reward

Immediate feedback from the environment after taking an action. Goal is to maximize cumulative reward.

Policy

A mapping from states to actions. Can be deterministic or stochastic.

Value Function

Expected cumulative reward from a state (V) or state-action pair (Q).

Real-World Applications

AlphaGo

Gaming/AI

DeepMind's Go-playing AI used RL to master the game, defeating world champions.

Robotics

Robotics

Boston Dynamics uses RL to teach robots to walk, run, and recover from disturbances.

Data Center Cooling

Infrastructure

Google uses RL to optimize cooling systems, reducing energy consumption by 40%.

Code Example

python
import numpy as np

class GridWorld:
    """
    Simple GridWorld environment
    """
    def __init__(self, size=4):
        self.size = size
        self.state = [0, 0]  # Start position
        self.goal = [size-1, size-1]  # Goal position
        self.actions = ['up', 'down', 'left', 'right']

    def reset(self):
        self.state = [0, 0]
        return tuple(self.state)

    def step(self, action):
        """Take action, return (next_state, reward, done)"""
        new_state = self.state.copy()

        if action == 'up' and self.state[0] > 0:
            new_state[0] -= 1
        elif action == 'down' and self.state[0] < self.size - 1:
            new_state[0] += 1
        elif action == 'left' and self.state[1] > 0:
            new_state[1] -= 1
        elif action == 'right' and self.state[1] < self.size - 1:
            new_state[1] += 1

        self.state = new_state

        # Check if goal reached
        done = self.state == self.goal
        reward = 1 if done else -0.1  # Small penalty for each step

        return tuple(self.state), reward, done

class QLearning:
    """
    Q-Learning agent
    """
    def __init__(self, states, actions, learning_rate=0.1, discount=0.99, epsilon=0.1):
        self.actions = actions
        self.lr = learning_rate
        self.gamma = discount
        self.epsilon = epsilon

        # Initialize Q-table
        self.q_table = {}
        for s in states:
            self.q_table[s] = {a: 0 for a in actions}

    def choose_action(self, state):
        """Epsilon-greedy action selection"""
        if np.random.random() < self.epsilon:
            return np.random.choice(self.actions)
        else:
            return max(self.q_table[state], key=self.q_table[state].get)

    def update(self, state, action, reward, next_state, done):
        """Q-Learning update rule"""
        current_q = self.q_table[state][action]

        if done:
            target = reward
        else:
            max_next_q = max(self.q_table[next_state].values())
            target = reward + self.gamma * max_next_q

        # Q-Learning update
        self.q_table[state][action] += self.lr * (target - current_q)

# Training
env = GridWorld(size=4)
states = [(i, j) for i in range(4) for j in range(4)]
agent = QLearning(states, env.actions)

# Train for 1000 episodes
for episode in range(1000):
    state = env.reset()
    total_reward = 0

    for _ in range(100):  # Max steps per episode
        action = agent.choose_action(state)
        next_state, reward, done = env.step(action)
        agent.update(state, action, reward, next_state, done)

        total_reward += reward
        state = next_state

        if done:
            break

    if episode % 200 == 0:
        print(f"Episode {episode}, Total Reward: {total_reward:.2f}")

# Show learned policy
print("\nLearned Policy:")
for i in range(4):
    row = ""
    for j in range(4):
        if (i, j) == env.goal:
            row += " G "
        else:
            best_action = max(agent.q_table[(i, j)], key=agent.q_table[(i, j)].get)
            symbols = {'up': '↑', 'down': '↓', 'left': '←', 'right': '→'}
            row += f" {symbols[best_action]} "
    print(row)

Q-Learning learns the value of state-action pairs through trial and error. The agent explores the environment, updates its Q-table based on rewards, and gradually learns the optimal policy.

Practice Problems

  • 1Implement SARSA and compare with Q-Learning
  • 2Add obstacles to the GridWorld
  • 3Visualize the Q-values as they change during training

Summary

MDPs and Q-Learning form the foundation of RL. Understanding these concepts is essential before moving to deep RL methods like DQN and policy gradients.