Environment Details
Actions
- Discrete (2 actions)
- 0: Push left
- 1: Push right
Reward
- +1 for each step
- Termination at 200 steps
- Solved at 195+ avg
Termination
- Pole angle > ±12°
- Cart position > ±2.4
- Episode length > 200
Implemented Algorithms
Monte Carlo
- Model-free approach
- Learns from complete episodes
- Averages returns for state-action pairs
Q-Learning
- Off-policy TD control
- Learns optimal policy directly
- Updates Q-values using max future reward
SARSA
- On-policy TD control
- Learns action-value function
- Updates using actual next action taken
Technical Implementation
- State discretization for tabular methods
- ε-greedy policy for exploration
- Discount factor (γ) = 0.99
- Learning rate (α) = 0.1