Q-Learning Grid World Explorer

Note: Numbers in cells represent $Q(s, a)$. Blue highlights the optimal action for each state.

$$Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma \max_{a'} Q(s', a') - Q(s, a)]$$

Episodes

Alpha

Gamma