Note: Numbers in cells represent $Q(s, a)$. Blue highlights the optimal action for each state.
Controls
Update Rule
$$Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma \max_{a'} Q(s', a') - Q(s, a)]$$
Training Metrics & Params
Episodes
0