GridWorld RL Lab

The world (click a cell to toggle a wall, drag S or G, hover to inspect)

Press Play

The blue dot is the agent. The faint line is its current path. The gold dashed line is the shortest route. S is the start, G is the goal, grey is a wall, black is a cliff.

Learned value of each cell

Optimal -

The update equation will appear here as it trains.

Reward per episode

Algorithm & world

Editing the grid restarts learning, because the old values no longer apply. The Cliff layout shows the difference between Q-learning and SARSA most clearly.

Hyperparameters (live)

α learn rate 0.30

γ discount 0.95

ε explore 1.00

slip noise 0.00

speed (steps/s) 8

auto-decay ε each episode

show optimal-path overlay

Controls

Presets & share

Episode 0

Step 0

Last steps to goal -

Phase -

Optimal length -

Episode reward 0.0

How to try it. Press Play and let it wander while exploration is high. Drag the ε slider down to about 0.05 and the agent walks straight to the goal, then keeps repeating the best path. Switch to the Cliff layout to see how Q-learning hugs the dangerous edge while SARSA plays it safe. Click any cell to drop a wall and watch the route change.