# Reinforcement Learning: an Easy Introduction to Value Iteration | by Carl Bettosi | Sep, 2023

## Solving the example using Value Iteration

VI should make even more sense once we complete an example problem, so let’s get back to our golf MDP. We have formalised this as an MDP but currently, the agent doesn’t know the best strategy when playing golf, so let’s solve the golf MDP using VI.

We’ll start by defining our model parameters using fairly standard values:

`γ = 0.9 // discount factorθ = 0.01 // convergence threshold`

We will then initialise our value table to 0 for states in S:

`// value tableV(s0) = 0V(s1) = 0V(s2) = 0`

We can now start in the outer loop:

`Δ = 0`

And three passes of the inner loop for each state in S:

`// Bellman update rule// V(s) ← maxₐ Σₛ′, ᵣ p(s′, r|s, a) * [r + γV(s′)]//******************* state s0 *******************//v = 0// we have only looked at one action here as only one is available from s0// we know that the others are not possible and would therefore sum to 0V(s0) = max[T(s0 | s0, hit to green) * (R(s0, hit to green, s0) + γ * V(s0)) +T(s1 | s0, hit to green) * (R(s0, hit to green, s1) + γ * V(s1))]V(s0) = max[0.1 * (0 + 0.9 * 0) + 0.9 * (0 + 0.9 * 0)]V(s0) = max[0] = 0// Delta update rule// Δ ← max(Δ,| v - V(s)|)Δ = max[Δ, |v - v(s0)|] = max[0, |0 - 0|] = 0//******************* state s1 *******************//v = 0// there are 2 available actions hereV(s1) = max[T(s0 | s1, hit to fairway) * (R(s1, hit to fairway, s0) + γ * V(s0)) + T(s1 | s1, hit to fairway) * (R(s1, hit to fairway, s1) + γ * V(s1)),T(s1 | s1, hit in hole) * (R(s1, hit in hole, s1) + γ * V(s1)) + T(s2 | s1, hit in hole) * (R(s1, hit in hole, s2) + γ * V(s2))]V(s1) = max[0.9 * (0 + 0.9 * 0) + 0.1 * (0 + 0.9 * 0),0.1 * (0 + 0.9 * 0) + 0.9 * (10 + 0.9 * 0)]V(s1) = max[0, 9] = 9Δ = max[Δ, |v - v(s1)|] = max[0, |0 - 9|] = 9//******************* state s2 *******************//// terminal state with no actions`

This gives us the following update to our value table:

`V(s0) = 0V(s1) = 9V(s2) = 0`

We don’t need to worry about s2 as this is a terminal state, meaning no actions are possible here.

We now break out the inner loop and continue the outer loop, performing a convergence check on:

`Δ < θ = 9 < 0.01 = False`

Since we have not converged, we do a second iteration of the outer loop:

`Δ = 0`

And another 3 passes of the inner loop, using the updated value table:

`//******************* state s0 *******************//v = 0V(s0) = max[T(s0 | s0, hit to green) * (R(s0, hit to green, s0) + γ * V(s0)) +T(s1 | s0, hit to green) * (R(s0, hit to green, s1) + γ * V(s1))]V(s0) = max[0.1 * (0 + 0.9 * 0) + 0.9 * (0 + 0.9 * 9)]V(s0) = max[7.29] = 7.29Δ = max[Δ, |v - v(s0)|] = max[0, |0 - 7.29|] = 7.29//******************* state s1 *******************//v = 9V(s1) = max[T(s0 | s1, hit to fairway) * (R(s1, hit to fairway, s0) + γ * V(s0)) + T(s1 | s1, hit to fairway) * (R(s1, hit to fairway, s1) + γ * V(s1)),T(s1 | s1, hit in hole) * (R(s1, hit in hole, s1) + γ * V(s1)) + T(s2 | s1, hit in hole) * (R(s1, hit in hole, s2) + γ * V(s2))]V(s1) = max[0.9 * (0 + 0.9 * 7.29) + 0.1 * (0 + 0.9 * 9),0.1 * (0 + 0.9 * 9) + 0.9 * (10 + 0.9 * 0)]V(s1) = max(6.7149, 9.81) = 9.81Δ = max[Δ, |v - v(s1)|] = max[7.29, |9 - 9.81|] = 7.29//******************* state s2 *******************//// terminal state with no actions`

At the end of the second iteration, our values are:

`V(s0) = 7.29V(s1) = 9.81V(s2) = 0`

Check convergence once again:

`Δ < θ = 7.29 < 0.01 = False`

Still no convergence, so we continue the same process as above until Δ < θ. I won’t show all the calculations, the above two are enough to understand the process.

After 6 iterations, our policy converges. This is our values and convergence rate as they change over each iteration:

`Iteration   V(s0)        V(s1)        V(s2)        Δ             Converged1           0            9            0            9             False2           7.29         9.81         0            7.29          False3           8.6022       9.8829       0            1.3122        False4           8.779447     9.889461     0            0.177247      False5           8.80061364   9.89005149   0            0.02116664    False6           8.8029969345 9.8901046341 0            0.0023832945  True`

Now we can extract our policy:

`// Policy extraction rule// π(s) = argmaxₐ Σₛ′, ᵣ p(s′, r|s, a) * [r + γV(s′)]//******************* state s0 *******************//// we know there is only one possible action from s0, but let's just do it anywayπ(s0) = argmax[T(s0 | s0, hit to green) * (R(s0, hit to green, s0) + γ * V(s0)) +T(s1 | s0, hit to green) * (R(s0, hit to green, s1) + γ * V(s1))π(s0) = argmax[0.1 * (0 + 0.9 * 8.8029969345) + 0.9 * (0 + 0.9 * 9.8901046341)]π(s0) = argmax[8.80325447773]π(s0) = hit to green//******************* state s1 *******************//π(s1) = argmax[T(s0 | s1, hit to fairway) * (R(s1, hit to fairway, s0) + γ * V(s0)) + T(s1 | s1, hit to fairway) * (R(s1, hit to fairway, s1) + γ * V(s1)),T(s1 | s1, hit in hole) * (R(s1, hit in hole, s1) + γ * V(s1)) + T(s2 | s1, hit in hole) * (R(s1, hit in hole, s2) + γ * V(s2))]V(s1) = max[0.9 * (0 + 0.9 * 8.8029969345) + 0.1 * (0 + 0.9 * 9.8901046341),0.1 * (0 + 0.9 * 9.8901046341) + 0.9 * (10 + 0.9 * 0)]π(s1) = argmax[8.02053693401, 9.89010941707]π(s1) = hit in hole`

Our final policy is:

`π(s0) = hit to greenπ(s1) = hit to holeπ(s2) = terminal state (no action)`

So, when our agent is in the Ball on fairway state (s0), the best action is to hit to green. This seems pretty obvious since that is the only available action. However, in s1, where there are two possible actions, our policy has learned to hit in hole. We can now give this learned policy to other agents who want to play golf!

And there you have it! We have just solved a very simple RL problem using Value Iteration.

This post originally appeared on TechToday.