This assignment involves finding an optimal policy for a given MDP. Specifically, the implementation will focus on iterative techniques such as value iteration, policy iteration, and their respective variants. Additionally, the analysis will extend to the examination of the impact of various components defining an MDP on the optimal policy.
Consider an autonomous car in the grid world environment. This car comes with a hybrid engine, allowing it to run on petrol or electricity. It is capable of moving in four directions: top, down, right, and left. Due to construction work, holes have been dug up in the streets of the grid world. Therefore, the autonomous car needs to find an optimal policy (the action it needs to take in every grid cell) to avoid falling into the holes and drive safely to its destination.
The environment is represented as a 2D grid world with discrete grid cells. Specific cells are designated as 'H,' indicating the presence of a hole. The garage is denoted by 'S,' where the vehicle is initially parked, and the destination is marked as 'G.' Cells that are neither holes nor the goal are labelled as 'F,' indicating free cells.
$$ right (x,y) = (x+1, y)\\ left (x,y) = (x-1, y)\\ top (x,y) = (x, y+1)\\ down (x,y) = (x, y-1) $$
living_reward
. In terminal states, where the agent encounters a hole or reaches the goal, it receives the respective rewards, namely hole_reward
and goal_reward
. Formally, the reward model $R(s, a, s^\prime)$ is defined as$$ R(s, a, s^\prime) = \begin{cases} \mathtt{living\_reward}, \quad \text{if } s^\prime = \text{F} \\ \mathtt{hole\_reward}, \quad \text{if } s^\prime = \text{H} \\ \mathtt{goal\_reward}, \quad \text{if } s^\prime = \text{G} \\ \end{cases} $$
Your task is to use iterative methods to solve the given MDP and find the optimal policy $\pi^*$. Your implementation should be general enough to be able to work for any given grid. You are provided with two maps: small_map
and large_map
to test your implementation. Assume the following default values for the parameters of the transition and reward model in your implementation.