Introduction

This assignment involves finding an optimal policy for a given MDP. Specifically, the implementation will focus on iterative techniques such as value iteration, policy iteration, and their respective variants. Additionally, the analysis will extend to the examination of the impact of various components defining an MDP on the optimal policy.

Domain Description

Consider an autonomous car in the grid world environment. This car comes with a hybrid engine, allowing it to run on petrol or electricity. It is capable of moving in four directions: top, down, right, and left. Due to construction work, holes have been dug up in the streets of the grid world. Therefore, the autonomous car needs to find an optimal policy (the action it needs to take in every grid cell) to avoid falling into the holes and drive safely to its destination.

Environment

The environment is represented as a 2D grid world with discrete grid cells. Specific cells are designated as 'H,' indicating the presence of a hole. The garage is denoted by 'S,' where the vehicle is initially parked, and the destination is marked as 'G.' Cells that are neither holes nor the goal are labelled as 'F,' indicating free cells.

Transition Model

The car can move in four directions: top ('t'), down ('d'), right ('r'), and left ('l'). The intended state change for each action is:-

$$ right (x,y) = (x+1, y)\\ left (x,y) = (x-1, y)\\ top (x,y) = (x, y+1)\\ down (x,y) = (x, y-1) $$

Because of rain, the roads in the grid world become slippery, introducing uncertainty to the effects of actions. In other words, the transitions are stochastic. Any action will result in the intended cell with a probability $p$ and randomly in one of the other three neighbours with probability $(1-p)$. Transitions outside the grid boundary are not allowed. That is, transitions resulting in a position outside the grid boundary will not change the state.
It is important to note that moving into a hole or reaching the goal is considered a terminal state in this context.

Reward Model

The car gets a reward after taking an action and reaching a state.
For each action that does not lead to a hole or the goal, the agent is granted a living_reward. In terminal states, where the agent encounters a hole or reaches the goal, it receives the respective rewards, namely hole_reward and goal_reward. Formally, the reward model $R(s, a, s^\prime)$ is defined as

$$ R(s, a, s^\prime) = \begin{cases} \mathtt{living\_reward}, \quad \text{if } s^\prime = \text{F} \\ \mathtt{hole\_reward}, \quad \text{if } s^\prime = \text{H} \\ \mathtt{goal\_reward}, \quad \text{if } s^\prime = \text{G} \\ \end{cases} $$

The discount factor, $\gamma$, represents the autonomous car's degree of patience or persistence in accumulating delayed rewards.

Your Implementation

Part A: Solving for optimal policy

Your task is to use iterative methods to solve the given MDP and find the optimal policy $\pi^*$. Your implementation should be general enough to be able to work for any given grid. You are provided with two maps: small_map and large_map to test your implementation. Assume the following default values for the parameters of the transition and reward model in your implementation.

The transition probability $p$ is 0.8, meaning that 80% of the time, actions lead to the intended state, while there's a 20% chance of reaching a random neighbouring cell.