Introduction

This assignment involves finding an optimal policy for a given MDP. Specifically, the implementation will focus on iterative techniques such as value iteration, policy iteration, and their respective variants. Additionally, the analysis will extend to the examination of the impact of various components defining an MDP on the optimal policy.

Domain Description

Consider an autonomous car in the grid world environment. This car comes with a hybrid engine, allowing it to run on petrol or electricity. It is capable of moving in four directions: top, down, right, and left. Due to construction work, holes have been dug up in the streets of the grid world. Therefore, the autonomous car needs to find an optimal policy (the action it needs to take in every grid cell) to avoid falling into the holes and drive safely to its destination.

Environment

The environment is represented as a 2D grid world with discrete grid cells. Specific cells are designated as 'H,' indicating the presence of a hole. The garage is denoted by 'S,' where the vehicle is initially parked, and the destination is marked as 'G.' Cells that are neither holes nor the goal are labelled as 'F,' indicating free cells.

Transition Model

$$ right (x,y) = (x+1, y)\\ left (x,y) = (x-1, y)\\ top (x,y) = (x, y+1)\\ down (x,y) = (x, y-1) $$

Reward Model

$$ R(s, a, s^\prime) = \begin{cases} \mathtt{living\_reward}, \quad \text{if } s^\prime = \text{F} \\ \mathtt{hole\_reward}, \quad \text{if } s^\prime = \text{H} \\ \mathtt{goal\_reward}, \quad \text{if } s^\prime = \text{G} \\ \end{cases} $$

Your Implementation

Part A: Solving for optimal policy

Your task is to use iterative methods to solve the given MDP and find the optimal policy $\pi^*$. Your implementation should be general enough to be able to work for any given grid. You are provided with two maps: small_map and large_map to test your implementation. Assume the following default values for the parameters of the transition and reward model in your implementation.