Introduction

The goal of this assignment is to model and solve a practical problem using search-based techniques.

Consider a scenario where a company is digitising its hand-written paper records containing text in English. The person carrying out the task scans the paper document and then uses an OCR (Optical Character Recognition) system to extract the text. The output of the OCR system is highly noisy and hence, the characters may be incorrectly extracted. For example, if the the original text in the document was ”i am studying ai” then it may be extracted as “i an studying aj” with errors occurring for two characters. Consequently, the digitised document contain sentences with imperfections and may not be proper sentences in English.

In this assignment, we apply search-based methods for determining a possible correction for the input erroneous text.

Problem Definition

The problem takes as input a sentence (sequence of all lowercase words separated by white spaces). The words in the sentence are all lowercase and may be have been erroneously extracted during the digitisation process. The output consists of a sentence which represents a “good” correction for the problem.

Input erroneous sentence: "i an stydling artificial intelligence"
Output corrected sentence: "i am studying artificial intelligence"

Further, it has been observed from prior experience that the OCR commonly confuses certain characters with certain other characters that resemble in shape, orientation etc. For example, the character “a” is incorrectly extracted as “e”, “i” or “o” and less often as “l”. This information regarding the possibilities for incorrect character replacement for a character is given in a file called conf_matrix.json. The format is shown below.

a → [e, i, o]
p → [m, b]
...
...
s → [c, t, z, x]

Note that given an erroneous string and the information regarding possible character errors, there may be a large number of possible corrections for a given input. A search based algorithm will need to explore this space of possible corrections. The algorithm will base its decisions based on a cost function inversely related to how likely (or fluent) the sentence is in the English language. The intuition is that lower the score, the higher is the likelihood for it to be error-free. For example, the sentence “”i am studying ai” is a more fluent (or more likely) English sentence in comparison with “i an studying aj”, therefore, the former has less cost and hence is more likely to be error free.

In this assignment, the cost function is provided to you. Internally, the cost function is realised by a language model which essentially determines the fluency of the sentence as discussed above. The detailed understanding of language models is beyond the scope of this assignment.

Formally, let $f_{cost}(.)$ denote the cost function that takes an input sentence $s$ consisting of a sequence of words $\{w_0, w_1, \dots w_{n-1}\}$ and outputs a cost. In this assignment, the cost is evaluated as $f_{cost}(s) = -log\big(p(w_0, w_1, w_2) * p(w_1, w_2, w_3) ... p(w_{n-3}, w_{n-2}, w_{n-1})\big)$. Here, the cost function determines the cost as an accumulation of probabilities of three consecutive words at a time. This cost evaluation is provided in the file **language_model.py ****as part of the starter code and can be called directly by your algorithm.

Finally, we can assume that there is a maximum time available for your algorithm to run. This is specified as the parameter timeout.

Your Implementation