Introduction

Voice-enabled assistants (e.g., Alexa, Google Home or other voice triggered applications such as search on a phone) rely on automatic speech recognition (ASR) to understand and respond to spoken commands by a person. ASR systems convert audio received from a human speaking to the device to the corresponding text for further processing. However, ASR systems can make mistakes that lead to errors in the output text extracted from the input audio. These errors can cause the assistant to misunderstand commands and potentially perform the wrong actions.

In this assignment we consider two main types of errors:

In this assignment, your goal is to develop an agent that helps fix these errors by analysing the text and improving its accuracy. The agent will use a cost function that return coherence score (lower is better) for any text for a given audio. Your agent must use a search-based method that fixes the errors in the ASR output.

Problem Statement

The task is to develop a correction agent for text transcribed by an automatic speech recognition (ASR) system. The input to the agent is a text with all words in capital letters, separated by spaces. This text may contain errors due to the ASR system, which can be incorrect character recognition or missing words at the start of end of the sentence as described before. To correct these errors, the agent will utilize two key resources:

Given an erroneous input text and the information about potential errors, there can be many possible corrections. To find the best correction, a search-based algorithm must be used. This algorithm should explore different correction options and use a cost function to evaluate them. The cost function determines how well each corrected sentence matches the original audio, with lower costs indicating more coherent and accurate sentences. The goal is to find the correction with the lowest cost, improving the accuracy of the ASR system's output.

Cost Model: In this assignment, the cost function is provided to you and is implemented using OpenAI Whisper model. Note that a detailed understanding of the Whisper model is not required for this assignment. Whisper is an ASR model that computes the likelihood of a text $s$ for a given audio $a$. Specifically, it breaks down the text $s$ into sequences of tokens text $[t_1, t_2,...,t_n]$ where each token $t_i$ consists of one or more characters. Then, Whisper computes the negative log likelihood as:

$$ L(s, a) = -log(P_{\theta}(s|a)) = -\sum_{i=1}^nlog(P_{\theta}(t_i|t_1,t_2,...,t_{i-1}, a)) $$

Here, $\theta$ denotes the parameters of the Whisper model which obtained by large-scale training. In essence, the Whisper model provides a cost function or a coherence score that relates the audio received with candidate text that your algorithm may consider. Formally, we the cost of a text $s$ as for a given audio $a$ is expressed as $f_{cost}(s) = L(s,a)$.

Implementation Details