Found this paper on arXiv and thought of giving it a read to understand exactly what was going on, from a beginner’s perspective.

This paper attempts to create a simple unsupervised approach to commonsense reasoning using neural networks and deep learning.

Prerequisites

Before presenting the approach shown in the paper, let’s understand some of the prerequisites I’ve listed here.

Language Models

A language model is basically a probability distribution over sequences of words. Formally, given you have a text corpus, such that the set of words in the corresponding vocabulary is $ V $, then a language model can define the exact probability of a sentence $ x_1, x_2, … x_n $ belonging to the set of all sentences $ V’ $ constructed using the vocabulary.

A simple example of a poor language model is the one that uniformly distributes the probability across all sentences. Suppose $ c(x_1, … , x_n) $ to be the number of times that sentence is seen in the training corpus, and $ N $ to be the total number of sentences in the training corpus. The probability can then be defined as:

$p(x_1, …, x_n) = \frac{c(x_1, …, x_n)}{N}$

Speech Recognition is one of the key applications of Langauge Models. Verbal Speech is processed to obtain a set of candidate sentences, which are then fed to a language model to get the most probable sentence. This document brilliantly explains the details of defining a language model, how Markov Models are used for fixed-length sentences and the types of language models etc. For the explanation of the paper being considered here, just knowing the input and output of a language model should be enough.

Winograd Schema Challenge

Designed to be an improvement over the traditional AI benchmark, the Turing Test, it is a multiple choice test that employs questions of a very specific structure, called the Winograd Schema, named after Terry Winograd, a professor of CS at Stanford University. Quoting Wikipedia, > Winograd Schema questions simply require the resolution of anaphora: the machine must identify the antecedent of an ambiguous pronoun in a statement. This makes it a task of natural language processing, but Levesque argues that for Winograd Schemas, the task requires the use of knowledge and commonsense reasoning.

The Winograd Schema Challenge was proposed in part to ameliorate the problems that came to light with the nature of the programs that performed well on the Turing Test. Essentially, a Winograd Schema consists of two noun phrases of similar semantic meaning, an ambigous pronoun that may refer to either of the above noun phrases, and two word choices such that each one results from a different interpretation of the pronoun. A question then asks the identity of the ambigous pronoun. A machine answering facing such a challenge cannot rely just on statistical measures, that is the whole point of Winograd Schemas. Moreover, they don’t need human judges as opposed to a Turing Test. The only pitfall is the difficulty in developing a Winograd Schema. Here’s an example of a Winograd Schema

More examples can be found here.

Pronoun Disambiguation Challenge

As you probably understand from the heading, this challenge is very similar to the Winograd Schema Challenge. A great collection of PDPs can be found here.

Both of the above challenges are a part of the general Word Sense Disambiguation problem, which aims at identifying which sense of a word is used in a sentence.

The Paper’s Approach

Now that we’ve understood the above concepts, it will be much easier to understand what the author’s are trying to achieve. In Related Work the author’s have mentioned an approach, by Mikolav et al., wherein predicting adjacent words in a sentence, word vectors can be made to answer analogy questions like Man:King::Woman:?. The authors use this as an inspiration to show that language models are capable of capturing common sense. Since Winograd Schemas require much more contextual information, just word vectors won’t suffice and hence the use of Language Models. Previously researchers have shown that pre-trained LMs can be used as feature representations for a sentence, or a paragraph to improve NLP applications such as document classification, machine translation, question answering, etc.

Given a Winograd Schema, The trophy doesn’t fit in the suitcase because it is too big, the authors substitute the two possible candidates suitcase and trophy into the pronoun position. A Language Model is then used to score the two substitutions.

The authors use two different scores, using full and partial representations of the candidate sentences.

Suppose the sentence $ S $ of $ n $ consecutive words has its pronoun to be resolved specified at the $ k^{th} $ position: $ S = {w_1, …, w_{k-1}, w_k \equiv p, w_{k+1}, …, w_n} $. The language model used by the authors is such that it defines the probability of word $ w_t $ preconditioned on the previous words $ w_1, …, w_{t - 1} $. The substitution of a candidate reference $ c $ in to the pronoun position k results in a new sentence $ S_{w_{k \leftarrow c}} $. The two scores are thus computed as follows:

  • $ Score_{full}(w_{k} \leftarrow c) = P(w_1, w_2, …, w_{k-1}, c, w_{k+1}, …, w_n) $
  • $ Score_{partial}(w_{k} \leftarrow c) = P(w_{k+1}, …, w_n | w_1, w_2, …, w_{k-1}, c) $

The above scores take into account how probable the full sentence is and how likely is it that the substituted phrase can act as an antecedent to the next part of the sentence.

Suprisingly, the results showed that partial scores perform better than the naive full scoring strategy. Partial scoring corrected a large portion of wrong predictions made by full scoring.


That is all for this summary. For more details on the recurrent language models used, how better the strategy was against current commonsense reasoning approaches, and one more interesting inference, I encourage you to read the complete paper :)