Jekyll2019-04-23T01:17:09+05:30https://arpitgogia.com/feed.xmlArpit GogiaA blog about AI and stuffText Classification based on Word Subspace with Term Frequency2018-10-04T00:00:00+05:302018-10-04T00:00:00+05:30https://arpitgogia.com/text-classification-based-on-word-subspace-with-term-frequency<p>My most recent read is the research paper titled <a href="http://arxiv.org/abs/1806.03125">“Text Classification based on Word Subspace with Term Frequency”</a>.</p>
<p>Bag of Words(BoW) is a very common and proven technqiue used in various learning and statistical models. However, BoW doesn’t capture the semantic meaning of text in it’s representation. To mitigate this problem, neural networks were used to learn word vectors, a technqiue called <code class="highlighter-rouge">word2vec</code>. Word2Vec is well known for embedding semantic structure into vectors, where angle between the vectors indicates the meaningful similarity between words. The authors propose the novel concept of <code class="highlighter-rouge">Word Subspace</code>, to measure similarity between texts and represent the intrinsic variability of features in a set of word vectors. The model is further extended to incorporate the term frequency (TF) using a TF weighted word subspace. Mathematically, word subspace is defined as a low dimensional linear subspace in a word vector space with high dimensionality.</p>
<h3 id="bag-of-words">Bag of Words</h3>
<p>The bag of words representation comes from the hypothesis that frequencies of words in a document can indicate the relevance of the document to a query i.e. if documents and a query have similar frequencies for the same words, they might have a similar meaning. This representation is based on the <a href="https://en.wikipedia.org/wiki/Vector_space_model">vector space model</a>. A document <script type="math/tex">d</script> can be represented by a vector in <script type="math/tex">{\Bbb R}^n</script>, where each dimension represents a different term. A term can mean a single word or an <script type="math/tex">n</script> -gram.</p>
<h3 id="term-weightage">Term Weightage</h3>
<p>Term weightage can be defined in the following ways:</p>
<ul>
<li>Binary weight: Occurence of term in teh document => weight = 1.</li>
<li>Term - frequency weight (TF): Weight is defined by the number of times it occurs in the document <script type="math/tex">d</script>.</li>
<li>
<table>
<tbody>
<tr>
<td>Inverse document - frequency (IDF): Weight is defined by the ratio of total number of documents $$</td>
<td>D</td>
<td><script type="math/tex">to number of documents that have the term</script> w,</td>
<td>D^w</td>
<td>$$.</td>
</tr>
</tbody>
</table>
</li>
<li>Term - frequency inverse document-frequency (TF-IDF): The weight of a term <script type="math/tex">w</script> is defined by the multiplication of its TF and IDF. TF weights do not capture the bias towards specific terms in the corpus. By using IDF, words that are more common across all documents in D receive a smaller weight, giving more credence to rare terms in the corpus. <br /></li>
</ul>
<center>$$ TFIDF(w, d|D) = TF * IDF $$</center>
<p>If the given corpus is very large, <script type="math/tex">log_{10}(IDF)</script> is used in order to dampen its effect.</p>
<h3 id="latent-semantic-analysis">Latent Semantic Analysis</h3>
<p>This is one of the existing technqiues available for text classification. It extends the vector space model by using singular value decomposition (SVD) to find a set of underlying latent variables which spans the meaning of texts. It is built from a term-document matrix, each row of which represents a term, and each column represents a document. It can be built using the BoW model</p>
<center> $$ X = [v_1, v_2, ..., v_{|D|}] $$ </center>
<p>where <script type="math/tex">v_i</script> is the vector representation obtained using the bag-of-words model.
In this method, the term-deocument matrix is decomposed using the singluar value decomposition,</p>
<center> $$ X = U \Sigma V^T $$ </center>
<p><script type="math/tex">U</script> and <script type="math/tex">V</script> are orthogonal matrices, <script type="math/tex">\Sigma</script> is a diagonal matrix, and it contains the square roots of the eigenvalues of <script type="math/tex">X^{T}X</script> and <script type="math/tex">XX^{T}</script>. LSA finds a low-rank approximation of X by selecting only the <script type="math/tex">k</script> largest singluar values. Comparison between two documents is done using the cosine distance between their respective projections.</p>
<h3 id="word-subspace">Word Subspace</h3>
<p>In the author’s formulation, words are represented as vectors in <script type="math/tex">{\Bbb R}^p</script>, by using <em>word2vec</em>. Words from similar contexts are represented by vectors close to each other, while words from different contexts are represented as far apart vectors. Arithmetic operations like “king” - “man” + “woman” = “queen”.
Let <script type="math/tex">D_c = {d_i}_{i=1}^{|D_c|}</script>. Each document <script type="math/tex">d_i</script> is represented by a set of <script type="math/tex">N_i</script> words, <script type="math/tex">d_i = {w_k}\_{k=1}^{N_i}</script>. By considering that all words from documents of the same context belong to the same distribution, a set of words <script type="math/tex">W_c = {w_k}\_{k=1}^{N_c}</script> with the words in context <script type="math/tex">c</script> is obtained. A set of word vectors <script type="math/tex">X_c = {x\_{c}^{k}}\_{k=1}^{N_c} \space\space \epsilon \space\space {\Bbb R}^p</script>, is then obtained using word2vec. The whole set is then modeled into a word subspace, which is a compact representation, while preserving meaning. Such a word subspace is generated by applying PCA to the set of word vectors.</p>
<h3 id="text-classification-based-on-word-subspace">Text Classification Based on Word Subspace</h3>
<p>Given a set of training documents, also called corpus, <script type="math/tex">D = {d_i}_{i=1}^{|D|}</script> with known classes <script type="math/tex">C = {c_j}_{j=1}^{|C|}</script>, the aim is to classify a query document <script type="math/tex">d_q</script> into one of the classes in <script type="math/tex">C</script>.</p>
<p>The learning stage, documents belonging to the same class (i.e. assuming they belong to the same context), resulting in a set of words <script type="math/tex">W_c = {w_{c}^{k}}_{k=1}^{N_c}</script>. Each set is then modeled into a word subspace <script type="math/tex">\gamma_{c}</script>. As the number of words in each class may vary largely, the dimension <script type="math/tex">m_{c}</script> of each class word subspace is not set to the same value.</p>
<p>In the classification process, for a query document <script type="math/tex">d_q</script> generates a subspace <script type="math/tex">\gamma_{q}</script>.</p>
<p>Comparison between the two above subspaces is done using canonical angles. I would suggest reading the paper’s of explanation of the part where the authors explain the calculation of canonical angles. <a href="https://en.wikipedia.org/wiki/Angles_between_flats">This Wikipedia article</a> is also pretty good.</p>
<h3 id="tf-weighted-word-subspace">TF Weighted Word Subspace</h3>
<p>As was shown in BoW features, the frequency of words is relevant information. Incorporating TF, gives a TF weighted word subspace. Consider the set of word vectors <script type="math/tex">{x_c^k}_{k=1}^{N_c} \space\space\epsilon\space\space {\Bbb R}^p</script>, which represents each word in context <script type="math/tex">c</script>, and the set of weights <script type="math/tex">{w_i}_{i=1}^{N_c}</script>, which represent the frequencies of the words in the context <script type="math/tex">c</script>.</p>
<p>The weighted matrix <script type="math/tex">\overline{X}</script> is obtained as follows:</p>
<center> $$ \overline{X} = X \Omega^{0.5} $$</center>
<p>where <script type="math/tex">X \space\space \epsilon \space\space {\Bbb R}^{p * N^c}</script> and <script type="math/tex">\Omega</script> is a diagonal matrix containing the weights. PCA is then performed by solving the SVD of <script type="math/tex">\overline{X}</script>.</p>
<hr />
<p>This was all about this paper from my side! I suggest going through the paper itself once for deeper explanation and the experimental results. The authors have used the Reuters-8 Database for experimentation, achieving a higher accuracy over the standard word2vec method.</p>My most recent read is the research paper titled “Text Classification based on Word Subspace with Term Frequency”.DeepWalk: Online Learning of Social Representations2018-07-27T00:00:00+05:302018-07-27T00:00:00+05:30https://arpitgogia.com/deepwalkonline-learning-of-social-representations<p>Recently I read this paper called <a href="https://arxiv.org/abs/1403.6652">DeepWalk: Online Learning of Social Representations</a>. It details a method to generate socially aware representations of nodes in a graph using deep learning. The core principle used here is random walks as a source of gathering information, by treating them as sentences. The authors define social representation as latent features of a vertex that capture neighborhood similarity and community membership. These latent representations encode social relations in a continuous vector space with a relatively small number of dimensions.</p>
<p>Random walks have been used as a similarity measure for a variety of problems in content recommendation and community detection. This was the motivation for the authors to use a stream of short random walks as the basic tool for extracting information from a network. Two other advantages apart from this are:</p>
<ul>
<li>Local exploration becomes easy to parallelize, with several random walkers running simultaneously to explore the graph.</li>
<li>Short random walks mean small changes can be accomodated in the graph structure without the need for global recomputation. The model can be iteratively updated with new random walks from the changed region in time sub-linear to the entire graph.</li>
</ul>
<p>The authors have emphasized that the reason they’re able to use techniques that are used to model natural languages for modelling community structure in networks is because of the fact that the degree distribution of a connected graph follows the power law, as is the case with word frequency in natural language.</p>
<blockquote>
<p>Power law is a functional relationship between two quantities, where a relative change in one quantity results in a proportional relative change in the other quantity, independent of the initial size of those quantites: One quantity varies as a power of another. For example, consider the area of a square in terms of the length of its side, if the length is doubled, the area is multiplied by a factor of four.</p>
</blockquote>
<p>The goal of language modeling is to estimate the likelihood of a specific sequence of words appearing in a corpus. More formally, given a sequence of words.</p>
<center>
$$ W_1^n = (w_0, w_1, ..., w_n) $$
</center>
<p><br />
where <script type="math/tex">w_i \varepsilon V ( V</script> is the vocabulary), we would like to maximize the Pr( <script type="math/tex">w_n | w_0, w_1, ..., w_{(n-1)})</script> ) over all the training corpus.</p>
<p>The direct analog is to estimate the likelihood of observing vertex <script type="math/tex">v_i</script> given all the previous vertices visited so far in the random walk.</p>
<center>
$$ Pr(v_i | (v_1, v_2, ..., v_{(i - 1)})) $$
</center>
<p><br />
The goal is to learn a latent representation, not only a probability distribution of node co-occurences, thus giving rise to a mapping function <script type="math/tex">\phi : v \space\space\varepsilon\space\space V \to {\Bbb R}^{|V| * d}</script> where <script type="math/tex">|V|</script> is the number of vertices and <script type="math/tex">d</script> is the number of dimensions the latent vector is expressed in.</p>
<p>Getting to the algorithm itself, the Deep Walk algorithm consists of two main components; first a random walk generator and second an update procedure. The random walk takes a graph G and samples uniformly a random vertex <script type="math/tex">v_i</script> as the root of the random walk <script type="math/tex">W_{v_i}</script> A walk samples uniformly from the neighbors of the last vertex visited until the stipulated length <script type="math/tex">t</script> is reached.
As each random walk is generated, the update part of the algorithm makes use of SkipGram to update these representations.</p>
<blockquote>
<p>SkipGram is a language model that maximizes co-occurence probability among the words that appear within a window, <script type="math/tex">w</script> in a sentence.</p>
</blockquote>
<p>This is a recent relaxation in language modeling which requires to predict the context from one word instead of predicting the word from a context. This is basically reversing the problem on its head. This relaxation is quite useful for speeding up training time by building small models as one vertex is given at a time.</p>
<p>To decrease the computation time, hierarchical softmax is used to maximize the probability of a specific path, if the vertices were to be placed in a binary tree. If the path to vertex <script type="math/tex">u_k</script> is identified by a sequence of tree nodes (<script type="math/tex">b_0, b_1, ..., b_{\lceil log(|V|) \rceil}</script>) then</p>
<center>
$$
Pr(u_k| \phi(v_j)) = \prod_{l = 1}^{\lceil log(|V|) \rceil} Pr(b_l | \phi(v_j))
$$
</center>
<p>The model parameters are further optimised using Stochastic Gradient Descent.</p>
<hr />
<p>That’s it for this post, hope the read was helpful. The authors have further described how they’ve accomplished parallelizability so in case you’re interested in that, I encourage you to read the paper!</p>Recently I read this paper called DeepWalk: Online Learning of Social Representations. It details a method to generate socially aware representations of nodes in a graph using deep learning. The core principle used here is random walks as a source of gathering information, by treating them as sentences. The authors define social representation as latent features of a vertex that capture neighborhood similarity and community membership. These latent representations encode social relations in a continuous vector space with a relatively small number of dimensions.Learning how Pinterest’s recommendation engine works2018-06-30T00:00:00+05:302018-06-30T00:00:00+05:30https://arpitgogia.com/learning-how-pinterests-recommendation-engine-works<p>This week I read this <a href="https://arxiv.org/pdf/1806.01973.pdf">paper</a> that details the algorithm Pinterest has in production currently for generating recommendations. They use Graph Convolutional Networks to accomplish this but another challenge they overcame was production level scalability of GCNs.</p>
<p>The authors have described the algorithm through 3 components: Efficient on-the-fly convolutions, Producer Consumer minibatch construction, efficient MapReduce inference. In this post I’m going to cover only the component concerned with the convolutions.</p>
<h3 id="graph-convolutional-networks">Graph Convolutional Networks</h3>
<p>The core idea behind GCNs is to learn how to iteratively aggregate feature information from local graph neighborhoods using neural networks. Here a single “convolution” operation transforms and aggregates feature information from a node’s one-hop graph neighborhood, and by stacking multiple such convolutions information can be propagated across far reaches of a graph. Currently, most graph neural network models have a somewhat universal architecture in common. The goal of these models is to learn a function of signals/features on a graph <script type="math/tex">G = (V, E)</script> which takes as input:</p>
<ul>
<li>A set of features <script type="math/tex">x_i</script> for each node <script type="math/tex">i</script> i.e. an <script type="math/tex">N x D</script> matrix <script type="math/tex">X</script>.</li>
<li>A representative description of the graph structure in matrix form; typically in the form of an adjacency matrix <script type="math/tex">A</script> (or some transformation of it).</li>
</ul>
<p>The model produces an output <script type="math/tex">Z</script> which is node specific feature vector. Every neural network layer can then be written as a non-linear function:
<script type="math/tex">\begin{aligned}
H^{(l + 1)} = f(H^{(l)}, A)
\end{aligned}</script>
with <script type="math/tex">H^{(0)} = X</script> and <script type="math/tex">H^{(l)} = Z</script>, <script type="math/tex">L</script> being the number of layers.</p>
<p>Thus GCNs can capture both the node’s properties and the graphical structure it is positioned in.
Totally Awesome article on Graph Convolutional Networks, source of the above example and container of an example can be found <a href="https://tkipf.github.io/graph-convolutional-networks/">here</a></p>
<h3 id="pinterests-recommendation-engine">Pinterest’s recommendation engine</h3>
<p>Pinterest is a content sharing and discovery platform where users interact with pins, which are visual bookmarks to online content. User generated data sets comprise of pins that the user organizes thematically.</p>
<blockquote>
<p>Altogether, the Pinterest graph contains 2 billion pins, 1 billion boards, and over 18 billion edges (i.e., memberships of pins to their corresponding boards).</p>
</blockquote>
<p>The author’s primary mission was to generate high qualtiy embeddings or representations of pins that can be used for recommendation tasks.</p>
<p>The Pinterest environment is modeled as a bipartite graph consisting of nodes in two disjoint sets, <script type="math/tex">I</script> being viewed as a set of items and <script type="math/tex">C</script> as a set of user-defined contexts or collections. Each pin <script type="math/tex">u \space \varepsilon \space</script> has associated real-valued features <script type="math/tex">x_u \space \varepsilon \space {\Bbb R}^{d}</script> which can be metadata or content information about the pin.</p>
<p>The core idea of Pinsage lies in it’s local convolution mechanism. The representations <script type="math/tex">z_v, \forall \space v \space \varepsilon N(u), u</script>’s neighborhood are passed through a dense neural network and then through an aggregator function on the resulting set of vectors. This aggregation step provides a vector representation <script type="math/tex">n_u</script> of <script type="math/tex">u</script>’s local neighborhood. This is then concatenated with <script type="math/tex">u</script>’s current representation <script type="math/tex">h_u</script> and then transformed using another dense neural network layer. The output of the algorithm is a representation of <script type="math/tex">u</script> that incorporates both information about itself and it’s neighborhood.</p>
<p>An important part of this algorithm is defining the “neighborhood” of a node. Previously, GCNs have used simply k-hop neighborhoods. In Pinsage, the neighborhood of a node <script type="math/tex">u</script> is defined as the <script type="math/tex">T</script> nodes that exert the most influence on node u. Top <script type="math/tex">T</script> nodes with the highest <script type="math/tex">L_1</script> normalized visit counts as a result of a random walk with respect to node <script type="math/tex">u</script> are used as the neighborhood of <script type="math/tex">u</script>.</p>
<blockquote>
<p>The advantages of this importance-based neighborhood definition are two-fold. First, selecting a fixed number of nodes to aggregate from allows us to control the memory footprint of the algorithm during training. Second, it allows us to take into account the importance of neighbors when aggregating the vector representations of neighbors. In particular, we implement this as a weighted-mean, with weights defined according to the L1 normalized visit counts. We refer to this new method as <em>importance pooling</em>.</p>
</blockquote>
<p>The various convolutions are stacked, i.e. applied one after the other successively on the data.</p>
<p>Now given the generated embeddings and a query item q, recommendations are obtained using the K-nearest neighbors of the query item’s embedding.</p>
<hr />
<p>This is it for this post, I hope it projects a satisfactory explanation of Pinsage and graph convolutional networks. This post will be limited to the GCN part of the paper. I’ll try and read the rest of it to understand the scalability and training part of Pinsage.</p>This week I read this paper that details the algorithm Pinterest has in production currently for generating recommendations. They use Graph Convolutional Networks to accomplish this but another challenge they overcame was production level scalability of GCNs.Azure Notebooks vs. Google CoLab from a Novice’s perspective2018-06-27T00:00:00+05:302018-06-27T00:00:00+05:30https://arpitgogia.com/azure-notebooks-vs-google-colab-from-a-novices-perspective<p>I’ve had the opportunity of using both Google CoLab and Azure Notebooks while working on my project last semester, and I think I can safely say both of them are awesome to use. Google CoLab was seeded in 2014 and has grown ever since. Azure Notebooks is still in “Preview” mode but it is great to use even at this stage.</p>
<p>Let’s compare the two environments based on the following parameters:</p>
<ul>
<li><a href="#speed">Speed and Responsiveness</a></li>
<li><a href="#memory">Memory</a></li>
<li><a href="#file">File I/O</a></li>
<li><a href="#other">Other Features</a></li>
<li><a href="#conc">Conclusion</a></li>
</ul>
<h2 id="speed-and-functionality"><a name="speed">Speed and Functionality</a></h2>
<p>Full points to Azure Notebooks here, it feels exactly like running a a Jupyter Notebook locally. Google CoLab on the other is not as responsive.</p>
<p>Both services otherwise are pretty much same on functionality with code and markdown cells.</p>
<p>Azure NB has native Jupyter UI where as Google has “materialized” it. It’s more about personal preference :p</p>
<p><code class="highlighter-rouge">Winner: Azure NB</code><br />
<a name="memory">Memory and Compute Power</a>
—</p>
<p>I think this is a big difference between Google CoLab and Azure Notebooks. Google CoLab has a healthy memory limit of 20GB (the last time I tried, I think it didn’t throw an error upto 20GB). Azure Notebooks on the other hand has a 4GB memory limit. This is a deal breaker for someone working with large datasets.</p>
<p>Here’s where things get interesting, Google offers 12 hours of free usage of a GPU as a backend. The GPU being used currently is an NVIDIA Tesla K80. And it’s more or less free forever because you can just connect to another VM to gain 12 more hours of free access. That’s an enormous boost in performance for someone training a deep learning model.</p>
<p><code class="highlighter-rouge">Winner: Google CoLab</code><br />
<a name="file">File I/O</a>
—</p>
<p>This is also an interesting aspect. File I/O is essential when you’re dealing with datasets and CSVs and what not.</p>
<p>Google is yet to perfect the way CoLab handles files. There are a couple of ways of adding files like using Google Drive, uploading directly to CoLab Storage, accessing a sheet from Google Sheets, using Google Cloud Storage. The whole list can be found <a href="https://colab.research.google.com/notebooks/io.ipynb">here</a>. It’s just that you need a lot of boilerplate code for all of the above methods except the first one. On top of that, uploading files for using in the notebook (which seems the most natural), has to be done everytime the notebook disconnects from the VM. Similarly, files downloaded using <code class="highlighter-rouge">wget</code> (notebooks can be made to run bash commands by prefixing a <code class="highlighter-rouge">!</code> against the command, didn’t know that :p) last only as long as you’re connected to the VM. Considering datasets can be huge, this isn’t really a sustainable solution (Though this post is meant for beginners, so significant chance that they’re not using massive datasets).</p>
<p>Azure Notebooks solves this problem by creating Libraries, which they’ve defined as being a collection of notebooks that are related. Libraries can also hold your data, assuming that each data file is less than 100 MB. I’ve used this before successfully with CSV, JSON and image files. Of course there are more ways to retrieve data in Azure Notebooks listed <a href="https://notebooks.azure.com/Microsoft/libraries/samples/html/Getting%20to%20your%20Data%20in%20Azure%20Notebooks.ipynb">here</a>. This is much more convenient if you are working with small datasets, essential for beginners. This I believe gives a more native feel of organising work into folders or workspaces as one does locally.</p>
<p><code class="highlighter-rouge">Winner: Azure NB</code><br />
<a name="other">Other Features</a>
—</p>
<p>Some other features that I’ve noticed:</p>
<ul>
<li>Azure Notebooks support not just Python, but also F# and R languages.</li>
<li>Did I mention Google offers free GPU compute using a Tesla K80 GPU :p ?</li>
<li>Both CoLab and Azure Notebooks have cloud sharing functionality. CoLab is backed by Google Drive whereas Azure NB has it’s Git-ish version of sharing through cloning.</li>
<li>You can specify shell scripts that run to setup your environment, or even specify config in YAML files. And then you’re given an integrated bash terminal. I mean, so much for customization :p. Though both the services support installing of Python modules using pip directly in the notebook.</li>
</ul>
<h2 id="conclusion"><a name="conc">Conclusion</a></h2>
<p>So in conclusion, I think it comes down to each aspect. If you want to do intensive computation, go for Google CoLab but if you’re looking for a simple cloud hosted notebook where you can mess around with classifiers, or linear regression or rudimentary neural networks, then go for Azure NB.</p>
<hr />
<p><strong>That’s it for this one, let me know if I’ve got something wrong above or there are some new features that I haven’t mentioned here!</strong></p>I’ve had the opportunity of using both Google CoLab and Azure Notebooks while working on my project last semester, and I think I can safely say both of them are awesome to use. Google CoLab was seeded in 2014 and has grown ever since. Azure Notebooks is still in “Preview” mode but it is great to use even at this stage.Summary of “A Simple Method for Commonsense Reasoning”2018-06-17T00:00:00+05:302018-06-17T00:00:00+05:30https://arpitgogia.com/summary-of-a-simple-method-for-commonsense-reasoning<p>Found this paper on <a href="https://arxiv.org/abs/1806.02847">arXiv</a> and thought of giving it a read to understand exactly what was going on, from a beginner’s perspective.</p>
<p>This paper attempts to create a simple unsupervised approach to commonsense reasoning using neural networks and deep learning.</p>
<h2 id="prerequisites">Prerequisites</h2>
<p>Before presenting the approach shown in the paper, let’s understand some of the prerequisites I’ve listed here.</p>
<ul>
<li><a href="#lm">Language Models</a></li>
<li><a href="#ws">Winograd Schema Challenge</a></li>
<li><a href="#pd">Pronoun Disambiguation Challenge</a></li>
</ul>
<h3 id="-language-models"><a name="lm"> Language Models</a></h3>
<p>A language model is basically a probability distribution over sequences of words. Formally, given you have a text corpus, such that the set of words in the corresponding vocabulary is <script type="math/tex">V</script>, then a language model can define the
exact probability of a sentence <script type="math/tex">x_1, x_2, ... x_n</script> belonging to the set of all sentences <script type="math/tex">V'</script> constructed using the vocabulary.</p>
<p>A simple example of a poor language model is the one that uniformly distributes the probability across all sentences. Suppose <script type="math/tex">c(x_1, ... , x_n)</script> to be the number of times that sentence is seen in the training corpus, and <script type="math/tex">N</script> to be the total number of sentences in the training corpus. The probability can then be defined as:</p>
<center>
$$ p(x_1, ..., x_n) = \frac{c(x_1, ..., x_n)}{N} $$
</center>
<p>Speech Recognition is one of the key applications of Langauge Models. Verbal Speech is processed to obtain a set of candidate sentences, which are then fed to a language model to get the most probable sentence. <a href="http://www.cs.columbia.edu/~mcollins/lm-spring2013.pdf">This document</a> brilliantly explains the details of defining a language model, how Markov Models are used for fixed-length sentences and the types of language models etc. For the explanation of the paper being considered here, just knowing the input and output of a language model should be enough.</p>
<h3 id="-winograd-schema-challenge"><a name="memory"> Winograd Schema Challenge</a></h3>
<p>Designed to be an improvement over the traditional AI benchmark, the <a href="https://plato.stanford.edu/entries/turing-test/">Turing Test</a>, it is a multiple choice test that employs questions of a very specific structure, called the Winograd Schema, named after Terry Winograd, a professor of CS at Stanford University. Quoting Wikipedia,</p>
<blockquote>
<p>Winograd Schema questions simply require the resolution of anaphora: the machine must identify the antecedent of an ambiguous pronoun in a statement. This makes it a task of natural language processing, but Levesque argues that for Winograd Schemas, the task requires the use of knowledge and commonsense reasoning.</p>
</blockquote>
<p>The Winograd Schema Challenge was proposed in part to ameliorate the problems that came to light with the nature of the programs that performed well on the Turing Test. Essentially, a Winograd Schema consists of two noun phrases of similar semantic meaning, an ambigous pronoun that may refer to either of the above noun phrases, and two word choices such that each one results from a different interpretation of the pronoun. A question then asks the identity of the ambigous pronoun. A machine answering facing such a challenge cannot rely just on statistical measures, that is the whole point of Winograd Schemas. Moreover, they don’t need human judges as opposed to a Turing Test. The only pitfall is the difficulty in developing a Winograd Schema.</p>
<p>Examples of a Winograd Schema can be found <a href="https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WSCExample.xml">here</a>.</p>
<h3 id="pronoun-disambiguation-challenge"><a name="pd">Pronoun Disambiguation Challenge</a></h3>
<p>As you probably understand from the heading, this challenge is very similar to the Winograd Schema Challenge. A great collection of PDPs can be found <a href="http://commonsensereasoning.org/disambiguation.html">here</a>.</p>
<p><em>Both of the above challenges are a part of the general Word Sense Disambiguation problem, which aims at identifying which sense of a word is used in a sentence.</em></p>
<h2 id="the-papers-approach">The Paper’s Approach</h2>
<p>Now that we’ve understood the above concepts, it will be much easier to understand what the author’s are trying to achieve.
In <strong>Related Work</strong> the author’s have mentioned an approach, by Mikolav et al., wherein predicting adjacent words in a sentence, word vectors can be made to answer analogy questions like <code class="highlighter-rouge">Man:King::Woman:?</code>. The authors use this as an inspiration to show that language models are capable of capturing common sense. Since Winograd Schemas require much more contextual information, just word vectors won’t suffice and hence the use of Language Models. Previously researchers have shown that pre-trained LMs can be used as feature representations for a sentence, or a paragraph to improve NLP applications such as document classification, machine translation, question answering, etc.</p>
<p>Given a Winograd Schema, <strong>The trophy doesn’t fit in the suitcase because it is too big</strong>, the authors substitute the two possible candidates <strong>suitcase</strong> and <strong>trophy</strong> into the pronoun position. A Language Model is then used to score the two substitutions.</p>
<p>The authors use two different scores, using full and partial representations of the candidate sentences.</p>
<p>Suppose the sentence <script type="math/tex">S</script> of <script type="math/tex">n</script> consecutive words has its pronoun to be resolved specified at the <script type="math/tex">k^{th}</script> position: <script type="math/tex">S = {w_1, ..., w_{k-1}, w_k \equiv p, w_{k+1}, ..., w_n}</script>. The language model used by the authors is such that it defines the probability of word <script type="math/tex">w\_t</script> preconditioned on the previous words <script type="math/tex">w_1, ..., w_{t - 1}</script>. The substitution of a candidate reference $ c $ in to the pronoun position k results in a new sentence <script type="math/tex">S_{w_{k \leftarrow c}}</script> The two scores are thus computed as follows:</p>
<ul>
<li>
<script type="math/tex; mode=display">Score_{full}(w_{k} \leftarrow c) = P(w_1, w_2, ..., w_{k-1}, c, w_{k+1}, ..., w_n)</script>
</li>
<li>
<script type="math/tex; mode=display">Score_{partial}(w_{k} \leftarrow c) = P(w_{k+1}, ..., w_n | w_1, w_2, ..., w_{k-1}, c)</script>
</li>
</ul>
<p>The above scores take into account how probable the full sentence is and how likely is it that the substituted phrase can act as an antecedent to the next part of the sentence.</p>
<p>Suprisingly, the results showed that partial scores perform better than the naive full scoring strategy. Partial scoring corrected a large portion of wrong predictions made by full scoring.</p>
<hr />
<p>That is all for this summary. For more details on the recurrent language models used, how better the strategy was against current commonsense reasoning approaches, and one more interesting inference, I encourage you to read the complete paper :)</p>Found this paper on arXiv and thought of giving it a read to understand exactly what was going on, from a beginner’s perspective.Editor Wars!2018-05-14T00:00:00+05:302018-05-14T00:00:00+05:30https://arpitgogia.com/editor-wars<p>Before you read any further, visit the Wikipedia page on the official definition of <a href="https://en.wikipedia.org/wiki/Editor_war">editor war</a> (I had absolutely zero idea this was a legit thing :p)</p>
<p><img src="https://i.redditmedia.com/1zWeLdE2ckLbcXChms5bXcU59Z2axH0iFitOiG-AiWQ.jpg?s=27cbe572e3e4a72e6b690186e3b24092" /></p>
<p>Choosing the perfect editor for programming can be such an intensive task. How do you zero down on one perfect editor for all the languages you can work on?</p>
<p>I have fiddled around with editors for quite a lot of time now, but somehow it doesn’t get monotonous. But it is sensible to make a decision and stick to it! So here’s my experience with Visual Studio Code etc. and why I keep coming back to Vim, particularly the awesome Vim distribution called <a href="https://spacevim.org/">Space Vim</a>.</p>
<h2 id="visual-studio-code-and-the-works">Visual Studio Code and the works</h2>
<p>There is absolutely no doubt in my mind that Visual Studio Code is a phenomenal editor. It is quick, responsive, is completely extensible, has immense developer support. Any language you name it and there’s all around support for it either natively or through plugins. I’m sure you have heard of the cross platform mobile application framework <code class="highlighter-rouge">Flutter</code>, well Visual Studio Code is officially fully equipped to handle the Dart language, and even cooler things like running, debugging and hot reloading applications. I primarily use Python and it’s quite I’ve had good experience with that too.</p>
<p>Some strong competitors to this are GitHub’s Atom, Sublime Text and Adobe’s very own Brackets. I haven’t used Brackets at all so no comments on that from my side. Sublime Text has been the strongest competitor and a developer favorite since a long time. Rightfully so, it is faster, lighter on the CPU and the memory. But I think this is where Microsoft has done an impeccable job of increasing their user base, through consistent updates and increasingly functional plugins like Live Code Sharing. Sublime Text and Atom still don’t even have properly integrated terminal support. Another point that’s crucial at least for me as a developer, is the visual appearance. Full points to Atom and Visual Studio Code for this. They seem to be more consistent than Sublime Text in their appearance.</p>
<p>Now here’s the gripe with Atom and VS Code. They’re built on this little platform called <code class="highlighter-rouge">Electron</code>. To be fair the name is quite an irony. Electron is neither light on the CPU nor easy on the memory. On an average you can observe close to a couple of hundred megabytes of RAM consumption. From someone with a conservative attitude towards hardware resources, that’s a tad bit high. Sublime is much better with memory management. One thing I’ve seen that both of them have in common, the python plugin I used would, over time, occupy a whole lot of memory forcing me to close and restart the editor. This is probably just a bug but it was definitely a deal breaker.</p>
<h2 id="vim--space-vim">Vim & Space Vim</h2>
<p><img src="https://images.duckduckgo.com/iu/?u=http%3A%2F%2Fwww.catonmat.net%2Fimages%2Fviral-4-times-1-week%2Fvim-programming-joke.png&f=1" /></p>
<p>I’ve been on and off of Vim a whole lot of times. Some or the other thing would just not work and that would make me go “Ah, damn it, I’ll just go back to what I was using before”. Even then, the whole concept of Vim (and Emacs) seems to be more developer friendly than anything else. It is difficult setting up Vim as a fully functional IDE with good autocomplete, folder explorer, etc but once you do it and you get used to the shortcuts, it’s coding nirvana! There are a lot of A little while ago, while scouring GitHub for open sourced Vim configurations, I discovered <a href="http://spacemacs.org">SpaceMacs</a>. It’s a super charged Emacs distribution which is supposedly easier to customise than the standard Emacs. According to their tagline “The best editor is neither Emacs nor Vim, it’s Emacs and Vim!”. Since I had never used Emacs, I was lucky enough to land at <a href="https://spacevim.org">SpaceVim</a>, which is a super charged Vim distribution. As a formal analogy, if SpaceMacs is the DC version of Vim, SpaceVim is the Marvel version of Vim. The way it is customisable using <code class="highlighter-rouge">layers</code> is so good, all you have to do to get support for a language is to enable the required layer. And rest of the customisations are similar to hwo you do in a standard <code class="highlighter-rouge">.vimrc</code> file. As for the shortcuts, they are already set up and documented, though of course can be customised. It’s the closest I’ve come to using Vim as a functional IDE. Now this is the good part, with all of these customisations, Vim still consumes a little less than 20 MB of RAM while working on a Python file. Oh and this same configurations works with both <code class="highlighter-rouge">vim</code> and <code class="highlighter-rouge">gvim</code>. If you are working on a server and you think you’ll survive by using Nano, you’re completely mistaken my friend, Vim is your saviour.</p>
<p>One suggestion I’ve been given a lot of times, is to install a “Vim mode” plugin in say something like VS Code. All I’ve to say to that is that I’m yet to find a plugin which enables the same, completely bug free.</p>
<p>There are other Vim distributions too that you can check out like VimR, spf-13 Vim and others. That’s it for now.. Yes I’m a moderate Vim fan but I’m sure I haven’t reached the stage shown below :p</p>
<p><img src="https://i.redditmedia.com/oeaQ7xcLfvuja1NhhQxnN9PDuZ5u0vm2XHI6HdfZUI0.jpg?s=2421f2b5219391248f9e3e57d81b323c" /></p>
<p>PS: This page was edited using Visual Studio Code 😂</p>Before you read any further, visit the Wikipedia page on the official definition of editor war (I had absolutely zero idea this was a legit thing :p)Setting up a basic static website2018-03-11T00:00:00+05:302018-03-11T00:00:00+05:30https://arpitgogia.com/setting-up-a-basic-static-website<p>Personal websites can be a great place to showcase your content, projects or just in general make a good first impression. This post is an attempt to reduce the burden, especially for novice.</p>
<h3 id="what-youll-need-and-why">What you’ll need and Why</h3>
<ol>
<li><a href="https://hexo.io">Hexo</a></li>
<li><a href="http://firebase.google.com">Firebase</a></li>
</ol>
<p>Hexo is a static site generator built using JavaScript. There’s no specific reason for me to use it here, I just found the right template for through Hexo :p. There are others that you can try like Jekyll (have seen this being used very often), Hugo etc. Here’s an exhaustive <a href="https://www.staticgen.com">list</a>.</p>
<p>The reason I’m using a static site generator here is to simplify the process creating the HTML pages itself. This way you don’t have to create the CSS and the JS code needed for proper detailed formatting. Static site generators allow you to create websites using just the information that matters the most, your content.</p>
<h3 id="choosing-a-theme">Choosing a theme</h3>
<p>Hexo has a big theme <a href="https://hexo.io/themes/index.html">library</a>. There are templates for any and every requirement. The theme that is powering this website is <a href="https://github.com/probberechts/hexo-theme-cactus">Cactus</a>. As you can see it is very minimalistic which is what I was going for.</p>
<h3 id="customising">Customising</h3>
<p>Here comes the interesting part. The basic part of a Hexo site is the <code class="highlighter-rouge">_config.yaml</code> file. <a href="http://yaml.org">YAML (YAML Ain’t Markup Language)</a> is a configuration file format that is easy to read due to obvious semantics which aims to be “minimal”. YAML is a human-readable data serialization language. It is commonly used for configuration files, but could be used in many applications where data is being stored or transmitted. Here’s a <a href="https://hexo.io/docs/configuration.html">list</a> of the settings that Hexo provides. There’s also effortless support for integration of comments using Disqus and analytics using Google Analytics.</p>
<p>At this point if you’re happy with the look of your website you can go ahead to the Deployment section and deploy it to a hosting service of your choice. Or you can dig deeper and begin tweaking the CSS till you’re satisfied with the look.</p>
<p>You can test your website by typing <code class="highlighter-rouge">hexo generate && hexo server</code> in a terminal in the directory where your <code class="highlighter-rouge">_config.yaml</code> is placed.</p>
<h3 id="deployment">Deployment</h3>
<p>I chose Firebase for deployment because it is relatively simple to use and I am a tad bit biased towards Google Products :p. Hexo’s <a href="https://hexo.io/docs/deployment.html">documentation</a> mentions quite a few ways to host your site.</p>
<p>Moving on, assuming you’ve Firebase CLI installed and are <a href="https://firebase.google.com/docs/cli/">logged in</a>, go ahead and initialise your project using <code class="highlighter-rouge">firebase init</code> in the same directory as your <code class="highlighter-rouge">_config.yaml</code>. Now just one more command <code class="highlighter-rouge">hexo generate && firebase deploy</code> and there you go, your website is deployed! If you want to mimic the firebase environment for local testing you can also do <code class="highlighter-rouge">hexo generate && firebase serve</code>. This is a bit different than <code class="highlighter-rouge">hexo server</code> because as I said it mimics a firebase server locally.
<br /></p>
<p>There you go, you now have a full fledged website.</p>
<p>If you’ve any doubts or suggestions regarding the instructions above or this site in general, hit me up on <a href="https://twitter.com/arpit_gogia">Twitter</a>.</p>Personal websites can be a great place to showcase your content, projects or just in general make a good first impression. This post is an attempt to reduce the burden, especially for novice.