# Text Classification based on Word Subspace with Term Frequency

- 5 mins

My most recent read is the research paper titled “Text Classification based on Word Subspace with Term Frequency”.

Bag of Words(BoW) is a very common and proven technqiue used in various learning and statistical models. However, BoW doesn’t capture the semantic meaning of text in it’s representation. To mitigate this problem, neural networks were used to learn word vectors, a technqiue called word2vec. Word2Vec is well known for embedding semantic structure into vectors, where angle between the vectors indicates the meaningful similarity between words. The authors propose the novel concept of Word Subspace, to measure similarity between texts and represent the intrinsic variability of features in a set of word vectors. The model is further extended to incorporate the term frequency (TF) using a TF weighted word subspace. Mathematically, word subspace is defined as a low dimensional linear subspace in a word vector space with high dimensionality.

### Bag of Words

The bag of words representation comes from the hypothesis that frequencies of words in a document can indicate the relevance of the document to a query i.e. if documents and a query have similar frequencies for the same words, they might have a similar meaning. This representation is based on the vector space model. A document $d$ can be represented by a vector in ${\Bbb R}^n$, where each dimension represents a different term. A term can mean a single word or an $n$ -gram.

### Term Weightage

Term weightage can be defined in the following ways:

• Binary weight: Occurence of term in teh document => weight = 1.
• Term - frequency weight (TF): Weight is defined by the number of times it occurs in the document $d$.
•  Inverse document - frequency (IDF): Weight is defined by the ratio of total number of documents $$D $to number of documents that have the term$ w, D^w$$.
• Term - frequency inverse document-frequency (TF-IDF): The weight of a term $w$ is defined by the multiplication of its TF and IDF. TF weights do not capture the bias towards specific terms in the corpus. By using IDF, words that are more common across all documents in D receive a smaller weight, giving more credence to rare terms in the corpus.
$$TFIDF(w, d|D) = TF * IDF$$

If the given corpus is very large, $log_{10}(IDF)$ is used in order to dampen its effect.

### Latent Semantic Analysis

This is one of the existing technqiues available for text classification. It extends the vector space model by using singular value decomposition (SVD) to find a set of underlying latent variables which spans the meaning of texts. It is built from a term-document matrix, each row of which represents a term, and each column represents a document. It can be built using the BoW model

$$X = [v_1, v_2, ..., v_{|D|}]$$

where $v_i$ is the vector representation obtained using the bag-of-words model. In this method, the term-deocument matrix is decomposed using the singluar value decomposition,

$$X = U \Sigma V^T$$

$U$ and $V$ are orthogonal matrices, $\Sigma$ is a diagonal matrix, and it contains the square roots of the eigenvalues of $X^{T}X$ and $XX^{T}$. LSA finds a low-rank approximation of X by selecting only the $k$ largest singluar values. Comparison between two documents is done using the cosine distance between their respective projections.

### Word Subspace

In the author’s formulation, words are represented as vectors in ${\Bbb R}^p$, by using word2vec. Words from similar contexts are represented by vectors close to each other, while words from different contexts are represented as far apart vectors. Arithmetic operations like “king” - “man” + “woman” = “queen”. Let $D_c = {d_i}_{i=1}^{|D_c|}$. Each document $d_i$ is represented by a set of $N_i$ words, $d_i = {w_k}\_{k=1}^{N_i}$. By considering that all words from documents of the same context belong to the same distribution, a set of words $W_c = {w_k}\_{k=1}^{N_c}$ with the words in context $c$ is obtained. A set of word vectors $X_c = {x\_{c}^{k}}\_{k=1}^{N_c} \space\space \epsilon \space\space {\Bbb R}^p$, is then obtained using word2vec. The whole set is then modeled into a word subspace, which is a compact representation, while preserving meaning. Such a word subspace is generated by applying PCA to the set of word vectors.

### Text Classification Based on Word Subspace

Given a set of training documents, also called corpus, $D = {d_i}_{i=1}^{|D|}$ with known classes $C = {c_j}_{j=1}^{|C|}$, the aim is to classify a query document $d_q$ into one of the classes in $C$.

The learning stage, documents belonging to the same class (i.e. assuming they belong to the same context), resulting in a set of words $W_c = {w_{c}^{k}}_{k=1}^{N_c}$. Each set is then modeled into a word subspace $\gamma_{c}$. As the number of words in each class may vary largely, the dimension $m_{c}$ of each class word subspace is not set to the same value.

In the classification process, for a query document $d_q$ generates a subspace $\gamma_{q}$.

Comparison between the two above subspaces is done using canonical angles. I would suggest reading the paper’s of explanation of the part where the authors explain the calculation of canonical angles. This Wikipedia article is also pretty good.

### TF Weighted Word Subspace

As was shown in BoW features, the frequency of words is relevant information. Incorporating TF, gives a TF weighted word subspace. Consider the set of word vectors ${x_c^k}_{k=1}^{N_c} \space\space\epsilon\space\space {\Bbb R}^p$, which represents each word in context $c$, and the set of weights ${w_i}_{i=1}^{N_c}$, which represent the frequencies of the words in the context $c$.

The weighted matrix $\overline{X}$ is obtained as follows:

$$\overline{X} = X \Omega^{0.5}$$

where $X \space\space \epsilon \space\space {\Bbb R}^{p * N^c}$ and $\Omega$ is a diagonal matrix containing the weights. PCA is then performed by solving the SVD of $\overline{X}$.

This was all about this paper from my side! I suggest going through the paper itself once for deeper explanation and the experimental results. The authors have used the Reuters-8 Database for experimentation, achieving a higher accuracy over the standard word2vec method.