My most recent read is the research paper titled “Text Classification based on Word Subspace with Term Frequency”.
Bag of Words(BoW) is a very common and proven technqiue used in various learning and statistical models. However, BoW doesn’t capture the semantic meaning of text in it’s representation. To mitigate this problem, neural networks were used to learn word vectors, a technqiue called word2vec
. Word2Vec is well known for embedding semantic structure into vectors, where angle between the vectors indicates the meaningful similarity between words. The authors propose the novel concept of Word Subspace
, to measure similarity between texts and represent the intrinsic variability of features in a set of word vectors. The model is further extended to incorporate the term frequency (TF) using a TF weighted word subspace. Mathematically, word subspace is defined as a low dimensional linear subspace in a word vector space with high dimensionality.
The bag of words representation comes from the hypothesis that frequencies of words in a document can indicate the relevance of the document to a query i.e. if documents and a query have similar frequencies for the same words, they might have a similar meaning. This representation is based on the vector space model. A document can be represented by a vector in , where each dimension represents a different term. A term can mean a single word or an -gram.
Term weightage can be defined in the following ways:
Inverse document - frequency (IDF): Weight is defined by the ratio of total number of documents $$ | D | w, | D^w | $$. |
If the given corpus is very large, is used in order to dampen its effect.
This is one of the existing technqiues available for text classification. It extends the vector space model by using singular value decomposition (SVD) to find a set of underlying latent variables which spans the meaning of texts. It is built from a term-document matrix, each row of which represents a term, and each column represents a document. It can be built using the BoW model
where is the vector representation obtained using the bag-of-words model. In this method, the term-deocument matrix is decomposed using the singluar value decomposition,
and are orthogonal matrices, is a diagonal matrix, and it contains the square roots of the eigenvalues of and . LSA finds a low-rank approximation of X by selecting only the largest singluar values. Comparison between two documents is done using the cosine distance between their respective projections.
In the author’s formulation, words are represented as vectors in , by using word2vec. Words from similar contexts are represented by vectors close to each other, while words from different contexts are represented as far apart vectors. Arithmetic operations like “king” - “man” + “woman” = “queen”. Let . Each document is represented by a set of words, . By considering that all words from documents of the same context belong to the same distribution, a set of words with the words in context is obtained. A set of word vectors , is then obtained using word2vec. The whole set is then modeled into a word subspace, which is a compact representation, while preserving meaning. Such a word subspace is generated by applying PCA to the set of word vectors.
Given a set of training documents, also called corpus, with known classes , the aim is to classify a query document into one of the classes in .
The learning stage, documents belonging to the same class (i.e. assuming they belong to the same context), resulting in a set of words . Each set is then modeled into a word subspace . As the number of words in each class may vary largely, the dimension of each class word subspace is not set to the same value.
In the classification process, for a query document generates a subspace .
Comparison between the two above subspaces is done using canonical angles. I would suggest reading the paper’s of explanation of the part where the authors explain the calculation of canonical angles. This Wikipedia article is also pretty good.
As was shown in BoW features, the frequency of words is relevant information. Incorporating TF, gives a TF weighted word subspace. Consider the set of word vectors , which represents each word in context , and the set of weights , which represent the frequencies of the words in the context .
The weighted matrix is obtained as follows:
where and is a diagonal matrix containing the weights. PCA is then performed by solving the SVD of .
This was all about this paper from my side! I suggest going through the paper itself once for deeper explanation and the experimental results. The authors have used the Reuters-8 Database for experimentation, achieving a higher accuracy over the standard word2vec method.