Recently I read this paper called DeepWalk: Online Learning of Social Representations. It details a method to generate socially aware representations of nodes in a graph using deep learning. The core principle used here is random walks as a source of gathering information, by treating them as sentences. The authors define social representation as latent features of a vertex that capture neighborhood similarity and community membership. These latent representations encode social relations in a continuous vector space with a relatively small number of dimensions.

Random walks have been used as a similarity measure for a variety of problems in content recommendation and community detection. This was the motivation for the authors to use a stream of short random walks as the basic tool for extracting information from a network. Two other advantages apart from this are:

  • Local exploration becomes easy to parallelize, with several random walkers running simultaneously to explore the graph.
  • Short random walks mean small changes can be accomodated in the graph structure without the need for global recomputation. The model can be iteratively updated with new random walks from the changed region in time sub-linear to the entire graph.

The authors have emphasized that the reason they’re able to use techniques that are used to model natural languages for modelling community structure in networks is because of the fact that the degree distribution of a connected graph follows the power law, as is the case with word frequency in natural language.

Power law is a functional relationship between two quantities, where a relative change in one quantity results in a proportional relative change in the other quantity, independent of the initial size of those quantites: One quantity varies as a power of another. For example, consider the area of a square in terms of the length of its side, if the length is doubled, the area is multiplied by a factor of four.

The goal of language modeling is to estimate the likelihood of a specific sequence of words appearing in a corpus. More formally, given a sequence of words.

$ W_1^n = (w_0, w_1, …, w_n) $

where $ w_i \varepsilon V $ ($ V $ is the vocabulary), we would like to maximize the Pr($ w_n | w_0, w1, …, w{(n-1)}) $) over all the training corpus.

The direct analog is to estimate the likelihood of observing vertex $ v_i $ given all the previous vertices visited so far in the random walk.

$ Pr(v_i | (v_1, v_2, …, v_{(i - 1)})) $

The goal is to learn a latent representation, not only a probability distribution of node co-occurences, thus giving rise to a mapping function $ \phi : v \space\space\varepsilon\space\space V \to {\Bbb R}^{|V| * d} $ where $ |V| $ is the number of vertices and $ d $ is the number of dimensions the latent vector is expressed in.

Getting to the algorithm itself, the Deep Walk algorithm consists of two main components; first a random walk generator and second an update procedure. The random walk takes a graph G and samples uniformly a random vertex $ v_i $ as the root of the random walk $ W_{v_i} $. A walk samples uniformly from the neighbors of the last vertex visited until the stipulated length $ t $ is reached. As each random walk is generated, the update part of the algorithm makes use of SkipGram to update these representations.

SkipGram is a language model that maximizes co-occurence probability among the words that appear within a window, $ w $ in a sentence.
This is a recent relaxation in language modeling which requires to predict the context from one word instead of predicting the word from a context. This is basically reversing the problem on its head. This relaxation is quite useful for speeding up training time by building small models as one vertex is given at a time.

To decrease the computation time, hierarchical softmax is used to maximize the probability of a specific path, if the vertices were to be placed in a binary tree. If the path to vertex $ u_k $ is identified by a sequence of tree nodes ($ b_0, b_1, …, b_{\lceil log(|V|) \rceil} $) then

$ Pr(u_k| \phi(v_j)) = \prod_{l = 1}^{\lceil log(|V|) \rceil} Pr(b_l | \phi(v_j)) $

The model parameters are further optimised using Stochastic Gradient Descent.


That’s it for this post, hope the read was helpful. The authors have further described how they’ve accomplished parallelizability so in case you’re interested in that, I encourage you to read the paper!