This week I read this paper that details the algorithm Pinterest has in production currently for generating recommendations. They use Graph Convolutional Networks to accomplish this but another challenge they overcame was production level scalability of GCNs.

The authors have described the algorithm through 3 components: Efficient on-the-fly convolutions, Producer Consumer minibatch construction, efficient MapReduce inference. In this post I’m going to cover only the component concerned with the convolutions.

Graph Convolutional Networks

The core idea behind GCNs is to learn how to iteratively aggregate feature information from local graph neighborhoods using neural networks. Here a single “convolution” operation transforms and aggregates feature information from a node’s one-hop graph neighborhood, and by stacking multiple such convolutions information can be propagated across far reaches of a graph. Currently, most graph neural network models have a somewhat universal architecture in common. The goal of these models is to learn a function of signals/features on a graph $ G = (V, E) $ which takes as input: * A set of features $ x_i $ for each node $ i $, i.e. an $ N x D $ matrix $ X $. * A representative description of the graph structure in matrix form; typically in the form of an adjacency matrix $ A $ (or some transformation of it).

The model produces an output $ Z $ which is node specific feature vector. Every neural network layer can then be written as a non-linear function: $ \begin{aligned} H^{(l + 1)} = f(H^{(l)}, A) \end{aligned} $ with $ H^{(0)} = X $ and $ H^{(l)} = Z $, $ L $ being the number of layers.

Thus GCNs can capture both the node’s properties and the graphical structure it is positioned in. Totally Awesome article on Graph Convolutional Networks, source of the above example and container of an example can be found here

Pinterest’s recommendation engine

Pinterest is a content sharing and discovery platform where users interact with pins, which are visual bookmarks to online content. User generated data sets comprise of pins that the user organizes thematically. >Altogether, the Pinterest graph contains 2 billion pins, 1 billion boards, and over 18 billion edges (i.e., memberships of pins to their corresponding boards).

The author’s primary mission was to generate high qualtiy embeddings or representations of pins that can be used for recommendation tasks.

The Pinterest environment is modeled as a bipartite graph consisting of nodes in two disjoint sets, $ I $ being viewed as a set of items and $ C $ as a set of user-defined contexts or collections. Each pin $ u \space \varepsilon \space $ has associated real-valued features $ x_u \space \varepsilon \space {\Bbb R}^{d} $, which can be metadata or content information about the pin.

The core idea of Pinsage lies in it’s local convolution mechanism. The representations $ z_v, \forall \space v \space \varepsilon N(u), u$’s neighborhood are passed through a dense neural network and then through an aggregator function on the resulting set of vectors. This aggregation step provides a vector representation $ n_u $ of $u$’s local neighborhood. This is then concatenated with $u$’s current representation $h_u$ and then transformed using another dense neural network layer. The output of the algorithm is a representation of $u$ that incorporates both information about itself and it’s neighborhood.

An important part of this algorithm is defining the “neighborhood” of a node. Previously, GCNs have used simply k-hop neighborhoods. In Pinsage, the neighborhood of a node $u$ is defined as the $ T $ nodes that exert the most influence on node u. Top $ T $ nodes with the highest $ L_1 $ normalized visit counts as a result of a random walk with respect to node $ u $ are used as the neighborhood of $ u $.

The advantages of this importance-based neighborhood definition are two-fold. First, selecting a fixed number of nodes to aggregate from allows us to control the memory footprint of the algorithm during training. Second, it allows us to take into account the importance of neighbors when aggregating the vector representations of neighbors. In particular, we implement this as a weighted-mean, with weights defined according to the L1 normalized visit counts. We refer to this new method as importance pooling.

The various convolutions are stacked, i.e. applied one after the other successively on the data.

Now given the generated embeddings and a query item q, recommendations are obtained using the K-nearest neighbors of the query item’s embedding.


This is it for this post, I hope it projects a satisfactory explanation of Pinsage and graph convolutional networks. This post will be limited to the GCN part of the paper. I’ll try and read the rest of it to understand the scalability and training part of Pinsage.