Dive into Deep Learning
Table Of Contents
Dive into Deep Learning
Table Of Contents

14.3. Matrix Factorization

The first version of matrix factorization model is proposed by Simon Funk in a famous blog post in which he described the idea of factorizing the interaction matrix. It then became widely known due to the Netflix contest. In 2006, Netflix, a media-streaming and video-rental company, announced a contest to improve its recommender systems. The best team that can improve on the Netflix baseline, Cinematch, by 10 percent will win a one million USD prize. This contest attracted a lot of attention and created the hype of recommender system research. Later on, the grand prize was won by the BellKor’s Pragmatic Chaos team, a combined team of BellKor, Pragmatic Theory, and BigChaos (you do not need to worry about these algorithms now). Although it is an ensemble method (a combination of many algorithms) that finally achieved the best score, matrix factorization plays a critical role in the final blend. The technical report the Netflix Grand Prize solution provides a detailed introduction to the employed model. In this section, we will dive into the details of the matrix factorization model and its implementation.

14.3.1. The Matrix Factorization Model

Matrix factorization is a class of collaborative filtering models. Simply put, this model factorizes the user-item interaction matrix (e.g., rating matrix) into the product of two lower-rank matrices.

Let \(\mathbf{R} \in \mathbb{R}^{m \times n}\) denote the interaction matrix with \(m\) users and \(n\) items, and the values of \(\mathbf{R}\) represent explicit ratings. It will be factorized into a user latent matrix \(\mathbf{P} \in \mathbb{R}^{m \times k}\) and an item latent matrix \(\mathbf{Q} \in \mathbb{R}^{n \times k}\), where \(k \ll m, n\), is the latent factor size. For a given item \(i\), the elements of \(\mathbf{Q}_i\) measure the extent to which the item possesses those characteristics such as the genres and languages of a movie. For a given user \(u\), the elements of \(\mathbf{P}_u\) measure the extent of interest the user has in items’ corresponding characteristics. These factors might measure obvious dimensions as mentioned in those examples or are completely uninterpretable. The predicted ratings can be estimated by

(14.3.1)\[\hat{\mathbf{R}} = \mathbf{PQ}^\top\]

where \(\hat{\mathbf{R}}\in \mathbb{R}^{m \times n}\) is the predicted rating matrix which has the same shape as \(\mathbf{R}\). One major problem of this prediction rule is that users/items biases can not be modeled. For example, some users tend to give higher ratings or some items always get lower ratings due to poorer quality. These biases are commonplace in real-world applications. To capture these biases, user specific and item specific bias terms are introduced. Specifically, the predicted rating user \(u\) gives to item \(i\) is calculated by

(14.3.2)\[\hat{\mathbf{R}}_{ui} = \mathbf{P}_u\mathbf{Q}^\top_i + b_u + b_i\]

Then, we train the matrix factorization model by minimizing the mean squared error between predicted rating scores and real rating scores. The objective function is defined as follows:

(14.3.3)\[\underset{\mathbf{P}, \mathbf{Q}, b}{\mathrm{argmin}} \sum_{(u, i) \in \mathcal{K}} \| \mathbf{R}_{ui} - \hat{\mathbf{R}}_{ui} \|^2 + \lambda (\| \mathbf{P} \|^2_F + \| \mathbf{Q} \|^2_F + b_u^2 + b_i^2 )\]

where \(\lambda\) denotes the regularization rate. The regularizing term \(\lambda (\| \mathbf{P} \|^2_F + \| \mathbf{Q} \|^2_F + b_u^2 + b_i^2 )\) is used to avoid overfitting by penalizing the magnitudes of the parameters. The \((u, i)\) pairs for which \(\mathbf{R}_{ui}\) is known are stored in the set \(\mathcal{K}=\{(u, i) \mid \mathbf{R}_{ui} \text{ is known}\}\). The model parameters can be learned with an optimization algorithm, such as SGD and Adam.

An intuitive illustration of the matrix factorization model is shown below:

../_images/rec-mf.svg

Fig. 14.3.1 Illustration of matrix factorization model

In the rest of this section, we will explain the implementation of matrix factorization and train the model on the MovieLens dataset.

import d2l
from mxnet import autograd, init, gluon, np, npx
from mxnet.gluon import nn
import mxnet as mx
npx.set_np()

14.3.2. Model Implementation

First, we implement the matrix factorization model described above. The user and item latent factors can be created with the nn.Embedding. The input_dim is the number of items/users and the (output_dim) is the dimension of the latent factors (\(k\)). We can also use nn.Embedding to create the user/item biases by setting the output_dim to one. In the forward function, user and item ids are used to look up the embeddings.

class MF(nn.Block):
    def __init__(self, num_factors, num_users, num_items, **kwargs):
        super(MF, self).__init__(**kwargs)
        self.P = nn.Embedding(input_dim=num_users, output_dim=num_factors)
        self.Q = nn.Embedding(input_dim=num_items, output_dim=num_factors)
        self.user_bias = nn.Embedding(num_users, 1)
        self.item_bias = nn.Embedding(num_items, 1)

    def forward(self, user_id, item_id):
        P_u = self.P(user_id)
        Q_i = self.Q(item_id)
        b_u = self.user_bias(user_id)
        b_i = self.item_bias(item_id)
        outputs = (P_u * Q_i).sum(axis=1) + np.squeeze(b_u) + np.squeeze(b_i)
        return outputs.flatten()

14.3.3. Evaluation Measures

We then implement the RMSE (root-mean-square error) measure, which is commonly used to measure the differences between rating scores predicted by the model and the actually observed ratings (ground truth). RMSE is defined as:

(14.3.4)\[\mathrm{RMSE} = \sqrt{\frac{1}{|\mathcal{T}|}\sum_{(u,i) \in \mathcal{T}}(\mathbf{R}_{ui} -\hat{\mathbf{R}}_{ui})^2}\]

where \(\mathcal{T}\) is the set consisting of pairs of users and items that you want to evaluate on. \(|\mathcal{T}|\) is the size of this set. We can use the RMSE function provided by mx.metric.

def evaluator(net, test_iter, ctx):
    rmse = mx.metric.RMSE() # Get the RMSE
    rmse_list = []
    for idx, (users, items, ratings) in enumerate(test_iter):
        u = gluon.utils.split_and_load(users, ctx, even_split=False)
        i = gluon.utils.split_and_load(items, ctx, even_split=False)
        r_ui = gluon.utils.split_and_load(ratings, ctx, even_split=False)
        r_hat = [net(u, i) for u, i in zip(u, i)]
        rmse.update(labels=r_ui, preds=r_hat)
        rmse_list.append(rmse.get()[1])
    return float(np.mean(np.array(rmse_list)))

14.3.4. Training and Evaluating the Model

In the training function, we adopt the \(L_2\) loss with weight decay. The weight decay mechanism has the same effect as the \(L_2\) regularization.

# Save to the d2l package.
def train_recsys_rating(net, train_iter, test_iter, loss, trainer, num_epochs,
                        ctx_list=d2l.try_all_gpus(), evaluator=None,
                        **kwargs):
    num_batches, timer = len(train_iter), d2l.Timer()
    animator = d2l.Animator(xlabel='epoch', xlim=[0, num_epochs], ylim=[0, 2],
                            legend=['train loss','test RMSE'])
    for epoch in range(num_epochs):
        metric, l = d2l.Accumulator(3), 0.
        for i, values in enumerate(train_iter):
            timer.start()
            input_data = []
            values = values if isinstance(values, list) else [values]
            for v in values:
                input_data.append(gluon.utils.split_and_load(v, ctx_list))
            train_feat = input_data[0:-1] if len(values) > 1 else input_data
            train_label = input_data[-1]
            with autograd.record():
                preds = [net(*t) for t in zip(*train_feat)]
                ls = [loss(p, s) for p, s in zip(preds, train_label)]
            [l.backward() for l in ls]
            l += sum([l.asnumpy() for l in ls]).mean() / len(ctx_list)
            trainer.step(values[0].shape[0])
            metric.add(l, values[0].shape[0], values[0].size)
            timer.stop()
        if len(kwargs) > 0: # it will be used in section AutoRec.
            test_rmse = evaluator(net, test_iter, kwargs['inter_mat'],
                                 ctx_list)
        else:
            test_rmse = evaluator(net, test_iter, ctx_list)
        train_l = l / (i + 1)
        animator.add(epoch + 1, (train_l, None, test_rmse))
    print('train loss %.3f, test RMSE %.3f'
          % (metric[0] / metric[1], test_rmse))
    print('%.1f examples/sec on %s'
          % (metric[2] * num_epochs / timer.sum(), ctx_list))

Finally, let’s put all things together and train the model. Here, we set the latent factor dimension to 50.

ctx = d2l.try_all_gpus()
num_users, num_items, train_iter, test_iter = d2l.split_and_load_ml100k(
    test_ratio=0.1, batch_size=128)
net = MF(50, num_users, num_items)
net.initialize(ctx=ctx, force_reinit=True, init=mx.init.Normal(0.01))
lr, num_epochs, wd, optimizer = 0.001, 25, 1e-5, 'adam'
loss = gluon.loss.L2Loss()
trainer = gluon.Trainer(net.collect_params(), optimizer,
                        {"learning_rate": lr, 'wd': wd})
train_recsys_rating(net, train_iter, test_iter, loss, trainer, num_epochs,
                    ctx, evaluator)
train loss 0.587, test RMSE 1.061
20458.2 examples/sec on [gpu(0), gpu(1)]
../_images/output_mf_607684_9_1.svg

Below, we use the trained model to predict the rating that a user (ID 20) might give to an item (ID 30).

scores = net(np.array([20], dtype='int', ctx=d2l.try_gpu()),
             np.array([30], dtype='int', ctx=d2l.try_gpu()))
print(scores)
[3.007846] @gpu(0)

14.3.5. Summary

  • The matrix factorization model is widely used in recommender systems. It can be used to predict ratings that a user might give to an item.
  • We can implement and train matrix factorization for recommender systems.

14.3.6. Exercise

  • Vary the size of latent factors. How does the size of latent factors influence the model performance?
  • Try different optimizers, learning rates, and weight decay rates.
  • Check the predicted rating scores of other users for a specific movie.

14.3.7. Reference

  • Koren, Yehuda, Robert Bell, and Chris Volinsky. “Matrix factorization techniques for recommender systems.” Computer 8 (2009): 30-37.