Proabilistic Matrix Factorization

This paper¹ presents the Probabilistic Matrix Factorization (PMF) model which scales linearly with the number of observations and performs well on the large, sparse, and imbalanced dataset. Before going into the detail of this model, let’s discuss why it is even needed?

Recommendation systems enhance user experiences by suggesting relevant items based on preferences. They primarily use two traditional methods: collaborative filtering, which recommends items liked by other users with similar tastes and content-based filtering, which recommends items similar to those a user has liked before. The most popular approach to collaborative filtering is based on low-dimensional factor models.

Many of these low-dimensional models are probabilistic factor-based models. The majority of these probabilistic factor-based models attempt to explain observations through underlying factors or variables (hidden factors; these represent latent characteristics). The problem with these models is that exact inference is intractable, meaning that determining the distribution of hidden factors given the observed data is computationally impractical. This necessitates the use of approximations to estimate the posterior distribution of hidden factors based on observed data. Singular Value Decomposition (SVD) is one of the low-rank approximation algorithms based on minimizing the sum-squared distance. It finds the matrix $\hat R=U^TV$ of the given rank which minimizes the sum-squared distance to the target matrix $R$. Collaborative filtering algorithms need to scale linearly with number of observations and perform well on very sparse and imbalanced datasets. Most of the aforementioned methods prove infeasible for various reasons. Firstly, except for the matrix-factorization based methods, none scale well to large datasets. Secondly, if most of the rating entries are missing, it leads to inaccurate predictions. Here, PMF comes into the picture as it solves both problems: it scales linearly with the number of observations and performs well on very sparse and imbalanced datasets.

PMF algorithm

Suppose we have $M$ movies, $N$ users, and $R_{ij}$ represent the rating of user $i$ for movie $j$. (for the sake of simplicity it’s assumed rating values are integer $\in [1, K]$)

$$U \in R^{D\times N}$$ (latent user matrix)
$$V \in R^{DxM}$$ (latent movie matrix)

For test set, assume a probabilistic linear model (relationship between input and output is linear but observed outputs deviate from this linear relationship due to random noise). Define conditional distribution over the observed ratings as:

$$ p(R|U, V, \sigma ^2)=\prod_{i = 1}^{N}\prod_{j = 1}^{M}\left[ N(R_{ij}|U^T_iV_j, \sigma ^2)\right]^{I_{ij}} $$

where,
$\hspace{20pt} N(x|\mu, \sigma ^2)$ denotes probability density function of Gaussian distribution with mean $\mu$ and variance $\sigma ^2$
$\hspace{20pt}I_{ij}=1$, if user $i$ rated movie $j$
$\hspace{33pt}=0$, otherwise

We also place zero-mean spherical Gussian priors on user and movie feature vectors.

$\hspace{60pt} p(U|\sigma ^2_{U})=\prod_{i=1}^{N} N(U_i|0, \sigma ^2_{U}I), \quad p(V|\sigma ^2_{V})=\prod_{i=1}^{N} N(V_i|0, \sigma ^2_{V}I)$

Using Bayes’ Rule, we find that the posterior distribution over latent vectors $U_i$ and $V_i$ given the observed ratings $R$ is proportional to the product of the likelihood of the observed data and the prior distribution. Taking log of the posterior distribution over the user and movie features, we get:

$\hspace{10pt} ln \, p(U, V|R, \sigma ^2, \sigma ^2_V, \sigma ^2_U) = - \frac{1}{2\sigma ^2}\sum_{i = 1}^{N}\sum_{j = 1}^{M}I_{ij}(R_{ij}-U_i^TV_j)^2 - \frac{1}{2\sigma ^2_U}\sum_{i = 1}^{N}U_i^TU_i - \frac{1}{2\sigma ^2_V}\sum_{j = 1}^{M}V_j^TV_j$ $\hspace{125pt} - \frac{1}{2}\left( \left( \sum_{i = 1}^{N}\sum_{j = 1}^{M}I_{ij}\right )ln\, \sigma ^2 + ND ln\, \sigma ^2_U + MD ln \,\sigma ^2_V \right)$

Maximising the log-posterior over movie and user features with hyperparameters (observation noise variance $\sigma$ and prior variances $\sigma_U, \sigma_V$) kept fixed is equivalent to minimizing the sum-of-squared-errors obective function with quadratic regularization terms:

$\hspace{40pt} E = \frac{1}{2}\sum_{i = 1}^{N}\sum_{j = 1}^{M}I_{ij}(R_{ij}-U_i^TV_j)^2 + \frac{\lambda _U}{2}\sum_{i = 1}^{N} ||U_i||^2_{Fro} + \frac{\lambda _V}{2}\sum_{j = 1}^{M} ||V_i||^2_{Fro}$

$\hspace{150pt}$ where, $\lambda _U=\frac{\sigma ^2}{\sigma ^2_U}, \lambda _V=\frac{\sigma ^2}{\sigma ^2_V}$

A local minimum of this objective function can be found by performing gradient descent in $U$ and $V$.

If all the ratings have been observed it means the estimates of the parameters (eg. user and latent features) can be made with higher confidence and potentially less variance in the estimates themselves.

PMF model extends traditional SVD by incorporating probabilistic elements (gaussian noise in observed ratings, gaussian priors for latent features). This probabilistic framework accounts for uncertainty in observation (ratings) and parameter estimates (latent features) offering a more flexible and potentially robust approach compared to deterministic SVD.

In a typical real-world scenario, not all user-item ratings are available. However, if hypothetically, all ratings were observed, the PMF model would still need to balance fitting the observed data (through the likelihood term) and maintaining reasonable complexity (through the prior regularization terms).

If prior variances are made very large, then it effectively weakens the influence of the prior distribution, making it non-informative. This means the latent features are less regularized and the model focuses more on fitting the observed data. In the limit, when the prior variances go to infinity, the regularization effect vanishes, and the PMF model then essentially reduces to performing an SVD.

R. Salakhutdino and A. Mnihv, Probabilistic Matrix Factorization ↩