MAY 04, 2023

Recommender System Series Part 2

DevOps & Continuous Engineering

DevOps

Technical Blog

Artificial Intelligence & Machine Learning

Cloud & SaaS Enablement

Cloud Enablement

Daniel Pinheiro Franco

In the previous post, we introduced the concepts of Recommender Systems and Content-based Filtering. In this post, we will delve deeper into Collaborative Filtering methods with a focus on Neighborhood-based methods. To make it more interesting, this post also comes with an accompanying Jupyter Notebook that showcases practical examples of these methods using real-world data from Amazon Reviews.

Neighborhood-based Collaborative Filtering

Remember the definition from our previous post:

Collaborative Filtering (CF) methods collect preferences in the form of ratings or signals from many users (hence the name) and then recommend items to a user based on item interactions that people have with similar tastes as this user had in the past. In other words, these methods assume that if person X likes a subset of items that person Y likes, then X is more likely to have the same opinion as Y for a given item compared to a random person that may or may not have the same preferences.

The main idea with neighborhood-based methods is to leverage either user-user similarity or item-item similarity to make recommendations. These methods assume that similar users tend to have similar behaviors when rating items. We can also expand this assumption to items as well: similar items tend to receive similar ratings from the same user.

In these methods, the interactions between users and items are generally represented by a user-item matrix, where each row represents a user and each column represents an item, while the cells represent the interaction between the two, which, in most cases, are the item ratings made by users. In this context, we can define two types of neighborhood-based methods:

User-based Collaborative Filtering: Ratings given by users like a user U are used to make recommendations. More specifically, to predict U's rating for a given item I, we calculate the weighted average of the rating r of k similar users (neighbors) to U, where the weights are determined by the similarity between U and each of the similar users.
Item-based Collaborative Filtering: Ratings of a group of similar items are used to make recommendations for item I. Similarly, to predict I's rating given by a user U, we calculate the weighted average of the rating r of k similar items (neighbors) to I, where the weights are determined by the similarity between I and each of the similar items.

Comparison between User-based and Item-based Methods

The difference is subtle, but user-based collaborative filtering predicts a user’s rating by using the ratings of neighboring users, while item-based collaborative filtering leverages the user's ratings on neighboring items, which allows for more consistent predictions because it follows the rating behaviors of that user. In the former case, the similarity is calculated between the rows of the user-item matrix, while the latter looks at similarities between the columns of the matrix.

These approaches also differ in the way they solve problems. It is common to use the item neighborhood to recommend a list of top k items to a user. On the other hand, it is interesting to retrieve the top k users from a segment to target them for marketing campaigns.

To understand the reasoning behind a recommendation, item-based methods provide better explanations than user-based methods. This is because item-based recommendations can use the item neighborhood to explain the results in the form of "you bought this, so these are the recommended items". The item neighborhood can also be useful for suggesting product bundles to maximize sales. On the other hand, user-based methods’ recommendations usually cannot be explained directly because neighbor users are anonymized for privacy reasons.

Additionally, item-based methods may only recommend items very similar to what the user already liked, whereas user-based methods often recommend a more diverse set of items. This can encourage users to try new items and potentially keep their engagement and interest.

Another significant difference between these approaches is related to ratings. Calculating the similarity between users to predict ratings may be misleading because users may rate items in a different manner. When you present a range of values to the user, he/she might interpret them differently. For instance, in a 5-star rating system, a user may rate an item as 3 because it does what it is expected to do and nothing more, while others might use 3 to rate an item that barely works. Some users rate items highly and others rate items less favorably. To address this issue, the ratings should be mean centered by the user, meaning the user’s mean rating is subtracted from their raw rating, and the target user’s mean rating is added to the calculation, as in the example below:

Neighborhood Models in Practice

Let’s say we have the following small sample of a user-item matrix, where items are from a digital commerce store. Notice there are missing ratings, which means users typically do not rate all products.

Note: Product images extracted from Amazon Marketplace

To show how the algorithm works in practice, let’s assume we have built an item-based model. Note that the steps of the algorithm would be analogous to the user-based model, except for the perspective changes and focus on similarities between rows (users).

Remember that neighborhood CF algorithms rely on the ratings and similarity between items/users, so the first step is to define which similarity metric to use. One of the most common choices is the Pearson similarity, which measures how correlated a pair of vectors are. The range of values scales from -1 to 1, where those values indicate negative and positive correlations, respectively, and 0 indicates no correlation between vectors. This is the Pearson similarity equation for item-based models:

During this first phase, it’s usual to precompute the similarity matrix beforehand to obtain a good performance during inference time. In the case of item-based models, an item-item similarity matrix is built by applying the similarity metric between all pairs of items. Since the matrix is sparse, we only consider the set of mutually rated pairs of items during the similarity computation. For instance, the similarity between items from columns 1 and 4 of the image above will be computed as the similarity between vectors [4,3,5] and [5,3,4]. It’s possible that a pair of items may show no co-ratings by users due to the sparsity of the matrix, resulting in an empty set. In that case, a value of 0 similarity is assigned for that pair. To improve computational efficiency, it is common to consider only the k nearest neighbors of an item during inference time.

Let’s say we want to predict how Madison rated the Animal Farm book and we defined k=2 as the number of nearest neighbors to consider during calculation. To simplify the example, we will only manually calculate the similarities between the target item and items from columns 2 and 4 because they are the nearest neighbors for this item. When calculating the mean rating during similarity computation, we will consider only the set of ratings that are mutually exclusive between items.

The image below shows how the neighborhood is formed. The circle in red is the value we’re trying to predict. The squares in green are ratings from Madison that are going to be used to infer the rating for the target item. The other two ratings marked with an X are not considered because k=2. The rectangles in orange show a set of mutually exclusive ratings between the target item and the item from column 2, while the rectangles in blue show the same, but for the common ratings between the target item and item from column 4.

These are the common set of ratings between the target item (item 3) and the first neighbor (item 2): [4,3,3] and [4,4,3]. The first step is to calculate the mean in each set:

The Pearson similarity formula centers the ratings by their mean, so we can transform this vector and then plug the results into the equation:

To simplify the calculations, we separate the numerator and denominator:

Then finally compute the similarity between items 3 and 2.

The same calculation is done for the similarity between items 3 and 4:

Next, we calculate the mean for each item, considering all the item’s ratings:

Then, we can plug in the values we found together with Madison’s ratings for Items 2 and 4 (1 and 3, respectively) in the equation below:

So,

Since ratings are discrete numbers, we round this value to 2. It’s important to note that in a real-world setting, it’s often recommended to use neighborhood methods only when k is above a certain threshold because, when the number of neighbors is small, the predictions are usually not precise. An alternative would be to use Content-based filtering when we do not have enough data about the user-item relationship.

Online and Offline Phases

Neighborhood-based methods separate the computations into two phases: offline, where the model is fitted; and online, where inferences are made. In the offline phase, the user-user (or item-item) similarity values are precomputed, and the k most similar users or items are predetermined. These pre-computed values are leveraged to make fast predictions during the online phase.

Although good online performance is a major benefit of these methods, there is also a known disadvantage: the similarity matrix can get huge depending on the number of users/items in the system, causing the offline phase to not scale well. In addition to that, as the methods do not traditionally adapt to change, they need to update the precomputed similarities and nearest neighbors to account for new users and items, which makes the retraining process even more challenging.

The Cold-Start Problem

As previously stated, the effectiveness of Neighborhood-based CF algorithms depends on user-item interaction data. However, the user-item matrix is often sparse, with many missing values, as most users only interact with a small fraction of the items. This sparsity can lead to inaccurate recommendations due to the small neighborhood size.

This challenge is known as the cold-start problem, which is more apparent when new users or items enter the system. In this setting, the algorithm does not have enough data to form the nearest neighbors, so, the system cannot make useful recommendations.

Another important property of the user-item matrix is that the distribution of ratings among items displays a long-tail pattern. This means that only a small subset of items receives a significant number of ratings and are considered popular, while most items receive few ratings or no ratings at all. As a result, it is difficult to make precise predictions about items in the long tail using these methods, and that can be a problem because items that are less rated may provide large profit margins. This is something that is explored by Chris Anderson in his book, “The long tail”. This problem can also result in a lack of diversity in recommendations because the algorithm will usually only recommend popular items.

To address these limitations to some extent, alternative algorithms can be used, such as matrix factorization and hybrid algorithms, which combine CF with Content-based Filtering methods. The next blog posts of this series will explore these topics in greater detail.

Conclusion

In this blog post, we discussed Collaborative Filtering, focusing on Neighborhood-based methods. We have seen the differences between user-based and item-based models, and how the algorithm works in practice. Some of these methods' strengths and weaknesses were addressed too. Remember to check out the Jupyter Notebook to see a Python implementation that displays the concepts explored in this post.

References

This blog post is mostly inspired by Chapter 2 of the book [Aggarwal, C. C. (2016). Recommender Systems: The Textbook]. The Surprise Library Documentation (https://surprise.readthedocs.io/en/stable/) was also a great reference for mathematical equations related to the topic. The book [Anderson, C. (2006). The long tail: Why the future of business is selling less or more] is also an interesting read for understanding the potential profitability of selling items in the long tail.

Acknowledgment

This piece was written by Daniel Pinheiro Franco, Innovation Expert at Encora’s Data Science & Engineering Technology Practices group. Thanks to João Caleffi and Caio Dadauto for their reviews and insights.

About Encora

Fast-growing tech companies partner with Encora to outsource product development and drive growth. Contact us to learn more about our software engineering capabilities.