Key concepts related to recommendation systems

Before showing you how to build a recommendation engine in R, I need to get you up-to-speed on the concepts behind how recommendation engines work.

What’s a recommendation engine?

In case you’re totally new to marketing data science, let me illustrate the recommendation engine concept a little before proceeding. You know how, when you go buy something on Amazon, you see related products under the heading of ‘People who purchased this item also purchased…’ (or something like that). Those recommendations are made automatically by a decision engine that sits on the backed of the platform. Today I am going to show you how to build an engine that functions in a similar way.

Before getting into details about how recommendation engines work, let’s take a step back in the discussion and refresh our memories about what, exactly, a recommendation engines is. In essence, a recommendation engine is an automated decision engine that evaluates similarities between people (ie. “users”) and/or items in order to make recommendations about what items go well together. Although the underlying methods behind recommendation engines can be used for a variety of applications, the most common application is in ecommerce. In this application, the recommendation engine identifies items that have a high-propensity for user consumption, and recommends those items to only the most appropriate users.

Now, with respect to marketing science, recommendation systems have been a breathtaking disruption to traditional cross-selling strategies. They’ve allowed us to drive conversion rates up by automating the identification and recommendation of related products. In ecommerce this represents a true win-win, where buyers are satisfied because they get an ideal combination of products, and sellers are happy because they enjoy more sales and a higher ROI.

The go-to use case: NetFlix movie recommendations

The go-to use case for recommendation engines is the NetFlix recommender. In fact, Netflix runs many layers of recommendations, each operating according to it’s own unique set of instructions, if you will. But Netflix really broke ground back in 2009, when it hosted an open competition on Kaggle.

In this competition, participants where basically tasked with predicting user ratings for new films based on the users’ previous ratings on films they’ve already seen. This prediction can then be used as a basis on which to make recommendations to those users, if the predictions that are made by the engine have a high degree of accuracy. So in that competition, a team developed a recommendation algorithm that performed 10% better than NetFlix’s existing algorithm. In return, the team by bagged $1 mil in cash, just to give you a general idea of the algorithms worth to NetFlix.

First, you’ve got to understand collaborative filtering

Recommendation engines utilize collaborative filtering. As the name suggests, collaborative filtering is a method that uses data from other people (or “users” on the platform) to make its prediction. Collaborative filtering can work a few different ways, but one way might be; A collaborative filtering algorithm could ‘filter’ similar purchases users made in the past to generate and then recommend a list of items that go well together in combination. In this example, items that do not occur together frequently enough in past purchasing data would be dropped from the list, and the recommendation engine would make recommendation from a final set of items that have a strong history of being purchased together.

Recommendation algorithms generally make recommendations based on two types of collaborative filtering algorithms; User-based collaborative or item-based collaborative filtering. I’ll define these in terms of movie recommendation systems, like those used by Netflix.

User-based collaborative filtering systems: A user-based recommendation engine recommends movies based on what other users with similar profiles have watched and liked in the past. As an example of a user-based recommender, imagine there is a movie lover who watches movies regularly, every Friday evening. He’s an unmarried man who’s also a working professional. A user-based recommender could go and look up movie recommendations based on what other unmarried, professionnel men who watch movies regularly have liked.

Item-based collaborative filtering systems: An item-based recommender would make recommendations based on similarities between movies; in other words, it would recommend movies that are similar to ones that a user already likes. As an example of this, imagine you watch the movie ‘Kung Fu Panda’ and you liked it so you gave it five stars. A item-based collaborative filtering system would then look into similar movies from the same genre (perhaps animated, fighting, comedy or based on similar storyline) and then recommend to you similar movies based on the preference you indicated when you gave ‘Kung Fu Panda’ five stars. In fact, and item-based collaborative filtering system can even make recommendations based on any variety of common elements, such as movies about pandas, movies from the same producers, directors, etc. For this example, it’s most likely that the primary suggestions will include ‘Kung Fu Panda 2’ and ‘Kung Fu Panda 3’ followed by other cases.

If only life were that simple

Ok, so now that you understand the basics about collaborative filtering algorithms we need to add a little complexity to the discussion. Don’t worry, I’m going to be showing you how to build a recommendation engine in R very soon!!

If you think about it, it makes absolute sense that a person needs to do more than just watch a movie in order for that movie to qualify for being recommendable to other users. After all, the user may have seen the movie and hated it. In this case, recommending the movie to similar users is a really bad idea.

Instead of just looking at how many times a movie was viewed, we’ve actually go to take into account the rating that each of the users gave the movie (aka; “movie ratings data”). Doing this allows us to identify movies that similar users have enjoyed, in order to filter our movie recommendations accordingly. Now the recommendation will only include the movies which are rated high by other similar users.

Real-life recommenders that are in-production on ecommerce platforms are usually quite complex. They almost always hybridize the two collaborative techniques we’ve discussed. These recommendation engines may, for example, suggest a movie based on what other users with similar profiles have enjoyed, and then further order the recommendations based on how similar those movies are to the movie you watched last. My point here is that all recommendation engines really have their own utility in different situations, so decisions about the best logic to use requires solid reasoning on behalf of the data scientist or machine learning engineer who’s developing it.

Where machine learning fits in

Both recommendation methods we’ve discussed can use clustering as the backbone, although there are other machine learning algorithms that may be better suited for the job, depending on your project requirements. Clustering algorithms allow you to group users and items based on similarity, so these are an easy fit. Another way to make recommendation though, might be to focus on what’s dissimilar between users and/or items. Suffice it to say, the machine learning algorithms you choose depend heavily on your project specifics.

Don’t forget about the content-based recommenders

There’s another type of recommendation system we’re yet to discuss. That’s content-based recommendation systems. Content-based recommenders are an alternative approach that you can use when there is not much data available. Since the algorithm used in content-based recommenders has a speed that depends upon the dataset’s size, these methods aren’t appropriate for large datasets.

One big advantage of content-based recommenders is that you can use them to start recommending new items that have not yet accrued user ratings (fixing what’s known as the “cold start” problem). This is particularly helpful for helping get new products out before your user base, so they can gain traction more quickly.

This said collaborative filtering systems have many advantages over content-based recommenders. These advantages include:

They can handle huge, high-dimensional datasets.
They can suggest niche items (items popular among only a specific segment of users).
They can suggest items which may be from a completely different product category all together.
Based on the type of data you have, a collaborative filtering system can suggest items purchased by similar users solely depending upon their ratings for these items.

Alright, this concludes our discussion on recommendation engine concepts. Now I’m going to show you how to build a recommendation engine in R.

How to build a recommendation engine in R

Phew, that was a lot! But if you’ve made it this far then you should be ready to begin looking at how to build a recommendation engine in R.

The coding demonstration

In the following demo, we’ll use the famous movielens dataset that’s been made available by grouplens research. The dataset consists of 20,000,000 distinct user ratings on about 27,000 movies, and rated by 138,000 users. The data can be downloaded from the website here: https://grouplens.org/datasets/movielens/.

This dataset is quite large, about 190 mb. Luckily, the website also hosts miniature versions of the movie lens data with sizes varying from 100,000 ratings, 1 million ratings and 10 million ratings. Let’s keep it simple by using the 100,000 ratings data, which is only 1 MB. With the download, you get a zipped file containing a readme and movies data, with separate links, tags and ratings files.

Here is the link to the dataset used in the demo: http://files.grouplens.org/datasets/movielens/ml-latest-small.zip

So, how to build a recommendation engine in R… starting with the reading step in R, let’s read-in all our datasets and build a ratings matrix:

R

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

<em>##Demo: How to build a recommendation engine in R

## setwd("C:/Users/User/Desktop/Data-Mania Blog Coding Demos/Recommendation Engine in R")

#Read all the datasets</em>

<em>movies=read.csv("movies.csv")</em>

<em>links=read.csv("links.csv")</em>

<em>ratings=read.csv("ratings.csv")</em>

<em>tags=read.csv("tags.csv")</em>

<em>#Import the reshape2 library. Use the file install.packages(“reshape2”) if the package is not already installed

install.packages("reshape2", dependencies=TRUE)

install.packages("stringi", dependencies=TRUE)

library(stringi)</em>

<em>library(reshape2)

</em>

<em>#Create ratings matrix with rows as users and columns as movies. We don't need timestamp</em>

<em>ratingmat = dcast(ratings, userId~movieId, value.var = "rating", na.rm=FALSE)

</em>

<em>#We can now remove user ids</em>

<em>ratingmat = as.matrix(ratingmat[,-1])</em>

The recommendation package in R we’ll use is recommenderlab. It provides us a User Based Collaborative Filtering (UBCF) model. For similarity among user ratings, we have a choice to calculate similarity according to the following methods:

Jaccard similarity
Cosine similarity
Pearson similarity

In this example, we’ll use the cosine similarity metric.

R

1

2

3

<em>#Uncomment the following line if the package is not installed</em>

<em>#install.packages("recommenderlab", dependencies=TRUE)</em>

<em>library(recommenderlab)</em>

First, we want to reduce the size of our ratings matrix to make computation faster. In my machine, the ratingmat takes up about 46.9 Mbs. This size is due to the large number of zero’s in the matrix (in other words, it’s a “sparse matrix”). Let’s transform into a dense matrix by removing the zero’s.

R

1

2

<em>#Convert ratings matrix to real rating matrx which makes it dense</em>

<em>ratingmat = as(ratingmat, "realRatingMatrix")</em>

This step immediately reduced the size of the matrix to 1.7 Mbs, in my machine, which is much, much smaller. Now let’s normalize the matrix so that our our recommendations come out unbiased.

R

1

2

<em>#Normalize the ratings matrix</em>

<em>ratingmat = normalize(ratingmat)</em><em> </em>

The Recommender() function in the recommenderlab package is the underlying recommendation model we’re using here.

You may want to use the help function for this recommender to learn more about it. To do this, just enter the ‘?Recommender’ command in R.

R

1

2

<em>#Create Recommender Model. The parameters are UBCF and Cosine similarity. We take 10 nearest neighbours</em>

<em>rec_mod = Recommender(ratingmat, method = "UBCF", param=list(method="Cosine",nn=10))</em>

Now that we’ve built our model, let’s make some predictions.

Starting with the first user:

R

1

2

<em>#Obtain top 5 recommendations for 1st user entry in dataset</em>

<em>Top_5_pred = predict(rec_mod, ratingmat[1], n=5)</em>

At this point, we’ve created recommendations for the first user, but we can’t see them. That’s annoying.

To see the predictions our model made, we’ll convert them to a list and print them out:

R

1

2

3

4

<em>#Convert the recommendations to a list</em>

<em>Top_5_List = as(Top_5_pred, "list")</em>

<em>Top_5_List</em>

"47"   "893"  "1769" "2567" "3423"

As you can see, we get movie recommendations… but alas, they’re in movieId number format.

Let’s take a look at the movie names that correspond to these number. We’ll do this by using the movies dataset. It maps movie id to movie titles.

R

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

<em>#Uncomment the following line if the package is not installed</em>

<em>#install.packages("dplyr")</em>

<em>library(dplyr)

</em>

<em>#We convert the list to a dataframe and change the column name to movieId</em>

<em>Top_5_df=data.frame(Top_5_List)</em>

<em>colnames(Top_5_df)="movieId"

</em>

<em>#Since movieId is of type integer in Movies data, we typecast id in our recommendations as well</em>

<em>Top_5_df$movieId=as.numeric(levels(Top_5_df$movieId))

</em>

<em>#Merge the movie ids with names to get titles and genres</em>

<em>names=left_join(Top_5_df, movies, by="movieId")

</em>

<em>#Print the titles and genres</em>

<em>names</em>

<em>  movieId                                       title                                                genres</em>

<em>1    1769                      Replacement Killers, The (1998)                    Action|Crime|Thriller</em>

<em>2    2567                      EDtv (1999)                                                    Comedy</em>

<em>3    3423                      School Daze (1988)                                       Drama</em>

<em>4      47                        Seven (a.k.a. Se7en) (1995)                          Mystery|Thriller</em>

<em>5     893                       Mother Night (1996)                                       Drama</em>

Based on similarity between users, for the first user, our model initially recommends the above movies.

In our results, you can see:

The year that the movie was released.
The movie genres.

With further data processing and filtering, we could probably improve the relevancy of the recommendations, so that years and genres are even more similar. Congratulations, though!! You now know the basics on how to build a recommendation engine in R.

TECHCEPTRON

Saturday, July 28, 2018