1 billion dollars per year. That’s how much Netflix’s Chief Product Officer Neil Hunt estimates the company saves per year thanks to their global recommendation system. No wonder you’ve found yourself searching for how to build a recommendation engine in R! They’re valuable commodities!

From tech giants like Netflix to Amazon to YouTube, enterprises all over the world are recognizing the importance of recommendation engines in order to keep their customer base engaged and their conversions high. And they’re looking for data professionals like you to build them.

Here at Data Mania, we help give data pros a leg up by helping them make (and save) money for the corporations they serve, so they can advance their data career and get the promotion and raise they deserve. We were tired of seeing so many data scientists launch initiatives that flop, and having their superiors fail to see the ROI their projects could bring in.

While you may be a pro at learning all the different algorithms and technologies, finding a clear, consistent way to combine them to increase business profits is where the magic happens (and where most people struggle!). That’s why my team and I embarked on a 6-month AI research crusade to study the most successful, reproducible data projects across the market.

The end product of that expedition? A collection of winning data-strategy case collections to kick off your next profitable data project. Get it inside here ⬇️.

Let’s get started today with helping you build a recommendation engine in R that’s sure to help your company’s customer retention and profitability. First I’m going to provide you a conceptual overview of the topic, and then I’ll show you the exact steps on how to build a recommendation engine in R. 

Make sure to stick around to the end for another show-stopper of a recommendation on how to get ahead FAST in your data science career.

Ready? Let’s get started.

 

KEY CONCEPTS RELATED TO RECOMMENDATION SYSTEMS

Before showing you how to build a recommendation engine in R, let’s get you up-to-speed on the concepts behind how recommendation engines work.

What’s a recommendation engine?

[disclaim]If you’re a developer who’s here just to see how the code works… I understand! CLICK HERE TO SKIP THE INTRO and go straight to where I show you the code for how to build a recommendation engine in R.[/disclaim]

In case you’re a total newbie to marketing data science, let’s get a little clearer on the concepts of recommendation engines and how they’re used. Let’s take Amazon as an example. Every time you go buy something on Amazon, under the product you’ll see the heading ‘People Who Purchased This Item Also Purchased…’ (or something along those lines) with a selection of products underneath. Those recommendations are made automatically by a decision engine that sits on the backend of the platform. Today, you’re going to learn the exact steps to understand how to build an engine that functions the same way. Before getting into the nitty-gritty about how recommendation engines work, let’s first take a step back and refresh our memory about what exactly a recommendation engine is. In essence, a recommendation engine is an automated decision engine that evaluates similarities between people (ie. “users”) and/or items in order to make recommendations about what items go well together.  The underlying methods behind recommendation engines can be used for a variety of applications, but the most common application is often e-commerce. In this application, the recommendation engine identifies items that have a high-propensity for user consumption, and recommends those items to only the most appropriate users. When it comes to marketing science, recommendation systems have been a breathtaking disruption to traditional cross-selling strategies. They’ve allowed us to significantly drive conversion rates up by automating the identification and recommendation of related products. In ecommerce this represents a true win-win, where buyers are satisfied because they get an ideal combination of products, and sellers are happy because they enjoy more sales and a higher ROI. What’s not to love?! 😉 If you’re looking for more ways to increase sales conversions in e-commerce using data initiatives check out our “Marketing Improvements” Route within our Winning With Data Case Collections, which takes an insider’s look into the marketing success of businesses’ like The North Face, Caesars Entertainment, or Godiva. 

The go-to case study: NetFlix movie recommendations

The go-to case study for recommendation engines is the NetFlix recommender that I mentioned above. In fact, Netflix runs many layers of recommendations, each operating according to its own unique set of instructions. But it wasn’t until 2009 that Netflix really broke ground with its recommendations, back when it hosted an open competition on Kaggle. In the competition, participants were asked to predict user ratings for new films by using previous user rating data for films they’d already seen. If the predictions made by the engine had a high degree of accuracy, Netflix would select the team’s engine to make recommendations to its users. In the end, a team developed a recommendation algorithm that performed 10% better than NetFlix’s existing algorithm, bagging them a $1 mil in cash (not bad, right?! 💥)….just to give you a general idea of how much these algorithms are worth to Netflix. 

First things first: understanding collaborative filtering

Recommendation engines use collaborative filtering. Like the name suggests, collaborative filtering uses data from other people (or “users” on the platform) to make its prediction. Collaborative filtering can work a few different ways.  One possible way to use a collaborative filtering algorithm could be to ‘filter’ similar purchases users made in the past to generate and then recommend a list of items that go well together in combination. In this example, items that are not frequently purchased together would be excluded from the list, and the engine would make recommendations from a final set of items that have a history of  being purchased together. 

2 Types of Collaborative Filtering Algorithms – User-based collaborative filtering and Item-Based Collaborative Filtering.

I’ll define these in terms of movie recommendation systems, using Netflix again as our trusty example.

A Screen Grab From My LinkedIn Learning Course:

Building a Recommendation System with Python

  • User-based collaborative filtering systems: A user-based recommendation engine recommends movies based on what other users with similar profiles have watched and liked in the past. As an example of a user-based recommender, imagine there’s a big movie buff who loves watching movies regularly, usually every Friday evening. He’s an unmarried man and a working professional. A user-based recommender would go in and look up movie recommendations based on what other unmarried, professionnel men who watch movies regularly have liked.
  • Item-based collaborative filtering systems: An item-based recommender would make recommendations based on similarities between movies; in other words, it would recommend movies that are similar to ones a user already likes. Say you watched the movie ‘Kung Fu Panda’ and you liked it so much you gave it five stars. A item-based collaborative filtering system would then look into similar movies from the same genre (perhaps animated, fighting, comedy or films with a similar storyline) and then recommend similar movies based on the preference you displayed when giving ‘Kung Fu Panda’ five stars. 

In fact, item-based collaborative filtering systems can even make recommendations based on any variety of common elements, like movies about pandas, movies from the same producers, directors, etc…the possibilities are truly endless! In the case of Kung Fu Panda, it’s most likely that ‘Kung Fu Panda 2’ and ‘Kung Fu Panda 3’ will be suggested to the user, followed by other cases.

If only life were that simple

Now that you understand the basics about collaborative filtering algorithms, let’s go ahead and add a little complexity to the discussion. Don’t worry, I’m going to be showing you how to build a recommendation engine in R very soon!

If you really think about it, a user should have to do more than simply watch a movie in order for the film to qualify as being recommendable to other users. After all, the user may have seen the movie and absolutely hated it! If that were the case, recommending the movie to similar users could be a potentially terrible idea.

Instead of just looking at how many times a movie was viewed, we’ve actually got to take into account the rating each user gave the movie (aka; “movie ratings data”). By doing this, we can see what movies similar users have enjoyed and use that data to filter our movie recommendations accordingly. Now, the recommendation will only include the movies which are rated highly by other similar users.

Real-life recommenders that are in-production on ecommerce platforms are usually quite complex. They almost always hybridize the two collaborative techniques we’ve discussed above. These recommendation engines may, for example, suggest a movie based on what other users with similar profiles have enjoyed, and then further order the recommendations based on how similar those movies are to the movie you last watched. My point here is that all recommendation engines all have their own utility in different situations, so decisions about the best logic to use requires data scientists and machine learning engineers alike to use solid reasoning and sound strategy alike when planning initiatives. 

The reality is, good implementation and coding skills alone simply isn’t enough to win with data these days. With 85% of data initiatives failing (according to Gartner) it’s time to step up your game and become a data leader to make sure your projects aren’t among those. How? Using data strategy! As a data educator who’s trained over 1 Million workers on data science (through prestigious partnerships with LinkedIn Learning and Wiley & Sons Publishers), I’m ready to tell you what you need to know to reach that next rung up the data career ladder. Get started learning need-to-know data strategy skills here.

Where machine learning fits in

Both recommendation methods we discussed above (user-based collaborative filtering systems and Item-based collaborative filtering systems) can use clustering as the backbone, although there are other machine learning algorithms that may be better suited for the job depending on your project requirements. 

Clustering algorithms allow you to group users and items based on similarity, so these are an easy fit when building a recommendation engine. Another way to make recommendations might be to focus on what’s dissimilar between users and/or items. Needless to say, the machine learning algorithms you choose largely depend on the specifics of your unique project.

Don’t forget about the content-based recommenders

Wait! There’s one more type of recommendation system we haven’t gotten around to yet – content-based recommendation systems. Content-based recommenders are an alternative approach you can use when you don’t have a ton of data available. The speed of content-based recommenders, however, largely correlates with the dataset’s size, making them unfit for large datasets. 

One advantage to content-based recommenders? You can use them to start recommending newer items that still don’t have user ratings (fixing what’s known as the “cold start” problem). This is helpful for getting new products out in front of your user base, so they can quickly begin to gain traction.

That being said, collaborative filtering systems still have a lot of advantages over content-based recommenders. These advantages include:

  • They can handle huge, high-dimensional datasets.
  • They can suggest niche items (items popular among only a specific segment of users).
  • They can suggest items which may be from a completely different product category altogether.
  • Based on the type of data you have, a collaborative filtering system can suggest items purchased by similar users, solely depending upon their ratings for these items.

By now, you should have a good grasp on recommendation engine concepts. It’s the time you’ve all been waiting for – I’m now going to show you how to build a recommendation engine in R!

HOW TO BUILD A RECOMMENDATION ENGINE IN R Phew, that was a lot! But if you’ve made it this far then you should be ready to begin looking at how to build a recommendation engine in R. 

The coding demonstration

In the following demo, we’ll use the famous movielens dataset that’s been made available by grouplens research. The dataset consists of 20,000,000 distinct user ratings on about 27,000 movies, and rated by 138,000 users. The data can be downloaded from the website here: https://grouplens.org/datasets/movielens/. This dataset is fairly large, about 190 mb. Luckily, the website also hosts miniature versions of the movie lens data with sizes varying from 100,000 ratings, 1 million ratings and 10 million ratings. Let’s keep it simple by using the 100,000 ratings data, which is only 1 MB. With the download, you get a zipped file containing a readme and movies data, with separate links, tags and ratings files. Here is the link to the dataset used in the demo: http://files.grouplens.org/datasets/movielens/ml-latest-small.zip So, how to build a recommendation engine in R… starting with the reading step in R, let’s read-in all our datasets and build a ratings matrix:
##Demo: How to build a recommendation engine in R
## setwd("C:/Users/User/Desktop/Data-Mania Blog Coding Demos/Recommendation Engine in R")
#Read all the datasets
movies=read.csv("movies.csv")
links=read.csv("links.csv")
ratings=read.csv("ratings.csv")
tags=read.csv("tags.csv")
#Import the reshape2 library. Use the file install.packages(“reshape2”) if the package is not already installed
install.packages("reshape2", dependencies=TRUE)
install.packages("stringi", dependencies=TRUE)
library(stringi)
library(reshape2)

#Create ratings matrix with rows as users and columns as movies. We don't need timestamp
ratingmat = dcast(ratings, userId~movieId, value.var = "rating", na.rm=FALSE)

#We can now remove user ids
ratingmat = as.matrix(ratingmat[,-1])
The recommendation package in R we’ll use is recommenderlab. It provides us a User Based Collaborative Filtering (UBCF) model. For similarity among user ratings, we have a choice to calculate similarity according to the following methods:
  • Jaccard similarity
  • Cosine similarity
  • Pearson similarity
In this example, we’ll use the cosine similarity metric.
#Uncomment the following line if the package is not installed
#install.packages("recommenderlab", dependencies=TRUE)
library(recommenderlab)
First, we want to reduce the size of our ratings matrix to make computation faster. In my machine, the ratingmat takes up about 46.9 Mbs. This size is due to the large number of zero’s in the matrix (in other words, it’s a “sparse matrix”).  Let’s transform into a dense matrix by removing the zero’s.
#Convert ratings matrix to real rating matrx which makes it dense
ratingmat = as(ratingmat, "realRatingMatrix")
This step immediately reduced the size of the matrix to 1.7 Mbs, in my machine, which is much, much smaller. Now let’s normalize the matrix so that our our recommendations come out unbiased.
#Normalize the ratings matrix
ratingmat = normalize(ratingmat) 
The Recommender() function in the recommenderlab package is the underlying recommendation model we’re using here. [warning]You may want to use the help function for this recommender to learn more about it. To do this, just enter the ‘?Recommender’ command in R.[/warning]
#Create Recommender Model. The parameters are UBCF and Cosine similarity. We take 10 nearest neighbours
rec_mod = Recommender(ratingmat, method = "UBCF", param=list(method="Cosine",nn=10)) 
Now that we’ve built our model, let’s make some predictions. Starting with the first user:
#Obtain top 5 recommendations for 1st user entry in dataset
Top_5_pred = predict(rec_mod, ratingmat[1], n=5)
At this point, we’ve created recommendations for the first user, but we can’t see them. That’s annoying. To see the predictions our model made, we’ll convert them to a list and print them out:
#Convert the recommendations to a list
Top_5_List = as(Top_5_pred, "list")
Top_5_List
"47"   "893"  "1769" "2567" "3423" 
As you can see, we get movie recommendations… but alas, they’re in movieId number format. Let’s take a look at the movie names that correspond to these number. We’ll do this by using the movies dataset. It maps movie id to movie titles.
#Uncomment the following line if the package is not installed
#install.packages("dplyr")
library(dplyr)

#We convert the list to a dataframe and change the column name to movieId
Top_5_df=data.frame(Top_5_List)
colnames(Top_5_df)="movieId"

#Since movieId is of type integer in Movies data, we typecast id in our recommendations as well
Top_5_df$movieId=as.numeric(levels(Top_5_df$movieId))

#Merge the movie ids with names to get titles and genres
names=left_join(Top_5_df, movies, by="movieId")

#Print the titles and genres
names
  movieId                                       title                                                genres
1    1769                      Replacement Killers, The (1998)                    Action|Crime|Thriller
2    2567                      EDtv (1999)                                                    Comedy
3    3423                      School Daze (1988)                                       Drama
4      47                        Seven (a.k.a. Se7en) (1995)                          Mystery|Thriller
5     893                       Mother Night (1996)                                       Drama
Based on similarity between users, for the first user, our model initially recommends the above movies. In our results, you can see:
  • The year that the movie was released.
  • The movie genres.
With further data processing and filtering, we could probably improve the relevancy of the recommendations, so that years and genres are even more similar. Congratulations, though!! You now know the basics on how to build a recommendation engine in R.

This is just the tip of the proverbial iceberg…

Netflix uses more than 27,000 genres to classify its movies. It suggests movies based on user similarities and on movie classifications. Tons of other features (like year, age, and user demographic) are used when making recommendations, so this tutorial really was just the tip of the iceberg when it comes to building functional recommendation systems. Take your knowledge deeper with my LinkedIn Learning course on building recommendation systems. Or maybe you’re ready to go BEYOND learning new languages and implementation techniques. While it’s normal to think learning new data implementation skills is what will help you advance your career and get promoted, it’ll actually just keep you at your current level doing more of the same.  If you’re ready to skyrocket your data career and earn more recognition, responsibility and a heftier salary, it’s time you transition from data science to data strategy.  Take my client’s word for it 👇🏻👇🏻
"I'VE USED [LILLIAN'S DATA STRATEGY SUPPORT] TO GET A PROMOTION AT WORK, DO MY WORK MUCH MORE EFFICIENTLY AND EFFECTIVELY. THIS HELPED INCREASE ADOPTION OF A TOOL I CREATED BY 30% INTERNALLY WITHIN THE TEAM WE PROVIDE INSIGHTS."
Derek Naminda
Marketing Analytics Manager, Cisco

 

Winning With Data: Case Collection was created for the DATA PROFESSIONAL who wants to up-level their career & their company’s results with effective data strategy. Get the exact use cases you need to start planning your company’s next profitable data project 👇🏻

If you enjoyed learning how to build a recommendation engine in R, why not share it with your colleagues so they can benefit from it too? Tag me on LinkedIn in your share, and I will reshare you!

 

Lillian Pierson, P.E.

Lillian Pierson is a CEO & data leader that supports data professionals to evolve into world-class leaders & entrepreneurs. To date, she’s helped educate over 1.3 million data professionals on AI and data science. Lillian has authored 6 data books with Wiley & Sons Publishers as well as 8 data courses with LinkedIn Learning. She’s supported a wide variety of organizations across the globe, from the United Nations and National Geographic, to Ericsson and Saudi Aramco, and everything in between. She is a licensed Professional Engineer, in good standing. She’s been a technical consultant since 2007 and a data business mentor since 2018. She occasionally volunteers her expertise in global summits and forums on data privacy and ethics.

This Post Has 17 Comments

  1. Yusuf

    So Cool.
    Thanks Lilian

  2. Abel

    Lillian,

    I enjoy learning from you and thanks for sharing.

    Thanks,

    Abel

  3. Kevin Burke

    Hello Lillian,
    I enjoy learning about artificial intelligence and this was really cool. That being said, I also read your article on how to get experience when you first start with your career and I haven’t had any luck. I went on freelancer.com and there were only three projects that were already bid on and I am not currently working for a company so I am having a hard time getting experience. I am not sure what to do because I have sent resumes to all companies around my area and I can’t move to a different state or different part of Georgia, so it is really difficult. Any help would be greatly appreciated.
    Thanks,
    Kevin Burke

    1. Hi Kevin – Please keep looking. There are tons of DS jobs on Upwork. That or create your own project, or go to Kaggle, etc.

  4. rahul

    why is used?

    1. Hi Rahul, How are you? Are you asking me to explain more why recommender systems are useful?

  5. Josephine WilesWarner

    Hi Lillian,

    Thanks for this course. I am just learning R coding. I will keep this and come back to it. I am happy you have taken the time to share this.

  6. Michael

    Hi Lillian,

    Thank you so much for this. I would like to try and create something like this. However, most use cases, theories and how-to’s are based on recommenders that use ratings. I’m looking for a hotel recommender, so taking into account the views and (single) purchase of an hotel. Basicly translating into ‘this is the most booked hotel for people that have also seen this hotel or have similar behavior as yours’. Is this possible?

  7. Ghena

    Hi Lillian,

    first of all, thank you for this information, I want to ask you how I can create a test and train set for this code & how to evaluate the accuracy

    Thank you again .

      1. Ghena

        very helpful, so can you tell me how I can create a train and test set to the code that uses ( UBCF) model, because Im trying to write a code using your article .

  8. Hello

    Greetings from India.

    When trying the above exercise, I got stuck as below

    ratingmat = as(ratingmat, “realRatingMatrix”)

    Error in validObject(.Object) : invalid class “dgTMatrix” object: all column indices (slot ‘j’) must be between 0 and ncol-1 in a TsparseMatrix

    Unable to proceed. Please help me correct the error

    Thanks

    SHRINIVAS

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.