NLP Based Recommender System Without User Preferences
March 23, 2019
Recommender systems (RS) have evolved into a fundamental tool for helping users make informed decisions and choices, especially in the era of big data in which customers have to make choices from a large number of products and services. A lot of RS models and techniques have been proposed and most of them have achieved great success. Among them, the content-based RS and collaborative filtering RS are two representative ones. Their efficacy has been demonstrated by both research and industry communities.
We will be building a content-based recommender system based on a natural language processing approach, This approach is very useful when we are dealing with products that are featured by a description or a title (text data in general), like news, jobs, books etc ..
We will be using a dataset of projects from some freelancing websites, that were parsed from RSS feed for educational purposes only.
we will try to recommend interesting projects to the freelancers based on the projects that they have done previously.
1import pandas as pd2projects = pd.read_csv("projects.csv")3projects_done= pd.read_csv("projects_done_by_users_history.csv")4projects.head()
The projects done history table comprises 2 rows, user_id and project_id.
Now let’s prepare the data and clean it.
1projects = projects.drop_duplicates(subset="Title", keep='first', inplace=False)2projects["Category"] = projects["Category"].fillna('')3projects["Description"] = projects["Description"].fillna('')4projects["SubCategory"] = projects["SubCategory"].fillna('')5projects['Title']= projects['Title'].fillna('')
Let's combine the whole rows as one row for each document
1projects["text"] = projects["Category"] + projects["Description"] + projects["SubCategory"] + projects["Title"]
Now we will calculate TF-IDF “Term Frequency — Inverse Data Frequency”
1from sklearn.feature_extraction.text import TfidfVectorizer23tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')4tfidf_matrix = tf.fit_transform(projects['text'])
Its time to generate the cosine sim matrix
1from sklearn.metrics.pairwise import linear_kernel, cosine_similarity23cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
Now we’re done, our freelancers are free for open projects .. let’s recommend some interesting projects for them :))
getUserDone function will return a list that contains all the ids of done projects by the user.
get_recommendations will return a list with top similar projects to the projects done by a specific user, the list will be sorted ascending. titles = projects['Title'] indices = pd.Series(projects.index, index=projects['Title'])
1def getUserDone(user_id):2 ProjectsDone = projects_done[users_done['userid'] == user_id]3 print(ProjectsDone["projectid"])4 return ProjectsDone["projectid"].values.tolist()56def get_recommendations(user_id):7 id = getUserDone(user_id)8 sim_scores = 9 for idx in id:10 sim_scores = sim_scores + list(enumerate(cosine_sim[idx]))11 sim_scores = sorted(sim_scores, key=lambda x: x, reverse=True)12 project_indices = [i for i in sim_scores]13 return titles.iloc[project_indices][len(id):]
Let's recommend top 25 projects that user with id 20 may be interested on.
In this article, we have built a content-based recommender using an NLP approach that is very useful when it comes to products with text data.
Without the need for users preferences, we can recommend for them quality recommendations, just from their completed projects.
This approach is also having pros, that's why it is always recommended to use multiple approaches, and combine them for accurate results.