In this post I am going to introduce you a way of using a graph database like OrientDB for building a recommendation system through a simple use case scenario.
Every day people take decisions based on suggestion coming from the most varied sources: books, magazines, web sites, travel guides, radio, television, papers, advertisement and so on. Such advices are considered the more trustworthy the better is the level of the awareness of the source by the person.
Recommender Systems (RS) were introduced for identifying the items that may be considered interesting from a specific user in domains that extend from musical tracks to movies, from points of interests to articles to buy online.
Xiao & Benbasat give the following definition in the paper “E-commerce product recommendation agents: Use, characteristics, and impact, MIS Quarterly 31 (2007)”
“RS are software agents that elicit the interests and preferences of individual consumers […] and make recommendations accordingly.
They have potential to support and improve the quality of the decisions that consumers make while searching for and selecting products online.”
A lot of firms are currently implementing a recommender system in their services. Amazon, Spotify, Netflix, Ebay, just to name a few, gained their popularity
thanks to them.
Collaborative Filtering: A Recommender System Technique
Various state of art techniques have been proposed for RS. Here I will introduce the collaborative filtering, the process of looking for information or pattern through methods that involve the collaboration among multiple agents, data sources, viewpoints and so on. The aim is to provide recommendations or to predict the unknown interests of a user using the known preferences of a group of people. The core assumption of this approach is the following one: if the person A has the same opinion of the person B on a determined question, then is more likely that A has the most similar idea of B about a different problem than another unknown individual.
Let’s consider this scenario. The user Alessio has recently visited some places in his city and rated each one of his experiences in a 1 (the lowest) to 5 range.
The friends of Alessio did the same. Can a system infer his preferences and recommend him places that he didn’t visit yet and he may consider interesting?
You can solve this question using a collaborative filtering approach based on a graph model.
You can identify two type of nodes, Person and Place, and two kind of relationships, FOAF (Friend Of A Friend) and Visited, as in the Figure. Visited
has an attribute named rating.
We will use OrientDB, “the first Multi-Model Open Source NoSQL DBMS that brings together the power of graphs and the flexibility of documents into one scalable high-performance operational database”.
The version used is 2.0.8. Recommendation engines are general use case scenarios for graph databases. We will interact with the db via OrientStudio tool and query it using the OrientDB Sql Dialect.
Firstly we create the vertex classes Person and Place, then the edges classes FOAF and Visited.
CREATE CLASS Person EXTENDS V CREATE CLASS Place EXTENDS V Create class FOAF extends E Create class Visited extends E
Now we populate the db as the Figure.
The syntax is pretty straightforward for users who knows Sql. In particular you can create edges using subqueries or directly rid, if you already know them.
The person vertexes
INSERT INTO Person (id, name) VALUES (1, 'Alessio'), (2, 'Alice'), (3, 'Bob')
The place vertexes
INSERT INTO Place (id, name) VALUES (4, 'Colosseum'), (5, 'Circo Massimo'), (6, 'Trondheim'), (7, 'Cinque Terre')
create edge FOAF from (select from Person where name = 'Pippo') to (select from Person where name = 'Alice') create edge FOAF from #12:0 to #12:2 create edge Visited from (select from person where name = 'Alessio') to (select from Place where name = 'Colosseum') set rating = '5' create edge Visited from (select from person where name = 'Alessio') to (select from Place where name = 'Circo Massimo') set rating = '4' create edge Visited from (select from person where name = 'Alice') to (select from Place where name = 'Circo Massimo') set rating = '3' create edge Visited from (select from person where name = 'Alice') to (select from Place where name = 'Trondheim') set rating = '5'
edge for Bob using rid
create edge Visited from #12:2 to #13:0 set rating = '4' create edge Visited from #12:2 to #13:1 set rating = '3' create edge Visited from #12:2 to #13:2 set rating = '5' create edge Visited from #12:2 to #13:3 set rating = '4' create edge Visited from #12:2 to #13:3 set rating = '4'
The database has now been populated. It’s time to query it and unleash the power of collaborative filtering! The approach proposed here is the following one:
let’s consider we want to provide recommendations to the user Alessio about places he may like. Among all of his friends, we firstly look to the most similar user to Alessio, that is the one who has mostly visited the same places as Alessio. The system will thereby suggest him places visited by the most similar friend but not already visited by Alessio, sorted by ratings.
How can you translate this algorithm into an OrientDB query? Let’s break it down into smaller parts and go on step by step. For the sake of simplicity, consider we already know the rid of the user Alessio, id est #12:0 (it is easy to retrieve it anyways).
1) Find the most similar user. The similarity is the total number of the incoming edges to the places visited. We will proceeding incrementally:
Select the places visited from Alessio
Select the person nodes that have a relationship with the places visited by Alessio
SELECT expand(out('Visited').in('Visited')) FROM #12:0
Select the person nodes that have a Visited relationship with the places visited by Alessio and a FOAF relationship with the node Alessio. Count the occurrences as the friend similarity score, grouping by @rid and ordering by the similarity.
SELECT count(@rid) as friend_similarity, @rid as friend_id, name FROM (SELECT expand(in('Visited')) FROM (SELECT expand(out('Visited')) FROM #12:0)) WHERE @rid <> #12:0 AND @rid in (SELECT out('FOAF') FROM #12:0) GROUP BY @rid ORDER BY friend_similarity DESC
So putting it all together, the most similar user can be retrieved this way
SELECT expand(friend_id) FROM (SELECT count(@rid) as friend_similarity, @rid as friend_id, name FROM (SELECT expand(in('Visited')) FROM (SELECT expand(out('Visited')) FROM #12:0)) WHERE @rid <> #12:0 AND @rid in (SELECT out('FOAF') from #12:0) GROUP BY @rid ORDER BY friend_similarity DESC LIMIT 1)
2) Recommend the user the places visited by the most similar friend and not by him, sorted by rating. The previous query is bind to the $most_similar_friend variable in the LET block.
SELECT in.name, rating, $most_similar_friend.name FROM (SELECT expand(inE('Visited')) FROM (SELECT expand(@rid) FROM (SELECT expand(out('Visited')) FROM (select expand($most_similar_friend) LET $most_similar_friend = (SELECT expand(friend_id) FROM (SELECT count(@rid) as friend_similarity, @rid as friend_id, name FROM (SELECT expand(out('Visited').in('Visited')) FROM #12:0) WHERE @rid <> #12:0 AND @rid in (SELECT out('FOAF') from #12:0) GROUP BY @rid ORDER BY friend_similarity DESC LIMIT 1)))) WHERE @rid not in (select out('Visited') from #12:0))) WHERE out in $most_similar_friend ORDER BY rating DESC
We can use query against a bigger dataset: the italian cities visited by 100 friends. The graph is composed by 100 Person nodes, 8092 Place nodes (the ), 98 FOAF edges and 23917Visited edges (avg of 239 places visited by a single person ). Note that it has been used a 0-500 rating to have a better perception of the row ordering.
The top 3 most similar friends are
and recommended places generated by the system
Conclusion and future works
The aim of this post is to show how to use a graph db as OrientDB to train a recommendation system scenario with just a few lines of query. This collaborative filtering approach can be for sure refined and optimized. For example you can consider the top N similar friends and not only the first one, ordering the places aggregating by the sum of the ranking attribute in the incoming links. In other blog posts I will probably explore other graph databases and techniques.