Pages

Review new book: Practical Data Analysis

Tuesday, November 26, 2013



Hi all,

I was invited by the Packtpub team to review the book Practical Data Analysis by Hector Cuesta.  I started reading it for the last two weeks and I really enjoyed the topics approach covered by the book.




The book goes through the data science hot topics by presenting several practical examples of data exploration, analysis and even some machine learning techniques. Python as the main platform for the sample codes was a perfect choice, at my opinion. Since Python has becoming more popular at scientific community. 

The book brings several examples in  several science study fields such as stock prices, sentiment analysis on social networks,  biology modelling scenarios, social graphs,  MapReduce, text classification,  data visualisation, etc.  Many novel libraries and tools are presented including Numpy, Scipy, PIL, Python, Pandas, NLTK, IPython, Wakari ( I really liked a dedicated chapter for this excellent tool for scientific python environment on-line),  etc.  It also covers NoSQL databases such as MongoDB and visualisation libraries like D3.js.

I believe the biggest value proposition of this book is that it brings together in one book several tools and how they can be applied on data science. Many tools mentioned lacks further examples or documentation, which this book can assist any data scientist on this task.

However, the reader must not expect learn machine learning and data science in this book.  The theory is  scarce, and I believe it was not the main goal of the author for this book. For anyone looking to learn data science, this is not the right book. It is focused on who desires an extra resource for sample codes and inspirations, it will be a great pick!

The source code is available on Github, but you can explore them better explained inside the book, including illustrations. To sum up, I have to congratulate Hector for his effort writing this book. For the Scientific community, including the Python group, they will really enjoy this book! I really missed more material about scientific stack software installations, since for the beginners it can be really painful.  But in overall, it was well written focused on practical problems! A guide for any scientists.

For me the best chapters were the Chapter 6: Simulation of Stock Prices, the visualisation using D3.js was great, and the last chapter, 14, about On-line Data Analysis with IPython and Wakari. It's the first time I see Wakari covered in a book! Everyone who works with scientific Python today must give a chance some day to experiment this on-line tool! It's awesome!

Congratulations to PacktPub and Hector for the book!

Regards,

Marcel Caraciolo

Non-Personalized Recommender systems with Pandas and Python

Tuesday, October 22, 2013


Hi all,

At the last PythonBrasil I gave a tutorial about Python and Data Analysis focused on recommender systems, the main topic I've been studying for the last years. There is a popular python package among the statisticians and data scientists called Pandas. I watched several talks and keynotes about it, but I didn't have a try on it. The tutorial gave me this chance and after the tutorial me and the audience fell quite excited about the potential and power that this library gives.

This post starts a series of articles that I will write about recommender systems and even the introduction for the new-old refreshed library that I am working on:  Crab,  a python library for building recommender systems. :)

This post starts with the first topic about the theme: Non-personalized Recommender Systems and giving several examples with the python package Pandas.  In future I will also post an alternative version of this post but referencing Crab, about how it works with him.

But first let's introduce what Pandas is.

Introduction to Pandas


Pandas is a data analysis library for Python that is great for data preparation, joining and ultimately generating well-formed, tabular data that's easy to use in a variety of visualization tools or (as we will see here) machine learning applications. For further introduction about pandas, check this website or this notebook.

Non-personalized Recommenders


Non-personalized recommenders can  recommend items to consumers based on what other consumers have said about the items on average. That is, the recommendations are independent of the customer,  so each customer gets the same recommendation.  For example, if you go to amazon.com as an anonymous user it shows items that are currently viewed by other members.

Generally the recommendations come in two flavours: predictions or recommendations. In case of predictions are simple statements that are formed in form of scores, stars or counts.  On the other hand, recommendations are generally simple a list of items shown without any number associated with it.

Let's going by an example:

Simple Prediction using Average

The score in the scale of 1 to 5 to the book Programming Collective Intelligence was 4.5 stars out of 5.
This is an example of a simple prediction. It displays a simple average of other customer reviews about the book.
The math behind it is quite simple:

Score = ( 65 * 5 + 18 * 4 + 7 * 3 +  4 * 2 +  2 * 1)
Score =  428/ 96
Score = 4.45 ˜= 4.5 out of 5 stars

In the same page it also displays the information about the other books which the customers bought after buying Programming Collective Intelligence. A list of recommended books presented to anyone who visits the product's page. It is an example of recommendation.




But how Amazon came up with those recommendations ? There are several techniques that could be applied to provide those recommendations. One would be the association rules mining, a data mining technique to generate a set of  rules and combinatios of items that were bought together. Or it could be a simple average measure based on the proportion of who bought x and y by who bought x. Let's explain using some maths:




Let X be the number of customers who purchased the book Programming Collective Intelligence. Let Y be the other books they purchased. You need to compute the ration given below for each book and sort them by descending order.  Finally, pick up the top K books and show them as related. :D

Score(X, Y) =  Total Customers who purchased X and Y / Total Customers who purchased X


Using this simple score function for all the books you wil achieve:


Python for Data Analysis                                                 100%

Startup Playbook                                                              100%

MongoDB Definitive Guid                                                0 %

Machine Learning for Hackers                                          0%


As we imagined the book  Python for Data Analysis makes perfect sense. But why did the book  Startup Playbook came to the top when it has been purchased by customers who have not purchased Programming Collective Intelligence.  This a famous trick in e-commerce applications called banana trap.   Let's explain: In a grocery store most of customers will buy bananas. If someones buys a razor and a banana then you cannot tell that the purchase of a razor influenced the purchase of banana.  Hence we need to adjust the math to handle this case as well. Modfying the version:

Score(X, Y) =  (Total Customers who purchased X and Y / Total Customers who purchased X) / 
         (Total Customers who did not purchase X but got Y / Total Customers who did not purchase X)

Substituting the number we get:

Python for Data Analysis =   ( 2 / 2 ) /  ( 1 / 3) =  1 / 1/3  =  3 

Startup Playbook   =   ( 2 / 2)  /  ( 3 /  3)  =  1 

The denominator acts as a normalizer and you can see that Python for Data Analysis clearly stands out.  Interesting, doesn't ? 

The next article I will work more with non-personalized recommenders, presenting some ranking algorithms that I developed for Atepassar.com for ranking  professors. :)

Examples with real dataset (let's play with CourseTalk dataset)

To present non-personalized recommenders let's play with some data. I decided to crawl the data from the popular ranking site for MOOC's  Course Talk.  It is an aggregator of several MOOC's where people can rate the courses and write reviews.  The dataset is a mirror from the date  10/11/2013 and it is only used here for study purposes.



Let's use Pandas to read all the data and start showing what we can do with Python and present a list of top courses ranked by some non-personalized metrics :)

Update: For better analysis I hosted all the code provided at the IPython Notebook at the following link by using nbviewer.

All the dataset and source code will be provided at crab's github, the idea is to work on those notebooks to provide a future book about recommender systems :)

I hope you enjoyed this article,  and stay tunned for the next one about another type of non-personalized recommenders:  Ranking algorithms for vote up/vote down systems!

Special thanks for the tutorial of Diego Manillof :)

Cheers,

Marcel Caraciolo

Ruby in the world of Recommendations (also machine learning, statistics and visualizations)

Tuesday, September 17, 2013

Hello everyone!

I am back with lots of news and articles! I've been quite busy but I returned. In this post I'd like to share my last presentation that I gave at Frevo On'Rails Pernambuco Ruby Meeting at Recife, PE. My  Ruby developer colleagues around Recife invited me to give a lecture there.  I was quite excited about the invitation and instanlty I accepted.

I decided to research more about scientific computing with Ruby and recommender libraries written focused on Ruby either.  Unfortunately the escossistem for scientific environment in Ruby is still at very beginning.  I found several libraries but most of them were abandoned by their developers or commiters.  But I didn't give up and decided to refine my searches. I found a espectular and promising work on scientific computing with ruby called SciRuby. The goal is to por several matrix and vector representations with Ruby using C and C++ at backend. It remembered me a lot the beginnings of the library Numpy and Scipy :)

About the recommenders, I didn't find any deep work as Mahout, but I found a library called Recommendable that uses memory-based collaborative filtering.  I really liked the design of the library and the workaround of the developer on making the linear algebra operations with Redis instead of Ruby :D

All those considerations and more insights I put on my slides, feel free to share :)



I hope you enjoyed, and even I love Python, I really like programming another languages :)


Regards,

Marcel

Slides and Video from my talk about Big Data with Python in Portuguese

Thursday, July 4, 2013




Hi all,

It has been a while since my last technical posts. But it's for a great cause! I am writing the current book about recommender systems and it is taking some dedicated time to work on it! But great posts are coming to the blog.

I'd like to publish an on-line talk that I gave in June' 18 at #MutiraoPython, a great initiative that I started with my startup PyCursos, an learning school for teaching Python and applications on-line. It's like a Coursera MOOC's for Programming in portuguese!  This talk was part of a series of keynotes that are happening every week on-line for free using the Hangout on Air!



I gave a lecture for about two hours about Big Data with Python and presentend some tools used for data analysis. I know that I missed to explore more Pandas, Ipython, Scikit-learn. However, I decided to explore the Hadoop Architecture and MapReduce Paradigm and some code examples with Python.

It's in portuguese! But all the content is available for free! I hope you enjoy!

Video and code







Review about the book Learning IPython for Interactive Computing and Data Visualization

Monday, June 17, 2013



Hi all,

I was invited to review a copy of the book recently released titled "Learning IPython for Interactive Computing and Data Visualization" by the author Cyrille Rossant.  The book focus on one of the best tools for working with Python with the interactive incremented shell IPython.  By the way, it was the time to the tool receive a special book about it.


Learning IPython for Interactive Computing and Data Visualization

IPython is covered through the six chapters using several basic examples related to scientific computing  along with another Python tools such as Matplotlib, Numpy, Pandas, etc.  The first chapters explore the IPython basics such as installation, basic commands to get used with the tool.

The next chapters introduces NumPy and Pandas basics with the IPython shell active. Don't expect advanced examples with those tools. The idea is a simple demonstration of what we can do at IPython.

There is a chapter to discuss the visualization data with graphs, plottings with IPython Notebook. However I missed more details abot IPython notebook. It lacks more deep examples related to the topic.  

I really liked the chapter 5 when they showed some basics of MPI (Message Passing Theme), although the topic wasn't vasted explored. But the introduction gives a greate potential of usefulness to the more advanced books about IPython.

The last chapter shows how to create pluggins to IPython, for instance, create an simple extension that introduces a new cell magic (write C++ code directly in the cell, and it will be automatically compiled and executed).

My conclusion about the book is that it achieves the expected goal: a technical introduction to IPython. If you want a book to explore scientific computing or a advanced stuff to IPython, this is not the the book yet. I can say that this book is a first start for much more topics about the use of IPython.  MPI, IPython notebook, etc.  I recommend the book for start exploring the IPython as reference! :D


Regards,

Marcel

Review of the book Learning Scipy for Numerical and Scientific Computing

Monday, April 15, 2013


Hi all,

I've finished reading the book "Learning Scipy for Numerical and Scientific Computing".  This book comes to the scientific python series that PacktPub are bringing to the Python Developers! Congratulations!  As the title informs: it includes Scipy, Numpy and Matplotlib.  I only missed some further information about IPython, but it wasn't the goal of the book, so it goes well even leaved out.





It covers several important topics that are not as commonly covered, specially with several snippets illustrating special functions presented at Scipy library. For the developers it will be another great reference book to complement the native docs that comes with the library.  I enjoyed the author focused more on numerical analysis functions, it is one of the most used functions at the library.  

The bool also brings chapters on more specific applications: signal processing, data mining and computational geometry.  There is an extra chapter about the integration with another languages, but I found it not dense enough to explain those integrations. I really missed more start-off examples showing how to install the f2py or how to use Scipy with C/C++. 

Overall the Learning Scipy for Numerical and Scientific Computing book is a good book on Scipy covering  lots of mathematics with examples in Python. The book has a good size and it helps the scientists and scientific developers (by the way the non-developers will face some difficulties due to the heavy math that comes with the examples) to have a good overview on the library before exploring the reference material.


Thanks Kenny for the invitation to review this book, and congratulations to Francisco for bringing one more technical book for scientific python computing to the series!

Regards,

Marcel Caraciolo


Slides for Scientific Computing Meeting: Benchy and GeoMapper Visualization

Sunday, April 7, 2013



Hi all,

Yesterday it happened the XXVI local meeting of the Python Users Group at Pernambuco (PUG-PE).  In the occasion I had the opportunity to present two talks about scientific computing with python.


The first one was the lightweight framework for benchmark analysis on Python Scripts called Benchy, which I developed for about one week to help me on checking performance of several algorithms that I developed in Python.  I covered the framework at my last post which can be found here.


Here are the slides for the presentation:




The second talk was about a new type of visualization that I developed for social network analysis in order to check the degree of connections between the users at the socialnetwork using their geolocation data to present in a map.

The result was beautiful plots using this new type of visualization.  Amazing!

The slides are available here:



I hope you enjoy the slides, any further information feel free to comment!

Regards,

Marcel Caraciolo

Performing runtime benchmarks with Python Monitoring Tool Benchy

Friday, March 22, 2013


Hi all,

I've been working on in the last weeks at a little project that I developed called benchy.  The goal of benchy is answer some trivial questions about which code is faster ?  Or which algorithm consumes more memory ?  I know that there are several tools suitable for this task, but I would like to create some performance reports  by myself using Python.   

Why did I create it ?  Since the beginning of the year I decided to rewrite all the code at Crab, a python framework for building recommender systems.  And one of the main components that required some refactoring was the pairwise metrics such as cosine, pearson, euclidean, etc.  I needed to unit test the performance of several versions of code for those functions. But doing this manually ? It's boring. That's why benchy came for!


What benchy can do ?

Benchy is a lightweight Python library for running performance benchmarks over alternative versions of code.  How can we use it ?

Let's see the cosine function, a popular pairwise function for comparing the similarity between two vectors and matrices in recommender systems.




Let's define the benchmarks to test:



With all benchmarks created, we could test a simple benchmark by calling the method run:


The dict associated to the key memory represents the memory performance results. It gives you the number of calls repeat to the statement, the average consumption usage in units . In addition, the key 'runtime' indicates the runtime performance in timing results. It presents the number of calls repeat following the average time to execute it timing in units.

Do you want see a more presentable output ? It is possible calling the method to_rst with the results as parameter:


Benchmark setup
import numpy
X = numpy.random.uniform(1,5,(1000,))

import scipy.spatial.distance as ssd
X = X.reshape(-1,1)
def cosine_distances(X, Y):
    return 1. - ssd.cdist(X, Y, 'cosine')
Benchmark statement
cosine_distances(X, X)
namerepeattimingloopsunits
scipy.spatial 0.8.0318.3610ms


Now let's check which one is faster and which one consumes less memory. Let's create a BenchmarkSuite. It is referred as a container for benchmarks.:

Finally, let's run all the benchmarks together with the BenchmarkRunner. This class can load all the benchmarks from the suite and run each individual analysis and print out interesting reports:



Next, we will plot the relative timings. It is important to measure how faster the other benchmarks are compared to reference one. By calling the method plot_relative:




As you can see the graph aboe the scipy.spatial.distance function is 2129x slower and the sklearn approach is 19x. The best one is the numpy approach. Let's see the absolute timings. Just call the method plot_absolute:



You may notice besides the bar representing the timings, the line plot representing the memory consumption for each statement. The one who consumes the less memory is the nltk.cluster approach!

Finally, benchy also provides a full repport for all benchmarks by calling the method to_rst:




Performance Benchmarks

These historical benchmark graphs were produced with benchy.
Produced on a machine with
  • Intel Core i5 950 processor
  • Mac Os 10.6
  • Python 2.6.5 64-bit
  • NumPy 1.6.1

scipy.spatial 0.8.0

Benchmark setup
import numpy
X = numpy.random.uniform(1,5,(1000,))

import scipy.spatial.distance as ssd
X = X.reshape(-1,1)
def cosine_distances(X, Y):
    return 1. - ssd.cdist(X, Y, 'cosine')
Benchmark statement
cosine_distances(X, X)
namerepeattimingloopsunits
scipy.spatial 0.8.0319.1910ms

sklearn 0.13.1

Benchmark setup
import numpy
X = numpy.random.uniform(1,5,(1000,))

from sklearn.metrics.pairwise import cosine_similarity as cosine_distances
Benchmark statement
cosine_distances(X, X)
namerepeattimingloopsunits
sklearn 0.13.130.18121000ms

nltk.cluster

Benchmark setup
import numpy
X = numpy.random.uniform(1,5,(1000,))

from nltk import cluster
def cosine_distances(X, Y):
    return 1. - cluster.util.cosine_distance(X, Y)
Benchmark statement
cosine_distances(X, X)
namerepeattimingloopsunits
nltk.cluster30.010241e+04ms

numpy

Benchmark setup
import numpy
X = numpy.random.uniform(1,5,(1000,))

import numpy, math
def cosine_distances(X, Y):
    return 1. -  numpy.dot(X, Y) / (math.sqrt(numpy.dot(X, X)) *
                                     math.sqrt(numpy.dot(Y, Y)))
Benchmark statement
cosine_distances(X, X)
namerepeattimingloopsunits
numpy30.0093391e+05ms

Final Results

namerepeattimingloopsunitstimeBaselines
scipy.spatial 0.8.0319.1910ms2055
sklearn 0.13.130.18121000ms19.41
nltk.cluster30.010241e+04ms1.097
numpy30.0093391e+05ms1

Final code!

I might say this micro-project is still a prototype, however  I tried to build it to be easily extensible. I have several ideas to extend it, but feel free to fork it and send suggestions and bug fixes.  This project was inspired by the open-source project vbench, a framework for performance benchmarks over your source repository's history. I recommend!

For me, benchy will assist me to test several pairwise alternative functions in Crab. :)  Soon I will publish the performance results that we got with the pairwise functions that we built for Crab :)

I hope you enjoyed,

Regards,

Marcel Caraciolo

Graph Based Recommendations using "How-To" Guides Dataset

Friday, March 1, 2013


Hi all,

In this post I'd like to introduce another approach for recommender engines using graph concepts to recommend novel and interesting items. I will build a graph-based how-to tutorials recommender engine using the data available on the website SnapGuide (By the way I am a huge fan and user of this tutorials website), the graph database Neo4J and the graph traversal language Gremlin.

What is SnapGuide ?

Snapguide is a web service for anyone who wants to create and share step-by-step "how to guides".  It is available on the web and IOS app. There you can find several tutorials with easy visual instructions for a wide array of topics including cooking, gardening, crafts, projects, fashion tips and more.  It is free  and anyone is invitide to submit guides in order to share their passions and expertise with the community.  I have extracted from their website for only research purposes the corpus of tutorials likes. Several users may like the tutorial and this signal can be quite useful to recommend similar tutorials based on what other users liked.  Unfortunately I can't provide the dataset for download but the code you can follow below for your own data set.

Snapguide 



Getting Started with Neo4J


To create your own graph with Neo4J you will need to use Java/Groovy to explore it.  I found Bulbflow, it is a open-source Python ORM  for graph databases and supports puggable backends using Blueprints standards.  In this post I used it to connect to Neo4j Servers.  The snippet code below is a simple example of Bulbflow in action by creating some edges and vertexes.


>>> from people import Person, Knows
>>> from bulbs.neo4jserver import Graph
>>> g = Graph()
>>> g.add_proxy("people", Person)
>>> g.add_proxy("knows", Knows)
>>> james = g.people.create(name="James")
>>> julie = g.people.create(name="Julie")
>>> g.knows.create(james, julie)

Generating our tutorials Graph


I decided to define my graph schema in order to map the raw data into a property graph so the traversals required to get recommendations of which tutorials to check could be natural as possible.


SnapGuide Graph Schema


The data will be inserted into the graph database Neo4J  The code belows creates a new Neo4J graph with all the data set.

#-*- coding: utf-8 -*-
from bulbs.neo4jserver import Graph
from nltk.tag.hunpos import HunposTagger
from nltk.tokenize import word_tokenize

ht = HunposTagger('en_wsj.model')

likes = open('likes.csv')
tutorials = open('tutorials.csv')
users = open('users.csv')
g = Graph()
def filter_nouns(words):
   return [word.lower() for word, cat in words if cat in ['NN', 'NNP', 'NNPS']]
#Loading tutorials and categories
for tutorial in tutorials:
    tutorial = tutorial.strip()
    try:
 ID, title, likes, category = tutorial.split(';')
    except:
 try:
      ID, title, category = tutorial.split(';')
 except:
      t = tutorial.split(';')
      ID, title, category = t[0], t[1].replace('&Yuml', ''), t[-1] 
   
     tut =  g.vertices.create(type='Tutorial', tutorialId=int(ID), title=title)
     keywords = filter_nouns(ht.tag(word_tokenize(tutorial.split(';')[1])))
     keywords.append(category)

     for keyword in keywords:
 resp = g.vertices.index.lookup(category=keyword)
 if resp is None:
      ct = g.vertices.create(type='Category', category = keyword)
 else:
      ct = resp.next()
 g.edges.create(tut,'hasCategory', ct)
#Loading user dataset.

for user in users:
     user = user.strip()
     username = user.split(';')[0]
 
     user = g.vertices.create(type='User', userId=username)
#Loading the likes dataset.
for like in likes:
    like = like.strip()
    item_id, user_id = like.split(';')
    p = g.vertices.index.lookup(tutorialId=int(item_id))
    q =  g.vertices.index.lookup(userId=user_id)
    g.edges.create(q.next(), 'liked', p.next())
There are three input files: tutorials.dat, users.dat and likes.dat. The file tutorials.dat contains the list of  tutorials. Each row has 2 columns: tutorialId, title and category. The file users.dat contains the list of users.  Each row contains the columns:  userID, user name.  Finally  the likes.dat includes the tutorials that a user marked their interest. Each row of the raw file has : userId and movieId.

Given that there are more than 1 million likes, it will take some time to process all the data. An important note before going on. Don't forget to create the vertices indexes,  if you forget your queries it will take ages to proccess.


  1. //These indexes are a must, otherwise querying the graph database will take so looong
  2. g.createKeyIndex('userId',Vertex.class)
  3. g.createKeyIndex('tutorialId',Vertex.class)
  4. g.createKeyIndex('category',Vertex.class)
  5. g.createKeyIndex('title',Vertex.class)


Before moving on to recommender algorithms, let's make sure the graph is ok.

For instance,  what is the distribution of keywords amongst the tutorials repository ?

  1. //Distribution frequency of Categories
    def dist_categories(){
      m = [:]
     g.V.filter{it.getProperty('type')=='Tutorial'}.out('hasCategory').category.groupCount(m).iterate() 
    return m.sort{-it.value}
    }
>>> script = g.scripts.get('dist_categories')
>>> categories = g.gremlin.execute(script, params=None)
>>> sorted(categories.content.items(), key=lambda keyword: -keyword[1])[:10]
[(u'food', 4537), (u'make', 3840), (u'arts-crafts', 1609), (u'cook', 1362), (u'desserts', 1247), (u'beauty', 1108), (u'technology', 943), (u'drinks', 587), (u'home', 508), (u'chicken', 452)]

What about the average number of likes per tutorial ?

  1. //Get the average number of likes per tutorial
    def avg_likes(){
    return  
    g.V.filter{it.getProperty('type')=='Tutorial'}.transform{it.in('liked').count()}.mean() 
    }

>>> script = g.scripts.get('avg_likes')
>>>likes = g.gremlin.command(script, params=None)
>>>likes
111.089116326

Trasversing the Tutorials Graph

Now that the data is represented as graph, let's make some queries. Behind the scene what we make are some traversals.  In recommender systems  there are two general typs of recommendation approaches: the collaborative filtering and content-based one.

In collaborative, the liking behavior of users is correlated in order to recommend the favorites of one user to another, in this case let's find the similar user.

I like the tutorials Amanda preferred, what other tutorials does Amanda like that I haven't seen ?

Otherwise, the content-base strategy is based on the features of a recommendable item. So the attributes are analyzed in order to find other similar items with analogous features.

I  like food tutorials, what other food tutorials are there ?


Making Recommendations

Let's begin with collaborative filtering.  I will use some complex traversal queries at our graph.  Let's start with the tutorial: "How to Make Sous Vide Chicken at Home".  Yes,  I love chicken! :)

Great dish by the way!
Which users liked Make Sous Vide Chicken at Home ?
  1. //Get the users who liked a tutorial
    def users_liked(tutorial){
       v = g.V.filter{it.getPropery('title') == tutorial}
       return v.inE('liked').outV.userId[0..4]
    }
>>> tuts = g.vertices.index.lookup(title='Make Sous Vide Chicken at Home')
>>> tut = tuts.next()
>>> tut.title 
Make Sous Vide Chicken at Home
>>> tut.tutorialId
11890
>>> tut.type  
Tutorial
>>> script = g.scripts.get('n_users_liked')
>>> users_liked = g.gremlin.command(script, params={'tutorial': 'Make Sous Vide Chicken at Home'})
>>> users_liked
1000
This traversal doesn't provide us useful information, but we could put in action now the collaborative filtering with a extended query:

Which users liked Make Sous Vide Chicken at Home and what other tutorials did they liked most in common to ?


  1. //Get the users who liked the tutorial and what other tutorials did they like too ?
    def similar_tutorials(tutorial){
    v = g.V.filter{it.getProperty('title') == tutorial}
    return v.inE('liked').outV.outE('liked').inV.title[0..4]
    }


>>>> script = g.scripts.get('similar_tutorials')
>>>> similar_tutorials = g.gremlin.execute(script, params={'tutorial': 'Make Sous Vide Chicken at Home'})
>>> similar_tutorials.content
[u'Make Potato Latkes', u'Make Beeswax and Honey Lip Balm', u'Make Sous Vide Chicken at Home', u'Cook the Perfect & Simple Chicken Ramen Soup', u'Make a Simple (But Authentic) Paella on Your BBQ']

What is the query above express ?

It filters all users that liked the tutorial (inE('liked')) and find out what they liked (outV.outE('liked')), fetching the title of those tutorials (inV.title) . It returns the first five items ([0..4])

In recommendations we have to find the most-common purchased or liked itens.  Using Gremlin, we can work on a simple collaborative filtering algorithm by joining several steps together.

  1. //Get similar tutorials
    def topMatches(tutorial){
        m = [:]
    v = g.V.filter{it.getProperty('title') == tutorial}
    v.inE('liked').outV.outE('liked').inV.title.groupCount(m).iterate()
        return m.sort{-it.value}[0..9]

    }


>>> script = g.scripts.get('topMatches')
>>> topMatches = g.gremlin.execute(script, params={'tutorial': 'Make Sous Vide Chicken at Home'})
>>> sorted(topMatches.content.items(), key=lambda keyword: -keyword[1])[:10]
{u'Make Cake Pops!!': 75, u'Make Sous Vide Chicken at Home': 1000, u'Make Potato Latkes': 124, u'Make Incredible Beef Jerky at Home Easily!': 131, u'Cook the Perfect & Simple Chicken Ramen Soup': 96, u'Make Mint Juleps': 74, u"Solve a 3x3 Rubik's Cube": 89, u'Cook Lamb Shanks Moroccan Style': 74, u'Make Beeswax and Honey Lip Balm': 75, u'Make an Aerium': 74}

This traversal will return a list of tutorials.  But you may notice if you get all matches, ther are many duplicates. It happens because who like  How to Make sous Vide Chicken At Home also like many of the same other tutorials.  The similarity between users in represented at collaborative filtering algorithms.


How many of How to Make sous Vide Chicken At Home highly correlated tutorials are unique ?

  1. //Get the number of unique similar tutorials
    def n_similar_unique_tutorials(tutorial){
    v = g.V.filter{it.title == tutorial}
    return v.inE('liked').outV.outE('liked').inV.dedup.count()
    }

    //Get the number of similar tutorials
    def n_similar_tutorials(tutorial){
    v = g.V.filter{it.getProperty('title') == tutorial}
    return v.inE('liked').outV.outE('liked').inV.count()
    }

>>> script = g.scripts.get('n_similar_tutorials')
>>> similar_tutorials = g.gremlin.command(script, params={'tutorial': 'Make Sous Vide Chicken at Home'})
>>> similar_tutorials
37323
>>> script = g.scripts.get('n_similar_unique_tutorials')
>>> similar_tutorials = g.gremlin.command(script, params={'tutorial': 'Make Sous Vide Chicken at Home'})
>>> similar_tutorials
8766

There are 37323 paths from Make Sous Vide Chicken at Home to other tutorials and only  8766 of those tutorials are unique. Using this information we can use these duplications to build a ranking mechanism to build recommendations.

Which tutorials are most highly co-rated with How to Make Soous Vide Chicken ?


>>> script = g.scripts.get('topMatches')
>>> topMatches = g.gremlin.execute(script, params={'tutorial': 'Make Sous Vide Chicken at Home'})
>>> sorted(topMatches.content.items(), key=lambda keyword: -keyword[1])[:10]
[(u'Make Sous Vide Chicken at Home', 1000), (u'Make Incredible Beef Jerky at Home Easily!', 131), (u'Make Potato Latkes', 124), (u'Cook the Perfect & Simple Chicken Ramen Soup', 96), (u"Solve a 3x3 Rubik's Cube", 89), (u'Make Cake Pops!!', 75), (u'Make Beeswax and Honey Lip Balm', 75), (u'Make Mint Juleps', 74), (u'Cook Lamb Shanks Moroccan Style', 74), (u'Make an Aerium', 74)]

So we have the top similar tutorials. It means, people who like  Make Sous Vide Chicken at Home, also like Make Sous Viden Chicken at Home, oops! Let's remove these reflexive paths, by filtering out the Sous Viden Chicken.

  1. //Get similar tutorials
    def topUniqueMatches(tutorial){
        m = [:]
        v = g.V.filter{it.getProperty('title') == tutorial}
        possible_tutorials = v.inE('liked').outV.outE('liked').inV
        possible_tutorials.hasNot('title',tutorial).title.groupCount(m).iterate()
        return m.sort{-it.value}[0..9]
    }




>>>> script = g.scripts.get('topUniqueMatches')
>>>> topMatches = g.gremlin.execute(script, params={'tutorial': 'Make Sous Vide Chicken at Home'})
>>> topMatches.content
[(u'Make Incredible Beef Jerky at Home Easily!', 131), (u'Make Potato Latkes', 124), (u'Cook the Perfect & Simple Chicken Ramen Soup', 96), (u"Solve a 3x3 Rubik's Cube", 89), (u'Make Cake Pops!!', 75), (u'Make Beeswax and Honey Lip Balm', 75), (u'Make Mint Juleps', 74), (u'Cook Lamb Shanks Moroccan Style', 74), (u'Make an Aerium', 74), (u'Make a Leather iPhone Flip Wallet', 73)]

The recommendation above starts from a particular tutorial (i.e. Make Sous Vide Chicken), not from a particular user. This collaborative filtering method is called item-based filtering.   

Given an tutorial that a user likes, who else like this tutorial, and from those what other tutorials do they like that are not already liked by the initial user.

And the recommendation for a particular user ?  That comes the user-based filtering.


Which tutorials that similar users liked are recommended given a specified user  ?


  1. def userRecommendations(user){
      m = [:]
      v = g.V.filter{it.getProperty('userId') == user}
     v.out('liked').aggregate(x).in('liked').dedup.out('liked').except(x).title.groupCount(m).iterate()
      return m.sort{-it.value}[0..9]
    }
>>>> script = g.scripts.get('topRecommendations')
>>>> recommendations = g.gremlin.execute(script, params={'user': 'emma-rushin'})
>>> recommendations.content
[(u'Create a Real Fisheye Picture With Your iPhone', 1156), (u'Make a DIY Galaxy Print Tshirt', 933), (u'Make a Macro Lens for Free!', 932), (u'Make Glass Marble Magnets With Any Image', 932), (u'Make DIY Nail Decals', 932), (u'Make a Five Strand Braid', 929), (u'Create a Pendant Lamp From Coffee Filters', 928), (u'Make Avocado Toast', 926), (u'Make Instagram Magnets for Less Than $10', 923), (u'Make a Recycled Magazine Tree (Christmas Tree)', 923)]

Emma Rushin will really like art and crafts suggestions! :D

Ok, we have interesting recommendations, but if I desire to make another styles of chicken like Chicken Ramen Soup for my dinner, I probably do not want some tutorial of How to Solve a Rubik Cube 3x3.  To adapt to this situation, it is possible to mix collaborative filtering and content-based recommendation into a traversal so it would recommend similar chicken and food tutorials based on similar keywords.
Now let's play with content-based recommendation! 
Which tutorials are most highly correlated with Sous Vide Chicken that share the same category of food?

  1. //Top recommendations mixing content + collaborative sharing all categories.
    def topRecommendations(tutorial){
      m = [:]
      x = [] as Set
     v = g.V.filter{it.getProperty('title') == tutorial}
     tuts =v.out('hasCategory').aggregate(x).back(2).inE('liked').outV.outE('liked').inV
    tuts.hasNot('title',tutorial).out('hasCategory').retain(x).back(2).title.groupCount(m).iterate()
      return m.sort{-it.value}[0..9]
    }
>>>> script = g.scripts.get('topRecommendations')
>>>> recommendations = g.gremlin.execute(script, params={'tutorial': 'Make Sous Vide Chicken at Home'})
>>> topMatches.content
[(u'Make Incredible Beef Jerky at Home Easily!', 131), (u'Make Potato Latkes', 124), (u'Cook the Perfect & Simple Chicken Ramen Soup', 96), (u'Make Cake Pops!!', 75), (u'Make Beeswax and Honey Lip Balm', 75), (u'Make Mint Juleps', 74), (u'Cook Lamb Shanks Moroccan Style', 74), (u'Cook an Egg in a Basket', 72), (u'Make Banana Fritters', 72), (u'Prepare Chicken With Peppers and Gorgonzola Cheese', 71)]

This rank makes sense, but it still has a flaw.  The tutorial like Make mint Juleps may not be interesting for me. How about only considering those tutorials that share the same keyword 'chicken' with Vide Chicken  ?

Which tutorials are most highly co-rated with Vide Chicken that share the same keyword 'chicken' with  Vide Chicken?
  1. //Top recommendations mixing content + collaborative sharing the chicken category.
    def topRecommendations(tutorial){
     m = [:]
     v = g.V.filter{it.getProperty('title') == tutorial}

     v.inE('liked').outV.outE('liked').inV.hasNot('title',tutorial).out('hasCategory').
     has('category' ,'chicken').back(2).title.groupCount(m).iterate()

     return m.sort{-it.value}[0..9]
    }

>>>> script = g.scripts.get('topRecommendations')
>>>> recommendations = g.gremlin.execute(script, params={'tutorial': 'Make Sous Vide Chicken at Home'})
>>> topMatches.content
{u'Make a Whole Chicken With Veggies in the Crockpot': 28, u'Bake Crispy Chicken With Doritos': 30, u'Cook Chicken Rollatini With Zucchini & Mozzarella': 28, u'Make Beer Can Chicken': 23, u'Roast a Chicken': 54, u'Cook the Perfect & Simple Chicken Ramen Soup': 96, u'Pesto Chicken Roll-Ups Recipe': 31, u'Cook Chicken in Roasting Bag': 23, u'Make Chicken Enchiladas': 29, u'Prepare Chicken With Peppers and Gorgonzola Cheese': 71}


Conclusions
In this post I presented one strategy for recommending items using graph concepts. What I explored here is the flexibility of the property graph data structure and the notion of derived and inferred relationships. This strategy could be further explored to use other features available at your dataset (I will be sure that SnapGuide has more rich information to use such as Age, sex and the category taxonomy).  I am working on a book for recommender systems and I will explain with more details about graph based recommendations, so stay tunned at my blog!

The performance ?  Ok, I didn't test in order to compare with the current solutions nowadays.  What I can say is that Neo4J can theoretically hold billions entities (vertices + edges) and the Gremlin makes possible advanced queries. I will perform some tests, but based on what I studied, depending on the complexity of the the graph structure, runtimes vary. 

I also would like to thank Marko Rodriguez with his help at the Grenlim-Users community with his post to inspire me to take a further look into Neo4J + Grenlim! It amazed me! :)

Regards,

Marcel Caraciolo