OK, forget about this question. It was a nasty, one character typo in my own code (sorting by rating instead of item at one point). Best, Friso
On Tue, Mar 25, 2014 at 1:53 PM, Friso van Vollenhoven < [email protected]> wrote: > Hi, > > I have an example where I use a tuple of (int,int) in Python as key for a > RDD. When I do a reduceByKey(...), sometimes the tuples turn up with the > two int's reversed in order (which is problematic, as the ordering is part > of the key). > > Here is a ipython notebook that has some code and demonstrates this issue: > http://nbviewer.ipython.org/urls/dl.dropboxusercontent.com/u/5812021/test-on-cluster.ipynb?create=1 > > Here's the long story... I am doing collaborative filtering using cosine > similarity on Spark using Python. Not because I need it, but it seemed like > a appropriately simple but useful exercise to get started with Spark. > > I am seeing a difference in outcomes between running locally and running > on a cluster. > > My approach is this: > > - given a dataset containing tuples of (user, item, rating) > - group by user > - for each user flatMap into tuples of ((item, item), (rating, > rating)) for each combination of two items that that user has seen > - then map into a structure containing ((item, item), (rating * > rating, left rating ^ 2, right rating ^ 2, 1) > - then reduceByKey, summing up the values in the tuples column-wise; > this gives a tuple of (sum of rating product, sum of left rating squares, > sum of right rating square, co-occurrence count) which can be used to > calculate cosine similarity. > - map into (item,item), (similarity, count) > - this should result in a dataset that looks like: (item, item), > (similarity, count) > > (In the notebook I leave out the final step of converting the sums into a > cosine similarity.) > > When I do this for a artificial dataset of 100000 users with each 70 items > rated that are exactly the same for each user (so 7M ratings in total where > the user,item matrix is dense), I would expect the resulting dataset to > have 2415 tuples (== 70*69 / 2, the number of co-occurrences that exist > amongst 70 items that each user has rated) and I would expect the > co-occurrence count for each item,item pair to be 100000, as there are 100K > users. > > When I run my code locally, the above assumptions work out, but when I run > on a small cluster (4 workers, on AWS), the numbers are way off. This > happens because of the reversed tuples. > > Where am I going wrong? > > > Thanks for any pointers, cheers, > Friso > >
