I will add dice, overlap, and jaccard similarity in a future PR, probably still for 1.2
On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das <debasish.da...@gmail.com> wrote: > Awesome...Let me try it out... > > Any plans of putting other similarity measures in future (jaccard is > something that will be useful) ? I guess it makes sense to add some > similarity measures in mllib... > > > On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh <r...@databricks.com> wrote: > >> Yes you're right, calling dimsum with gamma as PositiveInfinity turns it >> into the usual brute force algorithm for cosine similarity, there is no >> sampling. This is by design. >> >> >> On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das <debasish.da...@gmail.com> >> wrote: >> >>> I looked at the code: similarColumns(Double.posInf) is generating the >>> brute force... >>> >>> Basically dimsum with gamma as PositiveInfinity will produce the exact >>> same result as doing catesian products of RDD[(product, vector)] and >>> computing similarities or there will be some approximation ? >>> >>> Sorry I have not read your paper yet. Will read it over the weekend. >>> >>> >>> >>> On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <r...@databricks.com> wrote: >>> >>>> For 60M x 10K brute force and dimsum thresholding should be fine. >>>> >>>> For 60M x 10M probably brute force won't work depending on the >>>> cluster's power, and dimsum thresholding should work with appropriate >>>> threshold. >>>> >>>> Dimensionality reduction should help, and how effective it is will >>>> depend on your application and domain, it's worth trying if the direct >>>> computation doesn't work. >>>> >>>> You can also try running KMeans clustering (perhaps after >>>> dimensionality reduction) if your goal is to find batches of similar points >>>> instead of all pairs above a threshold. >>>> >>>> >>>> >>>> >>>> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das <debasish.da...@gmail.com> >>>> wrote: >>>> >>>>> Also for tall and wide (rows ~60M, columns 10M), I am considering >>>>> running a matrix factorization to reduce the dimension to say ~60M x 50 >>>>> and >>>>> then run all pair similarity... >>>>> >>>>> Did you also try similar ideas and saw positive results ? >>>>> >>>>> >>>>> >>>>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das <debasish.da...@gmail.com >>>>> > wrote: >>>>> >>>>>> Ok...just to make sure I have RowMatrix[SparseVector] where rows are >>>>>> ~ 60M and columns are 10M say with billion data points... >>>>>> >>>>>> I have another version that's around 60M and ~ 10K... >>>>>> >>>>>> I guess for the second one both all pair and dimsum will run fine... >>>>>> >>>>>> But for tall and wide, what do you suggest ? can dimsum handle it ? >>>>>> >>>>>> I might need jaccard as well...can I plug that in the PR ? >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <r...@databricks.com> >>>>>> wrote: >>>>>> >>>>>>> You might want to wait until Wednesday since the interface will be >>>>>>> changing in that PR before Wednesday, probably over the weekend, so that >>>>>>> you don't have to redo your code. Your call if you need it before a >>>>>>> week. >>>>>>> Reza >>>>>>> >>>>>>> >>>>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das < >>>>>>> debasish.da...@gmail.com> wrote: >>>>>>> >>>>>>>> Ohh cool....all-pairs brute force is also part of this PR ? Let me >>>>>>>> pull it in and test on our dataset... >>>>>>>> >>>>>>>> Thanks. >>>>>>>> Deb >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <r...@databricks.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Deb, >>>>>>>>> >>>>>>>>> We are adding all-pairs and thresholded all-pairs via dimsum in >>>>>>>>> this PR: https://github.com/apache/spark/pull/1778 >>>>>>>>> >>>>>>>>> Your question wasn't entirely clear - does this answer it? >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Reza >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das < >>>>>>>>> debasish.da...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Reza, >>>>>>>>>> >>>>>>>>>> Have you compared with the brute force algorithm for similarity >>>>>>>>>> computation with something like the following in Spark ? >>>>>>>>>> >>>>>>>>>> https://github.com/echen/scaldingale >>>>>>>>>> >>>>>>>>>> I am adding cosine similarity computation but I do want to >>>>>>>>>> compute an all pair similarities... >>>>>>>>>> >>>>>>>>>> Note that the data is sparse for me (the data that goes to matrix >>>>>>>>>> factorization) so I don't think joining and group-by on >>>>>>>>>> (product,product) >>>>>>>>>> will be a big issue for me... >>>>>>>>>> >>>>>>>>>> Does it make sense to add all pair similarities as well with >>>>>>>>>> dimsum based similarity ? >>>>>>>>>> >>>>>>>>>> Thanks. >>>>>>>>>> Deb >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <r...@databricks.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Xiaoli, >>>>>>>>>>> >>>>>>>>>>> There is a PR currently in progress to allow this, via the >>>>>>>>>>> sampling scheme described in this paper: >>>>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf >>>>>>>>>>> >>>>>>>>>>> The PR is at https://github.com/apache/spark/pull/336 though it >>>>>>>>>>> will need refactoring given the recent changes to matrix interface >>>>>>>>>>> in >>>>>>>>>>> MLlib. You may implement the sampling scheme for your own app since >>>>>>>>>>> it's >>>>>>>>>>> much code. >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Reza >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li < >>>>>>>>>>> lixiaolima...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>> >>>>>>>>>>>> Thanks for your suggestion. I have tried the method. I used 8 >>>>>>>>>>>> nodes and every node has 8G memory. The program just stopped at a >>>>>>>>>>>> stage for >>>>>>>>>>>> about several hours without any further information. Maybe I need >>>>>>>>>>>> to find >>>>>>>>>>>> out a more efficient way. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash < >>>>>>>>>>>> and...@andrewash.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> The naive way would be to put all the users and their >>>>>>>>>>>>> attributes into an RDD, then cartesian product that with itself. >>>>>>>>>>>>> Run the >>>>>>>>>>>>> similarity score on every pair (1M * 1M => 1T scores), map to >>>>>>>>>>>>> (user, >>>>>>>>>>>>> (score, otherUser)) and take the .top(k) for each user. >>>>>>>>>>>>> >>>>>>>>>>>>> I doubt that you'll be able to take this approach with the 1T >>>>>>>>>>>>> pairs though, so it might be worth looking at the literature for >>>>>>>>>>>>> recommender systems to see what else is out there. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li < >>>>>>>>>>>>> lixiaolima...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I am implementing an algorithm using Spark. I have one >>>>>>>>>>>>>> million users. I need to compute the similarity between each >>>>>>>>>>>>>> pair of users >>>>>>>>>>>>>> using some user's attributes. For each user, I need to get top >>>>>>>>>>>>>> k most >>>>>>>>>>>>>> similar users. What is the best way to implement this? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >