Hi Xiangrui, For tall skinny matrices, if I can pass a similarityMeasure to computeGrammian, I could re-use the SVD's computeGrammian for similarity computation as well...
Do you recommend using this approach for tall skinny matrices or just use the dimsum's routines ? Right now RowMatrix does not have a loadRowMatrix function like the one available in LabeledPoint...should I add one ? I want to export the matrix out from my stable code and then test dimsum... Thanks. Deb On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh <r...@databricks.com> wrote: > I will add dice, overlap, and jaccard similarity in a future PR, probably > still for 1.2 > > > On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das <debasish.da...@gmail.com> > wrote: > >> Awesome...Let me try it out... >> >> Any plans of putting other similarity measures in future (jaccard is >> something that will be useful) ? I guess it makes sense to add some >> similarity measures in mllib... >> >> >> On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh <r...@databricks.com> wrote: >> >>> Yes you're right, calling dimsum with gamma as PositiveInfinity turns it >>> into the usual brute force algorithm for cosine similarity, there is no >>> sampling. This is by design. >>> >>> >>> On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das <debasish.da...@gmail.com> >>> wrote: >>> >>>> I looked at the code: similarColumns(Double.posInf) is generating the >>>> brute force... >>>> >>>> Basically dimsum with gamma as PositiveInfinity will produce the exact >>>> same result as doing catesian products of RDD[(product, vector)] and >>>> computing similarities or there will be some approximation ? >>>> >>>> Sorry I have not read your paper yet. Will read it over the weekend. >>>> >>>> >>>> >>>> On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <r...@databricks.com> wrote: >>>> >>>>> For 60M x 10K brute force and dimsum thresholding should be fine. >>>>> >>>>> For 60M x 10M probably brute force won't work depending on the >>>>> cluster's power, and dimsum thresholding should work with appropriate >>>>> threshold. >>>>> >>>>> Dimensionality reduction should help, and how effective it is will >>>>> depend on your application and domain, it's worth trying if the direct >>>>> computation doesn't work. >>>>> >>>>> You can also try running KMeans clustering (perhaps after >>>>> dimensionality reduction) if your goal is to find batches of similar >>>>> points >>>>> instead of all pairs above a threshold. >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das <debasish.da...@gmail.com >>>>> > wrote: >>>>> >>>>>> Also for tall and wide (rows ~60M, columns 10M), I am considering >>>>>> running a matrix factorization to reduce the dimension to say ~60M x 50 >>>>>> and >>>>>> then run all pair similarity... >>>>>> >>>>>> Did you also try similar ideas and saw positive results ? >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das < >>>>>> debasish.da...@gmail.com> wrote: >>>>>> >>>>>>> Ok...just to make sure I have RowMatrix[SparseVector] where rows are >>>>>>> ~ 60M and columns are 10M say with billion data points... >>>>>>> >>>>>>> I have another version that's around 60M and ~ 10K... >>>>>>> >>>>>>> I guess for the second one both all pair and dimsum will run fine... >>>>>>> >>>>>>> But for tall and wide, what do you suggest ? can dimsum handle it ? >>>>>>> >>>>>>> I might need jaccard as well...can I plug that in the PR ? >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <r...@databricks.com> >>>>>>> wrote: >>>>>>> >>>>>>>> You might want to wait until Wednesday since the interface will be >>>>>>>> changing in that PR before Wednesday, probably over the weekend, so >>>>>>>> that >>>>>>>> you don't have to redo your code. Your call if you need it before a >>>>>>>> week. >>>>>>>> Reza >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das < >>>>>>>> debasish.da...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Ohh cool....all-pairs brute force is also part of this PR ? Let me >>>>>>>>> pull it in and test on our dataset... >>>>>>>>> >>>>>>>>> Thanks. >>>>>>>>> Deb >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <r...@databricks.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Deb, >>>>>>>>>> >>>>>>>>>> We are adding all-pairs and thresholded all-pairs via dimsum in >>>>>>>>>> this PR: https://github.com/apache/spark/pull/1778 >>>>>>>>>> >>>>>>>>>> Your question wasn't entirely clear - does this answer it? >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Reza >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das < >>>>>>>>>> debasish.da...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Reza, >>>>>>>>>>> >>>>>>>>>>> Have you compared with the brute force algorithm for similarity >>>>>>>>>>> computation with something like the following in Spark ? >>>>>>>>>>> >>>>>>>>>>> https://github.com/echen/scaldingale >>>>>>>>>>> >>>>>>>>>>> I am adding cosine similarity computation but I do want to >>>>>>>>>>> compute an all pair similarities... >>>>>>>>>>> >>>>>>>>>>> Note that the data is sparse for me (the data that goes to >>>>>>>>>>> matrix factorization) so I don't think joining and group-by on >>>>>>>>>>> (product,product) will be a big issue for me... >>>>>>>>>>> >>>>>>>>>>> Does it make sense to add all pair similarities as well with >>>>>>>>>>> dimsum based similarity ? >>>>>>>>>>> >>>>>>>>>>> Thanks. >>>>>>>>>>> Deb >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <r...@databricks.com >>>>>>>>>>> > wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Xiaoli, >>>>>>>>>>>> >>>>>>>>>>>> There is a PR currently in progress to allow this, via the >>>>>>>>>>>> sampling scheme described in this paper: >>>>>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf >>>>>>>>>>>> >>>>>>>>>>>> The PR is at https://github.com/apache/spark/pull/336 though >>>>>>>>>>>> it will need refactoring given the recent changes to matrix >>>>>>>>>>>> interface in >>>>>>>>>>>> MLlib. You may implement the sampling scheme for your own app >>>>>>>>>>>> since it's >>>>>>>>>>>> much code. >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> Reza >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li < >>>>>>>>>>>> lixiaolima...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks for your suggestion. I have tried the method. I used 8 >>>>>>>>>>>>> nodes and every node has 8G memory. The program just stopped at a >>>>>>>>>>>>> stage for >>>>>>>>>>>>> about several hours without any further information. Maybe I need >>>>>>>>>>>>> to find >>>>>>>>>>>>> out a more efficient way. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash < >>>>>>>>>>>>> and...@andrewash.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> The naive way would be to put all the users and their >>>>>>>>>>>>>> attributes into an RDD, then cartesian product that with itself. >>>>>>>>>>>>>> Run the >>>>>>>>>>>>>> similarity score on every pair (1M * 1M => 1T scores), map to >>>>>>>>>>>>>> (user, >>>>>>>>>>>>>> (score, otherUser)) and take the .top(k) for each user. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I doubt that you'll be able to take this approach with the 1T >>>>>>>>>>>>>> pairs though, so it might be worth looking at the literature for >>>>>>>>>>>>>> recommender systems to see what else is out there. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li < >>>>>>>>>>>>>> lixiaolima...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I am implementing an algorithm using Spark. I have one >>>>>>>>>>>>>>> million users. I need to compute the similarity between each >>>>>>>>>>>>>>> pair of users >>>>>>>>>>>>>>> using some user's attributes. For each user, I need to get top >>>>>>>>>>>>>>> k most >>>>>>>>>>>>>>> similar users. What is the best way to implement this? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >