Cool...can I add loadRowMatrix in your PR ? Thanks. Deb
On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh <r...@databricks.com> wrote: > Hi Deb, > > Did you mean to message me instead of Xiangrui? > > For TS matrices, dimsum with positiveinfinity and computeGramian have the > same cost, so you can do either one. For dense matrices with say, 1m > columns this won't be computationally feasible and you'll want to start > sampling with dimsum. > > It would be helpful to have a loadRowMatrix function, I would use it. > > Best, > Reza > > On Tue, Sep 9, 2014 at 12:05 AM, Debasish Das <debasish.da...@gmail.com> > wrote: > >> Hi Xiangrui, >> >> For tall skinny matrices, if I can pass a similarityMeasure to >> computeGrammian, I could re-use the SVD's computeGrammian for similarity >> computation as well... >> >> Do you recommend using this approach for tall skinny matrices or just use >> the dimsum's routines ? >> >> Right now RowMatrix does not have a loadRowMatrix function like the one >> available in LabeledPoint...should I add one ? I want to export the matrix >> out from my stable code and then test dimsum... >> >> Thanks. >> Deb >> >> >> >> On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh <r...@databricks.com> wrote: >> >>> I will add dice, overlap, and jaccard similarity in a future PR, >>> probably still for 1.2 >>> >>> >>> On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das <debasish.da...@gmail.com> >>> wrote: >>> >>>> Awesome...Let me try it out... >>>> >>>> Any plans of putting other similarity measures in future (jaccard is >>>> something that will be useful) ? I guess it makes sense to add some >>>> similarity measures in mllib... >>>> >>>> >>>> On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh <r...@databricks.com> wrote: >>>> >>>>> Yes you're right, calling dimsum with gamma as PositiveInfinity turns >>>>> it into the usual brute force algorithm for cosine similarity, there is no >>>>> sampling. This is by design. >>>>> >>>>> >>>>> On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das <debasish.da...@gmail.com >>>>> > wrote: >>>>> >>>>>> I looked at the code: similarColumns(Double.posInf) is generating the >>>>>> brute force... >>>>>> >>>>>> Basically dimsum with gamma as PositiveInfinity will produce the >>>>>> exact same result as doing catesian products of RDD[(product, vector)] >>>>>> and >>>>>> computing similarities or there will be some approximation ? >>>>>> >>>>>> Sorry I have not read your paper yet. Will read it over the weekend. >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <r...@databricks.com> >>>>>> wrote: >>>>>> >>>>>>> For 60M x 10K brute force and dimsum thresholding should be fine. >>>>>>> >>>>>>> For 60M x 10M probably brute force won't work depending on the >>>>>>> cluster's power, and dimsum thresholding should work with appropriate >>>>>>> threshold. >>>>>>> >>>>>>> Dimensionality reduction should help, and how effective it is will >>>>>>> depend on your application and domain, it's worth trying if the direct >>>>>>> computation doesn't work. >>>>>>> >>>>>>> You can also try running KMeans clustering (perhaps after >>>>>>> dimensionality reduction) if your goal is to find batches of similar >>>>>>> points >>>>>>> instead of all pairs above a threshold. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das < >>>>>>> debasish.da...@gmail.com> wrote: >>>>>>> >>>>>>>> Also for tall and wide (rows ~60M, columns 10M), I am considering >>>>>>>> running a matrix factorization to reduce the dimension to say ~60M x >>>>>>>> 50 and >>>>>>>> then run all pair similarity... >>>>>>>> >>>>>>>> Did you also try similar ideas and saw positive results ? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das < >>>>>>>> debasish.da...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Ok...just to make sure I have RowMatrix[SparseVector] where rows >>>>>>>>> are ~ 60M and columns are 10M say with billion data points... >>>>>>>>> >>>>>>>>> I have another version that's around 60M and ~ 10K... >>>>>>>>> >>>>>>>>> I guess for the second one both all pair and dimsum will run >>>>>>>>> fine... >>>>>>>>> >>>>>>>>> But for tall and wide, what do you suggest ? can dimsum handle it ? >>>>>>>>> >>>>>>>>> I might need jaccard as well...can I plug that in the PR ? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <r...@databricks.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> You might want to wait until Wednesday since the interface will >>>>>>>>>> be changing in that PR before Wednesday, probably over the weekend, >>>>>>>>>> so that >>>>>>>>>> you don't have to redo your code. Your call if you need it before a >>>>>>>>>> week. >>>>>>>>>> Reza >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das < >>>>>>>>>> debasish.da...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Ohh cool....all-pairs brute force is also part of this PR ? Let >>>>>>>>>>> me pull it in and test on our dataset... >>>>>>>>>>> >>>>>>>>>>> Thanks. >>>>>>>>>>> Deb >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <r...@databricks.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Deb, >>>>>>>>>>>> >>>>>>>>>>>> We are adding all-pairs and thresholded all-pairs via dimsum in >>>>>>>>>>>> this PR: https://github.com/apache/spark/pull/1778 >>>>>>>>>>>> >>>>>>>>>>>> Your question wasn't entirely clear - does this answer it? >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> Reza >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das < >>>>>>>>>>>> debasish.da...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Reza, >>>>>>>>>>>>> >>>>>>>>>>>>> Have you compared with the brute force algorithm for >>>>>>>>>>>>> similarity computation with something like the following in Spark >>>>>>>>>>>>> ? >>>>>>>>>>>>> >>>>>>>>>>>>> https://github.com/echen/scaldingale >>>>>>>>>>>>> >>>>>>>>>>>>> I am adding cosine similarity computation but I do want to >>>>>>>>>>>>> compute an all pair similarities... >>>>>>>>>>>>> >>>>>>>>>>>>> Note that the data is sparse for me (the data that goes to >>>>>>>>>>>>> matrix factorization) so I don't think joining and group-by on >>>>>>>>>>>>> (product,product) will be a big issue for me... >>>>>>>>>>>>> >>>>>>>>>>>>> Does it make sense to add all pair similarities as well with >>>>>>>>>>>>> dimsum based similarity ? >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks. >>>>>>>>>>>>> Deb >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh < >>>>>>>>>>>>> r...@databricks.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Xiaoli, >>>>>>>>>>>>>> >>>>>>>>>>>>>> There is a PR currently in progress to allow this, via the >>>>>>>>>>>>>> sampling scheme described in this paper: >>>>>>>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf >>>>>>>>>>>>>> >>>>>>>>>>>>>> The PR is at https://github.com/apache/spark/pull/336 though >>>>>>>>>>>>>> it will need refactoring given the recent changes to matrix >>>>>>>>>>>>>> interface in >>>>>>>>>>>>>> MLlib. You may implement the sampling scheme for your own app >>>>>>>>>>>>>> since it's >>>>>>>>>>>>>> much code. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best, >>>>>>>>>>>>>> Reza >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li < >>>>>>>>>>>>>> lixiaolima...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks for your suggestion. I have tried the method. I used >>>>>>>>>>>>>>> 8 nodes and every node has 8G memory. The program just stopped >>>>>>>>>>>>>>> at a stage >>>>>>>>>>>>>>> for about several hours without any further information. Maybe >>>>>>>>>>>>>>> I need to >>>>>>>>>>>>>>> find >>>>>>>>>>>>>>> out a more efficient way. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash < >>>>>>>>>>>>>>> and...@andrewash.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The naive way would be to put all the users and their >>>>>>>>>>>>>>>> attributes into an RDD, then cartesian product that with >>>>>>>>>>>>>>>> itself. Run the >>>>>>>>>>>>>>>> similarity score on every pair (1M * 1M => 1T scores), map to >>>>>>>>>>>>>>>> (user, >>>>>>>>>>>>>>>> (score, otherUser)) and take the .top(k) for each user. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I doubt that you'll be able to take this approach with the >>>>>>>>>>>>>>>> 1T pairs though, so it might be worth looking at the >>>>>>>>>>>>>>>> literature for >>>>>>>>>>>>>>>> recommender systems to see what else is out there. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li < >>>>>>>>>>>>>>>> lixiaolima...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I am implementing an algorithm using Spark. I have one >>>>>>>>>>>>>>>>> million users. I need to compute the similarity between each >>>>>>>>>>>>>>>>> pair of users >>>>>>>>>>>>>>>>> using some user's attributes. For each user, I need to get >>>>>>>>>>>>>>>>> top k most >>>>>>>>>>>>>>>>> similar users. What is the best way to implement this? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >