Better to do it in a PR of your own, it's not sufficiently related to dimsum
On Tue, Sep 9, 2014 at 7:03 AM, Debasish Das <debasish.da...@gmail.com> wrote: > Cool...can I add loadRowMatrix in your PR ? > > Thanks. > Deb > > On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh <r...@databricks.com> wrote: > >> Hi Deb, >> >> Did you mean to message me instead of Xiangrui? >> >> For TS matrices, dimsum with positiveinfinity and computeGramian have the >> same cost, so you can do either one. For dense matrices with say, 1m >> columns this won't be computationally feasible and you'll want to start >> sampling with dimsum. >> >> It would be helpful to have a loadRowMatrix function, I would use it. >> >> Best, >> Reza >> >> On Tue, Sep 9, 2014 at 12:05 AM, Debasish Das <debasish.da...@gmail.com> >> wrote: >> >>> Hi Xiangrui, >>> >>> For tall skinny matrices, if I can pass a similarityMeasure to >>> computeGrammian, I could re-use the SVD's computeGrammian for similarity >>> computation as well... >>> >>> Do you recommend using this approach for tall skinny matrices or just >>> use the dimsum's routines ? >>> >>> Right now RowMatrix does not have a loadRowMatrix function like the one >>> available in LabeledPoint...should I add one ? I want to export the matrix >>> out from my stable code and then test dimsum... >>> >>> Thanks. >>> Deb >>> >>> >>> >>> On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh <r...@databricks.com> wrote: >>> >>>> I will add dice, overlap, and jaccard similarity in a future PR, >>>> probably still for 1.2 >>>> >>>> >>>> On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das <debasish.da...@gmail.com> >>>> wrote: >>>> >>>>> Awesome...Let me try it out... >>>>> >>>>> Any plans of putting other similarity measures in future (jaccard is >>>>> something that will be useful) ? I guess it makes sense to add some >>>>> similarity measures in mllib... >>>>> >>>>> >>>>> On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh <r...@databricks.com> >>>>> wrote: >>>>> >>>>>> Yes you're right, calling dimsum with gamma as PositiveInfinity turns >>>>>> it into the usual brute force algorithm for cosine similarity, there is >>>>>> no >>>>>> sampling. This is by design. >>>>>> >>>>>> >>>>>> On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das < >>>>>> debasish.da...@gmail.com> wrote: >>>>>> >>>>>>> I looked at the code: similarColumns(Double.posInf) is generating >>>>>>> the brute force... >>>>>>> >>>>>>> Basically dimsum with gamma as PositiveInfinity will produce the >>>>>>> exact same result as doing catesian products of RDD[(product, vector)] >>>>>>> and >>>>>>> computing similarities or there will be some approximation ? >>>>>>> >>>>>>> Sorry I have not read your paper yet. Will read it over the weekend. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <r...@databricks.com> >>>>>>> wrote: >>>>>>> >>>>>>>> For 60M x 10K brute force and dimsum thresholding should be fine. >>>>>>>> >>>>>>>> For 60M x 10M probably brute force won't work depending on the >>>>>>>> cluster's power, and dimsum thresholding should work with appropriate >>>>>>>> threshold. >>>>>>>> >>>>>>>> Dimensionality reduction should help, and how effective it is will >>>>>>>> depend on your application and domain, it's worth trying if the direct >>>>>>>> computation doesn't work. >>>>>>>> >>>>>>>> You can also try running KMeans clustering (perhaps after >>>>>>>> dimensionality reduction) if your goal is to find batches of similar >>>>>>>> points >>>>>>>> instead of all pairs above a threshold. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das < >>>>>>>> debasish.da...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Also for tall and wide (rows ~60M, columns 10M), I am considering >>>>>>>>> running a matrix factorization to reduce the dimension to say ~60M x >>>>>>>>> 50 and >>>>>>>>> then run all pair similarity... >>>>>>>>> >>>>>>>>> Did you also try similar ideas and saw positive results ? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das < >>>>>>>>> debasish.da...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Ok...just to make sure I have RowMatrix[SparseVector] where rows >>>>>>>>>> are ~ 60M and columns are 10M say with billion data points... >>>>>>>>>> >>>>>>>>>> I have another version that's around 60M and ~ 10K... >>>>>>>>>> >>>>>>>>>> I guess for the second one both all pair and dimsum will run >>>>>>>>>> fine... >>>>>>>>>> >>>>>>>>>> But for tall and wide, what do you suggest ? can dimsum handle it >>>>>>>>>> ? >>>>>>>>>> >>>>>>>>>> I might need jaccard as well...can I plug that in the PR ? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <r...@databricks.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> You might want to wait until Wednesday since the interface will >>>>>>>>>>> be changing in that PR before Wednesday, probably over the weekend, >>>>>>>>>>> so that >>>>>>>>>>> you don't have to redo your code. Your call if you need it before a >>>>>>>>>>> week. >>>>>>>>>>> Reza >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das < >>>>>>>>>>> debasish.da...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Ohh cool....all-pairs brute force is also part of this PR ? Let >>>>>>>>>>>> me pull it in and test on our dataset... >>>>>>>>>>>> >>>>>>>>>>>> Thanks. >>>>>>>>>>>> Deb >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <r...@databricks.com >>>>>>>>>>>> > wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Deb, >>>>>>>>>>>>> >>>>>>>>>>>>> We are adding all-pairs and thresholded all-pairs via dimsum >>>>>>>>>>>>> in this PR: https://github.com/apache/spark/pull/1778 >>>>>>>>>>>>> >>>>>>>>>>>>> Your question wasn't entirely clear - does this answer it? >>>>>>>>>>>>> >>>>>>>>>>>>> Best, >>>>>>>>>>>>> Reza >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das < >>>>>>>>>>>>> debasish.da...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Reza, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Have you compared with the brute force algorithm for >>>>>>>>>>>>>> similarity computation with something like the following in >>>>>>>>>>>>>> Spark ? >>>>>>>>>>>>>> >>>>>>>>>>>>>> https://github.com/echen/scaldingale >>>>>>>>>>>>>> >>>>>>>>>>>>>> I am adding cosine similarity computation but I do want to >>>>>>>>>>>>>> compute an all pair similarities... >>>>>>>>>>>>>> >>>>>>>>>>>>>> Note that the data is sparse for me (the data that goes to >>>>>>>>>>>>>> matrix factorization) so I don't think joining and group-by on >>>>>>>>>>>>>> (product,product) will be a big issue for me... >>>>>>>>>>>>>> >>>>>>>>>>>>>> Does it make sense to add all pair similarities as well with >>>>>>>>>>>>>> dimsum based similarity ? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>> Deb >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh < >>>>>>>>>>>>>> r...@databricks.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Xiaoli, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> There is a PR currently in progress to allow this, via the >>>>>>>>>>>>>>> sampling scheme described in this paper: >>>>>>>>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The PR is at https://github.com/apache/spark/pull/336 >>>>>>>>>>>>>>> though it will need refactoring given the recent changes to >>>>>>>>>>>>>>> matrix >>>>>>>>>>>>>>> interface in MLlib. You may implement the sampling scheme for >>>>>>>>>>>>>>> your own app >>>>>>>>>>>>>>> since it's much code. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>> Reza >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li < >>>>>>>>>>>>>>> lixiaolima...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks for your suggestion. I have tried the method. I used >>>>>>>>>>>>>>>> 8 nodes and every node has 8G memory. The program just stopped >>>>>>>>>>>>>>>> at a stage >>>>>>>>>>>>>>>> for about several hours without any further information. Maybe >>>>>>>>>>>>>>>> I need to >>>>>>>>>>>>>>>> find >>>>>>>>>>>>>>>> out a more efficient way. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash < >>>>>>>>>>>>>>>> and...@andrewash.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The naive way would be to put all the users and their >>>>>>>>>>>>>>>>> attributes into an RDD, then cartesian product that with >>>>>>>>>>>>>>>>> itself. Run the >>>>>>>>>>>>>>>>> similarity score on every pair (1M * 1M => 1T scores), map to >>>>>>>>>>>>>>>>> (user, >>>>>>>>>>>>>>>>> (score, otherUser)) and take the .top(k) for each user. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I doubt that you'll be able to take this approach with the >>>>>>>>>>>>>>>>> 1T pairs though, so it might be worth looking at the >>>>>>>>>>>>>>>>> literature for >>>>>>>>>>>>>>>>> recommender systems to see what else is out there. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li < >>>>>>>>>>>>>>>>> lixiaolima...@gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I am implementing an algorithm using Spark. I have one >>>>>>>>>>>>>>>>>> million users. I need to compute the similarity between each >>>>>>>>>>>>>>>>>> pair of users >>>>>>>>>>>>>>>>>> using some user's attributes. For each user, I need to get >>>>>>>>>>>>>>>>>> top k most >>>>>>>>>>>>>>>>>> similar users. What is the best way to implement this? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >