Hi Reza, In similarColumns, it seems with cosine similarity I also need other numbers such as intersection, jaccard and other measures...
Right now I modified the code to generate jaccard but I had to run it twice due to the design of RowMatrix / CoordinateMatrix...I feel we should modify RowMatrix and CoordinateMatrix to be templated on the value... Are you considering this in your design ? Thanks. Deb On Tue, Sep 9, 2014 at 9:45 AM, Reza Zadeh <r...@databricks.com> wrote: > Better to do it in a PR of your own, it's not sufficiently related to > dimsum > > On Tue, Sep 9, 2014 at 7:03 AM, Debasish Das <debasish.da...@gmail.com> > wrote: > >> Cool...can I add loadRowMatrix in your PR ? >> >> Thanks. >> Deb >> >> On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh <r...@databricks.com> wrote: >> >>> Hi Deb, >>> >>> Did you mean to message me instead of Xiangrui? >>> >>> For TS matrices, dimsum with positiveinfinity and computeGramian have >>> the same cost, so you can do either one. For dense matrices with say, 1m >>> columns this won't be computationally feasible and you'll want to start >>> sampling with dimsum. >>> >>> It would be helpful to have a loadRowMatrix function, I would use it. >>> >>> Best, >>> Reza >>> >>> On Tue, Sep 9, 2014 at 12:05 AM, Debasish Das <debasish.da...@gmail.com> >>> wrote: >>> >>>> Hi Xiangrui, >>>> >>>> For tall skinny matrices, if I can pass a similarityMeasure to >>>> computeGrammian, I could re-use the SVD's computeGrammian for similarity >>>> computation as well... >>>> >>>> Do you recommend using this approach for tall skinny matrices or just >>>> use the dimsum's routines ? >>>> >>>> Right now RowMatrix does not have a loadRowMatrix function like the one >>>> available in LabeledPoint...should I add one ? I want to export the matrix >>>> out from my stable code and then test dimsum... >>>> >>>> Thanks. >>>> Deb >>>> >>>> >>>> >>>> On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh <r...@databricks.com> wrote: >>>> >>>>> I will add dice, overlap, and jaccard similarity in a future PR, >>>>> probably still for 1.2 >>>>> >>>>> >>>>> On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das <debasish.da...@gmail.com >>>>> > wrote: >>>>> >>>>>> Awesome...Let me try it out... >>>>>> >>>>>> Any plans of putting other similarity measures in future (jaccard is >>>>>> something that will be useful) ? I guess it makes sense to add some >>>>>> similarity measures in mllib... >>>>>> >>>>>> >>>>>> On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh <r...@databricks.com> >>>>>> wrote: >>>>>> >>>>>>> Yes you're right, calling dimsum with gamma as PositiveInfinity >>>>>>> turns it into the usual brute force algorithm for cosine similarity, >>>>>>> there >>>>>>> is no sampling. This is by design. >>>>>>> >>>>>>> >>>>>>> On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das < >>>>>>> debasish.da...@gmail.com> wrote: >>>>>>> >>>>>>>> I looked at the code: similarColumns(Double.posInf) is generating >>>>>>>> the brute force... >>>>>>>> >>>>>>>> Basically dimsum with gamma as PositiveInfinity will produce the >>>>>>>> exact same result as doing catesian products of RDD[(product, vector)] >>>>>>>> and >>>>>>>> computing similarities or there will be some approximation ? >>>>>>>> >>>>>>>> Sorry I have not read your paper yet. Will read it over the weekend. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <r...@databricks.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> For 60M x 10K brute force and dimsum thresholding should be fine. >>>>>>>>> >>>>>>>>> For 60M x 10M probably brute force won't work depending on the >>>>>>>>> cluster's power, and dimsum thresholding should work with appropriate >>>>>>>>> threshold. >>>>>>>>> >>>>>>>>> Dimensionality reduction should help, and how effective it is will >>>>>>>>> depend on your application and domain, it's worth trying if the direct >>>>>>>>> computation doesn't work. >>>>>>>>> >>>>>>>>> You can also try running KMeans clustering (perhaps after >>>>>>>>> dimensionality reduction) if your goal is to find batches of similar >>>>>>>>> points >>>>>>>>> instead of all pairs above a threshold. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das < >>>>>>>>> debasish.da...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Also for tall and wide (rows ~60M, columns 10M), I am considering >>>>>>>>>> running a matrix factorization to reduce the dimension to say ~60M x >>>>>>>>>> 50 and >>>>>>>>>> then run all pair similarity... >>>>>>>>>> >>>>>>>>>> Did you also try similar ideas and saw positive results ? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das < >>>>>>>>>> debasish.da...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Ok...just to make sure I have RowMatrix[SparseVector] where rows >>>>>>>>>>> are ~ 60M and columns are 10M say with billion data points... >>>>>>>>>>> >>>>>>>>>>> I have another version that's around 60M and ~ 10K... >>>>>>>>>>> >>>>>>>>>>> I guess for the second one both all pair and dimsum will run >>>>>>>>>>> fine... >>>>>>>>>>> >>>>>>>>>>> But for tall and wide, what do you suggest ? can dimsum handle >>>>>>>>>>> it ? >>>>>>>>>>> >>>>>>>>>>> I might need jaccard as well...can I plug that in the PR ? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <r...@databricks.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> You might want to wait until Wednesday since the interface will >>>>>>>>>>>> be changing in that PR before Wednesday, probably over the >>>>>>>>>>>> weekend, so that >>>>>>>>>>>> you don't have to redo your code. Your call if you need it before >>>>>>>>>>>> a week. >>>>>>>>>>>> Reza >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das < >>>>>>>>>>>> debasish.da...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Ohh cool....all-pairs brute force is also part of this PR ? >>>>>>>>>>>>> Let me pull it in and test on our dataset... >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks. >>>>>>>>>>>>> Deb >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh < >>>>>>>>>>>>> r...@databricks.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Deb, >>>>>>>>>>>>>> >>>>>>>>>>>>>> We are adding all-pairs and thresholded all-pairs via dimsum >>>>>>>>>>>>>> in this PR: https://github.com/apache/spark/pull/1778 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Your question wasn't entirely clear - does this answer it? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best, >>>>>>>>>>>>>> Reza >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das < >>>>>>>>>>>>>> debasish.da...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Reza, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Have you compared with the brute force algorithm for >>>>>>>>>>>>>>> similarity computation with something like the following in >>>>>>>>>>>>>>> Spark ? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> https://github.com/echen/scaldingale >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I am adding cosine similarity computation but I do want to >>>>>>>>>>>>>>> compute an all pair similarities... >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Note that the data is sparse for me (the data that goes to >>>>>>>>>>>>>>> matrix factorization) so I don't think joining and group-by on >>>>>>>>>>>>>>> (product,product) will be a big issue for me... >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Does it make sense to add all pair similarities as well with >>>>>>>>>>>>>>> dimsum based similarity ? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>> Deb >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh < >>>>>>>>>>>>>>> r...@databricks.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Xiaoli, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> There is a PR currently in progress to allow this, via the >>>>>>>>>>>>>>>> sampling scheme described in this paper: >>>>>>>>>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The PR is at https://github.com/apache/spark/pull/336 >>>>>>>>>>>>>>>> though it will need refactoring given the recent changes to >>>>>>>>>>>>>>>> matrix >>>>>>>>>>>>>>>> interface in MLlib. You may implement the sampling scheme for >>>>>>>>>>>>>>>> your own app >>>>>>>>>>>>>>>> since it's much code. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>> Reza >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li < >>>>>>>>>>>>>>>> lixiaolima...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks for your suggestion. I have tried the method. I >>>>>>>>>>>>>>>>> used 8 nodes and every node has 8G memory. The program just >>>>>>>>>>>>>>>>> stopped at a >>>>>>>>>>>>>>>>> stage for about several hours without any further >>>>>>>>>>>>>>>>> information. Maybe I need >>>>>>>>>>>>>>>>> to find >>>>>>>>>>>>>>>>> out a more efficient way. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash < >>>>>>>>>>>>>>>>> and...@andrewash.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The naive way would be to put all the users and their >>>>>>>>>>>>>>>>>> attributes into an RDD, then cartesian product that with >>>>>>>>>>>>>>>>>> itself. Run the >>>>>>>>>>>>>>>>>> similarity score on every pair (1M * 1M => 1T scores), map >>>>>>>>>>>>>>>>>> to (user, >>>>>>>>>>>>>>>>>> (score, otherUser)) and take the .top(k) for each user. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I doubt that you'll be able to take this approach with >>>>>>>>>>>>>>>>>> the 1T pairs though, so it might be worth looking at the >>>>>>>>>>>>>>>>>> literature for >>>>>>>>>>>>>>>>>> recommender systems to see what else is out there. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li < >>>>>>>>>>>>>>>>>> lixiaolima...@gmail.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I am implementing an algorithm using Spark. I have one >>>>>>>>>>>>>>>>>>> million users. I need to compute the similarity between >>>>>>>>>>>>>>>>>>> each pair of users >>>>>>>>>>>>>>>>>>> using some user's attributes. For each user, I need to get >>>>>>>>>>>>>>>>>>> top k most >>>>>>>>>>>>>>>>>>> similar users. What is the best way to implement this? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >