Hi Reza, Have you tested if different runs of the algorithm produce different similarities (basically if the algorithm is deterministic) ?
This number does not look like a Monoid aggregation...iVal * jVal / (math.min(sg, colMags(i)) * math.min(sg, colMags(j)) I am noticing some weird behavior as different runs are changing the results... Also can columnMagnitudes produce non-deterministic results ? Thanks. Deb On Thu, Sep 18, 2014 at 10:34 AM, Reza Zadeh <r...@databricks.com> wrote: > Hi Deb, > > I am not templating RowMatrix/CoordinateMatrix since that would be a big > deviation from the PR. We can add jaccard and other similarity measures in > later PRs. > > In the meantime, you can un-normalize the cosine similarities to get the > dot product, and then compute the other similarity measures from the dot > product. > > Best, > Reza > > > On Wed, Sep 17, 2014 at 6:52 PM, Debasish Das <debasish.da...@gmail.com> > wrote: > >> Hi Reza, >> >> In similarColumns, it seems with cosine similarity I also need other >> numbers such as intersection, jaccard and other measures... >> >> Right now I modified the code to generate jaccard but I had to run it >> twice due to the design of RowMatrix / CoordinateMatrix...I feel we should >> modify RowMatrix and CoordinateMatrix to be templated on the value... >> >> Are you considering this in your design ? >> >> Thanks. >> Deb >> >> >> On Tue, Sep 9, 2014 at 9:45 AM, Reza Zadeh <r...@databricks.com> wrote: >> >>> Better to do it in a PR of your own, it's not sufficiently related to >>> dimsum >>> >>> On Tue, Sep 9, 2014 at 7:03 AM, Debasish Das <debasish.da...@gmail.com> >>> wrote: >>> >>>> Cool...can I add loadRowMatrix in your PR ? >>>> >>>> Thanks. >>>> Deb >>>> >>>> On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh <r...@databricks.com> wrote: >>>> >>>>> Hi Deb, >>>>> >>>>> Did you mean to message me instead of Xiangrui? >>>>> >>>>> For TS matrices, dimsum with positiveinfinity and computeGramian have >>>>> the same cost, so you can do either one. For dense matrices with say, 1m >>>>> columns this won't be computationally feasible and you'll want to start >>>>> sampling with dimsum. >>>>> >>>>> It would be helpful to have a loadRowMatrix function, I would use it. >>>>> >>>>> Best, >>>>> Reza >>>>> >>>>> On Tue, Sep 9, 2014 at 12:05 AM, Debasish Das < >>>>> debasish.da...@gmail.com> wrote: >>>>> >>>>>> Hi Xiangrui, >>>>>> >>>>>> For tall skinny matrices, if I can pass a similarityMeasure to >>>>>> computeGrammian, I could re-use the SVD's computeGrammian for similarity >>>>>> computation as well... >>>>>> >>>>>> Do you recommend using this approach for tall skinny matrices or just >>>>>> use the dimsum's routines ? >>>>>> >>>>>> Right now RowMatrix does not have a loadRowMatrix function like the >>>>>> one available in LabeledPoint...should I add one ? I want to export the >>>>>> matrix out from my stable code and then test dimsum... >>>>>> >>>>>> Thanks. >>>>>> Deb >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh <r...@databricks.com> >>>>>> wrote: >>>>>> >>>>>>> I will add dice, overlap, and jaccard similarity in a future PR, >>>>>>> probably still for 1.2 >>>>>>> >>>>>>> >>>>>>> On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das < >>>>>>> debasish.da...@gmail.com> wrote: >>>>>>> >>>>>>>> Awesome...Let me try it out... >>>>>>>> >>>>>>>> Any plans of putting other similarity measures in future (jaccard >>>>>>>> is something that will be useful) ? I guess it makes sense to add some >>>>>>>> similarity measures in mllib... >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh <r...@databricks.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Yes you're right, calling dimsum with gamma as PositiveInfinity >>>>>>>>> turns it into the usual brute force algorithm for cosine similarity, >>>>>>>>> there >>>>>>>>> is no sampling. This is by design. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das < >>>>>>>>> debasish.da...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> I looked at the code: similarColumns(Double.posInf) is generating >>>>>>>>>> the brute force... >>>>>>>>>> >>>>>>>>>> Basically dimsum with gamma as PositiveInfinity will produce the >>>>>>>>>> exact same result as doing catesian products of RDD[(product, >>>>>>>>>> vector)] and >>>>>>>>>> computing similarities or there will be some approximation ? >>>>>>>>>> >>>>>>>>>> Sorry I have not read your paper yet. Will read it over the >>>>>>>>>> weekend. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <r...@databricks.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> For 60M x 10K brute force and dimsum thresholding should be fine. >>>>>>>>>>> >>>>>>>>>>> For 60M x 10M probably brute force won't work depending on the >>>>>>>>>>> cluster's power, and dimsum thresholding should work with >>>>>>>>>>> appropriate >>>>>>>>>>> threshold. >>>>>>>>>>> >>>>>>>>>>> Dimensionality reduction should help, and how effective it is >>>>>>>>>>> will depend on your application and domain, it's worth trying if >>>>>>>>>>> the direct >>>>>>>>>>> computation doesn't work. >>>>>>>>>>> >>>>>>>>>>> You can also try running KMeans clustering (perhaps after >>>>>>>>>>> dimensionality reduction) if your goal is to find batches of >>>>>>>>>>> similar points >>>>>>>>>>> instead of all pairs above a threshold. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das < >>>>>>>>>>> debasish.da...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Also for tall and wide (rows ~60M, columns 10M), I am >>>>>>>>>>>> considering running a matrix factorization to reduce the dimension >>>>>>>>>>>> to say >>>>>>>>>>>> ~60M x 50 and then run all pair similarity... >>>>>>>>>>>> >>>>>>>>>>>> Did you also try similar ideas and saw positive results ? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das < >>>>>>>>>>>> debasish.da...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Ok...just to make sure I have RowMatrix[SparseVector] where >>>>>>>>>>>>> rows are ~ 60M and columns are 10M say with billion data points... >>>>>>>>>>>>> >>>>>>>>>>>>> I have another version that's around 60M and ~ 10K... >>>>>>>>>>>>> >>>>>>>>>>>>> I guess for the second one both all pair and dimsum will run >>>>>>>>>>>>> fine... >>>>>>>>>>>>> >>>>>>>>>>>>> But for tall and wide, what do you suggest ? can dimsum handle >>>>>>>>>>>>> it ? >>>>>>>>>>>>> >>>>>>>>>>>>> I might need jaccard as well...can I plug that in the PR ? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh < >>>>>>>>>>>>> r...@databricks.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> You might want to wait until Wednesday since the interface >>>>>>>>>>>>>> will be changing in that PR before Wednesday, probably over the >>>>>>>>>>>>>> weekend, so >>>>>>>>>>>>>> that you don't have to redo your code. Your call if you need it >>>>>>>>>>>>>> before a >>>>>>>>>>>>>> week. >>>>>>>>>>>>>> Reza >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das < >>>>>>>>>>>>>> debasish.da...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Ohh cool....all-pairs brute force is also part of this PR ? >>>>>>>>>>>>>>> Let me pull it in and test on our dataset... >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>> Deb >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh < >>>>>>>>>>>>>>> r...@databricks.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Deb, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We are adding all-pairs and thresholded all-pairs via >>>>>>>>>>>>>>>> dimsum in this PR: >>>>>>>>>>>>>>>> https://github.com/apache/spark/pull/1778 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Your question wasn't entirely clear - does this answer it? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>> Reza >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das < >>>>>>>>>>>>>>>> debasish.da...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Reza, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Have you compared with the brute force algorithm for >>>>>>>>>>>>>>>>> similarity computation with something like the following in >>>>>>>>>>>>>>>>> Spark ? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> https://github.com/echen/scaldingale >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I am adding cosine similarity computation but I do want to >>>>>>>>>>>>>>>>> compute an all pair similarities... >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Note that the data is sparse for me (the data that goes to >>>>>>>>>>>>>>>>> matrix factorization) so I don't think joining and group-by on >>>>>>>>>>>>>>>>> (product,product) will be a big issue for me... >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Does it make sense to add all pair similarities as well >>>>>>>>>>>>>>>>> with dimsum based similarity ? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>>> Deb >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh < >>>>>>>>>>>>>>>>> r...@databricks.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi Xiaoli, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> There is a PR currently in progress to allow this, via >>>>>>>>>>>>>>>>>> the sampling scheme described in this paper: >>>>>>>>>>>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The PR is at https://github.com/apache/spark/pull/336 >>>>>>>>>>>>>>>>>> though it will need refactoring given the recent changes to >>>>>>>>>>>>>>>>>> matrix >>>>>>>>>>>>>>>>>> interface in MLlib. You may implement the sampling scheme >>>>>>>>>>>>>>>>>> for your own app >>>>>>>>>>>>>>>>>> since it's much code. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>> Reza >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li < >>>>>>>>>>>>>>>>>> lixiaolima...@gmail.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks for your suggestion. I have tried the method. I >>>>>>>>>>>>>>>>>>> used 8 nodes and every node has 8G memory. The program just >>>>>>>>>>>>>>>>>>> stopped at a >>>>>>>>>>>>>>>>>>> stage for about several hours without any further >>>>>>>>>>>>>>>>>>> information. Maybe I need >>>>>>>>>>>>>>>>>>> to find >>>>>>>>>>>>>>>>>>> out a more efficient way. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash < >>>>>>>>>>>>>>>>>>> and...@andrewash.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> The naive way would be to put all the users and their >>>>>>>>>>>>>>>>>>>> attributes into an RDD, then cartesian product that with >>>>>>>>>>>>>>>>>>>> itself. Run the >>>>>>>>>>>>>>>>>>>> similarity score on every pair (1M * 1M => 1T scores), map >>>>>>>>>>>>>>>>>>>> to (user, >>>>>>>>>>>>>>>>>>>> (score, otherUser)) and take the .top(k) for each user. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I doubt that you'll be able to take this approach with >>>>>>>>>>>>>>>>>>>> the 1T pairs though, so it might be worth looking at the >>>>>>>>>>>>>>>>>>>> literature for >>>>>>>>>>>>>>>>>>>> recommender systems to see what else is out there. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li < >>>>>>>>>>>>>>>>>>>> lixiaolima...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I am implementing an algorithm using Spark. I have one >>>>>>>>>>>>>>>>>>>>> million users. I need to compute the similarity between >>>>>>>>>>>>>>>>>>>>> each pair of users >>>>>>>>>>>>>>>>>>>>> using some user's attributes. For each user, I need to >>>>>>>>>>>>>>>>>>>>> get top k most >>>>>>>>>>>>>>>>>>>>> similar users. What is the best way to implement this? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >