Hi Deb, I am currently seeding the algorithm to be pseudo-random, this is an issue being addressed in the PR. If you pull the current version it will be deterministic, but not potentially not pseudo-random. The PR will updated today. Best, Reza
On Thu, Sep 18, 2014 at 2:06 PM, Debasish Das <debasish.da...@gmail.com> wrote: > Hi Reza, > > Have you tested if different runs of the algorithm produce different > similarities (basically if the algorithm is deterministic) ? > > This number does not look like a Monoid aggregation...iVal * jVal / > (math.min(sg, colMags(i)) * math.min(sg, colMags(j)) > > I am noticing some weird behavior as different runs are changing the > results... > > Also can columnMagnitudes produce non-deterministic results ? > > Thanks. > > Deb > > On Thu, Sep 18, 2014 at 10:34 AM, Reza Zadeh <r...@databricks.com> wrote: > >> Hi Deb, >> >> I am not templating RowMatrix/CoordinateMatrix since that would be a big >> deviation from the PR. We can add jaccard and other similarity measures in >> later PRs. >> >> In the meantime, you can un-normalize the cosine similarities to get the >> dot product, and then compute the other similarity measures from the dot >> product. >> >> Best, >> Reza >> >> >> On Wed, Sep 17, 2014 at 6:52 PM, Debasish Das <debasish.da...@gmail.com> >> wrote: >> >>> Hi Reza, >>> >>> In similarColumns, it seems with cosine similarity I also need other >>> numbers such as intersection, jaccard and other measures... >>> >>> Right now I modified the code to generate jaccard but I had to run it >>> twice due to the design of RowMatrix / CoordinateMatrix...I feel we should >>> modify RowMatrix and CoordinateMatrix to be templated on the value... >>> >>> Are you considering this in your design ? >>> >>> Thanks. >>> Deb >>> >>> >>> On Tue, Sep 9, 2014 at 9:45 AM, Reza Zadeh <r...@databricks.com> wrote: >>> >>>> Better to do it in a PR of your own, it's not sufficiently related to >>>> dimsum >>>> >>>> On Tue, Sep 9, 2014 at 7:03 AM, Debasish Das <debasish.da...@gmail.com> >>>> wrote: >>>> >>>>> Cool...can I add loadRowMatrix in your PR ? >>>>> >>>>> Thanks. >>>>> Deb >>>>> >>>>> On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh <r...@databricks.com> >>>>> wrote: >>>>> >>>>>> Hi Deb, >>>>>> >>>>>> Did you mean to message me instead of Xiangrui? >>>>>> >>>>>> For TS matrices, dimsum with positiveinfinity and computeGramian have >>>>>> the same cost, so you can do either one. For dense matrices with say, 1m >>>>>> columns this won't be computationally feasible and you'll want to start >>>>>> sampling with dimsum. >>>>>> >>>>>> It would be helpful to have a loadRowMatrix function, I would use it. >>>>>> >>>>>> Best, >>>>>> Reza >>>>>> >>>>>> On Tue, Sep 9, 2014 at 12:05 AM, Debasish Das < >>>>>> debasish.da...@gmail.com> wrote: >>>>>> >>>>>>> Hi Xiangrui, >>>>>>> >>>>>>> For tall skinny matrices, if I can pass a similarityMeasure to >>>>>>> computeGrammian, I could re-use the SVD's computeGrammian for similarity >>>>>>> computation as well... >>>>>>> >>>>>>> Do you recommend using this approach for tall skinny matrices or >>>>>>> just use the dimsum's routines ? >>>>>>> >>>>>>> Right now RowMatrix does not have a loadRowMatrix function like the >>>>>>> one available in LabeledPoint...should I add one ? I want to export the >>>>>>> matrix out from my stable code and then test dimsum... >>>>>>> >>>>>>> Thanks. >>>>>>> Deb >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh <r...@databricks.com> >>>>>>> wrote: >>>>>>> >>>>>>>> I will add dice, overlap, and jaccard similarity in a future PR, >>>>>>>> probably still for 1.2 >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das < >>>>>>>> debasish.da...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Awesome...Let me try it out... >>>>>>>>> >>>>>>>>> Any plans of putting other similarity measures in future (jaccard >>>>>>>>> is something that will be useful) ? I guess it makes sense to add some >>>>>>>>> similarity measures in mllib... >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh <r...@databricks.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Yes you're right, calling dimsum with gamma as PositiveInfinity >>>>>>>>>> turns it into the usual brute force algorithm for cosine similarity, >>>>>>>>>> there >>>>>>>>>> is no sampling. This is by design. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das < >>>>>>>>>> debasish.da...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> I looked at the code: similarColumns(Double.posInf) is >>>>>>>>>>> generating the brute force... >>>>>>>>>>> >>>>>>>>>>> Basically dimsum with gamma as PositiveInfinity will produce the >>>>>>>>>>> exact same result as doing catesian products of RDD[(product, >>>>>>>>>>> vector)] and >>>>>>>>>>> computing similarities or there will be some approximation ? >>>>>>>>>>> >>>>>>>>>>> Sorry I have not read your paper yet. Will read it over the >>>>>>>>>>> weekend. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <r...@databricks.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> For 60M x 10K brute force and dimsum thresholding should be >>>>>>>>>>>> fine. >>>>>>>>>>>> >>>>>>>>>>>> For 60M x 10M probably brute force won't work depending on the >>>>>>>>>>>> cluster's power, and dimsum thresholding should work with >>>>>>>>>>>> appropriate >>>>>>>>>>>> threshold. >>>>>>>>>>>> >>>>>>>>>>>> Dimensionality reduction should help, and how effective it is >>>>>>>>>>>> will depend on your application and domain, it's worth trying if >>>>>>>>>>>> the direct >>>>>>>>>>>> computation doesn't work. >>>>>>>>>>>> >>>>>>>>>>>> You can also try running KMeans clustering (perhaps after >>>>>>>>>>>> dimensionality reduction) if your goal is to find batches of >>>>>>>>>>>> similar points >>>>>>>>>>>> instead of all pairs above a threshold. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das < >>>>>>>>>>>> debasish.da...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Also for tall and wide (rows ~60M, columns 10M), I am >>>>>>>>>>>>> considering running a matrix factorization to reduce the >>>>>>>>>>>>> dimension to say >>>>>>>>>>>>> ~60M x 50 and then run all pair similarity... >>>>>>>>>>>>> >>>>>>>>>>>>> Did you also try similar ideas and saw positive results ? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das < >>>>>>>>>>>>> debasish.da...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Ok...just to make sure I have RowMatrix[SparseVector] where >>>>>>>>>>>>>> rows are ~ 60M and columns are 10M say with billion data >>>>>>>>>>>>>> points... >>>>>>>>>>>>>> >>>>>>>>>>>>>> I have another version that's around 60M and ~ 10K... >>>>>>>>>>>>>> >>>>>>>>>>>>>> I guess for the second one both all pair and dimsum will run >>>>>>>>>>>>>> fine... >>>>>>>>>>>>>> >>>>>>>>>>>>>> But for tall and wide, what do you suggest ? can dimsum >>>>>>>>>>>>>> handle it ? >>>>>>>>>>>>>> >>>>>>>>>>>>>> I might need jaccard as well...can I plug that in the PR ? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh < >>>>>>>>>>>>>> r...@databricks.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> You might want to wait until Wednesday since the interface >>>>>>>>>>>>>>> will be changing in that PR before Wednesday, probably over the >>>>>>>>>>>>>>> weekend, so >>>>>>>>>>>>>>> that you don't have to redo your code. Your call if you need it >>>>>>>>>>>>>>> before a >>>>>>>>>>>>>>> week. >>>>>>>>>>>>>>> Reza >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das < >>>>>>>>>>>>>>> debasish.da...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Ohh cool....all-pairs brute force is also part of this PR ? >>>>>>>>>>>>>>>> Let me pull it in and test on our dataset... >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>> Deb >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh < >>>>>>>>>>>>>>>> r...@databricks.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Deb, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> We are adding all-pairs and thresholded all-pairs via >>>>>>>>>>>>>>>>> dimsum in this PR: >>>>>>>>>>>>>>>>> https://github.com/apache/spark/pull/1778 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Your question wasn't entirely clear - does this answer it? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>> Reza >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das < >>>>>>>>>>>>>>>>> debasish.da...@gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi Reza, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Have you compared with the brute force algorithm for >>>>>>>>>>>>>>>>>> similarity computation with something like the following in >>>>>>>>>>>>>>>>>> Spark ? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> https://github.com/echen/scaldingale >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I am adding cosine similarity computation but I do want >>>>>>>>>>>>>>>>>> to compute an all pair similarities... >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Note that the data is sparse for me (the data that goes >>>>>>>>>>>>>>>>>> to matrix factorization) so I don't think joining and >>>>>>>>>>>>>>>>>> group-by on >>>>>>>>>>>>>>>>>> (product,product) will be a big issue for me... >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Does it make sense to add all pair similarities as well >>>>>>>>>>>>>>>>>> with dimsum based similarity ? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>>>> Deb >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh < >>>>>>>>>>>>>>>>>> r...@databricks.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi Xiaoli, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> There is a PR currently in progress to allow this, via >>>>>>>>>>>>>>>>>>> the sampling scheme described in this paper: >>>>>>>>>>>>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> The PR is at https://github.com/apache/spark/pull/336 >>>>>>>>>>>>>>>>>>> though it will need refactoring given the recent changes to >>>>>>>>>>>>>>>>>>> matrix >>>>>>>>>>>>>>>>>>> interface in MLlib. You may implement the sampling scheme >>>>>>>>>>>>>>>>>>> for your own app >>>>>>>>>>>>>>>>>>> since it's much code. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>>> Reza >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li < >>>>>>>>>>>>>>>>>>> lixiaolima...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi Andrew, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks for your suggestion. I have tried the method. I >>>>>>>>>>>>>>>>>>>> used 8 nodes and every node has 8G memory. The program >>>>>>>>>>>>>>>>>>>> just stopped at a >>>>>>>>>>>>>>>>>>>> stage for about several hours without any further >>>>>>>>>>>>>>>>>>>> information. Maybe I need >>>>>>>>>>>>>>>>>>>> to find >>>>>>>>>>>>>>>>>>>> out a more efficient way. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash < >>>>>>>>>>>>>>>>>>>> and...@andrewash.com> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> The naive way would be to put all the users and their >>>>>>>>>>>>>>>>>>>>> attributes into an RDD, then cartesian product that with >>>>>>>>>>>>>>>>>>>>> itself. Run the >>>>>>>>>>>>>>>>>>>>> similarity score on every pair (1M * 1M => 1T scores), >>>>>>>>>>>>>>>>>>>>> map to (user, >>>>>>>>>>>>>>>>>>>>> (score, otherUser)) and take the .top(k) for each user. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I doubt that you'll be able to take this approach with >>>>>>>>>>>>>>>>>>>>> the 1T pairs though, so it might be worth looking at the >>>>>>>>>>>>>>>>>>>>> literature for >>>>>>>>>>>>>>>>>>>>> recommender systems to see what else is out there. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li < >>>>>>>>>>>>>>>>>>>>> lixiaolima...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I am implementing an algorithm using Spark. I have >>>>>>>>>>>>>>>>>>>>>> one million users. I need to compute the similarity >>>>>>>>>>>>>>>>>>>>>> between each pair of >>>>>>>>>>>>>>>>>>>>>> users using some user's attributes. For each user, I >>>>>>>>>>>>>>>>>>>>>> need to get top k >>>>>>>>>>>>>>>>>>>>>> most similar users. What is the best way to implement >>>>>>>>>>>>>>>>>>>>>> this? >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >