Hi Reza,

Have you tested if different runs of the algorithm produce different
similarities (basically if the algorithm is deterministic) ?

This number does not look like a Monoid aggregation...iVal * jVal /
(math.min(sg, colMags(i)) * math.min(sg, colMags(j))

I am noticing some weird behavior as different runs are changing the
results...

Also can columnMagnitudes produce non-deterministic results ?

Thanks.

Deb

On Thu, Sep 18, 2014 at 10:34 AM, Reza Zadeh <r...@databricks.com> wrote:

> Hi Deb,
>
> I am not templating RowMatrix/CoordinateMatrix since that would be a big
> deviation from the PR. We can add jaccard and other similarity measures in
> later PRs.
>
> In the meantime, you can un-normalize the cosine similarities to get the
> dot product, and then compute the other similarity measures from the dot
> product.
>
> Best,
> Reza
>
>
> On Wed, Sep 17, 2014 at 6:52 PM, Debasish Das <debasish.da...@gmail.com>
> wrote:
>
>> Hi Reza,
>>
>> In similarColumns, it seems with cosine similarity I also need other
>> numbers such as intersection, jaccard and other measures...
>>
>> Right now I modified the code to generate jaccard but I had to run it
>> twice due to the design of RowMatrix / CoordinateMatrix...I feel we should
>> modify RowMatrix and CoordinateMatrix to be templated on the value...
>>
>> Are you considering this in your design ?
>>
>> Thanks.
>> Deb
>>
>>
>> On Tue, Sep 9, 2014 at 9:45 AM, Reza Zadeh <r...@databricks.com> wrote:
>>
>>> Better to do it in a PR of your own, it's not sufficiently related to
>>> dimsum
>>>
>>> On Tue, Sep 9, 2014 at 7:03 AM, Debasish Das <debasish.da...@gmail.com>
>>> wrote:
>>>
>>>> Cool...can I add loadRowMatrix in your PR ?
>>>>
>>>> Thanks.
>>>> Deb
>>>>
>>>> On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh <r...@databricks.com> wrote:
>>>>
>>>>> Hi Deb,
>>>>>
>>>>> Did you mean to message me instead of Xiangrui?
>>>>>
>>>>> For TS matrices, dimsum with positiveinfinity and computeGramian have
>>>>> the same cost, so you can do either one. For dense matrices with say, 1m
>>>>> columns this won't be computationally feasible and you'll want to start
>>>>> sampling with dimsum.
>>>>>
>>>>> It would be helpful to have a loadRowMatrix function, I would use it.
>>>>>
>>>>> Best,
>>>>> Reza
>>>>>
>>>>> On Tue, Sep 9, 2014 at 12:05 AM, Debasish Das <
>>>>> debasish.da...@gmail.com> wrote:
>>>>>
>>>>>> Hi Xiangrui,
>>>>>>
>>>>>> For tall skinny matrices, if I can pass a similarityMeasure to
>>>>>> computeGrammian, I could re-use the SVD's computeGrammian for similarity
>>>>>> computation as well...
>>>>>>
>>>>>> Do you recommend using this approach for tall skinny matrices or just
>>>>>> use the dimsum's routines ?
>>>>>>
>>>>>> Right now RowMatrix does not have a loadRowMatrix function like the
>>>>>> one available in LabeledPoint...should I add one ? I want to export the
>>>>>> matrix out from my stable code and then test dimsum...
>>>>>>
>>>>>> Thanks.
>>>>>> Deb
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh <r...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I will add dice, overlap, and jaccard similarity in a future PR,
>>>>>>> probably still for 1.2
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das <
>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Awesome...Let me try it out...
>>>>>>>>
>>>>>>>> Any plans of putting other similarity measures in future (jaccard
>>>>>>>> is something that will be useful) ? I guess it makes sense to add some
>>>>>>>> similarity measures in mllib...
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh <r...@databricks.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Yes you're right, calling dimsum with gamma as PositiveInfinity
>>>>>>>>> turns it into the usual brute force algorithm for cosine similarity, 
>>>>>>>>> there
>>>>>>>>> is no sampling. This is by design.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das <
>>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> I looked at the code: similarColumns(Double.posInf) is generating
>>>>>>>>>> the brute force...
>>>>>>>>>>
>>>>>>>>>> Basically dimsum with gamma as PositiveInfinity will produce the
>>>>>>>>>> exact same result as doing catesian products of RDD[(product, 
>>>>>>>>>> vector)] and
>>>>>>>>>> computing similarities or there will be some approximation ?
>>>>>>>>>>
>>>>>>>>>> Sorry I have not read your paper yet. Will read it over the
>>>>>>>>>> weekend.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <r...@databricks.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> For 60M x 10K brute force and dimsum thresholding should be fine.
>>>>>>>>>>>
>>>>>>>>>>> For 60M x 10M probably brute force won't work depending on the
>>>>>>>>>>> cluster's power, and dimsum thresholding should work with 
>>>>>>>>>>> appropriate
>>>>>>>>>>> threshold.
>>>>>>>>>>>
>>>>>>>>>>> Dimensionality reduction should help, and how effective it is
>>>>>>>>>>> will depend on your application and domain, it's worth trying if 
>>>>>>>>>>> the direct
>>>>>>>>>>> computation doesn't work.
>>>>>>>>>>>
>>>>>>>>>>> You can also try running KMeans clustering (perhaps after
>>>>>>>>>>> dimensionality reduction) if your goal is to find batches of 
>>>>>>>>>>> similar points
>>>>>>>>>>> instead of all pairs above a threshold.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das <
>>>>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Also for tall and wide (rows ~60M, columns 10M), I am
>>>>>>>>>>>> considering running a matrix factorization to reduce the dimension 
>>>>>>>>>>>> to say
>>>>>>>>>>>> ~60M x 50 and then run all pair similarity...
>>>>>>>>>>>>
>>>>>>>>>>>> Did you also try similar ideas and saw positive results ?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das <
>>>>>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Ok...just to make sure I have RowMatrix[SparseVector] where
>>>>>>>>>>>>> rows are ~ 60M and columns are 10M say with billion data points...
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have another version that's around 60M and ~ 10K...
>>>>>>>>>>>>>
>>>>>>>>>>>>> I guess for the second one both all pair and dimsum will run
>>>>>>>>>>>>> fine...
>>>>>>>>>>>>>
>>>>>>>>>>>>> But for tall and wide, what do you suggest ? can dimsum handle
>>>>>>>>>>>>> it ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I might need jaccard as well...can I plug that in the PR ?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <
>>>>>>>>>>>>> r...@databricks.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> You might want to wait until Wednesday since the interface
>>>>>>>>>>>>>> will be changing in that PR before Wednesday, probably over the 
>>>>>>>>>>>>>> weekend, so
>>>>>>>>>>>>>> that you don't have to redo your code. Your call if you need it 
>>>>>>>>>>>>>> before a
>>>>>>>>>>>>>> week.
>>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das <
>>>>>>>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ohh cool....all-pairs brute force is also part of this PR ?
>>>>>>>>>>>>>>> Let me pull it in and test on our dataset...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>> Deb
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <
>>>>>>>>>>>>>>> r...@databricks.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Deb,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We are adding all-pairs and thresholded all-pairs via
>>>>>>>>>>>>>>>> dimsum in this PR:
>>>>>>>>>>>>>>>> https://github.com/apache/spark/pull/1778
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Your question wasn't entirely clear - does this answer it?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <
>>>>>>>>>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Reza,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Have you compared with the brute force algorithm for
>>>>>>>>>>>>>>>>> similarity computation with something like the following in 
>>>>>>>>>>>>>>>>> Spark ?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> https://github.com/echen/scaldingale
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I am adding cosine similarity computation but I do want to
>>>>>>>>>>>>>>>>> compute an all pair similarities...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Note that the data is sparse for me (the data that goes to
>>>>>>>>>>>>>>>>> matrix factorization) so I don't think joining and group-by on
>>>>>>>>>>>>>>>>> (product,product) will be a big issue for me...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Does it make sense to add all pair similarities as well
>>>>>>>>>>>>>>>>> with dimsum based similarity ?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>> Deb
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <
>>>>>>>>>>>>>>>>> r...@databricks.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Xiaoli,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> There is a PR currently in progress to allow this, via
>>>>>>>>>>>>>>>>>> the sampling scheme described in this paper:
>>>>>>>>>>>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The PR is at https://github.com/apache/spark/pull/336
>>>>>>>>>>>>>>>>>> though it will need refactoring given the recent changes to 
>>>>>>>>>>>>>>>>>> matrix
>>>>>>>>>>>>>>>>>> interface in MLlib. You may implement the sampling scheme 
>>>>>>>>>>>>>>>>>> for your own app
>>>>>>>>>>>>>>>>>> since it's much code.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <
>>>>>>>>>>>>>>>>>> lixiaolima...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks for your suggestion. I have tried the method. I
>>>>>>>>>>>>>>>>>>> used 8 nodes and every node has 8G memory. The program just 
>>>>>>>>>>>>>>>>>>> stopped at a
>>>>>>>>>>>>>>>>>>> stage for about several hours without any further 
>>>>>>>>>>>>>>>>>>> information. Maybe I need
>>>>>>>>>>>>>>>>>>> to find
>>>>>>>>>>>>>>>>>>> out a more efficient way.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <
>>>>>>>>>>>>>>>>>>> and...@andrewash.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The naive way would be to put all the users and their
>>>>>>>>>>>>>>>>>>>> attributes into an RDD, then cartesian product that with 
>>>>>>>>>>>>>>>>>>>> itself.  Run the
>>>>>>>>>>>>>>>>>>>> similarity score on every pair (1M * 1M => 1T scores), map 
>>>>>>>>>>>>>>>>>>>> to (user,
>>>>>>>>>>>>>>>>>>>> (score, otherUser)) and take the .top(k) for each user.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I doubt that you'll be able to take this approach with
>>>>>>>>>>>>>>>>>>>> the 1T pairs though, so it might be worth looking at the 
>>>>>>>>>>>>>>>>>>>> literature for
>>>>>>>>>>>>>>>>>>>> recommender systems to see what else is out there.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <
>>>>>>>>>>>>>>>>>>>> lixiaolima...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I am implementing an algorithm using Spark. I have one
>>>>>>>>>>>>>>>>>>>>> million users. I need to compute the similarity between 
>>>>>>>>>>>>>>>>>>>>> each pair of users
>>>>>>>>>>>>>>>>>>>>> using some user's attributes.  For each user, I need to 
>>>>>>>>>>>>>>>>>>>>> get top k most
>>>>>>>>>>>>>>>>>>>>> similar users. What is the best way to implement this?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to