Hi Xiangrui,

For tall skinny matrices, if I can pass a similarityMeasure to
computeGrammian, I could re-use the SVD's computeGrammian for similarity
computation as well...

Do you recommend using this approach for tall skinny matrices or just use
the dimsum's routines ?

Right now RowMatrix does not have a loadRowMatrix function like the one
available in LabeledPoint...should I add one ? I want to export the matrix
out from my stable code and then test dimsum...

Thanks.
Deb



On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh <r...@databricks.com> wrote:

> I will add dice, overlap, and jaccard similarity in a future PR, probably
> still for 1.2
>
>
> On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das <debasish.da...@gmail.com>
> wrote:
>
>> Awesome...Let me try it out...
>>
>> Any plans of putting other similarity measures in future (jaccard is
>> something that will be useful) ? I guess it makes sense to add some
>> similarity measures in mllib...
>>
>>
>> On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh <r...@databricks.com> wrote:
>>
>>> Yes you're right, calling dimsum with gamma as PositiveInfinity turns it
>>> into the usual brute force algorithm for cosine similarity, there is no
>>> sampling. This is by design.
>>>
>>>
>>> On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das <debasish.da...@gmail.com>
>>> wrote:
>>>
>>>> I looked at the code: similarColumns(Double.posInf) is generating the
>>>> brute force...
>>>>
>>>> Basically dimsum with gamma as PositiveInfinity will produce the exact
>>>> same result as doing catesian products of RDD[(product, vector)] and
>>>> computing similarities or there will be some approximation ?
>>>>
>>>> Sorry I have not read your paper yet. Will read it over the weekend.
>>>>
>>>>
>>>>
>>>> On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <r...@databricks.com> wrote:
>>>>
>>>>> For 60M x 10K brute force and dimsum thresholding should be fine.
>>>>>
>>>>> For 60M x 10M probably brute force won't work depending on the
>>>>> cluster's power, and dimsum thresholding should work with appropriate
>>>>> threshold.
>>>>>
>>>>> Dimensionality reduction should help, and how effective it is will
>>>>> depend on your application and domain, it's worth trying if the direct
>>>>> computation doesn't work.
>>>>>
>>>>> You can also try running KMeans clustering (perhaps after
>>>>> dimensionality reduction) if your goal is to find batches of similar 
>>>>> points
>>>>> instead of all pairs above a threshold.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das <debasish.da...@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Also for tall and wide (rows ~60M, columns 10M), I am considering
>>>>>> running a matrix factorization to reduce the dimension to say ~60M x 50 
>>>>>> and
>>>>>> then run all pair similarity...
>>>>>>
>>>>>> Did you also try similar ideas and saw positive results ?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das <
>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>
>>>>>>> Ok...just to make sure I have RowMatrix[SparseVector] where rows are
>>>>>>> ~ 60M and columns are 10M say with billion data points...
>>>>>>>
>>>>>>> I have another version that's around 60M and ~ 10K...
>>>>>>>
>>>>>>> I guess for the second one both all pair and dimsum will run fine...
>>>>>>>
>>>>>>> But for tall and wide, what do you suggest ? can dimsum handle it ?
>>>>>>>
>>>>>>> I might need jaccard as well...can I plug that in the PR ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <r...@databricks.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> You might want to wait until Wednesday since the interface will be
>>>>>>>> changing in that PR before Wednesday, probably over the weekend, so 
>>>>>>>> that
>>>>>>>> you don't have to redo your code. Your call if you need it before a 
>>>>>>>> week.
>>>>>>>> Reza
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das <
>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Ohh cool....all-pairs brute force is also part of this PR ? Let me
>>>>>>>>> pull it in and test on our dataset...
>>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>>> Deb
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <r...@databricks.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Deb,
>>>>>>>>>>
>>>>>>>>>> We are adding all-pairs and thresholded all-pairs via dimsum in
>>>>>>>>>> this PR: https://github.com/apache/spark/pull/1778
>>>>>>>>>>
>>>>>>>>>> Your question wasn't entirely clear - does this answer it?
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Reza
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <
>>>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Reza,
>>>>>>>>>>>
>>>>>>>>>>> Have you compared with the brute force algorithm for similarity
>>>>>>>>>>> computation with something like the following in Spark ?
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/echen/scaldingale
>>>>>>>>>>>
>>>>>>>>>>> I am adding cosine similarity computation but I do want to
>>>>>>>>>>> compute an all pair similarities...
>>>>>>>>>>>
>>>>>>>>>>> Note that the data is sparse for me (the data that goes to
>>>>>>>>>>> matrix factorization) so I don't think joining and group-by on
>>>>>>>>>>> (product,product) will be a big issue for me...
>>>>>>>>>>>
>>>>>>>>>>> Does it make sense to add all pair similarities as well with
>>>>>>>>>>> dimsum based similarity ?
>>>>>>>>>>>
>>>>>>>>>>> Thanks.
>>>>>>>>>>> Deb
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <r...@databricks.com
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Xiaoli,
>>>>>>>>>>>>
>>>>>>>>>>>> There is a PR currently in progress to allow this, via the
>>>>>>>>>>>> sampling scheme described in this paper:
>>>>>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf
>>>>>>>>>>>>
>>>>>>>>>>>> The PR is at https://github.com/apache/spark/pull/336 though
>>>>>>>>>>>> it will need refactoring given the recent changes to matrix 
>>>>>>>>>>>> interface in
>>>>>>>>>>>> MLlib. You may implement the sampling scheme for your own app 
>>>>>>>>>>>> since it's
>>>>>>>>>>>> much code.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Reza
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <
>>>>>>>>>>>> lixiaolima...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for your suggestion. I have tried the method. I used 8
>>>>>>>>>>>>> nodes and every node has 8G memory. The program just stopped at a 
>>>>>>>>>>>>> stage for
>>>>>>>>>>>>> about several hours without any further information. Maybe I need 
>>>>>>>>>>>>> to find
>>>>>>>>>>>>> out a more efficient way.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <
>>>>>>>>>>>>> and...@andrewash.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> The naive way would be to put all the users and their
>>>>>>>>>>>>>> attributes into an RDD, then cartesian product that with itself. 
>>>>>>>>>>>>>>  Run the
>>>>>>>>>>>>>> similarity score on every pair (1M * 1M => 1T scores), map to 
>>>>>>>>>>>>>> (user,
>>>>>>>>>>>>>> (score, otherUser)) and take the .top(k) for each user.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I doubt that you'll be able to take this approach with the 1T
>>>>>>>>>>>>>> pairs though, so it might be worth looking at the literature for
>>>>>>>>>>>>>> recommender systems to see what else is out there.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <
>>>>>>>>>>>>>> lixiaolima...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am implementing an algorithm using Spark. I have one
>>>>>>>>>>>>>>> million users. I need to compute the similarity between each 
>>>>>>>>>>>>>>> pair of users
>>>>>>>>>>>>>>> using some user's attributes.  For each user, I need to get top 
>>>>>>>>>>>>>>> k most
>>>>>>>>>>>>>>> similar users. What is the best way to implement this?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to