Re: Huge matrix

Reza Zadeh Fri, 05 Sep 2014 21:44:13 -0700

I will add dice, overlap, and jaccard similarity in a future PR, probably
still for 1.2



On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das <debasish.da...@gmail.com>
wrote:

> Awesome...Let me try it out...
>
> Any plans of putting other similarity measures in future (jaccard is
> something that will be useful) ? I guess it makes sense to add some
> similarity measures in mllib...
>
>
> On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh <r...@databricks.com> wrote:
>
>> Yes you're right, calling dimsum with gamma as PositiveInfinity turns it
>> into the usual brute force algorithm for cosine similarity, there is no
>> sampling. This is by design.
>>
>>
>> On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das <debasish.da...@gmail.com>
>> wrote:
>>
>>> I looked at the code: similarColumns(Double.posInf) is generating the
>>> brute force...
>>>
>>> Basically dimsum with gamma as PositiveInfinity will produce the exact
>>> same result as doing catesian products of RDD[(product, vector)] and
>>> computing similarities or there will be some approximation ?
>>>
>>> Sorry I have not read your paper yet. Will read it over the weekend.
>>>
>>>
>>>
>>> On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <r...@databricks.com> wrote:
>>>
>>>> For 60M x 10K brute force and dimsum thresholding should be fine.
>>>>
>>>> For 60M x 10M probably brute force won't work depending on the
>>>> cluster's power, and dimsum thresholding should work with appropriate
>>>> threshold.
>>>>
>>>> Dimensionality reduction should help, and how effective it is will
>>>> depend on your application and domain, it's worth trying if the direct
>>>> computation doesn't work.
>>>>
>>>> You can also try running KMeans clustering (perhaps after
>>>> dimensionality reduction) if your goal is to find batches of similar points
>>>> instead of all pairs above a threshold.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das <debasish.da...@gmail.com>
>>>> wrote:
>>>>
>>>>> Also for tall and wide (rows ~60M, columns 10M), I am considering
>>>>> running a matrix factorization to reduce the dimension to say ~60M x 50 
>>>>> and
>>>>> then run all pair similarity...
>>>>>
>>>>> Did you also try similar ideas and saw positive results ?
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das <debasish.da...@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Ok...just to make sure I have RowMatrix[SparseVector] where rows are
>>>>>> ~ 60M and columns are 10M say with billion data points...
>>>>>>
>>>>>> I have another version that's around 60M and ~ 10K...
>>>>>>
>>>>>> I guess for the second one both all pair and dimsum will run fine...
>>>>>>
>>>>>> But for tall and wide, what do you suggest ? can dimsum handle it ?
>>>>>>
>>>>>> I might need jaccard as well...can I plug that in the PR ?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <r...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> You might want to wait until Wednesday since the interface will be
>>>>>>> changing in that PR before Wednesday, probably over the weekend, so that
>>>>>>> you don't have to redo your code. Your call if you need it before a 
>>>>>>> week.
>>>>>>> Reza
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das <
>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Ohh cool....all-pairs brute force is also part of this PR ? Let me
>>>>>>>> pull it in and test on our dataset...
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>> Deb
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <r...@databricks.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Deb,
>>>>>>>>>
>>>>>>>>> We are adding all-pairs and thresholded all-pairs via dimsum in
>>>>>>>>> this PR: https://github.com/apache/spark/pull/1778
>>>>>>>>>
>>>>>>>>> Your question wasn't entirely clear - does this answer it?
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Reza
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <
>>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Reza,
>>>>>>>>>>
>>>>>>>>>> Have you compared with the brute force algorithm for similarity
>>>>>>>>>> computation with something like the following in Spark ?
>>>>>>>>>>
>>>>>>>>>> https://github.com/echen/scaldingale
>>>>>>>>>>
>>>>>>>>>> I am adding cosine similarity computation but I do want to
>>>>>>>>>> compute an all pair similarities...
>>>>>>>>>>
>>>>>>>>>> Note that the data is sparse for me (the data that goes to matrix
>>>>>>>>>> factorization) so I don't think joining and group-by on 
>>>>>>>>>> (product,product)
>>>>>>>>>> will be a big issue for me...
>>>>>>>>>>
>>>>>>>>>> Does it make sense to add all pair similarities as well with
>>>>>>>>>> dimsum based similarity ?
>>>>>>>>>>
>>>>>>>>>> Thanks.
>>>>>>>>>> Deb
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <r...@databricks.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Xiaoli,
>>>>>>>>>>>
>>>>>>>>>>> There is a PR currently in progress to allow this, via the
>>>>>>>>>>> sampling scheme described in this paper:
>>>>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf
>>>>>>>>>>>
>>>>>>>>>>> The PR is at https://github.com/apache/spark/pull/336 though it
>>>>>>>>>>> will need refactoring given the recent changes to matrix interface 
>>>>>>>>>>> in
>>>>>>>>>>> MLlib. You may implement the sampling scheme for your own app since 
>>>>>>>>>>> it's
>>>>>>>>>>> much code.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Reza
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <
>>>>>>>>>>> lixiaolima...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for your suggestion. I have tried the method. I used 8
>>>>>>>>>>>> nodes and every node has 8G memory. The program just stopped at a 
>>>>>>>>>>>> stage for
>>>>>>>>>>>> about several hours without any further information. Maybe I need 
>>>>>>>>>>>> to find
>>>>>>>>>>>> out a more efficient way.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <
>>>>>>>>>>>> and...@andrewash.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> The naive way would be to put all the users and their
>>>>>>>>>>>>> attributes into an RDD, then cartesian product that with itself.  
>>>>>>>>>>>>> Run the
>>>>>>>>>>>>> similarity score on every pair (1M * 1M => 1T scores), map to 
>>>>>>>>>>>>> (user,
>>>>>>>>>>>>> (score, otherUser)) and take the .top(k) for each user.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I doubt that you'll be able to take this approach with the 1T
>>>>>>>>>>>>> pairs though, so it might be worth looking at the literature for
>>>>>>>>>>>>> recommender systems to see what else is out there.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <
>>>>>>>>>>>>> lixiaolima...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am implementing an algorithm using Spark. I have one
>>>>>>>>>>>>>> million users. I need to compute the similarity between each 
>>>>>>>>>>>>>> pair of users
>>>>>>>>>>>>>> using some user's attributes.  For each user, I need to get top 
>>>>>>>>>>>>>> k most
>>>>>>>>>>>>>> similar users. What is the best way to implement this?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Huge matrix

Reply via email to