Re: Huge matrix

Debasish Das Wed, 17 Sep 2014 18:53:29 -0700

Hi Reza,

In similarColumns, it seems with cosine similarity I also need other
numbers such as intersection, jaccard and other measures...


Right now I modified the code to generate jaccard but I had to run it twice
due to the design of RowMatrix / CoordinateMatrix...I feel we should modify
RowMatrix and CoordinateMatrix to be templated on the value...

Are you considering this in your design ?

Thanks.
Deb


On Tue, Sep 9, 2014 at 9:45 AM, Reza Zadeh <r...@databricks.com> wrote:

> Better to do it in a PR of your own, it's not sufficiently related to
> dimsum
>
> On Tue, Sep 9, 2014 at 7:03 AM, Debasish Das <debasish.da...@gmail.com>
> wrote:
>
>> Cool...can I add loadRowMatrix in your PR ?
>>
>> Thanks.
>> Deb
>>
>> On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh <r...@databricks.com> wrote:
>>
>>> Hi Deb,
>>>
>>> Did you mean to message me instead of Xiangrui?
>>>
>>> For TS matrices, dimsum with positiveinfinity and computeGramian have
>>> the same cost, so you can do either one. For dense matrices with say, 1m
>>> columns this won't be computationally feasible and you'll want to start
>>> sampling with dimsum.
>>>
>>> It would be helpful to have a loadRowMatrix function, I would use it.
>>>
>>> Best,
>>> Reza
>>>
>>> On Tue, Sep 9, 2014 at 12:05 AM, Debasish Das <debasish.da...@gmail.com>
>>> wrote:
>>>
>>>> Hi Xiangrui,
>>>>
>>>> For tall skinny matrices, if I can pass a similarityMeasure to
>>>> computeGrammian, I could re-use the SVD's computeGrammian for similarity
>>>> computation as well...
>>>>
>>>> Do you recommend using this approach for tall skinny matrices or just
>>>> use the dimsum's routines ?
>>>>
>>>> Right now RowMatrix does not have a loadRowMatrix function like the one
>>>> available in LabeledPoint...should I add one ? I want to export the matrix
>>>> out from my stable code and then test dimsum...
>>>>
>>>> Thanks.
>>>> Deb
>>>>
>>>>
>>>>
>>>> On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh <r...@databricks.com> wrote:
>>>>
>>>>> I will add dice, overlap, and jaccard similarity in a future PR,
>>>>> probably still for 1.2
>>>>>
>>>>>
>>>>> On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das <debasish.da...@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Awesome...Let me try it out...
>>>>>>
>>>>>> Any plans of putting other similarity measures in future (jaccard is
>>>>>> something that will be useful) ? I guess it makes sense to add some
>>>>>> similarity measures in mllib...
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh <r...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Yes you're right, calling dimsum with gamma as PositiveInfinity
>>>>>>> turns it into the usual brute force algorithm for cosine similarity, 
>>>>>>> there
>>>>>>> is no sampling. This is by design.
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das <
>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>
>>>>>>>> I looked at the code: similarColumns(Double.posInf) is generating
>>>>>>>> the brute force...
>>>>>>>>
>>>>>>>> Basically dimsum with gamma as PositiveInfinity will produce the
>>>>>>>> exact same result as doing catesian products of RDD[(product, vector)] 
>>>>>>>> and
>>>>>>>> computing similarities or there will be some approximation ?
>>>>>>>>
>>>>>>>> Sorry I have not read your paper yet. Will read it over the weekend.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <r...@databricks.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> For 60M x 10K brute force and dimsum thresholding should be fine.
>>>>>>>>>
>>>>>>>>> For 60M x 10M probably brute force won't work depending on the
>>>>>>>>> cluster's power, and dimsum thresholding should work with appropriate
>>>>>>>>> threshold.
>>>>>>>>>
>>>>>>>>> Dimensionality reduction should help, and how effective it is will
>>>>>>>>> depend on your application and domain, it's worth trying if the direct
>>>>>>>>> computation doesn't work.
>>>>>>>>>
>>>>>>>>> You can also try running KMeans clustering (perhaps after
>>>>>>>>> dimensionality reduction) if your goal is to find batches of similar 
>>>>>>>>> points
>>>>>>>>> instead of all pairs above a threshold.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das <
>>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Also for tall and wide (rows ~60M, columns 10M), I am considering
>>>>>>>>>> running a matrix factorization to reduce the dimension to say ~60M x 
>>>>>>>>>> 50 and
>>>>>>>>>> then run all pair similarity...
>>>>>>>>>>
>>>>>>>>>> Did you also try similar ideas and saw positive results ?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das <
>>>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Ok...just to make sure I have RowMatrix[SparseVector] where rows
>>>>>>>>>>> are ~ 60M and columns are 10M say with billion data points...
>>>>>>>>>>>
>>>>>>>>>>> I have another version that's around 60M and ~ 10K...
>>>>>>>>>>>
>>>>>>>>>>> I guess for the second one both all pair and dimsum will run
>>>>>>>>>>> fine...
>>>>>>>>>>>
>>>>>>>>>>> But for tall and wide, what do you suggest ? can dimsum handle
>>>>>>>>>>> it ?
>>>>>>>>>>>
>>>>>>>>>>> I might need jaccard as well...can I plug that in the PR ?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <r...@databricks.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> You might want to wait until Wednesday since the interface will
>>>>>>>>>>>> be changing in that PR before Wednesday, probably over the 
>>>>>>>>>>>> weekend, so that
>>>>>>>>>>>> you don't have to redo your code. Your call if you need it before 
>>>>>>>>>>>> a week.
>>>>>>>>>>>> Reza
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das <
>>>>>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Ohh cool....all-pairs brute force is also part of this PR ?
>>>>>>>>>>>>> Let me pull it in and test on our dataset...
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>> Deb
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <
>>>>>>>>>>>>> r...@databricks.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Deb,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We are adding all-pairs and thresholded all-pairs via dimsum
>>>>>>>>>>>>>> in this PR: https://github.com/apache/spark/pull/1778
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Your question wasn't entirely clear - does this answer it?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <
>>>>>>>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Reza,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Have you compared with the brute force algorithm for
>>>>>>>>>>>>>>> similarity computation with something like the following in 
>>>>>>>>>>>>>>> Spark ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> https://github.com/echen/scaldingale
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am adding cosine similarity computation but I do want to
>>>>>>>>>>>>>>> compute an all pair similarities...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Note that the data is sparse for me (the data that goes to
>>>>>>>>>>>>>>> matrix factorization) so I don't think joining and group-by on
>>>>>>>>>>>>>>> (product,product) will be a big issue for me...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Does it make sense to add all pair similarities as well with
>>>>>>>>>>>>>>> dimsum based similarity ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>> Deb
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <
>>>>>>>>>>>>>>> r...@databricks.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Xiaoli,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> There is a PR currently in progress to allow this, via the
>>>>>>>>>>>>>>>> sampling scheme described in this paper:
>>>>>>>>>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The PR is at https://github.com/apache/spark/pull/336
>>>>>>>>>>>>>>>> though it will need refactoring given the recent changes to 
>>>>>>>>>>>>>>>> matrix
>>>>>>>>>>>>>>>> interface in MLlib. You may implement the sampling scheme for 
>>>>>>>>>>>>>>>> your own app
>>>>>>>>>>>>>>>> since it's much code.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <
>>>>>>>>>>>>>>>> lixiaolima...@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks for your suggestion. I have tried the method. I
>>>>>>>>>>>>>>>>> used 8 nodes and every node has 8G memory. The program just 
>>>>>>>>>>>>>>>>> stopped at a
>>>>>>>>>>>>>>>>> stage for about several hours without any further 
>>>>>>>>>>>>>>>>> information. Maybe I need
>>>>>>>>>>>>>>>>> to find
>>>>>>>>>>>>>>>>> out a more efficient way.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <
>>>>>>>>>>>>>>>>> and...@andrewash.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The naive way would be to put all the users and their
>>>>>>>>>>>>>>>>>> attributes into an RDD, then cartesian product that with 
>>>>>>>>>>>>>>>>>> itself.  Run the
>>>>>>>>>>>>>>>>>> similarity score on every pair (1M * 1M => 1T scores), map 
>>>>>>>>>>>>>>>>>> to (user,
>>>>>>>>>>>>>>>>>> (score, otherUser)) and take the .top(k) for each user.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I doubt that you'll be able to take this approach with
>>>>>>>>>>>>>>>>>> the 1T pairs though, so it might be worth looking at the 
>>>>>>>>>>>>>>>>>> literature for
>>>>>>>>>>>>>>>>>> recommender systems to see what else is out there.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <
>>>>>>>>>>>>>>>>>> lixiaolima...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I am implementing an algorithm using Spark. I have one
>>>>>>>>>>>>>>>>>>> million users. I need to compute the similarity between 
>>>>>>>>>>>>>>>>>>> each pair of users
>>>>>>>>>>>>>>>>>>> using some user's attributes.  For each user, I need to get 
>>>>>>>>>>>>>>>>>>> top k most
>>>>>>>>>>>>>>>>>>> similar users. What is the best way to implement this?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Huge matrix

Reply via email to