Re: Huge matrix

Reza Zadeh Tue, 09 Sep 2014 09:45:44 -0700

Better to do it in a PR of your own, it's not sufficiently related to dimsum


On Tue, Sep 9, 2014 at 7:03 AM, Debasish Das <debasish.da...@gmail.com>
wrote:

> Cool...can I add loadRowMatrix in your PR ?
>
> Thanks.
> Deb
>
> On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh <r...@databricks.com> wrote:
>
>> Hi Deb,
>>
>> Did you mean to message me instead of Xiangrui?
>>
>> For TS matrices, dimsum with positiveinfinity and computeGramian have the
>> same cost, so you can do either one. For dense matrices with say, 1m
>> columns this won't be computationally feasible and you'll want to start
>> sampling with dimsum.
>>
>> It would be helpful to have a loadRowMatrix function, I would use it.
>>
>> Best,
>> Reza
>>
>> On Tue, Sep 9, 2014 at 12:05 AM, Debasish Das <debasish.da...@gmail.com>
>> wrote:
>>
>>> Hi Xiangrui,
>>>
>>> For tall skinny matrices, if I can pass a similarityMeasure to
>>> computeGrammian, I could re-use the SVD's computeGrammian for similarity
>>> computation as well...
>>>
>>> Do you recommend using this approach for tall skinny matrices or just
>>> use the dimsum's routines ?
>>>
>>> Right now RowMatrix does not have a loadRowMatrix function like the one
>>> available in LabeledPoint...should I add one ? I want to export the matrix
>>> out from my stable code and then test dimsum...
>>>
>>> Thanks.
>>> Deb
>>>
>>>
>>>
>>> On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh <r...@databricks.com> wrote:
>>>
>>>> I will add dice, overlap, and jaccard similarity in a future PR,
>>>> probably still for 1.2
>>>>
>>>>
>>>> On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das <debasish.da...@gmail.com>
>>>> wrote:
>>>>
>>>>> Awesome...Let me try it out...
>>>>>
>>>>> Any plans of putting other similarity measures in future (jaccard is
>>>>> something that will be useful) ? I guess it makes sense to add some
>>>>> similarity measures in mllib...
>>>>>
>>>>>
>>>>> On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh <r...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> Yes you're right, calling dimsum with gamma as PositiveInfinity turns
>>>>>> it into the usual brute force algorithm for cosine similarity, there is 
>>>>>> no
>>>>>> sampling. This is by design.
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das <
>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>
>>>>>>> I looked at the code: similarColumns(Double.posInf) is generating
>>>>>>> the brute force...
>>>>>>>
>>>>>>> Basically dimsum with gamma as PositiveInfinity will produce the
>>>>>>> exact same result as doing catesian products of RDD[(product, vector)] 
>>>>>>> and
>>>>>>> computing similarities or there will be some approximation ?
>>>>>>>
>>>>>>> Sorry I have not read your paper yet. Will read it over the weekend.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <r...@databricks.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> For 60M x 10K brute force and dimsum thresholding should be fine.
>>>>>>>>
>>>>>>>> For 60M x 10M probably brute force won't work depending on the
>>>>>>>> cluster's power, and dimsum thresholding should work with appropriate
>>>>>>>> threshold.
>>>>>>>>
>>>>>>>> Dimensionality reduction should help, and how effective it is will
>>>>>>>> depend on your application and domain, it's worth trying if the direct
>>>>>>>> computation doesn't work.
>>>>>>>>
>>>>>>>> You can also try running KMeans clustering (perhaps after
>>>>>>>> dimensionality reduction) if your goal is to find batches of similar 
>>>>>>>> points
>>>>>>>> instead of all pairs above a threshold.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das <
>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Also for tall and wide (rows ~60M, columns 10M), I am considering
>>>>>>>>> running a matrix factorization to reduce the dimension to say ~60M x 
>>>>>>>>> 50 and
>>>>>>>>> then run all pair similarity...
>>>>>>>>>
>>>>>>>>> Did you also try similar ideas and saw positive results ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das <
>>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Ok...just to make sure I have RowMatrix[SparseVector] where rows
>>>>>>>>>> are ~ 60M and columns are 10M say with billion data points...
>>>>>>>>>>
>>>>>>>>>> I have another version that's around 60M and ~ 10K...
>>>>>>>>>>
>>>>>>>>>> I guess for the second one both all pair and dimsum will run
>>>>>>>>>> fine...
>>>>>>>>>>
>>>>>>>>>> But for tall and wide, what do you suggest ? can dimsum handle it
>>>>>>>>>> ?
>>>>>>>>>>
>>>>>>>>>> I might need jaccard as well...can I plug that in the PR ?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <r...@databricks.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> You might want to wait until Wednesday since the interface will
>>>>>>>>>>> be changing in that PR before Wednesday, probably over the weekend, 
>>>>>>>>>>> so that
>>>>>>>>>>> you don't have to redo your code. Your call if you need it before a 
>>>>>>>>>>> week.
>>>>>>>>>>> Reza
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das <
>>>>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Ohh cool....all-pairs brute force is also part of this PR ? Let
>>>>>>>>>>>> me pull it in and test on our dataset...
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks.
>>>>>>>>>>>> Deb
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <r...@databricks.com
>>>>>>>>>>>> > wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Deb,
>>>>>>>>>>>>>
>>>>>>>>>>>>> We are adding all-pairs and thresholded all-pairs via dimsum
>>>>>>>>>>>>> in this PR: https://github.com/apache/spark/pull/1778
>>>>>>>>>>>>>
>>>>>>>>>>>>> Your question wasn't entirely clear - does this answer it?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <
>>>>>>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Reza,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Have you compared with the brute force algorithm for
>>>>>>>>>>>>>> similarity computation with something like the following in 
>>>>>>>>>>>>>> Spark ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://github.com/echen/scaldingale
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am adding cosine similarity computation but I do want to
>>>>>>>>>>>>>> compute an all pair similarities...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Note that the data is sparse for me (the data that goes to
>>>>>>>>>>>>>> matrix factorization) so I don't think joining and group-by on
>>>>>>>>>>>>>> (product,product) will be a big issue for me...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Does it make sense to add all pair similarities as well with
>>>>>>>>>>>>>> dimsum based similarity ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>> Deb
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <
>>>>>>>>>>>>>> r...@databricks.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Xiaoli,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> There is a PR currently in progress to allow this, via the
>>>>>>>>>>>>>>> sampling scheme described in this paper:
>>>>>>>>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The PR is at https://github.com/apache/spark/pull/336
>>>>>>>>>>>>>>> though it will need refactoring given the recent changes to 
>>>>>>>>>>>>>>> matrix
>>>>>>>>>>>>>>> interface in MLlib. You may implement the sampling scheme for 
>>>>>>>>>>>>>>> your own app
>>>>>>>>>>>>>>> since it's much code.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <
>>>>>>>>>>>>>>> lixiaolima...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for your suggestion. I have tried the method. I used
>>>>>>>>>>>>>>>> 8 nodes and every node has 8G memory. The program just stopped 
>>>>>>>>>>>>>>>> at a stage
>>>>>>>>>>>>>>>> for about several hours without any further information. Maybe 
>>>>>>>>>>>>>>>> I need to
>>>>>>>>>>>>>>>> find
>>>>>>>>>>>>>>>> out a more efficient way.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <
>>>>>>>>>>>>>>>> and...@andrewash.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The naive way would be to put all the users and their
>>>>>>>>>>>>>>>>> attributes into an RDD, then cartesian product that with 
>>>>>>>>>>>>>>>>> itself.  Run the
>>>>>>>>>>>>>>>>> similarity score on every pair (1M * 1M => 1T scores), map to 
>>>>>>>>>>>>>>>>> (user,
>>>>>>>>>>>>>>>>> (score, otherUser)) and take the .top(k) for each user.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I doubt that you'll be able to take this approach with the
>>>>>>>>>>>>>>>>> 1T pairs though, so it might be worth looking at the 
>>>>>>>>>>>>>>>>> literature for
>>>>>>>>>>>>>>>>> recommender systems to see what else is out there.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <
>>>>>>>>>>>>>>>>> lixiaolima...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I am implementing an algorithm using Spark. I have one
>>>>>>>>>>>>>>>>>> million users. I need to compute the similarity between each 
>>>>>>>>>>>>>>>>>> pair of users
>>>>>>>>>>>>>>>>>> using some user's attributes.  For each user, I need to get 
>>>>>>>>>>>>>>>>>> top k most
>>>>>>>>>>>>>>>>>> similar users. What is the best way to implement this?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Huge matrix

Reply via email to