Re: Huge matrix

Reza Zadeh Thu, 18 Sep 2014 14:33:13 -0700

Hi Deb,
I am currently seeding the algorithm to be pseudo-random, this is an issue
being addressed in the PR. If you pull the current version it will be
deterministic, but not potentially not pseudo-random. The PR will updated
today.
Best,
Reza


On Thu, Sep 18, 2014 at 2:06 PM, Debasish Das <debasish.da...@gmail.com>
wrote:

> Hi Reza,
>
> Have you tested if different runs of the algorithm produce different
> similarities (basically if the algorithm is deterministic) ?
>
> This number does not look like a Monoid aggregation...iVal * jVal /
> (math.min(sg, colMags(i)) * math.min(sg, colMags(j))
>
> I am noticing some weird behavior as different runs are changing the
> results...
>
> Also can columnMagnitudes produce non-deterministic results ?
>
> Thanks.
>
> Deb
>
> On Thu, Sep 18, 2014 at 10:34 AM, Reza Zadeh <r...@databricks.com> wrote:
>
>> Hi Deb,
>>
>> I am not templating RowMatrix/CoordinateMatrix since that would be a big
>> deviation from the PR. We can add jaccard and other similarity measures in
>> later PRs.
>>
>> In the meantime, you can un-normalize the cosine similarities to get the
>> dot product, and then compute the other similarity measures from the dot
>> product.
>>
>> Best,
>> Reza
>>
>>
>> On Wed, Sep 17, 2014 at 6:52 PM, Debasish Das <debasish.da...@gmail.com>
>> wrote:
>>
>>> Hi Reza,
>>>
>>> In similarColumns, it seems with cosine similarity I also need other
>>> numbers such as intersection, jaccard and other measures...
>>>
>>> Right now I modified the code to generate jaccard but I had to run it
>>> twice due to the design of RowMatrix / CoordinateMatrix...I feel we should
>>> modify RowMatrix and CoordinateMatrix to be templated on the value...
>>>
>>> Are you considering this in your design ?
>>>
>>> Thanks.
>>> Deb
>>>
>>>
>>> On Tue, Sep 9, 2014 at 9:45 AM, Reza Zadeh <r...@databricks.com> wrote:
>>>
>>>> Better to do it in a PR of your own, it's not sufficiently related to
>>>> dimsum
>>>>
>>>> On Tue, Sep 9, 2014 at 7:03 AM, Debasish Das <debasish.da...@gmail.com>
>>>> wrote:
>>>>
>>>>> Cool...can I add loadRowMatrix in your PR ?
>>>>>
>>>>> Thanks.
>>>>> Deb
>>>>>
>>>>> On Tue, Sep 9, 2014 at 1:14 AM, Reza Zadeh <r...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Deb,
>>>>>>
>>>>>> Did you mean to message me instead of Xiangrui?
>>>>>>
>>>>>> For TS matrices, dimsum with positiveinfinity and computeGramian have
>>>>>> the same cost, so you can do either one. For dense matrices with say, 1m
>>>>>> columns this won't be computationally feasible and you'll want to start
>>>>>> sampling with dimsum.
>>>>>>
>>>>>> It would be helpful to have a loadRowMatrix function, I would use it.
>>>>>>
>>>>>> Best,
>>>>>> Reza
>>>>>>
>>>>>> On Tue, Sep 9, 2014 at 12:05 AM, Debasish Das <
>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Xiangrui,
>>>>>>>
>>>>>>> For tall skinny matrices, if I can pass a similarityMeasure to
>>>>>>> computeGrammian, I could re-use the SVD's computeGrammian for similarity
>>>>>>> computation as well...
>>>>>>>
>>>>>>> Do you recommend using this approach for tall skinny matrices or
>>>>>>> just use the dimsum's routines ?
>>>>>>>
>>>>>>> Right now RowMatrix does not have a loadRowMatrix function like the
>>>>>>> one available in LabeledPoint...should I add one ? I want to export the
>>>>>>> matrix out from my stable code and then test dimsum...
>>>>>>>
>>>>>>> Thanks.
>>>>>>> Deb
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 5, 2014 at 9:43 PM, Reza Zadeh <r...@databricks.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I will add dice, overlap, and jaccard similarity in a future PR,
>>>>>>>> probably still for 1.2
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das <
>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Awesome...Let me try it out...
>>>>>>>>>
>>>>>>>>> Any plans of putting other similarity measures in future (jaccard
>>>>>>>>> is something that will be useful) ? I guess it makes sense to add some
>>>>>>>>> similarity measures in mllib...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh <r...@databricks.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Yes you're right, calling dimsum with gamma as PositiveInfinity
>>>>>>>>>> turns it into the usual brute force algorithm for cosine similarity, 
>>>>>>>>>> there
>>>>>>>>>> is no sampling. This is by design.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das <
>>>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I looked at the code: similarColumns(Double.posInf) is
>>>>>>>>>>> generating the brute force...
>>>>>>>>>>>
>>>>>>>>>>> Basically dimsum with gamma as PositiveInfinity will produce the
>>>>>>>>>>> exact same result as doing catesian products of RDD[(product, 
>>>>>>>>>>> vector)] and
>>>>>>>>>>> computing similarities or there will be some approximation ?
>>>>>>>>>>>
>>>>>>>>>>> Sorry I have not read your paper yet. Will read it over the
>>>>>>>>>>> weekend.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 5, 2014 at 8:13 PM, Reza Zadeh <r...@databricks.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> For 60M x 10K brute force and dimsum thresholding should be
>>>>>>>>>>>> fine.
>>>>>>>>>>>>
>>>>>>>>>>>> For 60M x 10M probably brute force won't work depending on the
>>>>>>>>>>>> cluster's power, and dimsum thresholding should work with 
>>>>>>>>>>>> appropriate
>>>>>>>>>>>> threshold.
>>>>>>>>>>>>
>>>>>>>>>>>> Dimensionality reduction should help, and how effective it is
>>>>>>>>>>>> will depend on your application and domain, it's worth trying if 
>>>>>>>>>>>> the direct
>>>>>>>>>>>> computation doesn't work.
>>>>>>>>>>>>
>>>>>>>>>>>> You can also try running KMeans clustering (perhaps after
>>>>>>>>>>>> dimensionality reduction) if your goal is to find batches of 
>>>>>>>>>>>> similar points
>>>>>>>>>>>> instead of all pairs above a threshold.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das <
>>>>>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Also for tall and wide (rows ~60M, columns 10M), I am
>>>>>>>>>>>>> considering running a matrix factorization to reduce the 
>>>>>>>>>>>>> dimension to say
>>>>>>>>>>>>> ~60M x 50 and then run all pair similarity...
>>>>>>>>>>>>>
>>>>>>>>>>>>> Did you also try similar ideas and saw positive results ?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das <
>>>>>>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ok...just to make sure I have RowMatrix[SparseVector] where
>>>>>>>>>>>>>> rows are ~ 60M and columns are 10M say with billion data 
>>>>>>>>>>>>>> points...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have another version that's around 60M and ~ 10K...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I guess for the second one both all pair and dimsum will run
>>>>>>>>>>>>>> fine...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> But for tall and wide, what do you suggest ? can dimsum
>>>>>>>>>>>>>> handle it ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I might need jaccard as well...can I plug that in the PR ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:48 PM, Reza Zadeh <
>>>>>>>>>>>>>> r...@databricks.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> You might want to wait until Wednesday since the interface
>>>>>>>>>>>>>>> will be changing in that PR before Wednesday, probably over the 
>>>>>>>>>>>>>>> weekend, so
>>>>>>>>>>>>>>> that you don't have to redo your code. Your call if you need it 
>>>>>>>>>>>>>>> before a
>>>>>>>>>>>>>>> week.
>>>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das <
>>>>>>>>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Ohh cool....all-pairs brute force is also part of this PR ?
>>>>>>>>>>>>>>>> Let me pull it in and test on our dataset...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>> Deb
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh <
>>>>>>>>>>>>>>>> r...@databricks.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Deb,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> We are adding all-pairs and thresholded all-pairs via
>>>>>>>>>>>>>>>>> dimsum in this PR:
>>>>>>>>>>>>>>>>> https://github.com/apache/spark/pull/1778
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Your question wasn't entirely clear - does this answer it?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das <
>>>>>>>>>>>>>>>>> debasish.da...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Reza,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Have you compared with the brute force algorithm for
>>>>>>>>>>>>>>>>>> similarity computation with something like the following in 
>>>>>>>>>>>>>>>>>> Spark ?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> https://github.com/echen/scaldingale
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I am adding cosine similarity computation but I do want
>>>>>>>>>>>>>>>>>> to compute an all pair similarities...
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Note that the data is sparse for me (the data that goes
>>>>>>>>>>>>>>>>>> to matrix factorization) so I don't think joining and 
>>>>>>>>>>>>>>>>>> group-by on
>>>>>>>>>>>>>>>>>> (product,product) will be a big issue for me...
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Does it make sense to add all pair similarities as well
>>>>>>>>>>>>>>>>>> with dimsum based similarity ?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>>> Deb
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:21 PM, Reza Zadeh <
>>>>>>>>>>>>>>>>>> r...@databricks.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Xiaoli,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> There is a PR currently in progress to allow this, via
>>>>>>>>>>>>>>>>>>> the sampling scheme described in this paper:
>>>>>>>>>>>>>>>>>>> stanford.edu/~rezab/papers/dimsum.pdf
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The PR is at https://github.com/apache/spark/pull/336
>>>>>>>>>>>>>>>>>>> though it will need refactoring given the recent changes to 
>>>>>>>>>>>>>>>>>>> matrix
>>>>>>>>>>>>>>>>>>> interface in MLlib. You may implement the sampling scheme 
>>>>>>>>>>>>>>>>>>> for your own app
>>>>>>>>>>>>>>>>>>> since it's much code.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>> Reza
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:17 PM, Xiaoli Li <
>>>>>>>>>>>>>>>>>>> lixiaolima...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Andrew,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks for your suggestion. I have tried the method. I
>>>>>>>>>>>>>>>>>>>> used 8 nodes and every node has 8G memory. The program 
>>>>>>>>>>>>>>>>>>>> just stopped at a
>>>>>>>>>>>>>>>>>>>> stage for about several hours without any further 
>>>>>>>>>>>>>>>>>>>> information. Maybe I need
>>>>>>>>>>>>>>>>>>>> to find
>>>>>>>>>>>>>>>>>>>> out a more efficient way.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 5:24 PM, Andrew Ash <
>>>>>>>>>>>>>>>>>>>> and...@andrewash.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> The naive way would be to put all the users and their
>>>>>>>>>>>>>>>>>>>>> attributes into an RDD, then cartesian product that with 
>>>>>>>>>>>>>>>>>>>>> itself.  Run the
>>>>>>>>>>>>>>>>>>>>> similarity score on every pair (1M * 1M => 1T scores), 
>>>>>>>>>>>>>>>>>>>>> map to (user,
>>>>>>>>>>>>>>>>>>>>> (score, otherUser)) and take the .top(k) for each user.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I doubt that you'll be able to take this approach with
>>>>>>>>>>>>>>>>>>>>> the 1T pairs though, so it might be worth looking at the 
>>>>>>>>>>>>>>>>>>>>> literature for
>>>>>>>>>>>>>>>>>>>>> recommender systems to see what else is out there.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2014 at 9:54 PM, Xiaoli Li <
>>>>>>>>>>>>>>>>>>>>> lixiaolima...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I am implementing an algorithm using Spark. I have
>>>>>>>>>>>>>>>>>>>>>> one million users. I need to compute the similarity 
>>>>>>>>>>>>>>>>>>>>>> between each pair of
>>>>>>>>>>>>>>>>>>>>>> users using some user's attributes.  For each user, I 
>>>>>>>>>>>>>>>>>>>>>> need to get top k
>>>>>>>>>>>>>>>>>>>>>> most similar users. What is the best way to implement 
>>>>>>>>>>>>>>>>>>>>>> this?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Huge matrix

Reply via email to