[jira] [Commented] (SPARK-4823) rowSimilarities

2016-10-16 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15581359#comment-15581359
 ] 

Debasish Das commented on SPARK-4823:
-

We use it in multiple usecases internally but did not get time to refactor the 
PR into 3 smaller PRsI will update the PR to 2.0

> rowSimilarities
> ---
>
> Key: SPARK-4823
> URL: https://issues.apache.org/jira/browse/SPARK-4823
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Reza Zadeh
> Attachments: MovieLensSimilarity Comparisons.pdf, 
> SparkMeetup2015-Experiments1.pdf, SparkMeetup2015-Experiments2.pdf
>
>
> RowMatrix has a columnSimilarities method to find cosine similarities between 
> columns.
> A rowSimilarities method would be useful to find similarities between rows.
> This is JIRA is to investigate which algorithms are suitable for such a 
> method, better than brute-forcing it. Note that when there are many rows (> 
> 10^6), it is unlikely that brute-force will be feasible, since the output 
> will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4823) rowSimilarities

2015-10-30 Thread Jerry Lam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14983330#comment-14983330
 ] 

Jerry Lam commented on SPARK-4823:
--

Hi [~debasish83], I wonder if this is still work in progress or something that 
can be merged to 1.5 soon? Thank you.

> rowSimilarities
> ---
>
> Key: SPARK-4823
> URL: https://issues.apache.org/jira/browse/SPARK-4823
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Reza Zadeh
> Attachments: MovieLensSimilarity Comparisons.pdf, 
> SparkMeetup2015-Experiments1.pdf, SparkMeetup2015-Experiments2.pdf
>
>
> RowMatrix has a columnSimilarities method to find cosine similarities between 
> columns.
> A rowSimilarities method would be useful to find similarities between rows.
> This is JIRA is to investigate which algorithms are suitable for such a 
> method, better than brute-forcing it. Note that when there are many rows (> 
> 10^6), it is unlikely that brute-force will be feasible, since the output 
> will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4823) rowSimilarities

2015-07-30 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648340#comment-14648340
 ] 

Debasish Das commented on SPARK-4823:
-

We did more detailed experiment for July 2015 Spark Meetup to understand the 
shuffle effects on runtime. I attached the data for experiments in the JIRA. I 
will update the PR as discussed with Reza. I am targeting 1 PR for Spark 1.5.


> rowSimilarities
> ---
>
> Key: SPARK-4823
> URL: https://issues.apache.org/jira/browse/SPARK-4823
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Reza Zadeh
> Attachments: MovieLensSimilarity Comparisons.pdf
>
>
> RowMatrix has a columnSimilarities method to find cosine similarities between 
> columns.
> A rowSimilarities method would be useful to find similarities between rows.
> This is JIRA is to investigate which algorithms are suitable for such a 
> method, better than brute-forcing it. Note that when there are many rows (> 
> 10^6), it is unlikely that brute-force will be feasible, since the output 
> will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4823) rowSimilarities

2015-05-17 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547318#comment-14547318
 ] 

Debasish Das commented on SPARK-4823:
-

I opened up a PR that worked well for our datasets. It is still brute-force 
computation although we use blocked cartesian and user defined kernels to 
optimize on cutting computation and shuffle...There are trivial ideas to go 
from BLAS-1 to BLAS-2 and BLAS-3 as more sparse operations are added to mllib 
BLAS although I don't think it will give us the runtime boost we are looking 
for...

We are looking into approximate KNN family of algorithms to improve the runtime 
further...KDTree is good for dense vector with low features but for sparse 
vector in higher dimensions researches did not find it useful..

LSH seems to be most commonly used and that's the direction we are looking 
into. I looked into papers but the one that showed good recall values in their 
experiments as compared to brute force KNN is Google Correlate and that's the 
validation strategy we will focus at 
https://www.google.com/trends/correlate/nnsearch.pdf. Please point to any other 
references that deem fit. There are twitter papers as well using LSH and the 
implementation is available in algebird. We will start with algebird LSH but 
ideally we don't want to have a distance metric hardcoded in LSH.

If we get good recall using LSH based method compared to the rowSimilarities 
code from the PR, we will use LSH based method to approximate compute 
similarities between dense/sparse rows using cosine kernel,  dense userFactor, 
productFactor from factorization using product kernel and dense user/product 
factor similarities using cosine kernel.

The kernel abstraction is part of the current PR and right now we support 
Cosine, Product, Euclidean and RBF. Pearson is something that's of interest but 
it's not added yet. For approximate row similarity I will open up a separate 
JIRA.

> rowSimilarities
> ---
>
> Key: SPARK-4823
> URL: https://issues.apache.org/jira/browse/SPARK-4823
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Reza Zadeh
>
> RowMatrix has a columnSimilarities method to find cosine similarities between 
> columns.
> A rowSimilarities method would be useful to find similarities between rows.
> This is JIRA is to investigate which algorithms are suitable for such a 
> method, better than brute-forcing it. Note that when there are many rows (> 
> 10^6), it is unlikely that brute-force will be feasible, since the output 
> will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4823) rowSimilarities

2015-05-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546996#comment-14546996
 ] 

Apache Spark commented on SPARK-4823:
-

User 'debasish83' has created a pull request for this issue:
https://github.com/apache/spark/pull/6213

> rowSimilarities
> ---
>
> Key: SPARK-4823
> URL: https://issues.apache.org/jira/browse/SPARK-4823
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Reza Zadeh
>
> RowMatrix has a columnSimilarities method to find cosine similarities between 
> columns.
> A rowSimilarities method would be useful to find similarities between rows.
> This is JIRA is to investigate which algorithms are suitable for such a 
> method, better than brute-forcing it. Note that when there are many rows (> 
> 10^6), it is unlikely that brute-force will be feasible, since the output 
> will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4823) rowSimilarities

2015-03-07 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14351948#comment-14351948
 ] 

Debasish Das commented on SPARK-4823:
-

[~mengxr] I need level 3 BLAS for this JIRA as well as 
https://issues.apache.org/jira/browse/SPARK-4675...Specifically I am looking 
for dense matrix x dense matrix and dense matrix x sparse matrix...Does breeze 
CSCMatrix support BLAS 3 based dense matrix x CSCMatrix product ? I had some 
code with breeze dot and it was extremely slow...I will migrate the code to 
netlib java BLAS from mllib and update the results on the JIRA...

> rowSimilarities
> ---
>
> Key: SPARK-4823
> URL: https://issues.apache.org/jira/browse/SPARK-4823
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Reza Zadeh
>
> RowMatrix has a columnSimilarities method to find cosine similarities between 
> columns.
> A rowSimilarities method would be useful to find similarities between rows.
> This is JIRA is to investigate which algorithms are suitable for such a 
> method, better than brute-forcing it. Note that when there are many rows (> 
> 10^6), it is unlikely that brute-force will be feasible, since the output 
> will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4823) rowSimilarities

2014-12-11 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14243207#comment-14243207
 ] 

Debasish Das commented on SPARK-4823:
-

Even for matrix factorization userFactors are user x rank...with modest ranks 
of 50..and users at 10M, I don't think it is possible to transpose the matrix 
and run column similarities...doing it on the fly complexity wise is still 
O(n*n) right...

> rowSimilarities
> ---
>
> Key: SPARK-4823
> URL: https://issues.apache.org/jira/browse/SPARK-4823
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Reza Zadeh
>
> RowMatrix has a columnSimilarities method to find cosine similarities between 
> columns.
> A rowSimilarities method would be useful to find similarities between rows.
> This is JIRA is to investigate which algorithms are suitable for such a 
> method, better than brute-forcing it. Note that when there are many rows (> 
> 10^6), it is unlikely that brute-force will be feasible, since the output 
> will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4823) rowSimilarities

2014-12-11 Thread Reza Zadeh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14243133#comment-14243133
 ] 

Reza Zadeh commented on SPARK-4823:
---

Given that we're talking about RowMatrices, computing rowSimilarities the same 
way as columnSimilarities would require transposing the matrix, which is 
dangerous when the original matrix has many rows. RowMatrix assumes a single 
row should fit in memory on a single machine, but this might not happen after 
transposing a RowMatrix.

> rowSimilarities
> ---
>
> Key: SPARK-4823
> URL: https://issues.apache.org/jira/browse/SPARK-4823
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Reza Zadeh
>
> RowMatrix has a columnSimilarities method to find cosine similarities between 
> columns.
> A rowSimilarities method would be useful to find similarities between rows.
> This is JIRA is to investigate which algorithms are suitable for such a 
> method, better than brute-forcing it. Note that when there are many rows (> 
> 10^6), it is unlikely that brute-force will be feasible, since the output 
> will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4823) rowSimilarities

2014-12-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14243055#comment-14243055
 ] 

Sean Owen commented on SPARK-4823:
--

I don't think MapReduce matters here. You can compute pairs of similarities 
with any framework, or try to do it on the fly. It's not different than column 
similarities, right? I don't think there's anything more to it than applying a 
similarity metric to all pairs of vectors. I think the JIRA is about exposing a 
method just for API convenience, not because it's conceptually different.

> rowSimilarities
> ---
>
> Key: SPARK-4823
> URL: https://issues.apache.org/jira/browse/SPARK-4823
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Reza Zadeh
>
> RowMatrix has a columnSimilarities method to find cosine similarities between 
> columns.
> A rowSimilarities method would be useful to find similarities between rows.
> This is JIRA is to investigate which algorithms are suitable for such a 
> method, better than brute-forcing it. Note that when there are many rows (> 
> 10^6), it is unlikely that brute-force will be feasible, since the output 
> will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4823) rowSimilarities

2014-12-11 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14243048#comment-14243048
 ] 

Debasish Das commented on SPARK-4823:
-

[~srowen] did you implement map-reduce row similarities for user factors ? 
What's the algorithm that you used ? Any pointers will be really helpful...

> rowSimilarities
> ---
>
> Key: SPARK-4823
> URL: https://issues.apache.org/jira/browse/SPARK-4823
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Reza Zadeh
>
> RowMatrix has a columnSimilarities method to find cosine similarities between 
> columns.
> A rowSimilarities method would be useful to find similarities between rows.
> This is JIRA is to investigate which algorithms are suitable for such a 
> method, better than brute-forcing it. Note that when there are many rows (> 
> 10^6), it is unlikely that brute-force will be feasible, since the output 
> will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4823) rowSimilarities

2014-12-10 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242031#comment-14242031
 ] 

Debasish Das commented on SPARK-4823:
-

I am considering coming up with a baseline version that's very close to brute 
force but we cut the computation with a topK number...for each user come up 
with topK users where K is defined by the client..this will take care of matrix 
factorization use-case...

Basically on master we collect a set of user factors, broadcast it to every 
node and does a reduceByKey to generate topK users for each user from this user 
block...We send a kernel function (cosine / polynomial / rbf) in this 
calculation...

But this idea does not work for raw features right...If we do map features to a 
lower dimension using factorization then this approach should run fine...but I 
am not sure if we can ask users to map their data into a lower dimension

Is it possible to bring in ideas from fastfood and kitchen sink to do this ?


> rowSimilarities
> ---
>
> Key: SPARK-4823
> URL: https://issues.apache.org/jira/browse/SPARK-4823
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Reza Zadeh
>
> RowMatrix has a columnSimilarities method to find cosine similarities between 
> columns.
> A rowSimilarities method would be useful to find similarities between rows.
> This is JIRA is to investigate which algorithms are suitable for such a 
> method, better than brute-forcing it. Note that when there are many rows (> 
> 10^6), it is unlikely that brute-force will be feasible, since the output 
> will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org