[jira] [Commented] (SPARK-4823) rowSimilarities
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15581359#comment-15581359 ] Debasish Das commented on SPARK-4823: - We use it in multiple usecases internally but did not get time to refactor the PR into 3 smaller PRsI will update the PR to 2.0 > rowSimilarities > --- > > Key: SPARK-4823 > URL: https://issues.apache.org/jira/browse/SPARK-4823 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Reza Zadeh > Attachments: MovieLensSimilarity Comparisons.pdf, > SparkMeetup2015-Experiments1.pdf, SparkMeetup2015-Experiments2.pdf > > > RowMatrix has a columnSimilarities method to find cosine similarities between > columns. > A rowSimilarities method would be useful to find similarities between rows. > This is JIRA is to investigate which algorithms are suitable for such a > method, better than brute-forcing it. Note that when there are many rows (> > 10^6), it is unlikely that brute-force will be feasible, since the output > will be of order 10^12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4823) rowSimilarities
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14983330#comment-14983330 ] Jerry Lam commented on SPARK-4823: -- Hi [~debasish83], I wonder if this is still work in progress or something that can be merged to 1.5 soon? Thank you. > rowSimilarities > --- > > Key: SPARK-4823 > URL: https://issues.apache.org/jira/browse/SPARK-4823 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Reza Zadeh > Attachments: MovieLensSimilarity Comparisons.pdf, > SparkMeetup2015-Experiments1.pdf, SparkMeetup2015-Experiments2.pdf > > > RowMatrix has a columnSimilarities method to find cosine similarities between > columns. > A rowSimilarities method would be useful to find similarities between rows. > This is JIRA is to investigate which algorithms are suitable for such a > method, better than brute-forcing it. Note that when there are many rows (> > 10^6), it is unlikely that brute-force will be feasible, since the output > will be of order 10^12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4823) rowSimilarities
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648340#comment-14648340 ] Debasish Das commented on SPARK-4823: - We did more detailed experiment for July 2015 Spark Meetup to understand the shuffle effects on runtime. I attached the data for experiments in the JIRA. I will update the PR as discussed with Reza. I am targeting 1 PR for Spark 1.5. > rowSimilarities > --- > > Key: SPARK-4823 > URL: https://issues.apache.org/jira/browse/SPARK-4823 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Reza Zadeh > Attachments: MovieLensSimilarity Comparisons.pdf > > > RowMatrix has a columnSimilarities method to find cosine similarities between > columns. > A rowSimilarities method would be useful to find similarities between rows. > This is JIRA is to investigate which algorithms are suitable for such a > method, better than brute-forcing it. Note that when there are many rows (> > 10^6), it is unlikely that brute-force will be feasible, since the output > will be of order 10^12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4823) rowSimilarities
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547318#comment-14547318 ] Debasish Das commented on SPARK-4823: - I opened up a PR that worked well for our datasets. It is still brute-force computation although we use blocked cartesian and user defined kernels to optimize on cutting computation and shuffle...There are trivial ideas to go from BLAS-1 to BLAS-2 and BLAS-3 as more sparse operations are added to mllib BLAS although I don't think it will give us the runtime boost we are looking for... We are looking into approximate KNN family of algorithms to improve the runtime further...KDTree is good for dense vector with low features but for sparse vector in higher dimensions researches did not find it useful.. LSH seems to be most commonly used and that's the direction we are looking into. I looked into papers but the one that showed good recall values in their experiments as compared to brute force KNN is Google Correlate and that's the validation strategy we will focus at https://www.google.com/trends/correlate/nnsearch.pdf. Please point to any other references that deem fit. There are twitter papers as well using LSH and the implementation is available in algebird. We will start with algebird LSH but ideally we don't want to have a distance metric hardcoded in LSH. If we get good recall using LSH based method compared to the rowSimilarities code from the PR, we will use LSH based method to approximate compute similarities between dense/sparse rows using cosine kernel, dense userFactor, productFactor from factorization using product kernel and dense user/product factor similarities using cosine kernel. The kernel abstraction is part of the current PR and right now we support Cosine, Product, Euclidean and RBF. Pearson is something that's of interest but it's not added yet. For approximate row similarity I will open up a separate JIRA. > rowSimilarities > --- > > Key: SPARK-4823 > URL: https://issues.apache.org/jira/browse/SPARK-4823 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Reza Zadeh > > RowMatrix has a columnSimilarities method to find cosine similarities between > columns. > A rowSimilarities method would be useful to find similarities between rows. > This is JIRA is to investigate which algorithms are suitable for such a > method, better than brute-forcing it. Note that when there are many rows (> > 10^6), it is unlikely that brute-force will be feasible, since the output > will be of order 10^12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4823) rowSimilarities
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546996#comment-14546996 ] Apache Spark commented on SPARK-4823: - User 'debasish83' has created a pull request for this issue: https://github.com/apache/spark/pull/6213 > rowSimilarities > --- > > Key: SPARK-4823 > URL: https://issues.apache.org/jira/browse/SPARK-4823 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Reza Zadeh > > RowMatrix has a columnSimilarities method to find cosine similarities between > columns. > A rowSimilarities method would be useful to find similarities between rows. > This is JIRA is to investigate which algorithms are suitable for such a > method, better than brute-forcing it. Note that when there are many rows (> > 10^6), it is unlikely that brute-force will be feasible, since the output > will be of order 10^12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4823) rowSimilarities
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14351948#comment-14351948 ] Debasish Das commented on SPARK-4823: - [~mengxr] I need level 3 BLAS for this JIRA as well as https://issues.apache.org/jira/browse/SPARK-4675...Specifically I am looking for dense matrix x dense matrix and dense matrix x sparse matrix...Does breeze CSCMatrix support BLAS 3 based dense matrix x CSCMatrix product ? I had some code with breeze dot and it was extremely slow...I will migrate the code to netlib java BLAS from mllib and update the results on the JIRA... > rowSimilarities > --- > > Key: SPARK-4823 > URL: https://issues.apache.org/jira/browse/SPARK-4823 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Reza Zadeh > > RowMatrix has a columnSimilarities method to find cosine similarities between > columns. > A rowSimilarities method would be useful to find similarities between rows. > This is JIRA is to investigate which algorithms are suitable for such a > method, better than brute-forcing it. Note that when there are many rows (> > 10^6), it is unlikely that brute-force will be feasible, since the output > will be of order 10^12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4823) rowSimilarities
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14243207#comment-14243207 ] Debasish Das commented on SPARK-4823: - Even for matrix factorization userFactors are user x rank...with modest ranks of 50..and users at 10M, I don't think it is possible to transpose the matrix and run column similarities...doing it on the fly complexity wise is still O(n*n) right... > rowSimilarities > --- > > Key: SPARK-4823 > URL: https://issues.apache.org/jira/browse/SPARK-4823 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Reza Zadeh > > RowMatrix has a columnSimilarities method to find cosine similarities between > columns. > A rowSimilarities method would be useful to find similarities between rows. > This is JIRA is to investigate which algorithms are suitable for such a > method, better than brute-forcing it. Note that when there are many rows (> > 10^6), it is unlikely that brute-force will be feasible, since the output > will be of order 10^12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4823) rowSimilarities
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14243133#comment-14243133 ] Reza Zadeh commented on SPARK-4823: --- Given that we're talking about RowMatrices, computing rowSimilarities the same way as columnSimilarities would require transposing the matrix, which is dangerous when the original matrix has many rows. RowMatrix assumes a single row should fit in memory on a single machine, but this might not happen after transposing a RowMatrix. > rowSimilarities > --- > > Key: SPARK-4823 > URL: https://issues.apache.org/jira/browse/SPARK-4823 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Reza Zadeh > > RowMatrix has a columnSimilarities method to find cosine similarities between > columns. > A rowSimilarities method would be useful to find similarities between rows. > This is JIRA is to investigate which algorithms are suitable for such a > method, better than brute-forcing it. Note that when there are many rows (> > 10^6), it is unlikely that brute-force will be feasible, since the output > will be of order 10^12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4823) rowSimilarities
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14243055#comment-14243055 ] Sean Owen commented on SPARK-4823: -- I don't think MapReduce matters here. You can compute pairs of similarities with any framework, or try to do it on the fly. It's not different than column similarities, right? I don't think there's anything more to it than applying a similarity metric to all pairs of vectors. I think the JIRA is about exposing a method just for API convenience, not because it's conceptually different. > rowSimilarities > --- > > Key: SPARK-4823 > URL: https://issues.apache.org/jira/browse/SPARK-4823 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Reza Zadeh > > RowMatrix has a columnSimilarities method to find cosine similarities between > columns. > A rowSimilarities method would be useful to find similarities between rows. > This is JIRA is to investigate which algorithms are suitable for such a > method, better than brute-forcing it. Note that when there are many rows (> > 10^6), it is unlikely that brute-force will be feasible, since the output > will be of order 10^12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4823) rowSimilarities
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14243048#comment-14243048 ] Debasish Das commented on SPARK-4823: - [~srowen] did you implement map-reduce row similarities for user factors ? What's the algorithm that you used ? Any pointers will be really helpful... > rowSimilarities > --- > > Key: SPARK-4823 > URL: https://issues.apache.org/jira/browse/SPARK-4823 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Reza Zadeh > > RowMatrix has a columnSimilarities method to find cosine similarities between > columns. > A rowSimilarities method would be useful to find similarities between rows. > This is JIRA is to investigate which algorithms are suitable for such a > method, better than brute-forcing it. Note that when there are many rows (> > 10^6), it is unlikely that brute-force will be feasible, since the output > will be of order 10^12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4823) rowSimilarities
[ https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242031#comment-14242031 ] Debasish Das commented on SPARK-4823: - I am considering coming up with a baseline version that's very close to brute force but we cut the computation with a topK number...for each user come up with topK users where K is defined by the client..this will take care of matrix factorization use-case... Basically on master we collect a set of user factors, broadcast it to every node and does a reduceByKey to generate topK users for each user from this user block...We send a kernel function (cosine / polynomial / rbf) in this calculation... But this idea does not work for raw features right...If we do map features to a lower dimension using factorization then this approach should run fine...but I am not sure if we can ask users to map their data into a lower dimension Is it possible to bring in ideas from fastfood and kitchen sink to do this ? > rowSimilarities > --- > > Key: SPARK-4823 > URL: https://issues.apache.org/jira/browse/SPARK-4823 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Reza Zadeh > > RowMatrix has a columnSimilarities method to find cosine similarities between > columns. > A rowSimilarities method would be useful to find similarities between rows. > This is JIRA is to investigate which algorithms are suitable for such a > method, better than brute-forcing it. Note that when there are many rows (> > 10^6), it is unlikely that brute-force will be feasible, since the output > will be of order 10^12. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org