[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14379513#comment-14379513 ] Yu Ishikawa commented on SPARK-2429: My implementation depends on a bug of breeze, when a sparse vector is divided by a scalar value. We should upgrade breeze to the next version. https://issues.apache.org/jira/browse/SPARK-6341 Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Labels: clustering Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377840#comment-14377840 ] Yu Ishikawa commented on SPARK-2429: [~freeman-lab], [~mengxr], [~josephkb], [~rnowling] I'd like to divide this issue into three sub issues in order to improve the efficiency of the review. I think there are three independent parts in this issue. I can send you a PR soon on the first sub issue of the below if you don't mind. 1. This algorithm's implementation in mllib package 2. Wrapper classes of the DataFrame in ml package 3. Examples and documentation Thanks Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Labels: clustering Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378877#comment-14378877 ] Yu Ishikawa commented on SPARK-2429: I got it. Thanks! Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Labels: clustering Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378413#comment-14378413 ] Joseph K. Bradley commented on SPARK-2429: -- [~yuu.ishik...@gmail.com] That sounds good to me. Depending on how quickly things go, we may want to do (3) before (2). Thanks! Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Labels: clustering Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362219#comment-14362219 ] Yu Ishikawa commented on SPARK-2429: I didn't change the algorithms very much. The new algorithm is a parallel execution version of the old version. I can focus on implementing this. I'd like to merge this into 1.4.0. Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Labels: clustering Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362223#comment-14362223 ] Yu Ishikawa commented on SPARK-2429: Hi [~freeman-lab], Thank you for commenting. I'll modify the new implementation for the new ML API. So, I'll clothe the current PR and then send a new PR for the new implementation. thanks Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Labels: clustering Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357497#comment-14357497 ] Jeremy Freeman commented on SPARK-2429: --- Thanks for the update and contribution [~yuu.ishik...@gmail.com]! I think I agree with [~josephkb] that it is worth bringing this into MLlib, as the algorithm itself will translate to future uses, and many groups (including ours!) will find it useful now. It might be worth adding to spark-packages, especially if we expect the review to take awhile. Those seem especially useful as a way to provide easy access to testing new pieces of functionality. But I'd probably prioritize just reviewing the patch. Also agree with the others that we should start a new PR with the new algorithm, 1000x faster is a lot! It is worth incorporating some of comments from the old PR if you haven't already, if relevant in the new version. I'd be happy to go through the new PR as I'm quite familiar with the problem / algorithm, but it would help if you could say a little more about what you did so differently here, to help guide me as I look at the code. Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Labels: clustering Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357367#comment-14357367 ] RJ Nowling commented on SPARK-2429: --- Hi [~yuu.ishik...@gmail.com] I think the new implementation is great. Did you change the algorithm? I've spoken with [~srowen]. The hierarchical clustering would be valuable to the community -- I actually had a couple people reach out to me about it. However, Spark is currently undergoing the transition to the new ML API and as such, there is concern about accepting code into the older MLlib library. With the announcement of Spark packages, there is also a move to encourage external libraries instead of large commits into Spark itself. Would you be interested in publishing your hierarchical clustering implementation as an external library like [~derrickburns] did for the [KMeans Mini Batch implementation|https://github.com/derrickburns/generalized-kmeans-clustering]? It could be listed in the [Spark packages index|http://spark-packages.org/] along with two other clustering packages so users can find it. Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Labels: clustering Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357407#comment-14357407 ] Joseph K. Bradley commented on SPARK-2429: -- I'll try to prioritize it, though the next week or so will be difficult because of the Spark Summit. It will be valuable to have your input still since you're familiar with the PR. Thanks for both of your efforts on this! Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Labels: clustering Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357389#comment-14357389 ] RJ Nowling commented on SPARK-2429: --- [~josephkb] I think it would be great to get the new implementation into Spark but we need a champion for it. [~yuu.ishik...@gmail.com] did some great work, and I've been trying to shepard the work but we need a committer who wants to bring it in. If you want to do that, then I can step back and let you and [~yuu.ishik...@gmail.com] bring this across the finish line. Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Labels: clustering Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357410#comment-14357410 ] RJ Nowling commented on SPARK-2429: --- I'm familiar with the community interest but I'm not terribly familiar with the implementations (old or now). [~freeman-lab] may be the appropriate person to ask for help -- the original implementation was based on his gist. Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Labels: clustering Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357380#comment-14357380 ] Joseph K. Bradley commented on SPARK-2429: -- But as far as the old vs. new algorithm, it sounds like it would make sense to replace the old one with the new one for this PR (though I have not yet had a chance to compare and understand them in detail). Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Labels: clustering Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357377#comment-14357377 ] Joseph K. Bradley commented on SPARK-2429: -- I don't think we should discourage contributions to the current MLlib package. It's true we're experimenting with spark.ml and figuring out a better long-term API, but the main work in JIRAs/PRs like this one is in designing and implementing the algorithm itself, which we can easily copy over to the new API when the time comes. I'd vote for trying to get this into spark.mllib and then wrapping it for spark.ml when ready. (It'd be fine to do a Spark package too, but I hope this can get into MLlib itself.) Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Labels: clustering Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356399#comment-14356399 ] Yu Ishikawa commented on SPARK-2429: [~rnowling] I apologize for the delay in replying. I'm still working on this. I will reply the feedback from Freeman and Owen ASSP. By the way, I have a question about the future course of action. I implemented another algorithm which is more scalable and 1000 times faster than current one. Should we continue the PR, or replace it with the new implementation? https://github.com/yu-iskw/more-scalable-hierarchical-clustering-with-spark It is difficult to run the current one with an argument of the number of large clusters, such as 1. Because the dividing processes are executed one-by-one. The dividing processes of the new one is parallel processing. That's because it is more scalable and much faster than the current one. thanks Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Labels: clustering Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14343806#comment-14343806 ] RJ Nowling commented on SPARK-2429: --- [~yuu.ishik...@gmail.com] are you still working on this? Thanks! Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Labels: clustering Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227371#comment-14227371 ] Yu Ishikawa commented on SPARK-2429: Hi [~rnowling], Thank you for replying. {quote} I'm having trouble finding the function to cut a dendrogram – I see the tests but not the implementation. {quote} I'm very sorry. [Here|https://github.com/yu-iskw/spark/blob/hierarchical/mllib%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fmllib%2Fclustering%2FHierarchicalClustering.scala#L390] is the implementation. I think the function is specialized to SciPy dendrogram much. {quote} If clusters at the same levels in the hierarchy do not overlap, you should be able to choose the closest cluster at each level until you find a leaf. I'm assuming that the children of a given cluster are contained within that cluster (spacially) – can you show this or find a reference for this? If so, then assignment should be faster for a larger number of clusters as Jun was saying above. {quote} Exactly. I agree with you. I implemented the function for assignment as a recursive function in O(log N) time. [https://github.com/yu-iskw/spark/blob/hierarchical/mllib%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fmllib%2Fclustering%2FHierarchicalClustering.scala#L482] Although I checked the performance for assignment of my implementation and that of KMenas, the elapsed time this assignment implementation is slower than that of KMeans. The result is [here|https://issues.apache.org/jira/browse/SPARK-2429?focusedCommentId=14190166page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14190166]. The elapsed time for assingment (predct) is not long from the beginning. For example, the time of assingnment of KMenas in O(N) is 0.011\[sec\], and that of my implementation in O(log N) is 0.307 \[sec\]. I'm very sorry if I misunderstand your question. thanks, Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227352#comment-14227352 ] Yu Ishikawa commented on SPARK-2429: Hi [~yangjunpro], {quote} 1. In your implementation, in each divisive steps, there will be a copy operations to distribution the data nodes in the parent cluster tree to the split children cluster trees, when the document size is large, I think this copy cost is non-neglectable, right? {quote} Exactly. A cached memory is twice larger than the original data. For example, if data size is 10 GB, a spark cluster always has 20 GB cached RDD through the algorithm. The reason why I cache the data nodes at each time dividing is for elapsed time. That is, this algorithm is very slow without caching the data nodes. {quote} 2. In your test code, the cluster size is not quite large( only about 100 ), have you ever tested it with big cluster size and big document corpus? e.g., 1 clusters with 200 documents. What is the performance behavior facing this kind of use case? {quote} The test code deals with small data as you said. I think data size in unit tests should not be large in order to reduce the test time. Of course, I am willing to implement this algorithm to fit large input data and large clusters we want. Although I have never check the performance of this implementation with large clusters, such as 1, elapsed time can be long. I will check the performance under the condition. Or if possible, could you check the performance? thanks Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213942#comment-14213942 ] Jun Yang commented on SPARK-2429: - Hi Yu Ishikawa Thanks for your wonderful hierarchical implementation of KMeans, which just meets one of our project requirement :) In our project, we initially used a MPI-based HAC implementation to do agglomeration bottom-up hierarchical clustering, and since we want to migrate the entire back-end pipeline to Spark, we just look for the alike hierarchical clustering implementation on Spark or we need to write it by ourselves. From functionality perspective, you implementation looks pretty good( I have already read through your code), but I still have several questions regarding to performance and scalability: 1. In your implementation, in each divisive steps, there will be a copy operations to distribution the data nodes in the parent cluster tree to the split children cluster trees, when the document size is large, I think this copy cost is non-neglectable, right? A potential optimization method is to keep the entire document data cached, and in each divisive steps, we just record the index of the documents into the ClusterTree object, so the cost could be lowered quite a lot. Does this idea make sense? 2. In your test code, the cluster size is not quite large( only about 100 ), have you ever tested it with big cluster size and big document corpus? e.g., 1 clusters with 200 documents. What is the performance behavior facing this kind of use case? Since in production environment, this use case is usually typical. Look forward to your reply. Thanks Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213945#comment-14213945 ] RJ Nowling commented on SPARK-2429: --- Hi Yu, I'm having trouble finding the function to cut a dendrogram -- I see the tests but not the implementation. I feel that you should be able to assign values in O(log N) time with the hierarchical method vs O(N) with the standard kmeans. So, say you train a model (this may be slower than kmeans) then assign additional points to clusters after training. If clusters at the same levels in the hierarchy do not overlap, you should be able to choose the closest cluster at each level until you find a leaf. I'm assuming that the children of a given cluster are contained within that cluster (spacially) -- can you show this or find a reference for this? If so, then assignment should be faster for a larger number of clusters as Jun was saying above. Do you agree with this? Or is there something I am misunderstanding! Thanks! Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207463#comment-14207463 ] Yu Ishikawa commented on SPARK-2429: [~rnowling] Sorry for commenting again. Could you tell me what you think about the new function to cut a dendrogram? (For example. you think we don't need the new function. Or we should make an advantage against KMeans from another point of view. ) 1. This algorithm doesn't have an advantage about elapsed time of assignment against KMeans. 2. The new function generate another model to cut a dendrogram by height without re-training by another parameters. Thanks, Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201990#comment-14201990 ] Yu Ishikawa commented on SPARK-2429: Hi [~rnowling], I have a suggestion to you about new function. I think it is difficult for this algorithm to have an advantage in computational complexity. So I implemented a function to cut a cluster tree as a result of clustering by height. This function restructures a cluster tree, not changing the original cluster tree. We can control the number of clusters in a cluster tree by height without recomputation. This is an advantage against KMeans and other clustering algorighms. You can see a test code at below URL. [https://github.com/yu-iskw/spark/blob/8355f959f02ca67454c9cb070912480db0a44671/mllib/src/test/scala/org/apache/spark/mllib/clustering/HierarchicalClusteringModelSuite.scala#L116] Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14190166#comment-14190166 ] Yu Ishikawa commented on SPARK-2429: I compared training and predicting elapsed times of the hierarchical clustering with them of kmeans. In fact, the theoretical computational complexity of hierarchical clustering assingment is smaller than that of kmeans. However, not only predicting time but also predicting time of the hierarchical clustering are slower than them of kmeans. I used the below url's program for this experiment. https://github.com/yu-iskw/hierarchical-clustering-with-spark/blob/37488e306d583d0e1743bff432165e8c1bf4465e/src/main/scala/CompareWithKMeansApp.scala {noformat} {maxCores : 160, numClusters : 50, dimension : 500, rows : 100, numPartitions : 160} KMeans Training Elappsed Time: 28.179 [sec] KMeans Predicting Elappsed Time: 0.011 [sec] Hierarchical Training Elappsed Time: 46.539 [sec] Hierarchical Predicting Elappsed Time: 0.3076923076923077 [sec] {maxCores : 160, numClusters : 50, dimension : 500, rows : 500, numPartitions : 160} KMeans Training Elappsed Time: 55.187 [sec] KMeans Predicting Elappsed Time: 0.008 [sec] Hierarchical Training Elappsed Time: 210.238 [sec] Hierarchical Predicting Elappsed Time: 0.3906093906093906 [sec] {noformat} Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14188325#comment-14188325 ] RJ Nowling commented on SPARK-2429: --- The sparsity tests look good. Have you compared training and assignment time to KMeans yet? An improvement in the assignment time will be important. Also, I don't see a breakdown of the total time by splitting clusters, assignments, etc. Doesn't need to be for every combination of parameters just one or two. That would be very helpful. Thanks! Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark-result.2014-10-29.html, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181167#comment-14181167 ] Apache Spark commented on SPARK-2429: - User 'yu-iskw' has created a pull request for this issue: https://github.com/apache/spark/pull/2906 Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181169#comment-14181169 ] Yu Ishikawa commented on SPARK-2429: Hi [~rnowling], I sent the PR. Could you review it? I added the logging to show the execution time in each step. And I could reduce the computational complexity of assignment to change the way to find the closest cluster. (https://github.com/apache/spark/pull/2906/files#diff-ab0740a588f75cada28bd0772820e2dcR409) Thanks Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181183#comment-14181183 ] RJ Nowling commented on SPARK-2429: --- I added a couple comments to the PR. I would say stick with Euclidean distance for now. For assignment, you should be able to do a binary search. E.g., if a center has children, which of the two children is it closer to? Choose that center. Repeat until you hit a leaf (cluster with no children). I saw that you added logging for timing but can you update your report with the timing breakdown for each stage? Thanks! Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179890#comment-14179890 ] RJ Nowling commented on SPARK-2429: --- A 6x performance improvement is great improvement! Can you add a breakdown of the timings for each part of the algorithm? (e.g, like you did to find out which parts were slowest?) You don't need to do a sweep over multiple data sizes or number of data points -- just pick a representative number of data point and rows. Have you compared the performance of the hierarchical KMeans vs KMeans implemented in MLLib? I expect that the hierarchical will be slower to cluster but the assignment should be faster (O(log k) vs O(k)). This improvement in assignment speed is the motivation for including the hierarchical KMeans in Spark. Thanks! Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179974#comment-14179974 ] Yu Ishikawa commented on SPARK-2429: {quote} Can you add a breakdown of the timings for each part of the algorithm? (e.g, like you did to find out which parts were slowest?) You don't need to do a sweep over multiple data sizes or number of data points – just pick a representative number of data point and rows. {quote} Yes, I can. But I only embedded the debug code in each part of the algorithm. I don't support logging messages in the algorithm yet. So I will be able to log the each part execution time of the algorithm. Or do you want to get the each part them as variables like `HierarchicalClusteringModel.trainTime`? {quote} Have you compared the performance of the hierarchical KMeans vs KMeans implemented in MLLib? I expect that the hierarchical will be slower to cluster but the assignment should be faster (O(log k) vs O(k)). This improvement in assignment speed is the motivation for including the hierarchical KMeans in Spark. {quote} No, I haven't yet. I agree with that the hierarchical clustering will be slower than the k-means. The assignment speed of the algorithm is as fast as that of k-means in `HierarchicalClusteringModel.predict`. We should improve `ClusterTree` data structure like B-tree in order to search by its index. Do you have any good idea to improve it? There is something I want to talk to you about. I know we should support for distance metrics other than Euclidean such as cosine distance, and I want to. However, I don't know the best way to support distance functions in MLlib as a common library yet. So I have a suggestion about the release of the algorithm. At first, we will merge it with only Euclidean distance metric. And then we will support other distance metrics at another issue. What do you think about it? Thanks Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The Result of Benchmarking a Hierarchical Clustering.pdf, benchmark2.html Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14172486#comment-14172486 ] RJ Nowling commented on SPARK-2429: --- Great to know! I'm glad that isn't a bottleneck. Have you been able to benchmark each of the major steps? Which steps are most expensive? On Wed, Oct 15, 2014 at 10:24 AM, Yu Ishikawa (JIRA) j...@apache.org -- em rnowl...@gmail.com c 954.496.2314 Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Attachments: The Result of Benchmarking a Hierarchical Clustering.pdf Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166593#comment-14166593 ] Yu Ishikawa commented on SPARK-2429: Hi [~rnowling], Thank you for your comments and advices. {quote} Ok, first off, let me make sure I understand what you're doing. You start with 2 centers. You assign all the points. You then apply KMeans recursively to each cluster, splitting each center into 2 centers. Each instance of KMeans stops when the error is below a certain value or a fixed number of iterations have been run. {quote} You are right. The algorithm runs as you said. {quote} I think your analysis of the overall run time is good and probably what we expect. Can you break down the timing to see which parts are the most expensive? Maybe we can figure out where to optimize it. {quote} OK. I will measure the execution time of parts of the implementation. {quote} 1. It might be good to convert everything to Breeze vectors before you do any operations – you need to convert the same vectors over and over again. KMeans converts them at the beginning and converts the vectors for the centers back at the end. {quote} I agree with you. I am troubled with this problem. After training the model, the user seems to select the data in a cluster which is the part of the whole input data. I think there are three approaches to realize it as below. # We extract the centers and their `RDD \[Vector\]` data in a cluster through the training like my implementation # We extract the centers and their `RDD\[BV\[Double\]\]` data, and then convert the data into `RDD\[Vector\]` at the last. The converting from breeze vectors to spark vectors is very slow. That's why we didn't implement it. # We only extract the centers through the training, not their data. And then we apply the trained model to the input data with `predict` method like scikit-lean in order to extract the part of the data in each cluster. This seems to be good. We have to save the `RDD\[BV\[Double\]\]` data of each cluster thorough the clustering. Because we extract the `RDD\[Vector\]` data of each cluster after the training, I am worried that it is a waste of the `RDD\[DB\[Double\]\]` data through the clustering. And I am troubled with how to elegantly save the data in progress of the clustering. {quote} 2. Instead of passing the centers as part of the EuclideanClosestCenterFinder, look into using a broadcast variable. See the latest KMeans implementation. This could improve performance by 10%+. 3. You may want to look into using reduceByKey or similar RDD operations – they will enable parallel reductions which will be faster than a loop on the master. {quote} I will give it a try. Thanks! Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Attachments: The Result of Benchmarking a Hierarchical Clustering.pdf Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165480#comment-14165480 ] RJ Nowling commented on SPARK-2429: --- Great work, Yu! Ok, first off, let me make sure I understand what you're doing. You start with 2 centers. You assign all the points. You then apply KMeans recursively to each cluster, splitting each center into 2 centers. Each instance of KMeans stops when the error is below a certain value or a fixed number of iterations have been run. I think your analysis of the overall run time is good and probably what we expect. Can you break down the timing to see which parts are the most expensive? Maybe we can figure out where to optimize it. A few thoughts on optimization: 1. It might be good to convert everything to Breeze vectors before you do any operations -- you need to convert the same vectors over and over again. KMeans converts them at the beginning and converts the vectors for the centers back at the end. 2. Instead of passing the centers as part of the EuclideanClosestCenterFinder, look into using a broadcast variable. See the latest KMeans implementation. This could improve performance by 10%+. 3. You may want to look into using reduceByKey or similar RDD operations -- they will enable parallel reductions which will be faster than a loop on the master. If you look at the JIRAs and PRs, there is some recent work to speed up KMeans -- maybe some of that is applicable? I'll probably have more questions -- it's a good way of helping me understand what you're doing :) Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Assignee: Yu Ishikawa Priority: Minor Attachments: The Result of Benchmarking a Hierarchical Clustering.pdf Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136671#comment-14136671 ] RJ Nowling commented on SPARK-2429: --- Great! I look forward to seeing your implementation. :) -- em rnowl...@gmail.com c 954.496.2314 Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Priority: Minor Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1411#comment-1411 ] RJ Nowling commented on SPARK-2429: --- Discussion on the dev list mentioned a community preference for implementing KMeans recursively (a divisive approach). Jeremy Freeman provided an example here: https://gist.github.com/freeman-lab/5947e7c53b368fe90371 The example needs to be optimized but provided a good starting point. For example, every time KMeans is called, the data is converted to Breeze Vectors. Here are two papers on divisive Kmeans: A combined K-means and hierarchical clustering method for improving the clustering efficiency of microarray (2005) by Chen, et al. Divisive Hierarchical K-Means (2006) by Lamrous, et al. Hierarchical Implementation of KMeans - Key: SPARK-2429 URL: https://issues.apache.org/jira/browse/SPARK-2429 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Priority: Minor Hierarchical clustering algorithms are widely used and would make a nice addition to MLlib. Clustering algorithms are useful for determining relationships between clusters as well as offering faster assignment. Discussion on the dev list suggested the following possible approaches: * Top down, recursive application of KMeans * Reuse DecisionTree implementation with different objective function * Hierarchical SVD It was also suggested that support for distance metrics other than Euclidean such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org