[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-25 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14379513#comment-14379513
 ] 

Yu Ishikawa commented on SPARK-2429:


My implementation depends on a bug of breeze, when a sparse vector is divided 
by a scalar value. We should upgrade breeze to the next version.

https://issues.apache.org/jira/browse/SPARK-6341

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
  Labels: clustering
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-24 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377840#comment-14377840
 ] 

Yu Ishikawa commented on SPARK-2429:


[~freeman-lab], [~mengxr], [~josephkb], [~rnowling]

I'd like to divide this issue into three sub issues in order to improve the 
efficiency of the review. I think there are three independent parts in this 
issue. I can send you a PR soon on the first sub issue of the below if you 
don't mind.

1. This algorithm's implementation in mllib package
2. Wrapper classes of the DataFrame in ml package
3. Examples and documentation

Thanks

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
  Labels: clustering
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-24 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378877#comment-14378877
 ] 

Yu Ishikawa commented on SPARK-2429:


I got it. Thanks!

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
  Labels: clustering
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-24 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378413#comment-14378413
 ] 

Joseph K. Bradley commented on SPARK-2429:
--

[~yuu.ishik...@gmail.com]  That sounds good to me.  Depending on how quickly 
things go, we may want to do (3) before (2).  Thanks!

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
  Labels: clustering
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-14 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362219#comment-14362219
 ] 

Yu Ishikawa commented on SPARK-2429:


I didn't change the algorithms very much. The new algorithm is a parallel 
execution version of the old version.
I can focus on implementing this. I'd like to merge this into 1.4.0.

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
  Labels: clustering
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-14 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14362223#comment-14362223
 ] 

Yu Ishikawa commented on SPARK-2429:


Hi [~freeman-lab], 
Thank you for commenting. I'll modify the new implementation for the new ML 
API. So, I'll clothe the current PR and then send a new PR for the new 
implementation. thanks

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
  Labels: clustering
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-11 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357497#comment-14357497
 ] 

Jeremy Freeman commented on SPARK-2429:
---

Thanks for the update and contribution [~yuu.ishik...@gmail.com]! I think I 
agree with [~josephkb] that it is worth bringing this into MLlib, as the 
algorithm itself will translate to future uses, and many groups (including 
ours!) will find it useful now.

It might be worth adding to spark-packages, especially if we expect the review 
to take awhile. Those seem especially useful as a way to provide easy access to 
testing new pieces of functionality. But I'd probably prioritize just reviewing 
the patch.

Also agree with the others that we should start a new PR with the new 
algorithm, 1000x faster is a lot! It is worth incorporating some of comments 
from the old PR if you haven't already, if relevant in the new version.

I'd be happy to go through the new PR as I'm quite familiar with the problem / 
algorithm, but it would help if you could say a little more about what you did 
so differently here, to help guide me as I look at the code.

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
  Labels: clustering
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-11 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357367#comment-14357367
 ] 

RJ Nowling commented on SPARK-2429:
---

Hi [~yuu.ishik...@gmail.com]

I think the new implementation is great.  Did you change the algorithm?

I've spoken with [~srowen].  The hierarchical clustering would be valuable to 
the community -- I actually had a couple people reach out to me about it. 
However, Spark is currently undergoing the transition to the new ML API and as 
such, there is concern about accepting code into the older MLlib library.  With 
the announcement of Spark packages, there is also a move to encourage external 
libraries instead of large commits into Spark itself.

Would you be interested in publishing your hierarchical clustering 
implementation as an external library like [~derrickburns] did for the [KMeans 
Mini Batch 
implementation|https://github.com/derrickburns/generalized-kmeans-clustering]?  
 It could be listed in the [Spark packages index|http://spark-packages.org/] 
along with two other clustering packages so users can find it.

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
  Labels: clustering
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357407#comment-14357407
 ] 

Joseph K. Bradley commented on SPARK-2429:
--

I'll try to prioritize it, though the next week or so will be difficult because 
of the Spark Summit.  It will be valuable to have your input still since you're 
familiar with the PR.  Thanks for both of your efforts on this!

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
  Labels: clustering
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-11 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357389#comment-14357389
 ] 

RJ Nowling commented on SPARK-2429:
---

[~josephkb]  I think it would be great to get the new implementation into Spark 
but we need a champion for it.  [~yuu.ishik...@gmail.com] did some great work, 
and I've been trying to shepard the work but we need a committer who wants to 
bring it in.  If you want to do that, then I can step back and let you and 
[~yuu.ishik...@gmail.com] bring this across the finish line.

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
  Labels: clustering
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-11 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357410#comment-14357410
 ] 

RJ Nowling commented on SPARK-2429:
---

I'm familiar with the community interest but I'm not terribly familiar with the 
implementations (old or now).  [~freeman-lab] may be the appropriate person to 
ask for help -- the original implementation was based on his gist.

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
  Labels: clustering
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357380#comment-14357380
 ] 

Joseph K. Bradley commented on SPARK-2429:
--

But as far as the old vs. new algorithm, it sounds like it would make sense to 
replace the old one with the new one for this PR (though I have not yet had a 
chance to compare and understand them in detail).

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
  Labels: clustering
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357377#comment-14357377
 ] 

Joseph K. Bradley commented on SPARK-2429:
--

I don't think we should discourage contributions to the current MLlib package.  
It's true we're experimenting with spark.ml and figuring out a better long-term 
API, but the main work in JIRAs/PRs like this one is in designing and 
implementing the algorithm itself, which we can easily copy over to the new API 
when the time comes.  I'd vote for trying to get this into spark.mllib and then 
wrapping it for spark.ml when ready.  (It'd be fine to do a Spark package too, 
but I hope this can get into MLlib itself.)

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
  Labels: clustering
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-11 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356399#comment-14356399
 ] 

Yu Ishikawa commented on SPARK-2429:


[~rnowling]

I apologize for the delay in replying. I'm still working on this.
I will reply the feedback from Freeman and Owen ASSP.

By the way, I have a question about the future course of action.
I implemented another algorithm which is more scalable and 1000 times faster 
than current one. 
Should we continue the PR, or replace it with the new implementation?

https://github.com/yu-iskw/more-scalable-hierarchical-clustering-with-spark

It is difficult to run the current one with an argument of the number of large 
clusters, such as 1. Because the dividing processes are executed 
one-by-one. The dividing processes of the new one is parallel processing. 
That's because it is more scalable and much faster than the current one.

thanks

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
  Labels: clustering
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-02 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14343806#comment-14343806
 ] 

RJ Nowling commented on SPARK-2429:
---

[~yuu.ishik...@gmail.com] are you still working on this?  Thanks!

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
  Labels: clustering
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-11-27 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227371#comment-14227371
 ] 

Yu Ishikawa commented on SPARK-2429:


Hi [~rnowling], 

Thank you for replying.

{quote}
I'm having trouble finding the function to cut a dendrogram – I see the tests 
but not the implementation.
{quote}

I'm very sorry. 
[Here|https://github.com/yu-iskw/spark/blob/hierarchical/mllib%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fmllib%2Fclustering%2FHierarchicalClustering.scala#L390]
 is the implementation. I think the function is specialized to SciPy dendrogram 
much.

{quote}
If clusters at the same levels in the hierarchy do not overlap, you should be 
able to choose the closest cluster at each level until you find a leaf. I'm 
assuming that the children of a given cluster are contained within that cluster 
(spacially) – can you show this or find a reference for this? If so, then 
assignment should be faster for a larger number of clusters as Jun was saying 
above.
{quote}

Exactly. I agree with you. 

I implemented the function for assignment as a recursive function in O(log N) 
time.
[https://github.com/yu-iskw/spark/blob/hierarchical/mllib%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fmllib%2Fclustering%2FHierarchicalClustering.scala#L482]

Although I checked the performance for assignment  of my implementation and 
that of KMenas, the elapsed time this assignment implementation is slower than 
that of KMeans. The result is 
[here|https://issues.apache.org/jira/browse/SPARK-2429?focusedCommentId=14190166page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14190166].
The elapsed time for assingment (predct) is not long from the beginning. For 
example, the time of assingnment of KMenas in O(N) is 0.011\[sec\], and that of 
my implementation in O(log N) is 0.307 \[sec\].

I'm very sorry if I misunderstand your question.
thanks,

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-11-26 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227352#comment-14227352
 ] 

Yu Ishikawa commented on SPARK-2429:


Hi [~yangjunpro], 

{quote}
1. In your implementation, in each divisive steps, there will be a copy 
operations to distribution the data nodes in the parent cluster tree to the 
split children cluster trees, when the document size is large, I think this 
copy cost is non-neglectable, right?
{quote}

Exactly. A cached memory is twice larger than the original data. For example, 
if data size is 10 GB, a spark cluster always has 20 GB cached RDD through the 
algorithm. The reason why I cache the data nodes at each time dividing is for 
elapsed time. That is, this algorithm is very slow without caching the data 
nodes.

{quote}
2. In your test code, the cluster size is not quite large( only about 100 ), 
have you ever tested it with big cluster size and big document corpus? e.g., 
1 clusters with 200 documents. What is the performance behavior facing 
this kind of use case?
{quote}

The test code deals with small data as you said. I think data size in unit 
tests should not be large in order to reduce the test time. Of course, I am 
willing to implement this algorithm to fit large input data and large clusters 
we want. Although I have never check the performance of this implementation 
with large clusters, such as 1, elapsed time can be long. I will check the 
performance under the condition. Or if possible, could you check the 
performance?

thanks


 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-11-16 Thread Jun Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213942#comment-14213942
 ] 

Jun Yang commented on SPARK-2429:
-

Hi Yu Ishikawa 

Thanks for your wonderful hierarchical implementation of KMeans, which just 
meets one of our project requirement :)

In our project, we initially used a MPI-based HAC implementation to do 
agglomeration bottom-up hierarchical clustering, and since 
we want to migrate the entire back-end pipeline to Spark, we just look for the 
alike hierarchical clustering implementation on Spark or we need to write it by 
ourselves. 

From functionality perspective, you implementation looks pretty good( I have 
already read through your code), but I still have several questions regarding 
to performance and scalability:
1. In your implementation, in each divisive steps, there will be a copy 
operations to distribution the data nodes in the parent cluster tree to the 
split children cluster trees, when the document size is large, I think this 
copy cost is non-neglectable, right?
A potential optimization method is to keep the entire document data cached, and 
in each divisive steps, we just record the index 
of the documents into the ClusterTree object, so the cost could be lowered 
quite a lot.

Does this idea make sense?

2. In your test code, the cluster size is not quite large( only about 100 ), 
have you ever tested it with big cluster size and big document corpus?  e.g., 
1 clusters with 200 documents. What is the performance behavior facing 
this kind of use case?
Since in production environment, this use case is usually typical.

Look forward to your reply. 

Thanks 


 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-11-16 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213945#comment-14213945
 ] 

RJ Nowling commented on SPARK-2429:
---

Hi Yu,

I'm having trouble finding the function to cut a dendrogram -- I see the tests 
but not the implementation.

I feel that you should be able to assign values in O(log N) time with the 
hierarchical method vs O(N) with the standard kmeans.  So, say you train a 
model (this may be slower than kmeans) then assign additional points to 
clusters after training.  If clusters at the same levels in the hierarchy do 
not overlap, you should be able to choose the closest cluster at each level 
until you find a leaf.  I'm assuming that the children of a given cluster are 
contained within that cluster (spacially) -- can you show this or find a 
reference for this?  If so, then assignment should be faster for a larger 
number of clusters as Jun was saying above.

Do you agree with this?  Or is there something I am misunderstanding!

Thanks!

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-11-11 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14207463#comment-14207463
 ] 

Yu Ishikawa commented on SPARK-2429:


[~rnowling] Sorry for commenting again. Could you tell me what you think about 
the new function to cut a dendrogram?
(For example. you think we don't need the new function. Or we should make an 
advantage against KMeans from another point of view. )

1. This algorithm doesn't have an advantage about elapsed time of assignment 
against KMeans.
2. The new function generate another model to cut a dendrogram by height 
without re-training by another parameters.

Thanks,

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-11-07 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201990#comment-14201990
 ] 

Yu Ishikawa commented on SPARK-2429:


Hi [~rnowling], 

I have a suggestion to you about new function. I think it is difficult for this 
algorithm to have an advantage in computational complexity. So I implemented a 
function to cut a cluster tree as a result of clustering by height. This 
function restructures a cluster tree, not changing the original cluster tree. 
We can control the number of clusters in a cluster tree by height without 
recomputation.  This is an advantage against KMeans and other clustering 
algorighms.

You can see a test code at below URL.
[https://github.com/yu-iskw/spark/blob/8355f959f02ca67454c9cb070912480db0a44671/mllib/src/test/scala/org/apache/spark/mllib/clustering/HierarchicalClusteringModelSuite.scala#L116]

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-10-30 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14190166#comment-14190166
 ] 

Yu Ishikawa commented on SPARK-2429:


I compared training and predicting elapsed times of the hierarchical clustering 
with them of kmeans.
In fact, the theoretical computational complexity of hierarchical clustering 
assingment is smaller than that of kmeans.
However, not only predicting time but also predicting time of the hierarchical 
clustering are slower than them of kmeans.

I used the below url's program for this experiment.
https://github.com/yu-iskw/hierarchical-clustering-with-spark/blob/37488e306d583d0e1743bff432165e8c1bf4465e/src/main/scala/CompareWithKMeansApp.scala

{noformat}
{maxCores : 160, numClusters : 50, dimension : 500, rows : 
100, numPartitions : 160}
KMeans Training Elappsed Time: 28.179 [sec]
KMeans Predicting Elappsed Time: 0.011 [sec]
Hierarchical Training Elappsed Time: 46.539 [sec]
Hierarchical Predicting Elappsed Time: 0.3076923076923077 [sec]

{maxCores : 160, numClusters : 50, dimension : 500, rows : 
500, numPartitions : 160}
KMeans Training Elappsed Time: 55.187 [sec]
KMeans Predicting Elappsed Time: 0.008 [sec]
Hierarchical Training Elappsed Time: 210.238 [sec]
Hierarchical Predicting Elappsed Time: 0.3906093906093906 [sec]
{noformat}


 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-10-29 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14188325#comment-14188325
 ] 

RJ Nowling commented on SPARK-2429:
---

The sparsity tests look good.  Have you compared training and assignment time 
to KMeans yet?  An improvement in the assignment time will be important.  Also, 
I don't see a breakdown of the total time by splitting clusters, assignments, 
etc.  Doesn't need to be for every combination of parameters just one or two.  
That would be very helpful.  Thanks!

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-10-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181167#comment-14181167
 ] 

Apache Spark commented on SPARK-2429:
-

User 'yu-iskw' has created a pull request for this issue:
https://github.com/apache/spark/pull/2906

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-10-23 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181169#comment-14181169
 ] 

Yu Ishikawa commented on SPARK-2429:


Hi [~rnowling], 

I sent the PR. Could you review it?

I added the logging to show the execution time in each step.
And  I could reduce the computational complexity of assignment to change the 
way to find the closest cluster.
(https://github.com/apache/spark/pull/2906/files#diff-ab0740a588f75cada28bd0772820e2dcR409)

Thanks

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-10-23 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181183#comment-14181183
 ] 

RJ Nowling commented on SPARK-2429:
---

I added a couple comments to the PR.  

I would say stick with Euclidean distance for now.

For assignment, you should be able to do a binary search.  E.g., if a center 
has children, which of the two children is it closer to?  Choose that center.  
Repeat until you hit a leaf (cluster with no children).

I saw that you added logging for timing but can you update your report with the 
timing breakdown for each stage?

Thanks!

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-10-22 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179890#comment-14179890
 ] 

RJ Nowling commented on SPARK-2429:
---

A 6x performance improvement is great improvement!

Can you add a breakdown of the timings for each part of the algorithm?  (e.g, 
like you did to find out which parts were slowest?)  You don't need to do a 
sweep over multiple data sizes or number of data points -- just pick a 
representative number of data point and rows.

Have you compared the performance of the hierarchical KMeans vs KMeans 
implemented in MLLib?  I expect that the hierarchical will be slower to cluster 
but the assignment should be faster (O(log k) vs O(k)).  This improvement in 
assignment speed is the motivation for including the hierarchical KMeans in 
Spark.

Thanks!

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-10-22 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14179974#comment-14179974
 ] 

Yu Ishikawa commented on SPARK-2429:


{quote}
Can you add a breakdown of the timings for each part of the algorithm? (e.g, 
like you did to find out which parts were slowest?) You don't need to do a 
sweep over multiple data sizes or number of data points – just pick a 
representative number of data point and rows.
{quote}

Yes, I can. But I only embedded the debug code in each part of the algorithm. I 
don't support logging messages in the algorithm yet. So I will be able to log 
the each part execution time of the algorithm. Or do you want to get the each 
part them as variables like `HierarchicalClusteringModel.trainTime`?

{quote}
Have you compared the performance of the hierarchical KMeans vs KMeans 
implemented in MLLib? I expect that the hierarchical will be slower to cluster 
but the assignment should be faster (O(log k) vs O(k)). This improvement in 
assignment speed is the motivation for including the hierarchical KMeans in 
Spark.
{quote}

No, I haven't yet. I agree with that the hierarchical clustering will be slower 
than the k-means. The assignment speed of the algorithm is as fast as that of 
k-means in `HierarchicalClusteringModel.predict`. We should improve 
`ClusterTree` data structure like B-tree in order to search by its index. Do 
you have any good idea to improve it?


There is something I want to talk to you about.  I know we should support for 
distance metrics other than Euclidean such as cosine distance, and I want to. 
However, I don't know the best way to support distance functions in MLlib as a 
common library yet. So I have a suggestion about the release of the algorithm. 
At first, we will merge it  with  only Euclidean distance metric. And then we 
will support other distance metrics at another issue. What do you think about 
it?

Thanks

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-10-15 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14172486#comment-14172486
 ] 

RJ Nowling commented on SPARK-2429:
---

Great to know! I'm glad that isn't a bottleneck.

Have you been able to benchmark each of the major steps?  Which steps are
most expensive?

On Wed, Oct 15, 2014 at 10:24 AM, Yu Ishikawa (JIRA) j...@apache.org




-- 
em rnowl...@gmail.com
c 954.496.2314


 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
 Attachments: The Result of Benchmarking a Hierarchical Clustering.pdf


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-10-10 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166593#comment-14166593
 ] 

Yu Ishikawa commented on SPARK-2429:


Hi [~rnowling],

Thank you for your comments and advices.

{quote}
Ok, first off, let me make sure I understand what you're doing. You start with 
2 centers. You assign all the points. You then apply KMeans recursively to each 
cluster, splitting each center into 2 centers. Each instance of KMeans stops 
when the error is below a certain value or a fixed number of iterations have 
been run.
{quote}
You are right. The algorithm runs as you said.

{quote}
I think your analysis of the overall run time is good and probably what we 
expect. Can you break down the timing to see which parts are the most 
expensive? Maybe we can figure out where to optimize it.
{quote}
OK. I will measure the execution time of parts of the implementation.

{quote}
1. It might be good to convert everything to Breeze vectors before you do any 
operations – you need to convert the same vectors over and over again. KMeans 
converts them at the beginning and converts the vectors for the centers back at 
the end.
{quote}
I agree with you. I am troubled with this problem. After training the model, 
the user seems to select  the data in a cluster which is the part of the whole 
input data. I think there are three approaches to realize it as below.

# We extract the centers and their `RDD \[Vector\]` data in a cluster through 
the training like my implementation
# We extract the centers and their `RDD\[BV\[Double\]\]` data, and then convert 
the data into `RDD\[Vector\]` at the last.
The converting from breeze vectors to spark vectors is very slow. That's why we 
didn't implement it.
# We only extract the centers through the training, not their data. And then we 
apply the trained model to the input data with `predict` method like 
scikit-lean in order to extract the part of the data in each cluster.
This seems to be good. We have to save the `RDD\[BV\[Double\]\]` data of each 
cluster thorough the clustering. Because we extract the `RDD\[Vector\]` data of 
each cluster after the training, I am worried that it is a waste of the 
`RDD\[DB\[Double\]\]` data through the clustering. And I am troubled with how 
to elegantly save the data in progress of the clustering.

{quote}
2. Instead of passing the centers as part of the EuclideanClosestCenterFinder, 
look into using a broadcast variable. See the latest KMeans implementation. 
This could improve performance by 10%+.

3. You may want to look into using reduceByKey or similar RDD operations – they 
will enable parallel reductions which will be faster than a loop on the master.
{quote}
I will give it a try. Thanks!

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
 Attachments: The Result of Benchmarking a Hierarchical Clustering.pdf


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-10-09 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165480#comment-14165480
 ] 

RJ Nowling commented on SPARK-2429:
---

Great work, Yu!

Ok, first off, let me make sure I understand what you're doing.  You start with 
2 centers.  You assign all the points.  You then apply KMeans recursively to 
each cluster, splitting each center into 2 centers.  Each instance of KMeans 
stops when the error is below a certain value or a fixed number of iterations 
have been run.

I think your analysis of the overall run time is good and probably what we 
expect.  Can you break down the timing to see which parts are the most 
expensive?  Maybe we can figure out where to optimize it.

A few thoughts on optimization:
1. It might be good to convert everything to Breeze vectors before you do any 
operations -- you need to convert the same vectors over and over again.  KMeans 
converts them at the beginning and converts the vectors for the centers back at 
the end.

2. Instead of passing the centers as part of the EuclideanClosestCenterFinder, 
look into using a broadcast variable.  See the latest KMeans implementation. 
This could improve performance by 10%+.

3. You may want to look into using reduceByKey or similar RDD operations -- 
they will enable parallel reductions which will be faster than a loop on the 
master.

If you look at the JIRAs and PRs, there is some recent work to speed up KMeans 
-- maybe some of that is applicable?

I'll probably have more questions -- it's a good way of helping me understand 
what you're doing :)

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
 Attachments: The Result of Benchmarking a Hierarchical Clustering.pdf


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-09-16 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136671#comment-14136671
 ] 

RJ Nowling commented on SPARK-2429:
---

Great!  I look forward to seeing your implementation.  :)





-- 
em rnowl...@gmail.com
c 954.496.2314


 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Priority: Minor

 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-08-27 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1411#comment-1411
 ] 

RJ Nowling commented on SPARK-2429:
---

Discussion on the dev list mentioned a community preference for implementing 
KMeans recursively (a divisive approach).  Jeremy Freeman provided an example 
here:

https://gist.github.com/freeman-lab/5947e7c53b368fe90371

The example needs to be optimized but provided a good starting point.  For 
example, every time KMeans is called, the data is converted to Breeze Vectors.

Here are two papers on divisive Kmeans:

A combined K-means and hierarchical clustering method for improving the 
clustering efficiency of microarray (2005) by Chen, et al.

Divisive Hierarchical K-Means (2006) by Lamrous, et al.


 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Priority: Minor

 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org