[ https://issues.apache.org/jira/browse/SPARK-3504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-3504. --------------------------------- Resolution: Incomplete > KMeans optimization: track distances and unmoved cluster centers across > iterations > ---------------------------------------------------------------------------------- > > Key: SPARK-3504 > URL: https://issues.apache.org/jira/browse/SPARK-3504 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 1.0.2 > Reporter: Derrick Burns > Priority: Major > Labels: bulk-closed, clustering > > The 1.0.2 implementation of the KMeans clusterer is VERY inefficient because > recomputes all distances to all cluster centers on each iteration. In later > iterations of Lloyd's algorithm, points don't change clusters and clusters > don't move. > By 1) tracking which clusters move and 2) tracking for each point which > cluster it belongs to and the distance to that cluster, one can avoid > recomputing distances in many cases with very little increase in memory > requirements. > I implemented this new algorithm and the results were fantastic. Using 16 > c3.8xlarge machines on EC2, the clusterer converged in 13 iterations on > 1,714,654 (182 dimensional) points and 20,000 clusters in 24 minutes. Here > are the running times for the first 7 rounds: > 6 minutes and 42 second > 7 minutes and 7 seconds > 7 minutes 13 seconds > 1 minutes 18 seconds > 30 seconds > 18 seconds > 12 seconds > Without this improvement, all rounds would have taken roughly 7 minutes, > resulting in Lloyd's iterations taking 7 * 13 = 91 minutes. In other words, > this improvement resulting in a reduction of roughly 75% in running time with > no loss of accuracy. > My implementation is a rewrite of the existing 1.0.2 implementation. It is > not a simple modification of the existing implementation. Please let me know > if you are interested in this new implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org