[ https://issues.apache.org/jira/browse/MATH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576710#comment-16576710 ]
Gilles commented on MATH-1465: ------------------------------ If you intend to implement this, don't forget to use the [development version of the code (git "master" branch)|https://git1-us-west.apache.org/repos/asf?p=commons-math.git;a=tree]. > Increase the initial sampling speed of KMeansPlusPlusClusterer for large k > -------------------------------------------------------------------------- > > Key: MATH-1465 > URL: https://issues.apache.org/jira/browse/MATH-1465 > Project: Commons Math > Issue Type: Improvement > Affects Versions: 3.6.1 > Reporter: yu zhang > Priority: Minor > Labels: performance > Original Estimate: 1,440h > Remaining Estimate: 1,440h > > As the major difference of KMeans++ to classic Kmeans, an initial distance > square sampling process is executed, the function is named > "chooseInitialCenters" in the current implementation. The time complexity is > O(dkn), where d is the dimension, k is the cluster number and n is the number > of points. > In paper "_Ackermann, Lammersen, Märtens, Raupach, Sohler, Swierkot. > StreamKM++: A Clustering Algorithm for Data Streams. In Proceedings of the > 12th Workshop on Algorithm Engineering and Experiments (ALENEX 2010)_", the > authors introduced a data structure named "coreset tree" which can reduce the > complexity of the initial sampling to O(dn log k). This is useful in the > scenario when value of k is large, say 1000, then the run speed of the > algorithm will be much faster. -- This message was sent by Atlassian JIRA (v7.6.3#76005)