Bisecting K-Means - Working with intermediate results as DataSets

Adrian Bartnik Sun, 28 Aug 2016 14:16:44 -0700

Hi,

I am working on implementing a variant of the k-means algorithm, namelyBisecting K-means [1].

The basic premise is to run the original k-means algorithm onincreasingly smaller subsets of the original input data.In each step of the outer loop, it splits the current cluster in 2 newsmaller clusters and delete the corresponding parent cluster.

I am currently using a modified version of the existing k-meansimplementation from the Flink examples.


Pseudocode:

while currentClusterNumber < finalClusterNumber
    currentCluster = Pick current largest cluster
    for i = 1 to innerIterations
        Pick 2 random starting centroids
        Run k-means on currentCluster with centroids
        Store output and compute similarity of temporary result

Pick the one innerIteration result with highest similarity fromtemporary results

    Replace currentCluster with the two smaller subsets

It all comes down to nested iterations, which are not supported by Flinkat the moment.


Does anyone has experiences or workarounds to avoid this issue?

Best,
Adrian

----

[1] A Comparison of Document Clustering Techniques -http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.125.9225

Bisecting K-Means - Working with intermediate results as DataSets

Reply via email to