K-means with large K

Buttler, David Mon, 28 Apr 2014 09:20:13 -0700

Hi,
I am trying to run the K-means code in mllib, and it works very nicely with 
small K (less than 1000).  However, when I try for a larger K (I am looking for 
2000-4000 clusters), it seems like the code gets part way through (perhaps just 
the initialization step) and freezes.  The compute nodes stop doing any CPU / 
network / IO and nothing happens for hours.  I had done something similar back 
in the days of Spark 0.6, and I didn't have any trouble going up to 4000 
clusters with similar data.


This happens with both a standalone cluster, and in local multi-core mode (with 
the node given 200GB of heap), but eventually completes in local single-core 
mode.

Data statistics:
Rows: 166248
Columns: 108

This is a test run before trying it out on much larger data

Any ideas on what might be the cause of this?

Thanks,
Dave

K-means with large K

Reply via email to