Re: [mahout 0.9 | k-means] methodology for selecting k to cluster very large datasets

2015-09-15 Thread Ted Dunning
My own feeling is that the right answer is to look at average squared distance on your training data and on held out data. As long as these values are nearly the same, you likely have a smaller (or equal) than optimal value of k. When the average squared distance is significantly less on the

[mahout 0.9 | k-means] methodology for selecting k to cluster very large datasets

2015-09-15 Thread hsharma mailinglists
Hello, I have some questions around large-scale clustering. I would like to arrive at a methodology that I can use to determine an appropriate value of K to run K-means clustering for (at least for my scenario, if not in general). More details follow below (apologies for the verbosity, but I