Re: KMeansJob vs KMeansDriver

2009-06-26 Thread Jeff Eastman
I didn't notice the --clusters option just reading the patch. If that puts the clusters into a specific directory then fine. I was suggesting the default be $output/state rather than currently just writing them all to $output. If you want some help I'm available some before next week then more

Re: KMeansJob vs KMeansDriver

2009-06-26 Thread Grant Ingersoll
On Jun 26, 2009, at 3:04 PM, Jeff Eastman wrote: That looks reasonable, just reading the patch. You might also want to put the clusters-x files into a state (or clusters) sub-directory to reduce noise in the output directory and improve consistency with MS and Dirichlet (which do not thems

Re: KMeansJob vs KMeansDriver

2009-06-26 Thread Jeff Eastman
That looks reasonable, just reading the patch. You might also want to put the clusters-x files into a state (or clusters) sub-directory to reduce noise in the output directory and improve consistency with MS and Dirichlet (which do not themselves agree on which directory name to use). Grant I

Re: KMeansJob vs KMeansDriver

2009-06-26 Thread Grant Ingersoll
Check out the patch I just put up on M-138 On Jun 26, 2009, at 12:32 PM, Jeff Eastman wrote: Grant Ingersoll wrote: Isn't the KMeansJob pretty much redundant, assuming we add a parameter to KMeansDriver to take in the number of reduce tasks? The purpose of the clustering jobs, in general, was

Re: KMeansJob vs KMeansDriver

2009-06-26 Thread Grant Ingersoll
On Jun 26, 2009, at 12:32 PM, Jeff Eastman wrote: Grant Ingersoll wrote: Isn't the KMeansJob pretty much redundant, assuming we add a parameter to KMeansDriver to take in the number of reduce tasks? The purpose of the clustering jobs, in general, was to simplify computing the clusters and t

Re: KMeansJob vs KMeansDriver

2009-06-26 Thread Ted Dunning
Of course, this should support assigning *any* input to clusters, not just the original input. On Fri, Jun 26, 2009 at 9:32 AM, Jeff Eastman wrote: > 2. Optionally cluster the input data points by assigning them to clusters. > This would be with probabilities in the case of FuzzyKMeans and Dirich

Re: KMeansJob vs KMeansDriver

2009-06-26 Thread Jeff Eastman
Grant Ingersoll wrote: Isn't the KMeansJob pretty much redundant, assuming we add a parameter to KMeansDriver to take in the number of reduce tasks? The purpose of the clustering jobs, in general, was to simplify computing the clusters and then clustering the data. It has been applied - and cha

Re: KMeansJob vs KMeansDriver

2009-06-26 Thread Grant Ingersoll
On Jun 26, 2009, at 11:32 AM, Grant Ingersoll wrote: Isn't the KMeansJob pretty much redundant, assuming we add a parameter to KMeansDriver to take in the number of reduce tasks? Also, the variable naming in KMeansJob that the number of reduce tasks (numCentroids) is actually the "k" in k-