Thanks Suneel. I tried to add this flag (though I think clusteredPoints directory was supposed to be created by default?). Either way, for some reason whenever I add '-cl' (tried to run it on several data sets), I get the following error: "There is no queue named default" (even though I do specify a queue by -Dmapred.job.queue.name=...). I don't get this error otherwise.
Has anyone ever encountered this error? Is there some sort of configuration I'm missing? Thanks, Galit. -----Original Message----- From: Suneel Marthi [mailto:[email protected]] Sent: Wednesday, July 10, 2013 5:30 PM To: [email protected] Subject: Re: mahout kmeans not generating clusteredPoint dir? Been a while since I last worked with this, I believe u r missing the clustering option '-cl'. Give that a try. ________________________________ From: "Fuhrmann Alpert, Galit" <[email protected]> To: "[email protected]" <[email protected]> Sent: Wednesday, July 10, 2013 5:17 AM Subject: mahout kmeans not generating clusteredPoint dir? Hello, I ran mahout kmeans (using rand seeds) on hadoop cluster. It ran successfully and created a directory containing clusters-*, including the last which was clusters-3-final. However, it did not create the clusteredPoints, or at least I cannot find it under the same dir (or anywhere else). My call was: mahout kmeans -k 4000 -i inputSeq.dat -o outputPath --maxIter 3 --clusters outputSeeds Was there an extra argument I needed to specify in order for it to generate the clusteredPoints? (BTW I also can't see the outputSeeds. Was it created for seeds and then deleted?) According to mahout in action: The k-means clustering implementation creates two types of directories in the output folder. The clusters-* directories are formed at the end of each iteration: the clusters-0 directory is generated after the first iteration, clusters-1 after the second iteration, and so on. These directories contain information about the clusters: centroid, standard deviation, and so on. The clusteredPoints directory, on the other hand, contains the final mapping from cluster ID to document ID. This data is generated from the output of the last MapReduce operation. The directory listing of the output folder looks something like this: $ ls -l reuters-kmeans-clusters drwxr-xr-x 4 user 5000 136 Feb 1 18:56 clusters-0 drwxr-xr-x 4 user 5000 136 Feb 1 18:56 clusters-1 drwxr-xr-x 4 user 5000 136 Feb 1 18:56 clusters-2 ... drwxr-xr-x 4 user 5000 136 Feb 1 18:59 clusteredPoint Again, my call did not generate the clusteredPoint directory. I would appreciate your help. Thanks a lot, Galit.
