Thanks Suneel.
I tried to add this flag (though I think clusteredPoints directory was supposed 
to be created by default?).
Either way, for some reason whenever I add '-cl' (tried to run it on several 
data sets), I get the following error: 
"There is no queue named default"
(even though I do specify a queue by -Dmapred.job.queue.name=...).
I don't get this error otherwise.

Has anyone ever encountered this error?
Is there some sort of configuration I'm missing?

Thanks,

Galit.

-----Original Message-----
From: Suneel Marthi [mailto:[email protected]] 
Sent: Wednesday, July 10, 2013 5:30 PM
To: [email protected]
Subject: Re: mahout kmeans not generating clusteredPoint dir?

Been a while since I last worked with this, I believe u r missing the 
clustering option '-cl'.
Give that a try.




________________________________
 From: "Fuhrmann Alpert, Galit" <[email protected]>
To: "[email protected]" <[email protected]> 
Sent: Wednesday, July 10, 2013 5:17 AM
Subject: mahout kmeans not generating clusteredPoint dir?
 

Hello,

I ran mahout kmeans (using rand seeds) on hadoop cluster. It ran successfully 
and created a directory containing clusters-*, including the last which was 
clusters-3-final.
However, it did not create the clusteredPoints, or at least I cannot find it 
under the same dir (or anywhere else).

My call was:
mahout kmeans  -k 4000 -i inputSeq.dat -o outputPath --maxIter 3 --clusters 
outputSeeds

Was there an extra argument I needed to specify in order for it to generate the 
clusteredPoints?
(BTW I also can't see the outputSeeds. Was it created for seeds and then 
deleted?)

According to mahout in action:

The k-means clustering implementation creates two types of directories in the 
output
folder. The clusters-* directories are formed at the end of each iteration: the 
clusters-0
directory is generated after the first iteration, clusters-1 after the second 
iteration, and
so on. These directories contain information about the clusters: centroid, 
standard
deviation, and so on. The clusteredPoints directory, on the other hand, 
contains the
final mapping from cluster ID to document ID. This data is generated from the 
output
of the last MapReduce operation.
The directory listing of the output folder looks something like this:
$ ls -l reuters-kmeans-clusters
drwxr-xr-x 4 user 5000 136 Feb 1 18:56 clusters-0
drwxr-xr-x 4 user 5000 136 Feb 1 18:56 clusters-1
drwxr-xr-x 4 user 5000 136 Feb 1 18:56 clusters-2
...
drwxr-xr-x 4 user 5000 136 Feb 1 18:59 clusteredPoint

Again, my call did not generate the clusteredPoint directory.
I would appreciate your help.

Thanks a lot,

Galit.

Reply via email to