Sounds like you might not be using the mahout-core-0.4-job.jar file? Also, we don't run on Hadoop 0.20.1, only 20.2. Finally, trunk always has the latest and greatest patches in it and the clustering stuff is quite stable there.
Jeff -----Original Message----- From: McConnell, Christopher (GE Global Research) [mailto:mccon...@ge.com] Sent: Wednesday, February 02, 2011 11:35 AM To: user@mahout.apache.org Subject: KMeans Clustering Issues All, I've begun to look into Mahout on top of Hadoop, specifically for large scale cluster analysis. I am running into an issue however, attempting to run the KMeansDriver.run(Configuration, Path, Path, Path, DistanceMeasure, double, int, Boolean, Boolean) with the last (runSequential) false when the data is stored on HDFS. I've seen multiple listings about this claiming a fix within the KMeansDriver by adding the job.setJarByClass() method call, however I am still getting the typical ClassNotFoundException: org.apache.mahout.math.Vector. A quick overview, we've created a Map job to take our current dataset and convert it into the Sequence files required for the driver to be executed. We have then tried a few different ways of calling the KMeansDriver.run() - either within the same driver as the previous MR job or separately for a new JVM. Both of these tests were run through the Hadoop environment. Next, I've tried running a standalone Java application, setting up the configuration to read from HDFS, but not run within the Hadoop environment - this gives us the same ClassNotFoundException. Our versions are Mahout 0.4, Hadoop 0.20.1+169.89 and Hadoop 0.20.2 (We have multiple clusters for testing). I have done other tests with the KMeansDriver that did work, for example, utilizing the method within memory works fine. We can also run the clustering over MapReduce, if the job is launched through a java -jar command and data stored locally. Finally, I can execute the mahout binary with the kmeans argument (./mahout kmeans -c path -i path -x #) which also works fine, however we do not want to rely on creating multiple stages/running multiple (and separate) applications. Any thoughts are appreciated. Thanks, Chris Christopher McConnell Computer Scientist Advanced Computing Lab Edison Engineering Development Program GE Global Research T +1 518 387 5176 mccon...@ge.com One Research Circle Niskayuna, NY 12309 GE Imagination at Work