Did you try running with the version from trunk, seems like you are running Mahout 0.5?
________________________________ From: Periya.Data <[email protected]> To: [email protected] Sent: Wednesday, December 21, 2011 12:53 PM Subject: Error running KMeans clustering Hi all, I am getting similar issues while I run a Kmeans clustering. I have posted the same in manning-forums. But, thought this is a wider community to get answers from. I have seen similar problems posted in the forum, but, I am not clear about how it was resolved. I am able to take a text file, convert to seq files and then to vector files properly. No issues there. The seq and vector file sizes are large and reasonable. But, when I run the Kmeans clustering, it gives a similar error of "IndexOutOfBoundsException: Index: 1, Size: 1". - Ubuntu 11.10 - Mahout 0.5-cdh3u2 - Hadoop -0.20.2-cdh3u2 - using pseudo-distributed mode and I have my intermediate outputs to HDFS. ============================================================= #!/bin/bash $MAHOUT_HOME/bin/mahout kmeans --input /input/vectorized/tfidf-vectors \ --output /output/kmeans/clusters \ --clusters /output/kmeans/initialclusters \ --maxIter 10 \ --numClusters 100 \ --clustering \ --overwrite wait $MAHOUT_HOME/bin/mahout clusterdump --seqFileDir /output/kmeans/clusters/clusters-1 \ --pointsDir /output/kmeans/clusters/clusteredPoints \ --numWords 5 \ --dictionary /input/vectorized/dictionary.file-0 \ --dictionaryType sequencefile ================================================================= pd@PeriyaData:~/bigdata/examples/bin$ ./bigdata_kmeans.sh Running on hadoop, using HADOOP_HOME=/home/pd/CDH3/hadoop HADOOP_CONF_DIR=/home/pd/CDH3/hadoop/conf MAHOUT-JOB: /home/pd/CDH3/mahout/mahout-examples-0.5-cdh3u2-job.jar 11/12/20 22:38:26 INFO common.AbstractJob: Command line arguments: {--clustering=null, --clusters=/output/kmeans/initialclusters, --convergenceDelta=0.5, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=/input/vectorized/tfidf-vectors, --maxIter=10, --method=mapreduce, --numClusters=100, --output=/output/kmeans/clusters, --overwrite=null, --startPhase=0, --tempDir=temp} 11/12/20 22:38:27 INFO common.HadoopUtil: Deleting /output/kmeans/initialclusters 11/12/20 22:38:27 INFO util.NativeCodeLoader: Loaded the native-hadoop library 11/12/20 22:38:27 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 11/12/20 22:38:27 INFO compress.CodecPool: Got brand-new compressor Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:108) at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:58) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:186) Running on hadoop, using HADOOP_HOME=/home/pd/CDH3/hadoop HADOOP_CONF_DIR=/home/pd/CDH3/hadoop/conf MAHOUT-JOB: /home/pd/CDH3/mahout/mahout-examples-0.5-cdh3u2-job.jar 11/12/20 22:38:29 INFO common.AbstractJob: Command line arguments: {--dictionary=/input/vectorized/dictionary.file-0, --dictionaryType=sequencefile, --endPhase=2147483647, --numWords=5, --pointsDir=/output/kmeans/clusters/clusteredPoints, --seqFileDir=/output/kmeans/clusters/clusters-1, --startPhase=0, --tempDir=temp} 11/12/20 22:38:30 INFO driver.MahoutDriver: Program took 795 ms pd@PeriyaData:~/bigdata/examples/bin$ ==================================================================== Your suggestions are much appreciated. Thanks, PD
