Yes, I am running 0.5. I shall try installing from the trunk. My subsequent email on vectorization...might be a problem since its size is very small as compared to associated seq file.
Thanks, /PD. On Wed, Dec 21, 2011 at 10:39 AM, Suneel Marthi <[email protected]>wrote: > Did you try running with the version from trunk, seems like you are > running Mahout 0.5? > > > > ________________________________ > From: Periya.Data <[email protected]> > To: [email protected] > Sent: Wednesday, December 21, 2011 12:53 PM > Subject: Error running KMeans clustering > > Hi all, > I am getting similar issues while I run a Kmeans clustering. I have posted > the same in manning-forums. But, thought this is a wider community to get > answers from. I have seen similar problems posted in the forum, but, I am > not clear about how it was resolved. > > I am able to take a text file, convert to seq files and then to vector > files properly. No issues there. The seq and vector file sizes are large > and reasonable. But, when I run the Kmeans clustering, it gives a similar > error of "IndexOutOfBoundsException: Index: 1, Size: 1". > > - Ubuntu 11.10 > - Mahout 0.5-cdh3u2 > - Hadoop -0.20.2-cdh3u2 > - using pseudo-distributed mode and I have my intermediate outputs to HDFS. > ============================================================= > > #!/bin/bash > > $MAHOUT_HOME/bin/mahout kmeans --input /input/vectorized/tfidf-vectors \ > --output /output/kmeans/clusters \ > --clusters /output/kmeans/initialclusters \ > --maxIter 10 \ > --numClusters 100 \ > --clustering \ > --overwrite > wait > > $MAHOUT_HOME/bin/mahout clusterdump --seqFileDir > /output/kmeans/clusters/clusters-1 \ > --pointsDir /output/kmeans/clusters/clusteredPoints \ > --numWords 5 \ > --dictionary /input/vectorized/dictionary.file-0 \ > --dictionaryType sequencefile > > ================================================================= > > pd@PeriyaData:~/bigdata/examples/bin$ ./bigdata_kmeans.sh > Running on hadoop, using HADOOP_HOME=/home/pd/CDH3/hadoop > HADOOP_CONF_DIR=/home/pd/CDH3/hadoop/conf > MAHOUT-JOB: /home/pd/CDH3/mahout/mahout-examples-0.5-cdh3u2-job.jar > 11/12/20 22:38:26 INFO common.AbstractJob: Command line arguments: > {--clustering=null, --clusters=/output/kmeans/initialclusters, > --convergenceDelta=0.5, > > --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, > --endPhase=2147483647, --input=/input/vectorized/tfidf-vectors, > --maxIter=10, --method=mapreduce, --numClusters=100, > --output=/output/kmeans/clusters, --overwrite=null, --startPhase=0, > --tempDir=temp} > 11/12/20 22:38:27 INFO common.HadoopUtil: Deleting > /output/kmeans/initialclusters > 11/12/20 22:38:27 INFO util.NativeCodeLoader: Loaded the native-hadoop > library > 11/12/20 22:38:27 INFO zlib.ZlibFactory: Successfully loaded & initialized > native-zlib library > 11/12/20 22:38:27 INFO compress.CodecPool: Got brand-new compressor > Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 1, > Size: 1 > at java.util.ArrayList.RangeCheck(ArrayList.java:547) > at java.util.ArrayList.get(ArrayList.java:322) > at > > org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:108) > at > org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at > org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:58) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) > at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) > at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.main(RunJar.java:186) > Running on hadoop, using HADOOP_HOME=/home/pd/CDH3/hadoop > HADOOP_CONF_DIR=/home/pd/CDH3/hadoop/conf > MAHOUT-JOB: /home/pd/CDH3/mahout/mahout-examples-0.5-cdh3u2-job.jar > 11/12/20 22:38:29 INFO common.AbstractJob: Command line arguments: > {--dictionary=/input/vectorized/dictionary.file-0, > --dictionaryType=sequencefile, --endPhase=2147483647, --numWords=5, > --pointsDir=/output/kmeans/clusters/clusteredPoints, > --seqFileDir=/output/kmeans/clusters/clusters-1, --startPhase=0, > --tempDir=temp} > 11/12/20 22:38:30 INFO driver.MahoutDriver: Program took 795 ms > pd@PeriyaData:~/bigdata/examples/bin$ > ==================================================================== > > Your suggestions are much appreciated. > > Thanks, > PD >
