Re: Error running KMeans clustering

Gary Snider Wed, 21 Dec 2011 12:18:34 -0800

I hope 0.6-SNAPSHOT works for you.  I could never run the reuters example
out of the box when running on hadoop.  I had to set MAHOUT_LOCAL=true for
it to work


As far as 0.5, I also got IndexOutOfBoundsException running KMeans.  But it
resolved itself eventually and I will try to dig up what I did to fix it.
 It definitely had to do with the seq files not being populated in hdfs.

Can you post your classpath variable (if any)?

On Wed, Dec 21, 2011 at 2:35 PM, Periya.Data <[email protected]> wrote:

> Yes, I am running 0.5. I shall try installing from the trunk.
>
> My subsequent email on vectorization...might be a problem since its size is
> very small as compared to associated seq file.
>
> Thanks,
> /PD.
>
> On Wed, Dec 21, 2011 at 10:39 AM, Suneel Marthi <[email protected]
> >wrote:
>
> > Did you try running with the version from trunk, seems like you are
> > running Mahout 0.5?
> >
> >
> >
> > ________________________________
> >  From: Periya.Data <[email protected]>
> > To: [email protected]
> > Sent: Wednesday, December 21, 2011 12:53 PM
> > Subject: Error running KMeans clustering
> >
> > Hi all,
> > I am getting similar issues while I run a Kmeans clustering. I have
> posted
> > the same in manning-forums. But, thought this is a wider community to get
> > answers from. I have seen similar problems posted in the forum, but, I am
> > not clear about how it was resolved.
> >
> > I am able to take a text file, convert to seq files and then to vector
> > files properly. No issues there. The seq and vector file sizes are large
> > and reasonable. But, when I run the Kmeans clustering, it gives a similar
> > error of "IndexOutOfBoundsException: Index: 1, Size: 1".
> >
> > - Ubuntu 11.10
> > - Mahout 0.5-cdh3u2
> > - Hadoop -0.20.2-cdh3u2
> > - using pseudo-distributed mode and I have my intermediate outputs to
> HDFS.
> > =============================================================
> >
> > #!/bin/bash
> >
> > $MAHOUT_HOME/bin/mahout kmeans --input /input/vectorized/tfidf-vectors \
> > --output /output/kmeans/clusters \
> > --clusters /output/kmeans/initialclusters \
> > --maxIter 10 \
> > --numClusters 100 \
> > --clustering \
> > --overwrite
> > wait
> >
> > $MAHOUT_HOME/bin/mahout clusterdump --seqFileDir
> > /output/kmeans/clusters/clusters-1 \
> > --pointsDir /output/kmeans/clusters/clusteredPoints \
> > --numWords 5 \
> > --dictionary /input/vectorized/dictionary.file-0 \
> > --dictionaryType sequencefile
> >
> > =================================================================
> >
> > pd@PeriyaData:~/bigdata/examples/bin$ ./bigdata_kmeans.sh
> > Running on hadoop, using HADOOP_HOME=/home/pd/CDH3/hadoop
> > HADOOP_CONF_DIR=/home/pd/CDH3/hadoop/conf
> > MAHOUT-JOB: /home/pd/CDH3/mahout/mahout-examples-0.5-cdh3u2-job.jar
> > 11/12/20 22:38:26 INFO common.AbstractJob: Command line arguments:
> > {--clustering=null, --clusters=/output/kmeans/initialclusters,
> > --convergenceDelta=0.5,
> >
> >
> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
> > --endPhase=2147483647, --input=/input/vectorized/tfidf-vectors,
> > --maxIter=10, --method=mapreduce, --numClusters=100,
> > --output=/output/kmeans/clusters, --overwrite=null, --startPhase=0,
> > --tempDir=temp}
> > 11/12/20 22:38:27 INFO common.HadoopUtil: Deleting
> > /output/kmeans/initialclusters
> > 11/12/20 22:38:27 INFO util.NativeCodeLoader: Loaded the native-hadoop
> > library
> > 11/12/20 22:38:27 INFO zlib.ZlibFactory: Successfully loaded &
> initialized
> > native-zlib library
> > 11/12/20 22:38:27 INFO compress.CodecPool: Got brand-new compressor
> > Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 1,
> > Size: 1
> > at java.util.ArrayList.RangeCheck(ArrayList.java:547)
> > at java.util.ArrayList.get(ArrayList.java:322)
> > at
> >
> >
> org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:108)
> > at
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > at
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:58)
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > at
> >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > at java.lang.reflect.Method.invoke(Method.java:597)
> > at
> >
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> > at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> > at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > at
> >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > at java.lang.reflect.Method.invoke(Method.java:597)
> > at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> > Running on hadoop, using HADOOP_HOME=/home/pd/CDH3/hadoop
> > HADOOP_CONF_DIR=/home/pd/CDH3/hadoop/conf
> > MAHOUT-JOB: /home/pd/CDH3/mahout/mahout-examples-0.5-cdh3u2-job.jar
> > 11/12/20 22:38:29 INFO common.AbstractJob: Command line arguments:
> > {--dictionary=/input/vectorized/dictionary.file-0,
> > --dictionaryType=sequencefile, --endPhase=2147483647, --numWords=5,
> > --pointsDir=/output/kmeans/clusters/clusteredPoints,
> > --seqFileDir=/output/kmeans/clusters/clusters-1, --startPhase=0,
> > --tempDir=temp}
> > 11/12/20 22:38:30 INFO driver.MahoutDriver: Program took 795 ms
> > pd@PeriyaData:~/bigdata/examples/bin$
> > ====================================================================
> >
> > Your suggestions are much appreciated.
> >
> > Thanks,
> > PD
> >
>

Re: Error running KMeans clustering

Reply via email to