Re: Error running KMeans clustering

Periya.Data Wed, 21 Dec 2011 12:31:09 -0800

Hi Gary,
    Should I add anything to my classpath? I have MAHOUT_HOME defined
properly in my ~/.bash_profile.


CDH3_HOME=/home/pd/CDH3
MAHOUT_HOME=$CDH3_HOME/mahout
[...]

pd@PeriyaData:~/bigdata/examples/bin$ echo $CLASSPATH
.:/home/pd/CDH3/hive/lib:/home/pd/CDH3/hive/lib/*:/home/pd/CDH3/hadoop/hadoop-core-0.20.2-cdh3u2.jar
pd@PeriyaData:~/bigdata/examples/bin$

So, I think I have to options to explore now:
- use the latest from the trunk, build it myself and use.
- use 0.6-SNAPSHOT
- use MAHOUT_LOCAL=true

Maybe my vectorized file (particularly the tfidf-vector) is the culprit...I
am still exploring.

Thanks,
/PD


On Wed, Dec 21, 2011 at 12:18 PM, Gary Snider <[email protected]>wrote:

> I hope 0.6-SNAPSHOT works for you.  I could never run the reuters example
> out of the box when running on hadoop.  I had to set MAHOUT_LOCAL=true for
> it to work
>
> As far as 0.5, I also got IndexOutOfBoundsException running KMeans.  But it
> resolved itself eventually and I will try to dig up what I did to fix it.
>  It definitely had to do with the seq files not being populated in hdfs.
>
> Can you post your classpath variable (if any)?
>
> On Wed, Dec 21, 2011 at 2:35 PM, Periya.Data <[email protected]>
> wrote:
>
> > Yes, I am running 0.5. I shall try installing from the trunk.
> >
> > My subsequent email on vectorization...might be a problem since its size
> is
> > very small as compared to associated seq file.
> >
> > Thanks,
> > /PD.
> >
> > On Wed, Dec 21, 2011 at 10:39 AM, Suneel Marthi <[email protected]
> > >wrote:
> >
> > > Did you try running with the version from trunk, seems like you are
> > > running Mahout 0.5?
> > >
> > >
> > >
> > > ________________________________
> > >  From: Periya.Data <[email protected]>
> > > To: [email protected]
> > > Sent: Wednesday, December 21, 2011 12:53 PM
> > > Subject: Error running KMeans clustering
> > >
> > > Hi all,
> > > I am getting similar issues while I run a Kmeans clustering. I have
> > posted
> > > the same in manning-forums. But, thought this is a wider community to
> get
> > > answers from. I have seen similar problems posted in the forum, but, I
> am
> > > not clear about how it was resolved.
> > >
> > > I am able to take a text file, convert to seq files and then to vector
> > > files properly. No issues there. The seq and vector file sizes are
> large
> > > and reasonable. But, when I run the Kmeans clustering, it gives a
> similar
> > > error of "IndexOutOfBoundsException: Index: 1, Size: 1".
> > >
> > > - Ubuntu 11.10
> > > - Mahout 0.5-cdh3u2
> > > - Hadoop -0.20.2-cdh3u2
> > > - using pseudo-distributed mode and I have my intermediate outputs to
> > HDFS.
> > > =============================================================
> > >
> > > #!/bin/bash
> > >
> > > $MAHOUT_HOME/bin/mahout kmeans --input /input/vectorized/tfidf-vectors
> \
> > > --output /output/kmeans/clusters \
> > > --clusters /output/kmeans/initialclusters \
> > > --maxIter 10 \
> > > --numClusters 100 \
> > > --clustering \
> > > --overwrite
> > > wait
> > >
> > > $MAHOUT_HOME/bin/mahout clusterdump --seqFileDir
> > > /output/kmeans/clusters/clusters-1 \
> > > --pointsDir /output/kmeans/clusters/clusteredPoints \
> > > --numWords 5 \
> > > --dictionary /input/vectorized/dictionary.file-0 \
> > > --dictionaryType sequencefile
> > >
> > > =================================================================
> > >
> > > pd@PeriyaData:~/bigdata/examples/bin$ ./bigdata_kmeans.sh
> > > Running on hadoop, using HADOOP_HOME=/home/pd/CDH3/hadoop
> > > HADOOP_CONF_DIR=/home/pd/CDH3/hadoop/conf
> > > MAHOUT-JOB: /home/pd/CDH3/mahout/mahout-examples-0.5-cdh3u2-job.jar
> > > 11/12/20 22:38:26 INFO common.AbstractJob: Command line arguments:
> > > {--clustering=null, --clusters=/output/kmeans/initialclusters,
> > > --convergenceDelta=0.5,
> > >
> > >
> >
> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
> > > --endPhase=2147483647, --input=/input/vectorized/tfidf-vectors,
> > > --maxIter=10, --method=mapreduce, --numClusters=100,
> > > --output=/output/kmeans/clusters, --overwrite=null, --startPhase=0,
> > > --tempDir=temp}
> > > 11/12/20 22:38:27 INFO common.HadoopUtil: Deleting
> > > /output/kmeans/initialclusters
> > > 11/12/20 22:38:27 INFO util.NativeCodeLoader: Loaded the native-hadoop
> > > library
> > > 11/12/20 22:38:27 INFO zlib.ZlibFactory: Successfully loaded &
> > initialized
> > > native-zlib library
> > > 11/12/20 22:38:27 INFO compress.CodecPool: Got brand-new compressor
> > > Exception in thread "main" java.lang.IndexOutOfBoundsException: Index:
> 1,
> > > Size: 1
> > > at java.util.ArrayList.RangeCheck(ArrayList.java:547)
> > > at java.util.ArrayList.get(ArrayList.java:322)
> > > at
> > >
> > >
> >
> org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:108)
> > > at
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
> > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > > at
> > >
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:58)
> > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > at
> > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > > at
> > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > > at java.lang.reflect.Method.invoke(Method.java:597)
> > > at
> > >
> > >
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> > > at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> > > at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
> > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > at
> > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > > at
> > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > > at java.lang.reflect.Method.invoke(Method.java:597)
> > > at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> > > Running on hadoop, using HADOOP_HOME=/home/pd/CDH3/hadoop
> > > HADOOP_CONF_DIR=/home/pd/CDH3/hadoop/conf
> > > MAHOUT-JOB: /home/pd/CDH3/mahout/mahout-examples-0.5-cdh3u2-job.jar
> > > 11/12/20 22:38:29 INFO common.AbstractJob: Command line arguments:
> > > {--dictionary=/input/vectorized/dictionary.file-0,
> > > --dictionaryType=sequencefile, --endPhase=2147483647, --numWords=5,
> > > --pointsDir=/output/kmeans/clusters/clusteredPoints,
> > > --seqFileDir=/output/kmeans/clusters/clusters-1, --startPhase=0,
> > > --tempDir=temp}
> > > 11/12/20 22:38:30 INFO driver.MahoutDriver: Program took 795 ms
> > > pd@PeriyaData:~/bigdata/examples/bin$
> > > ====================================================================
> > >
> > > Your suggestions are much appreciated.
> > >
> > > Thanks,
> > > PD
> > >
> >
>

Re: Error running KMeans clustering

Reply via email to