Re: Error running KMeans clustering

Suneel Marthi Wed, 21 Dec 2011 10:40:05 -0800

Did you try running with the version from trunk, seems like you are running 
Mahout 0.5?




________________________________
 From: Periya.Data <[email protected]>
To: [email protected] 
Sent: Wednesday, December 21, 2011 12:53 PM
Subject: Error running KMeans clustering
 
Hi all,
I am getting similar issues while I run a Kmeans clustering. I have posted
the same in manning-forums. But, thought this is a wider community to get
answers from. I have seen similar problems posted in the forum, but, I am
not clear about how it was resolved.

I am able to take a text file, convert to seq files and then to vector
files properly. No issues there. The seq and vector file sizes are large
and reasonable. But, when I run the Kmeans clustering, it gives a similar
error of "IndexOutOfBoundsException: Index: 1, Size: 1".

- Ubuntu 11.10
- Mahout 0.5-cdh3u2
- Hadoop -0.20.2-cdh3u2
- using pseudo-distributed mode and I have my intermediate outputs to HDFS.
=============================================================

#!/bin/bash

$MAHOUT_HOME/bin/mahout kmeans --input /input/vectorized/tfidf-vectors \
--output /output/kmeans/clusters \
--clusters /output/kmeans/initialclusters \
--maxIter 10 \
--numClusters 100 \
--clustering \
--overwrite
wait

$MAHOUT_HOME/bin/mahout clusterdump --seqFileDir
/output/kmeans/clusters/clusters-1 \
--pointsDir /output/kmeans/clusters/clusteredPoints \
--numWords 5 \
--dictionary /input/vectorized/dictionary.file-0 \
--dictionaryType sequencefile

=================================================================

pd@PeriyaData:~/bigdata/examples/bin$ ./bigdata_kmeans.sh
Running on hadoop, using HADOOP_HOME=/home/pd/CDH3/hadoop
HADOOP_CONF_DIR=/home/pd/CDH3/hadoop/conf
MAHOUT-JOB: /home/pd/CDH3/mahout/mahout-examples-0.5-cdh3u2-job.jar
11/12/20 22:38:26 INFO common.AbstractJob: Command line arguments:
{--clustering=null, --clusters=/output/kmeans/initialclusters,
--convergenceDelta=0.5,
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
--endPhase=2147483647, --input=/input/vectorized/tfidf-vectors,
--maxIter=10, --method=mapreduce, --numClusters=100,
--output=/output/kmeans/clusters, --overwrite=null, --startPhase=0,
--tempDir=temp}
11/12/20 22:38:27 INFO common.HadoopUtil: Deleting
/output/kmeans/initialclusters
11/12/20 22:38:27 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
11/12/20 22:38:27 INFO zlib.ZlibFactory: Successfully loaded & initialized
native-zlib library
11/12/20 22:38:27 INFO compress.CodecPool: Got brand-new compressor
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 1,
Size: 1
at java.util.ArrayList.RangeCheck(ArrayList.java:547)
at java.util.ArrayList.get(ArrayList.java:322)
at
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:108)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:58)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
Running on hadoop, using HADOOP_HOME=/home/pd/CDH3/hadoop
HADOOP_CONF_DIR=/home/pd/CDH3/hadoop/conf
MAHOUT-JOB: /home/pd/CDH3/mahout/mahout-examples-0.5-cdh3u2-job.jar
11/12/20 22:38:29 INFO common.AbstractJob: Command line arguments:
{--dictionary=/input/vectorized/dictionary.file-0,
--dictionaryType=sequencefile, --endPhase=2147483647, --numWords=5,
--pointsDir=/output/kmeans/clusters/clusteredPoints,
--seqFileDir=/output/kmeans/clusters/clusters-1, --startPhase=0,
--tempDir=temp}
11/12/20 22:38:30 INFO driver.MahoutDriver: Program took 795 ms
pd@PeriyaData:~/bigdata/examples/bin$
====================================================================

Your suggestions are much appreciated.

Thanks,
PD

Re: Error running KMeans clustering

Reply via email to