Exceptions when running kmeans from the mahout launcher

Ahmad Ammari Wed, 16 Nov 2011 02:05:54 -0800

Hello,

When trying to run the kmeans program from the mahout launcher (bin/mahout
kmeans) to cluster the reuters dataset on a single-node Hadoop cluster
(HDFS) on my ubuntu laptop, I got an IndexOutOfBoundsException exception.
The input directory (reuters-vectors/tfidf-vectors/) resides on hadoop HDFS
file system. I have checked that and It does exist.


Here is what I run:

bin/mahout kmeans -i reuters-vectors/tfidf-vectors/ -c
reuters-initial-clusters -o reuters-kmeans-clusters -dm
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd 1.0
-k 20 -x 20 -cl

And here is what I get:

Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
No HADOOP_CONF_DIR set, using /usr/local/hadoop/src/conf
11/11/15 14:37:05 INFO common.AbstractJob: Command line arguments:
{--clustering=null, --clusters=reuters-initial-clusters,
--convergenceDelta=1.0,
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
--endPhase=2147483647, --input=/user/admin2/reuters-vectors/tfidf-vectors/,
--maxIter=20, --method=mapreduce, --numClusters=20,
--output=reuters-kmeans-clusters, --startPhase=0, --tempDir=temp}
11/11/15 14:37:05 INFO common.HadoopUtil: Deleting reuters-initial-clusters
11/11/15 14:37:06 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
11/11/15 14:37:06 INFO zlib.ZlibFactory: Successfully loaded & initialized
native-zlib library
11/11/15 14:37:06 INFO compress.CodecPool: Got brand-new compressor
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0,
Size: 0
at java.util.ArrayList.RangeCheck(ArrayList.java:547)
at java.util.ArrayList.get(ArrayList.java:322)
at
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:108)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:58)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

The previous mahout launcher programs, which creates the sequence files
(seqdirectory) and vectors (seq2sparse) from the reuters directory that has
the text files worked very fine with me. I have checked my HDFS file system
and the input directory: reuters-vectors/tfidf-vectors exists there!!

I then tried copying the input directory (reuters-vectors/tfidf-vectors)
from the HDFS on hadoop to the local file system and tried running the
mahout launcher script again. Now I am getting FileNotFoundException
instead of IndexOutOfBoundsException:

Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
No HADOOP_CONF_DIR set, using /usr/local/hadoop/conf
11/11/15 19:19:42 INFO common.AbstractJob: Command line arguments:
{--clustering=null, --clusters=reuters-initial-clusters,
--convergenceDelta=1.0,
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
--endPhase=2147483647, --input=examples/reuters-vectors/tfidf-vectors/,
--maxIter=20, --method=mapreduce, --numClusters=20,
--output=reuters-kmeans-clusters, --startPhase=0, --tempDir=temp}
11/11/15 19:19:42 INFO common.HadoopUtil: Deleting reuters-initial-clusters
Exception in thread "main" java.io.FileNotFoundException: File does not
exist: examples/reuters-vectors/tfidf-vectors
at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
at
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:67)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

What could be wrong? Why ain't mahout finding the input directory/files?
And where the input directory should be? on HDFS or on the local file
system?

Many thanks in advance,
Ahmad

Exceptions when running kmeans from the mahout launcher

Reply via email to