Hello,

When trying to run the kmeans program from the mahout launcher (bin/mahout
kmeans) to cluster the reuters dataset on a single-node Hadoop cluster
(HDFS) on my ubuntu laptop, I got an IndexOutOfBoundsException exception.
The input directory (reuters-vectors/tfidf-vectors/) resides on hadoop HDFS
file system. I have checked that and It does exist.

Here is what I run:

bin/mahout kmeans -i reuters-vectors/tfidf-vectors/ -c
reuters-initial-clusters -o reuters-kmeans-clusters -dm
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd 1.0
-k 20 -x 20 -cl

And here is what I get:

Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
No HADOOP_CONF_DIR set, using /usr/local/hadoop/src/conf
11/11/15 14:37:05 INFO common.AbstractJob: Command line arguments:
{--clustering=null, --clusters=reuters-initial-clusters,
--convergenceDelta=1.0,
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
--endPhase=2147483647, --input=/user/admin2/reuters-vectors/tfidf-vectors/,
--maxIter=20, --method=mapreduce, --numClusters=20,
--output=reuters-kmeans-clusters, --startPhase=0, --tempDir=temp}
11/11/15 14:37:05 INFO common.HadoopUtil: Deleting reuters-initial-clusters
11/11/15 14:37:06 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
11/11/15 14:37:06 INFO zlib.ZlibFactory: Successfully loaded & initialized
native-zlib library
11/11/15 14:37:06 INFO compress.CodecPool: Got brand-new compressor
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0,
Size: 0
at java.util.ArrayList.RangeCheck(ArrayList.java:547)
at java.util.ArrayList.get(ArrayList.java:322)
at
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:108)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:58)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

The previous mahout launcher programs, which creates the sequence files
(seqdirectory) and vectors (seq2sparse) from the reuters directory that has
the text files worked very fine with me. I have checked my HDFS file system
and the input directory: reuters-vectors/tfidf-vectors exists there!!

I then tried copying the input directory (reuters-vectors/tfidf-vectors)
from the HDFS on hadoop to the local file system and tried running the
mahout launcher script again. Now I am getting FileNotFoundException
instead of IndexOutOfBoundsException:

Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
No HADOOP_CONF_DIR set, using /usr/local/hadoop/conf
11/11/15 19:19:42 INFO common.AbstractJob: Command line arguments:
{--clustering=null, --clusters=reuters-initial-clusters,
--convergenceDelta=1.0,
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
--endPhase=2147483647, --input=examples/reuters-vectors/tfidf-vectors/,
--maxIter=20, --method=mapreduce, --numClusters=20,
--output=reuters-kmeans-clusters, --startPhase=0, --tempDir=temp}
11/11/15 19:19:42 INFO common.HadoopUtil: Deleting reuters-initial-clusters
Exception in thread "main" java.io.FileNotFoundException: File does not
exist: examples/reuters-vectors/tfidf-vectors
at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
at
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:67)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

What could be wrong? Why ain't mahout finding the input directory/files?
And where the input directory should be? on HDFS or on the local file
system?

Many thanks in advance,
Ahmad

Reply via email to