Usually we see this error when the expected input vectors are not present. Often it is a configuration issue. Verify your paths exist and that there are vectors where you expect them.
-----Original Message----- From: Ahmad Ammari [mailto:[email protected]] Sent: Wednesday, November 16, 2011 1:58 AM To: [email protected] Subject: Exceptions when running kmeans from the mahout launcher Hello, When trying to run the kmeans program from the mahout launcher (bin/mahout kmeans) to cluster the reuters dataset on a single-node Hadoop cluster (HDFS) on my ubuntu laptop, I got an IndexOutOfBoundsException exception. The input directory (reuters-vectors/tfidf-vectors/) resides on hadoop HDFS file system. I have checked that and It does exist. Here is what I run: bin/mahout kmeans -i reuters-vectors/tfidf-vectors/ -c reuters-initial-clusters -o reuters-kmeans-clusters -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd 1.0 -k 20 -x 20 -cl And here is what I get: Running on hadoop, using HADOOP_HOME=/usr/local/hadoop No HADOOP_CONF_DIR set, using /usr/local/hadoop/src/conf 11/11/15 14:37:05 INFO common.AbstractJob: Command line arguments: {--clustering=null, --clusters=reuters-initial-clusters, --convergenceDelta=1.0, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=/user/admin2/reuters-vectors/tfidf-vectors/, --maxIter=20, --method=mapreduce, --numClusters=20, --output=reuters-kmeans-clusters, --startPhase=0, --tempDir=temp} 11/11/15 14:37:05 INFO common.HadoopUtil: Deleting reuters-initial-clusters 11/11/15 14:37:06 INFO util.NativeCodeLoader: Loaded the native-hadoop library 11/11/15 14:37:06 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 11/11/15 14:37:06 INFO compress.CodecPool: Got brand-new compressor Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:108) at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:101) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:58) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) The previous mahout launcher programs, which creates the sequence files (seqdirectory) and vectors (seq2sparse) from the reuters directory that has the text files worked very fine with me. I have checked my HDFS file system and the input directory: reuters-vectors/tfidf-vectors exists there!! I then tried copying the input directory (reuters-vectors/tfidf-vectors) from the HDFS on hadoop to the local file system and tried running the mahout launcher script again. Now I am getting FileNotFoundException instead of IndexOutOfBoundsException: Running on hadoop, using HADOOP_HOME=/usr/local/hadoop No HADOOP_CONF_DIR set, using /usr/local/hadoop/conf 11/11/15 19:19:42 INFO common.AbstractJob: Command line arguments: {--clustering=null, --clusters=reuters-initial-clusters, --convergenceDelta=1.0, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=examples/reuters-vectors/tfidf-vectors/, --maxIter=20, --method=mapreduce, --numClusters=20, --output=reuters-kmeans-clusters, --startPhase=0, --tempDir=temp} 11/11/15 19:19:42 INFO common.HadoopUtil: Deleting reuters-initial-clusters Exception in thread "main" java.io.FileNotFoundException: File does not exist: examples/reuters-vectors/tfidf-vectors at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457) at org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:67) at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) What could be wrong? Why ain't mahout finding the input directory/files? And where the input directory should be? on HDFS or on the local file system? Many thanks in advance, Ahmad
