Divya- Are you using just one input file? As far as I understand, seqdirectory creates one document per file in your input directory. When you try to cluster 1 document into 10 clusters, you get an IndexOutOfBoundsException when generating the random input clusters. Which is just as well, because your output won't be very interesting, anyway.
Break the XML into at least 10 documents, and you should have better luck. -Matt On Wed, Nov 3, 2010 at 5:44 AM, Divya <[email protected]> wrote: > Hi, > > > > Steps I am following for K Means clustering : > > I am using one of the chunk of Wikipedia as an input > > > > Convert XML into sequence format > > $ bin/mahout seqdirectory -i D:/MahoutResult/wikipedia/input -o > D:/MahoutResult/wikipedia/sequencefiles -chunk 30 -c UTF-8 > > > > Convert Sequence format to Vector format > > $ bin/mahout seqdirectory -i > > D:/Downloads/Mahout/j-mahout/apache-mahout-examples/wikipedia/enwiki-2007052 > 7-pages-articles1.xml -o D:/ > > MahoutResult/wikipedia/sequencefiles -chunk 100 -c UTF-8 > > > > Cluster data > > $ bin/mahout kmeans -i D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors > -o D:/MahoutResult/wikipedia/kmeans -c D:/MahoutResult/wik > > ipedia/kmeans -k 10 -x 20 -ow -cl > > > > > > Whenever I am trying to run Kmeans clustering having XML file as an input > > I am getting following error > > > > Running on hadoop, using HADOOP_HOME=C:\cygwin\home\Divya\hadoop-0.20.2 > > HADOOP_CONF_DIR=C:\cygwin\home\Divya\hadoop-0.20.2\conf > > 10/11/03 17:35:53 INFO common.AbstractJob: Command line arguments: > {--clustering=null, --clusters=D:/MahoutResult/wikipedia/kmeans, --c > > onvergenceDelta=0.5, > > --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistance > Measure, --endPhase=2147483647, --inpu > > t=D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors, --maxIter=20, > --method=mapreduce, --numClusters=10, --output=D:/MahoutResult/wiki > > pedia/kmeans, --overwrite=null, --startPhase=0, --tempDir=temp} > > 10/11/03 17:35:55 INFO common.HadoopUtil: Deleting > D:/MahoutResult/wikipedia/kmeans > > 10/11/03 17:35:56 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes wher > > e applicable > > 10/11/03 17:35:56 INFO compress.CodecPool: Got brand-new compressor > > Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 1, > Size: 1 > > at java.util.ArrayList.RangeCheck(ArrayList.java:547) > > at java.util.ArrayList.get(ArrayList.java:322) > > at > > org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSe > edGenerator.java:107) > > at > org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > at > org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 > ) > > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl > .java:25) > > at java.lang.reflect.Method.invoke(Method.java:597) > > at > > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver > .java:68) > > at > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) > > at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 > ) > > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl > .java:25) > > at java.lang.reflect.Method.invoke(Method.java:597) > > at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > > > > > > > > Am I not suppose to use XML file as an input? > > > > > > Regards, > > Divya > >
