Re: Kmeans Clustering error with XML input

Matt Spitz Wed, 03 Nov 2010 05:54:35 -0700

Divya-

Are you using just one input file?  As far as I understand, seqdirectory
creates one document per file in your input directory.  When you try to
cluster 1 document into 10 clusters, you get an IndexOutOfBoundsException
when generating the random input clusters.  Which is just as well, because
your output won't be very interesting, anyway.


Break the XML into at least 10 documents, and you should have better luck.

-Matt

On Wed, Nov 3, 2010 at 5:44 AM, Divya <[email protected]> wrote:

> Hi,
>
>
>
> Steps I am following for K Means clustering :
>
> I am using one of the chunk of Wikipedia as an input
>
>
>
> Convert XML into sequence format
>
> $ bin/mahout seqdirectory -i D:/MahoutResult/wikipedia/input  -o
> D:/MahoutResult/wikipedia/sequencefiles -chunk 30 -c UTF-8
>
>
>
> Convert Sequence format to Vector format
>
> $ bin/mahout seqdirectory -i
>
> D:/Downloads/Mahout/j-mahout/apache-mahout-examples/wikipedia/enwiki-2007052
> 7-pages-articles1.xml  -o D:/
>
> MahoutResult/wikipedia/sequencefiles -chunk 100 -c UTF-8
>
>
>
> Cluster data
>
> $ bin/mahout kmeans -i  D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors
> -o D:/MahoutResult/wikipedia/kmeans  -c D:/MahoutResult/wik
>
> ipedia/kmeans -k 10  -x 20 -ow -cl
>
>
>
>
>
> Whenever I am trying to run Kmeans clustering having XML file as an input
>
> I am getting following error
>
>
>
> Running on hadoop, using HADOOP_HOME=C:\cygwin\home\Divya\hadoop-0.20.2
>
> HADOOP_CONF_DIR=C:\cygwin\home\Divya\hadoop-0.20.2\conf
>
> 10/11/03 17:35:53 INFO common.AbstractJob: Command line arguments:
> {--clustering=null, --clusters=D:/MahoutResult/wikipedia/kmeans, --c
>
> onvergenceDelta=0.5,
>
> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistance
> Measure, --endPhase=2147483647, --inpu
>
> t=D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors, --maxIter=20,
> --method=mapreduce, --numClusters=10, --output=D:/MahoutResult/wiki
>
> pedia/kmeans, --overwrite=null, --startPhase=0, --tempDir=temp}
>
> 10/11/03 17:35:55 INFO common.HadoopUtil: Deleting
> D:/MahoutResult/wikipedia/kmeans
>
> 10/11/03 17:35:56 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes wher
>
> e applicable
>
> 10/11/03 17:35:56 INFO compress.CodecPool: Got brand-new compressor
>
> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 1,
> Size: 1
>
>        at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>
>        at java.util.ArrayList.get(ArrayList.java:322)
>
>        at
>
> org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSe
> edGenerator.java:107)
>
>        at
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96)
>
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>
>        at
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
>
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
>        at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
> )
>
>        at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> .java:25)
>
>        at java.lang.reflect.Method.invoke(Method.java:597)
>
>        at
>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver
> .java:68)
>
>        at
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>
>        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
>
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
>        at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
> )
>
>        at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> .java:25)
>
>        at java.lang.reflect.Method.invoke(Method.java:597)
>
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
>
>
>
>
>
>
> Am I not suppose to use XML file as an input?
>
>
>
>
>
> Regards,
>
> Divya
>
>

Re: Kmeans Clustering error with XML input

Reply via email to