Re: RE: Kmeans Clustering error with XML input

Matt Spitz Wed, 03 Nov 2010 18:44:09 -0700

Yes. One file = one document.

Break the file into meaningful documents, one per file, and you should be
golden.  The algorithm will then cluster these documents.


---
Sent while mobile. Please forgive brevity and typos.
On Nov 3, 2010 9:37 PM, "Divya" <[email protected]> wrote:
> Hi,
>
> My XML input file is just 64 MB i.e. I am using one of the chunk of
> Wikipedia example.
> Still I need to break this XML to get rid of the below error?
>
>
> Thanks in advance
> Regards,
> Divya
>
> -----Original Message-----
> From: Matt Spitz [mailto:[email protected]]
> Sent: Wednesday, November 03, 2010 8:54 PM
> To: [email protected]
> Cc: [email protected]
> Subject: Re: Kmeans Clustering error with XML input
>
> Divya-
>
> Are you using just one input file? As far as I understand, seqdirectory
> creates one document per file in your input directory. When you try to
> cluster 1 document into 10 clusters, you get an IndexOutOfBoundsException
> when generating the random input clusters. Which is just as well, because
> your output won't be very interesting, anyway.
>
> Break the XML into at least 10 documents, and you should have better luck.
>
> -Matt
>
> On Wed, Nov 3, 2010 at 5:44 AM, Divya <[email protected]> wrote:
>
>> Hi,
>>
>>
>>
>> Steps I am following for K Means clustering :
>>
>> I am using one of the chunk of Wikipedia as an input
>>
>>
>>
>> Convert XML into sequence format
>>
>> $ bin/mahout seqdirectory -i D:/MahoutResult/wikipedia/input -o
>> D:/MahoutResult/wikipedia/sequencefiles -chunk 30 -c UTF-8
>>
>>
>>
>> Convert Sequence format to Vector format
>>
>> $ bin/mahout seqdirectory -i
>>
>>
>
D:/Downloads/Mahout/j-mahout/apache-mahout-examples/wikipedia/enwiki-2007052
>> 7-pages-articles1.xml -o D:/
>>
>> MahoutResult/wikipedia/sequencefiles -chunk 100 -c UTF-8
>>
>>
>>
>> Cluster data
>>
>> $ bin/mahout kmeans -i D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors
>> -o D:/MahoutResult/wikipedia/kmeans -c D:/MahoutResult/wik
>>
>> ipedia/kmeans -k 10 -x 20 -ow -cl
>>
>>
>>
>>
>>
>> Whenever I am trying to run Kmeans clustering having XML file as an input
>>
>> I am getting following error
>>
>>
>>
>> Running on hadoop, using HADOOP_HOME=C:\cygwin\home\Divya\hadoop-0.20.2
>>
>> HADOOP_CONF_DIR=C:\cygwin\home\Divya\hadoop-0.20.2\conf
>>
>> 10/11/03 17:35:53 INFO common.AbstractJob: Command line arguments:
>> {--clustering=null, --clusters=D:/MahoutResult/wikipedia/kmeans, --c
>>
>> onvergenceDelta=0.5,
>>
>>
>
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistance
>> Measure, --endPhase=2147483647, --inpu
>>
>> t=D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors, --maxIter=20,
>> --method=mapreduce, --numClusters=10, --output=D:/MahoutResult/wiki
>>
>> pedia/kmeans, --overwrite=null, --startPhase=0, --tempDir=temp}
>>
>> 10/11/03 17:35:55 INFO common.HadoopUtil: Deleting
>> D:/MahoutResult/wikipedia/kmeans
>>
>> 10/11/03 17:35:56 WARN util.NativeCodeLoader: Unable to load
native-hadoop
>> library for your platform... using builtin-java classes wher
>>
>> e applicable
>>
>> 10/11/03 17:35:56 INFO compress.CodecPool: Got brand-new compressor
>>
>> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 1,
>> Size: 1
>>
>> at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>>
>> at java.util.ArrayList.get(ArrayList.java:322)
>>
>> at
>>
>>
>
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSe
>> edGenerator.java:107)
>>
>> at
>>
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96)
>>
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>
>> at
>>
>
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>> at
>>
>>
>
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
>> )
>>
>> at
>>
>>
>
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
>> .java:25)
>>
>> at java.lang.reflect.Method.invoke(Method.java:597)
>>
>> at
>>
>>
>
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver
>> .java:68)
>>
>> at
>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>
>> at
> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>> at
>>
>>
>
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
>> )
>>
>> at
>>
>>
>
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
>> .java:25)
>>
>> at java.lang.reflect.Method.invoke(Method.java:597)
>>
>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>
>>
>>
>>
>>
>>
>>
>> Am I not suppose to use XML file as an input?
>>
>>
>>
>>
>>
>> Regards,
>>
>> Divya
>>
>>
>

Re: RE: Kmeans Clustering error with XML input

Reply via email to