RE: RE: Kmeans Clustering error with XML input

Divya Sun, 07 Nov 2010 22:40:29 -0800

Hi Matt,

I have an XML input file like Wikipedia XML and try to find similar
documents using K means clustering.


But If pass whole XML file(size 64 MB) as an during kmeans clustering I am
getting error.

 

According to your short answer , if  I have 1000 s documents in an XML file
I should split my XML file in 1000s chunks.

 

Is there any other way I can get similar documents ?

 

 

 

Regards,

Divya 

 

From: Matt Spitz [mailto:[email protected]] 
Sent: Thursday, November 04, 2010 8:46 PM
To: Divya
Cc: [email protected]
Subject: Re: RE: Kmeans Clustering error with XML input

 

Divya-

 

A document is what the clustering algorithm operates on.  It finds
similarities among the documents and places similar documents into clusters.
The 'seqdirectory' command expects you to have a single document in every
file in the input directory.  What do you expect to happen with your
Wikipedia clustering?  What are you trying to do?

 

Short answer: yes, split the XML file by the <page> tags, putting each
<page> element in its own separate file.

 

-Matt

 

On Wed, Nov 3, 2010 at 10:26 PM, Divya <[email protected]> wrote:

Hi Matt,
I have Split my file in 10 chunks of 10 MB each.
Still getting  the error.
Do you mean the I should split XML file in (in wikipeadia example <page>
</page>).

I didn't understand what one file = one document meant to.

Regards,
Divya





$ bin/mahout kmeans -i  D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors

-o D:/MahoutResult/wikipedia/Kmeans  -dm  org.apache.mahout
.common.distance.CosineDistanceMeasure -c D:/MahoutResult/wikipedia/Kmeans

-k 10  -x 20 -ow -cl

Running on hadoop, using HADOOP_HOME=C:\cygwin\home\Divya\hadoop-0.20.2
HADOOP_CONF_DIR=C:\cygwin\home\Divya\hadoop-0.20.2\conf

10/11/04 10:21:21 INFO common.AbstractJob: Command line arguments:
{--clustering=null, --clusters=D:/MahoutResult/wikipedia/Kmeans, --c
onvergenceDelta=0.5,
--distanceMeasure=org.apache.mahout.common.distance.CosineDistanceMeasure,

--endPhase=2147483647, --input=D:/Mahou
tResult/wikipedia/seq2sparse/tfidf-vectors, --maxIter=20,
--method=mapreduce, --numClusters=10,

--output=D:/MahoutResult/wikipedia/Kmea

ns, --overwrite=null, --startPhase=0, --tempDir=temp}

10/11/04 10:21:22 INFO common.HadoopUtil: Deleting
D:/MahoutResult/wikipedia/Kmeans
10/11/04 10:21:22 WARN util.NativeCodeLoader: Unable to load native-hadoop

library for your platform... using builtin-java classes wher
e applicable

10/11/04 10:21:22 INFO compress.CodecPool: Got brand-new compressor

Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 5,
Size: 5
       at java.util.ArrayList.RangeCheck(ArrayList.java:547)
       at java.util.ArrayList.get(ArrayList.java:322)
       at
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSe
edGenerator.java:107)
       at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96)
       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
       at
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)
       at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
       at java.lang.reflect.Method.invoke(Method.java:597)
       at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver
.java:68)
       at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
       at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)
       at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)
       at java.lang.reflect.Method.invoke(Method.java:597)
       at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

-----Original Message-----
From: Matt Spitz [mailto:[email protected]]

Sent: Thursday, November 04, 2010 9:44 AM
To: [email protected]
Cc: [email protected]

Subject: Re: RE: Kmeans Clustering error with XML input

Yes. One file = one document.

Break the file into meaningful documents, one per file, and you should be
golden.  The algorithm will then cluster these documents.

---
Sent while mobile. Please forgive brevity and typos.
On Nov 3, 2010 9:37 PM, "Divya" <[email protected]> wrote:
> Hi,
>
> My XML input file is just 64 MB i.e. I am using one of the chunk of
> Wikipedia example.
> Still I need to break this XML to get rid of the below error?
>
>
> Thanks in advance
> Regards,
> Divya
>
> -----Original Message-----
> From: Matt Spitz [mailto:[email protected]]
> Sent: Wednesday, November 03, 2010 8:54 PM
> To: [email protected]
> Cc: [email protected]
> Subject: Re: Kmeans Clustering error with XML input
>
> Divya-
>
> Are you using just one input file? As far as I understand, seqdirectory
> creates one document per file in your input directory. When you try to
> cluster 1 document into 10 clusters, you get an IndexOutOfBoundsException
> when generating the random input clusters. Which is just as well, because
> your output won't be very interesting, anyway.
>
> Break the XML into at least 10 documents, and you should have better luck.
>
> -Matt
>
> On Wed, Nov 3, 2010 at 5:44 AM, Divya <[email protected]> wrote:
>
>> Hi,
>>
>>
>>
>> Steps I am following for K Means clustering :
>>
>> I am using one of the chunk of Wikipedia as an input
>>
>>
>>
>> Convert XML into sequence format
>>
>> $ bin/mahout seqdirectory -i D:/MahoutResult/wikipedia/input -o
>> D:/MahoutResult/wikipedia/sequencefiles -chunk 30 -c UTF-8
>>
>>
>>
>> Convert Sequence format to Vector format
>>
>> $ bin/mahout seqdirectory -i
>>
>>
>
D:/Downloads/Mahout/j-mahout/apache-mahout-examples/wikipedia/enwiki-2007052
>> 7-pages-articles1.xml -o D:/
>>
>> MahoutResult/wikipedia/sequencefiles -chunk 100 -c UTF-8
>>
>>
>>
>> Cluster data
>>
>> $ bin/mahout kmeans -i D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors
>> -o D:/MahoutResult/wikipedia/kmeans -c D:/MahoutResult/wik
>>
>> ipedia/kmeans -k 10 -x 20 -ow -cl
>>
>>
>>
>>
>>
>> Whenever I am trying to run Kmeans clustering having XML file as an input
>>
>> I am getting following error
>>
>>
>>
>> Running on hadoop, using HADOOP_HOME=C:\cygwin\home\Divya\hadoop-0.20.2
>>
>> HADOOP_CONF_DIR=C:\cygwin\home\Divya\hadoop-0.20.2\conf
>>
>> 10/11/03 17:35:53 INFO common.AbstractJob: Command line arguments:
>> {--clustering=null, --clusters=D:/MahoutResult/wikipedia/kmeans, --c
>>
>> onvergenceDelta=0.5,
>>
>>
>
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistance
>> Measure, --endPhase=2147483647, --inpu
>>
>> t=D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors, --maxIter=20,
>> --method=mapreduce, --numClusters=10, --output=D:/MahoutResult/wiki
>>
>> pedia/kmeans, --overwrite=null, --startPhase=0, --tempDir=temp}
>>
>> 10/11/03 17:35:55 INFO common.HadoopUtil: Deleting
>> D:/MahoutResult/wikipedia/kmeans
>>
>> 10/11/03 17:35:56 WARN util.NativeCodeLoader: Unable to load
native-hadoop
>> library for your platform... using builtin-java classes wher
>>
>> e applicable
>>
>> 10/11/03 17:35:56 INFO compress.CodecPool: Got brand-new compressor
>>
>> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 1,
>> Size: 1
>>
>> at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>>
>> at java.util.ArrayList.get(ArrayList.java:322)
>>
>> at
>>
>>
>
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSe
>> edGenerator.java:107)
>>
>> at
>>
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96)
>>
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>
>> at
>>
>
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>> at
>>
>>
>
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
>> )
>>
>> at
>>
>>
>
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
>> .java:25)
>>
>> at java.lang.reflect.Method.invoke(Method.java:597)
>>
>> at
>>
>>
>
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver
>> .java:68)
>>
>> at
>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>
>> at
> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>> at
>>
>>
>
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
>> )
>>
>> at
>>
>>
>
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
>> .java:25)
>>
>> at java.lang.reflect.Method.invoke(Method.java:597)
>>
>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>
>>
>>
>>
>>
>>
>>
>> Am I not suppose to use XML file as an input?
>>
>>
>>
>>
>>
>> Regards,
>>
>> Divya
>>
>>
>

RE: RE: Kmeans Clustering error with XML input

Reply via email to