Re: RE: Kmeans Clustering error with XML input

Matt Spitz Mon, 08 Nov 2010 05:53:09 -0800

Divya,

'seqdirectory' creates a document for every file in the directory you pass
in.  If there's just one file, there's just one document, and that's not
very interesting.


You basically have two options:
1) Parse the XML file once and break it into 1000s of little files (one per
document, however you define it)
2) Write a new 'seqdirectory' that creates a sequence file based on parsed
XML input.  This actually isn't too difficult, as the seqdirectory code is
pretty straightforward (thanks to whomever did that!).

-Matt

On Mon, Nov 8, 2010 at 1:39 AM, Divya <[email protected]> wrote:

>  Hi Matt,
>
> I have an XML input file like Wikipedia XML and try to find similar
> documents using K means clustering.
>
> But If pass whole XML file(size 64 MB) as an during kmeans clustering I am
> getting error.
>
>
>
> According to your short answer , if  I have 1000 s documents in an XML file
> I should split my XML file in 1000s chunks.
>
>
>
> Is there any other way I can get similar documents ?
>
>
>
>
>
>
>
> Regards,
>
> Divya
>
>
>
> *From:* Matt Spitz [mailto:[email protected]]
> *Sent:* Thursday, November 04, 2010 8:46 PM
> *To:* Divya
> *Cc:* [email protected]
>
> *Subject:* Re: RE: Kmeans Clustering error with XML input
>
>
>
> Divya-
>
>
>
> A document is what the clustering algorithm operates on.  It finds
> similarities among the documents and places similar documents into clusters.
>  The 'seqdirectory' command expects you to have a single document in every
> file in the input directory.  What do you expect to happen with your
> Wikipedia clustering?  What are you trying to do?
>
>
>
> Short answer: yes, split the XML file by the <page> tags, putting each
> <page> element in its own separate file.
>
>
>
> -Matt
>
>
>
> On Wed, Nov 3, 2010 at 10:26 PM, Divya <[email protected]> wrote:
>
> Hi Matt,
> I have Split my file in 10 chunks of 10 MB each.
> Still getting  the error.
> Do you mean the I should split XML file in (in wikipeadia example <page>
> </page>).
>
> I didn't understand what one file = one document meant to.
>
> Regards,
> Divya
>
>
>
>
>
> $ bin/mahout kmeans -i  D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors
>
> -o D:/MahoutResult/wikipedia/Kmeans  -dm  org.apache.mahout
> .common.distance.CosineDistanceMeasure -c D:/MahoutResult/wikipedia/Kmeans
>
> -k 10  -x 20 -ow -cl
>
> Running on hadoop, using HADOOP_HOME=C:\cygwin\home\Divya\hadoop-0.20.2
> HADOOP_CONF_DIR=C:\cygwin\home\Divya\hadoop-0.20.2\conf
>
> 10/11/04 10:21:21 INFO common.AbstractJob: Command line arguments:
> {--clustering=null, --clusters=D:/MahoutResult/wikipedia/Kmeans, --c
> onvergenceDelta=0.5,
> --distanceMeasure=org.apache.mahout.common.distance.CosineDistanceMeasure,
>
> --endPhase=2147483647, --input=D:/Mahou
> tResult/wikipedia/seq2sparse/tfidf-vectors, --maxIter=20,
> --method=mapreduce, --numClusters=10,
>
> --output=D:/MahoutResult/wikipedia/Kmea
>
> ns, --overwrite=null, --startPhase=0, --tempDir=temp}
>
> 10/11/04 10:21:22 INFO common.HadoopUtil: Deleting
> D:/MahoutResult/wikipedia/Kmeans
> 10/11/04 10:21:22 WARN util.NativeCodeLoader: Unable to load native-hadoop
>
> library for your platform... using builtin-java classes wher
> e applicable
>
> 10/11/04 10:21:22 INFO compress.CodecPool: Got brand-new compressor
>
> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 5,
> Size: 5
>        at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>        at java.util.ArrayList.get(ArrayList.java:322)
>        at
>
> org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSe
> edGenerator.java:107)
>        at
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
> )
>        at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> .java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at
>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver
> .java:68)
>        at
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
> )
>        at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> .java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
> -----Original Message-----
> From: Matt Spitz [mailto:[email protected]]
>
> Sent: Thursday, November 04, 2010 9:44 AM
> To: [email protected]
> Cc: [email protected]
>
> Subject: Re: RE: Kmeans Clustering error with XML input
>
> Yes. One file = one document.
>
> Break the file into meaningful documents, one per file, and you should be
> golden.  The algorithm will then cluster these documents.
>
> ---
> Sent while mobile. Please forgive brevity and typos.
> On Nov 3, 2010 9:37 PM, "Divya" <[email protected]> wrote:
> > Hi,
> >
> > My XML input file is just 64 MB i.e. I am using one of the chunk of
> > Wikipedia example.
> > Still I need to break this XML to get rid of the below error?
> >
> >
> > Thanks in advance
> > Regards,
> > Divya
> >
> > -----Original Message-----
> > From: Matt Spitz [mailto:[email protected]]
> > Sent: Wednesday, November 03, 2010 8:54 PM
> > To: [email protected]
> > Cc: [email protected]
> > Subject: Re: Kmeans Clustering error with XML input
> >
> > Divya-
> >
> > Are you using just one input file? As far as I understand, seqdirectory
> > creates one document per file in your input directory. When you try to
> > cluster 1 document into 10 clusters, you get an IndexOutOfBoundsException
> > when generating the random input clusters. Which is just as well, because
> > your output won't be very interesting, anyway.
> >
> > Break the XML into at least 10 documents, and you should have better
> luck.
> >
> > -Matt
> >
> > On Wed, Nov 3, 2010 at 5:44 AM, Divya <[email protected]> wrote:
> >
> >> Hi,
> >>
> >>
> >>
> >> Steps I am following for K Means clustering :
> >>
> >> I am using one of the chunk of Wikipedia as an input
> >>
> >>
> >>
> >> Convert XML into sequence format
> >>
> >> $ bin/mahout seqdirectory -i D:/MahoutResult/wikipedia/input -o
> >> D:/MahoutResult/wikipedia/sequencefiles -chunk 30 -c UTF-8
> >>
> >>
> >>
> >> Convert Sequence format to Vector format
> >>
> >> $ bin/mahout seqdirectory -i
> >>
> >>
> >
>
> D:/Downloads/Mahout/j-mahout/apache-mahout-examples/wikipedia/enwiki-2007052
> >> 7-pages-articles1.xml -o D:/
> >>
> >> MahoutResult/wikipedia/sequencefiles -chunk 100 -c UTF-8
> >>
> >>
> >>
> >> Cluster data
> >>
> >> $ bin/mahout kmeans -i
> D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors
> >> -o D:/MahoutResult/wikipedia/kmeans -c D:/MahoutResult/wik
> >>
> >> ipedia/kmeans -k 10 -x 20 -ow -cl
> >>
> >>
> >>
> >>
> >>
> >> Whenever I am trying to run Kmeans clustering having XML file as an
> input
> >>
> >> I am getting following error
> >>
> >>
> >>
> >> Running on hadoop, using HADOOP_HOME=C:\cygwin\home\Divya\hadoop-0.20.2
> >>
> >> HADOOP_CONF_DIR=C:\cygwin\home\Divya\hadoop-0.20.2\conf
> >>
> >> 10/11/03 17:35:53 INFO common.AbstractJob: Command line arguments:
> >> {--clustering=null, --clusters=D:/MahoutResult/wikipedia/kmeans, --c
> >>
> >> onvergenceDelta=0.5,
> >>
> >>
> >
>
> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistance
> >> Measure, --endPhase=2147483647, --inpu
> >>
> >> t=D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors, --maxIter=20,
> >> --method=mapreduce, --numClusters=10, --output=D:/MahoutResult/wiki
> >>
> >> pedia/kmeans, --overwrite=null, --startPhase=0, --tempDir=temp}
> >>
> >> 10/11/03 17:35:55 INFO common.HadoopUtil: Deleting
> >> D:/MahoutResult/wikipedia/kmeans
> >>
> >> 10/11/03 17:35:56 WARN util.NativeCodeLoader: Unable to load
> native-hadoop
> >> library for your platform... using builtin-java classes wher
> >>
> >> e applicable
> >>
> >> 10/11/03 17:35:56 INFO compress.CodecPool: Got brand-new compressor
> >>
> >> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index:
> 1,
> >> Size: 1
> >>
> >> at java.util.ArrayList.RangeCheck(ArrayList.java:547)
> >>
> >> at java.util.ArrayList.get(ArrayList.java:322)
> >>
> >> at
> >>
> >>
> >
>
> org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSe
> >> edGenerator.java:107)
> >>
> >> at
> >>
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96)
> >>
> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>
> >> at
> >>
> >
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
> >>
> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>
> >> at
> >>
> >>
> >
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
> >> )
> >>
> >> at
> >>
> >>
> >
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> >> .java:25)
> >>
> >> at java.lang.reflect.Method.invoke(Method.java:597)
> >>
> >> at
> >>
> >>
> >
>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver
> >> .java:68)
> >>
> >> at
> >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> >>
> >> at
> > org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
> >>
> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>
> >> at
> >>
> >>
> >
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
> >> )
> >>
> >> at
> >>
> >>
> >
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> >> .java:25)
> >>
> >> at java.lang.reflect.Method.invoke(Method.java:597)
> >>
> >> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> Am I not suppose to use XML file as an input?
> >>
> >>
> >>
> >>
> >>
> >> Regards,
> >>
> >> Divya
> >>
> >>
> >
>
>
>

Re: RE: Kmeans Clustering error with XML input

Reply via email to