Divya, 'seqdirectory' creates a document for every file in the directory you pass in. If there's just one file, there's just one document, and that's not very interesting.
You basically have two options: 1) Parse the XML file once and break it into 1000s of little files (one per document, however you define it) 2) Write a new 'seqdirectory' that creates a sequence file based on parsed XML input. This actually isn't too difficult, as the seqdirectory code is pretty straightforward (thanks to whomever did that!). -Matt On Mon, Nov 8, 2010 at 1:39 AM, Divya <[email protected]> wrote: > Hi Matt, > > I have an XML input file like Wikipedia XML and try to find similar > documents using K means clustering. > > But If pass whole XML file(size 64 MB) as an during kmeans clustering I am > getting error. > > > > According to your short answer , if I have 1000 s documents in an XML file > I should split my XML file in 1000s chunks. > > > > Is there any other way I can get similar documents ? > > > > > > > > Regards, > > Divya > > > > *From:* Matt Spitz [mailto:[email protected]] > *Sent:* Thursday, November 04, 2010 8:46 PM > *To:* Divya > *Cc:* [email protected] > > *Subject:* Re: RE: Kmeans Clustering error with XML input > > > > Divya- > > > > A document is what the clustering algorithm operates on. It finds > similarities among the documents and places similar documents into clusters. > The 'seqdirectory' command expects you to have a single document in every > file in the input directory. What do you expect to happen with your > Wikipedia clustering? What are you trying to do? > > > > Short answer: yes, split the XML file by the <page> tags, putting each > <page> element in its own separate file. > > > > -Matt > > > > On Wed, Nov 3, 2010 at 10:26 PM, Divya <[email protected]> wrote: > > Hi Matt, > I have Split my file in 10 chunks of 10 MB each. > Still getting the error. > Do you mean the I should split XML file in (in wikipeadia example <page> > </page>). > > I didn't understand what one file = one document meant to. > > Regards, > Divya > > > > > > $ bin/mahout kmeans -i D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors > > -o D:/MahoutResult/wikipedia/Kmeans -dm org.apache.mahout > .common.distance.CosineDistanceMeasure -c D:/MahoutResult/wikipedia/Kmeans > > -k 10 -x 20 -ow -cl > > Running on hadoop, using HADOOP_HOME=C:\cygwin\home\Divya\hadoop-0.20.2 > HADOOP_CONF_DIR=C:\cygwin\home\Divya\hadoop-0.20.2\conf > > 10/11/04 10:21:21 INFO common.AbstractJob: Command line arguments: > {--clustering=null, --clusters=D:/MahoutResult/wikipedia/Kmeans, --c > onvergenceDelta=0.5, > --distanceMeasure=org.apache.mahout.common.distance.CosineDistanceMeasure, > > --endPhase=2147483647, --input=D:/Mahou > tResult/wikipedia/seq2sparse/tfidf-vectors, --maxIter=20, > --method=mapreduce, --numClusters=10, > > --output=D:/MahoutResult/wikipedia/Kmea > > ns, --overwrite=null, --startPhase=0, --tempDir=temp} > > 10/11/04 10:21:22 INFO common.HadoopUtil: Deleting > D:/MahoutResult/wikipedia/Kmeans > 10/11/04 10:21:22 WARN util.NativeCodeLoader: Unable to load native-hadoop > > library for your platform... using builtin-java classes wher > e applicable > > 10/11/04 10:21:22 INFO compress.CodecPool: Got brand-new compressor > > Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 5, > Size: 5 > at java.util.ArrayList.RangeCheck(ArrayList.java:547) > at java.util.ArrayList.get(ArrayList.java:322) > at > > org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSe > edGenerator.java:107) > at > org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at > org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 > ) > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl > .java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver > .java:68) > at > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) > at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 > ) > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl > .java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > > -----Original Message----- > From: Matt Spitz [mailto:[email protected]] > > Sent: Thursday, November 04, 2010 9:44 AM > To: [email protected] > Cc: [email protected] > > Subject: Re: RE: Kmeans Clustering error with XML input > > Yes. One file = one document. > > Break the file into meaningful documents, one per file, and you should be > golden. The algorithm will then cluster these documents. > > --- > Sent while mobile. Please forgive brevity and typos. > On Nov 3, 2010 9:37 PM, "Divya" <[email protected]> wrote: > > Hi, > > > > My XML input file is just 64 MB i.e. I am using one of the chunk of > > Wikipedia example. > > Still I need to break this XML to get rid of the below error? > > > > > > Thanks in advance > > Regards, > > Divya > > > > -----Original Message----- > > From: Matt Spitz [mailto:[email protected]] > > Sent: Wednesday, November 03, 2010 8:54 PM > > To: [email protected] > > Cc: [email protected] > > Subject: Re: Kmeans Clustering error with XML input > > > > Divya- > > > > Are you using just one input file? As far as I understand, seqdirectory > > creates one document per file in your input directory. When you try to > > cluster 1 document into 10 clusters, you get an IndexOutOfBoundsException > > when generating the random input clusters. Which is just as well, because > > your output won't be very interesting, anyway. > > > > Break the XML into at least 10 documents, and you should have better > luck. > > > > -Matt > > > > On Wed, Nov 3, 2010 at 5:44 AM, Divya <[email protected]> wrote: > > > >> Hi, > >> > >> > >> > >> Steps I am following for K Means clustering : > >> > >> I am using one of the chunk of Wikipedia as an input > >> > >> > >> > >> Convert XML into sequence format > >> > >> $ bin/mahout seqdirectory -i D:/MahoutResult/wikipedia/input -o > >> D:/MahoutResult/wikipedia/sequencefiles -chunk 30 -c UTF-8 > >> > >> > >> > >> Convert Sequence format to Vector format > >> > >> $ bin/mahout seqdirectory -i > >> > >> > > > > D:/Downloads/Mahout/j-mahout/apache-mahout-examples/wikipedia/enwiki-2007052 > >> 7-pages-articles1.xml -o D:/ > >> > >> MahoutResult/wikipedia/sequencefiles -chunk 100 -c UTF-8 > >> > >> > >> > >> Cluster data > >> > >> $ bin/mahout kmeans -i > D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors > >> -o D:/MahoutResult/wikipedia/kmeans -c D:/MahoutResult/wik > >> > >> ipedia/kmeans -k 10 -x 20 -ow -cl > >> > >> > >> > >> > >> > >> Whenever I am trying to run Kmeans clustering having XML file as an > input > >> > >> I am getting following error > >> > >> > >> > >> Running on hadoop, using HADOOP_HOME=C:\cygwin\home\Divya\hadoop-0.20.2 > >> > >> HADOOP_CONF_DIR=C:\cygwin\home\Divya\hadoop-0.20.2\conf > >> > >> 10/11/03 17:35:53 INFO common.AbstractJob: Command line arguments: > >> {--clustering=null, --clusters=D:/MahoutResult/wikipedia/kmeans, --c > >> > >> onvergenceDelta=0.5, > >> > >> > > > > --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistance > >> Measure, --endPhase=2147483647, --inpu > >> > >> t=D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors, --maxIter=20, > >> --method=mapreduce, --numClusters=10, --output=D:/MahoutResult/wiki > >> > >> pedia/kmeans, --overwrite=null, --startPhase=0, --tempDir=temp} > >> > >> 10/11/03 17:35:55 INFO common.HadoopUtil: Deleting > >> D:/MahoutResult/wikipedia/kmeans > >> > >> 10/11/03 17:35:56 WARN util.NativeCodeLoader: Unable to load > native-hadoop > >> library for your platform... using builtin-java classes wher > >> > >> e applicable > >> > >> 10/11/03 17:35:56 INFO compress.CodecPool: Got brand-new compressor > >> > >> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: > 1, > >> Size: 1 > >> > >> at java.util.ArrayList.RangeCheck(ArrayList.java:547) > >> > >> at java.util.ArrayList.get(ArrayList.java:322) > >> > >> at > >> > >> > > > > org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSe > >> edGenerator.java:107) > >> > >> at > >> > org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96) > >> > >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > >> > >> at > >> > > > org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54) > >> > >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > >> > >> at > >> > >> > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 > >> ) > >> > >> at > >> > >> > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl > >> .java:25) > >> > >> at java.lang.reflect.Method.invoke(Method.java:597) > >> > >> at > >> > >> > > > > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver > >> .java:68) > >> > >> at > >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) > >> > >> at > > org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184) > >> > >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > >> > >> at > >> > >> > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 > >> ) > >> > >> at > >> > >> > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl > >> .java:25) > >> > >> at java.lang.reflect.Method.invoke(Method.java:597) > >> > >> at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > >> > >> > >> > >> > >> > >> > >> > >> Am I not suppose to use XML file as an input? > >> > >> > >> > >> > >> > >> Regards, > >> > >> Divya > >> > >> > > > > >
