Hi Matt, I have an XML input file like Wikipedia XML and try to find similar documents using K means clustering.
But If pass whole XML file(size 64 MB) as an during kmeans clustering I am getting error. According to your short answer , if I have 1000 s documents in an XML file I should split my XML file in 1000s chunks. Is there any other way I can get similar documents ? Regards, Divya From: Matt Spitz [mailto:[email protected]] Sent: Thursday, November 04, 2010 8:46 PM To: Divya Cc: [email protected] Subject: Re: RE: Kmeans Clustering error with XML input Divya- A document is what the clustering algorithm operates on. It finds similarities among the documents and places similar documents into clusters. The 'seqdirectory' command expects you to have a single document in every file in the input directory. What do you expect to happen with your Wikipedia clustering? What are you trying to do? Short answer: yes, split the XML file by the <page> tags, putting each <page> element in its own separate file. -Matt On Wed, Nov 3, 2010 at 10:26 PM, Divya <[email protected]> wrote: Hi Matt, I have Split my file in 10 chunks of 10 MB each. Still getting the error. Do you mean the I should split XML file in (in wikipeadia example <page> </page>). I didn't understand what one file = one document meant to. Regards, Divya $ bin/mahout kmeans -i D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors -o D:/MahoutResult/wikipedia/Kmeans -dm org.apache.mahout .common.distance.CosineDistanceMeasure -c D:/MahoutResult/wikipedia/Kmeans -k 10 -x 20 -ow -cl Running on hadoop, using HADOOP_HOME=C:\cygwin\home\Divya\hadoop-0.20.2 HADOOP_CONF_DIR=C:\cygwin\home\Divya\hadoop-0.20.2\conf 10/11/04 10:21:21 INFO common.AbstractJob: Command line arguments: {--clustering=null, --clusters=D:/MahoutResult/wikipedia/Kmeans, --c onvergenceDelta=0.5, --distanceMeasure=org.apache.mahout.common.distance.CosineDistanceMeasure, --endPhase=2147483647, --input=D:/Mahou tResult/wikipedia/seq2sparse/tfidf-vectors, --maxIter=20, --method=mapreduce, --numClusters=10, --output=D:/MahoutResult/wikipedia/Kmea ns, --overwrite=null, --startPhase=0, --tempDir=temp} 10/11/04 10:21:22 INFO common.HadoopUtil: Deleting D:/MahoutResult/wikipedia/Kmeans 10/11/04 10:21:22 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes wher e applicable 10/11/04 10:21:22 INFO compress.CodecPool: Got brand-new compressor Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 5, Size: 5 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSe edGenerator.java:107) at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 ) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl .java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver .java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 ) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl .java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) -----Original Message----- From: Matt Spitz [mailto:[email protected]] Sent: Thursday, November 04, 2010 9:44 AM To: [email protected] Cc: [email protected] Subject: Re: RE: Kmeans Clustering error with XML input Yes. One file = one document. Break the file into meaningful documents, one per file, and you should be golden. The algorithm will then cluster these documents. --- Sent while mobile. Please forgive brevity and typos. On Nov 3, 2010 9:37 PM, "Divya" <[email protected]> wrote: > Hi, > > My XML input file is just 64 MB i.e. I am using one of the chunk of > Wikipedia example. > Still I need to break this XML to get rid of the below error? > > > Thanks in advance > Regards, > Divya > > -----Original Message----- > From: Matt Spitz [mailto:[email protected]] > Sent: Wednesday, November 03, 2010 8:54 PM > To: [email protected] > Cc: [email protected] > Subject: Re: Kmeans Clustering error with XML input > > Divya- > > Are you using just one input file? As far as I understand, seqdirectory > creates one document per file in your input directory. When you try to > cluster 1 document into 10 clusters, you get an IndexOutOfBoundsException > when generating the random input clusters. Which is just as well, because > your output won't be very interesting, anyway. > > Break the XML into at least 10 documents, and you should have better luck. > > -Matt > > On Wed, Nov 3, 2010 at 5:44 AM, Divya <[email protected]> wrote: > >> Hi, >> >> >> >> Steps I am following for K Means clustering : >> >> I am using one of the chunk of Wikipedia as an input >> >> >> >> Convert XML into sequence format >> >> $ bin/mahout seqdirectory -i D:/MahoutResult/wikipedia/input -o >> D:/MahoutResult/wikipedia/sequencefiles -chunk 30 -c UTF-8 >> >> >> >> Convert Sequence format to Vector format >> >> $ bin/mahout seqdirectory -i >> >> > D:/Downloads/Mahout/j-mahout/apache-mahout-examples/wikipedia/enwiki-2007052 >> 7-pages-articles1.xml -o D:/ >> >> MahoutResult/wikipedia/sequencefiles -chunk 100 -c UTF-8 >> >> >> >> Cluster data >> >> $ bin/mahout kmeans -i D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors >> -o D:/MahoutResult/wikipedia/kmeans -c D:/MahoutResult/wik >> >> ipedia/kmeans -k 10 -x 20 -ow -cl >> >> >> >> >> >> Whenever I am trying to run Kmeans clustering having XML file as an input >> >> I am getting following error >> >> >> >> Running on hadoop, using HADOOP_HOME=C:\cygwin\home\Divya\hadoop-0.20.2 >> >> HADOOP_CONF_DIR=C:\cygwin\home\Divya\hadoop-0.20.2\conf >> >> 10/11/03 17:35:53 INFO common.AbstractJob: Command line arguments: >> {--clustering=null, --clusters=D:/MahoutResult/wikipedia/kmeans, --c >> >> onvergenceDelta=0.5, >> >> > --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistance >> Measure, --endPhase=2147483647, --inpu >> >> t=D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors, --maxIter=20, >> --method=mapreduce, --numClusters=10, --output=D:/MahoutResult/wiki >> >> pedia/kmeans, --overwrite=null, --startPhase=0, --tempDir=temp} >> >> 10/11/03 17:35:55 INFO common.HadoopUtil: Deleting >> D:/MahoutResult/wikipedia/kmeans >> >> 10/11/03 17:35:56 WARN util.NativeCodeLoader: Unable to load native-hadoop >> library for your platform... using builtin-java classes wher >> >> e applicable >> >> 10/11/03 17:35:56 INFO compress.CodecPool: Got brand-new compressor >> >> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 1, >> Size: 1 >> >> at java.util.ArrayList.RangeCheck(ArrayList.java:547) >> >> at java.util.ArrayList.get(ArrayList.java:322) >> >> at >> >> > org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSe >> edGenerator.java:107) >> >> at >> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96) >> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> >> at >> > org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54) >> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> >> at >> >> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 >> ) >> >> at >> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl >> .java:25) >> >> at java.lang.reflect.Method.invoke(Method.java:597) >> >> at >> >> > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver >> .java:68) >> >> at >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >> >> at > org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184) >> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> >> at >> >> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 >> ) >> >> at >> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl >> .java:25) >> >> at java.lang.reflect.Method.invoke(Method.java:597) >> >> at org.apache.hadoop.util.RunJar.main(RunJar.java:156) >> >> >> >> >> >> >> >> Am I not suppose to use XML file as an input? >> >> >> >> >> >> Regards, >> >> Divya >> >> >
