Yes. One file = one document. Break the file into meaningful documents, one per file, and you should be golden. The algorithm will then cluster these documents.
--- Sent while mobile. Please forgive brevity and typos. On Nov 3, 2010 9:37 PM, "Divya" <[email protected]> wrote: > Hi, > > My XML input file is just 64 MB i.e. I am using one of the chunk of > Wikipedia example. > Still I need to break this XML to get rid of the below error? > > > Thanks in advance > Regards, > Divya > > -----Original Message----- > From: Matt Spitz [mailto:[email protected]] > Sent: Wednesday, November 03, 2010 8:54 PM > To: [email protected] > Cc: [email protected] > Subject: Re: Kmeans Clustering error with XML input > > Divya- > > Are you using just one input file? As far as I understand, seqdirectory > creates one document per file in your input directory. When you try to > cluster 1 document into 10 clusters, you get an IndexOutOfBoundsException > when generating the random input clusters. Which is just as well, because > your output won't be very interesting, anyway. > > Break the XML into at least 10 documents, and you should have better luck. > > -Matt > > On Wed, Nov 3, 2010 at 5:44 AM, Divya <[email protected]> wrote: > >> Hi, >> >> >> >> Steps I am following for K Means clustering : >> >> I am using one of the chunk of Wikipedia as an input >> >> >> >> Convert XML into sequence format >> >> $ bin/mahout seqdirectory -i D:/MahoutResult/wikipedia/input -o >> D:/MahoutResult/wikipedia/sequencefiles -chunk 30 -c UTF-8 >> >> >> >> Convert Sequence format to Vector format >> >> $ bin/mahout seqdirectory -i >> >> > D:/Downloads/Mahout/j-mahout/apache-mahout-examples/wikipedia/enwiki-2007052 >> 7-pages-articles1.xml -o D:/ >> >> MahoutResult/wikipedia/sequencefiles -chunk 100 -c UTF-8 >> >> >> >> Cluster data >> >> $ bin/mahout kmeans -i D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors >> -o D:/MahoutResult/wikipedia/kmeans -c D:/MahoutResult/wik >> >> ipedia/kmeans -k 10 -x 20 -ow -cl >> >> >> >> >> >> Whenever I am trying to run Kmeans clustering having XML file as an input >> >> I am getting following error >> >> >> >> Running on hadoop, using HADOOP_HOME=C:\cygwin\home\Divya\hadoop-0.20.2 >> >> HADOOP_CONF_DIR=C:\cygwin\home\Divya\hadoop-0.20.2\conf >> >> 10/11/03 17:35:53 INFO common.AbstractJob: Command line arguments: >> {--clustering=null, --clusters=D:/MahoutResult/wikipedia/kmeans, --c >> >> onvergenceDelta=0.5, >> >> > --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistance >> Measure, --endPhase=2147483647, --inpu >> >> t=D:/MahoutResult/wikipedia/seq2sparse/tfidf-vectors, --maxIter=20, >> --method=mapreduce, --numClusters=10, --output=D:/MahoutResult/wiki >> >> pedia/kmeans, --overwrite=null, --startPhase=0, --tempDir=temp} >> >> 10/11/03 17:35:55 INFO common.HadoopUtil: Deleting >> D:/MahoutResult/wikipedia/kmeans >> >> 10/11/03 17:35:56 WARN util.NativeCodeLoader: Unable to load native-hadoop >> library for your platform... using builtin-java classes wher >> >> e applicable >> >> 10/11/03 17:35:56 INFO compress.CodecPool: Got brand-new compressor >> >> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 1, >> Size: 1 >> >> at java.util.ArrayList.RangeCheck(ArrayList.java:547) >> >> at java.util.ArrayList.get(ArrayList.java:322) >> >> at >> >> > org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSe >> edGenerator.java:107) >> >> at >> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:96) >> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> >> at >> > org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54) >> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> >> at >> >> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 >> ) >> >> at >> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl >> .java:25) >> >> at java.lang.reflect.Method.invoke(Method.java:597) >> >> at >> >> > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver >> .java:68) >> >> at >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >> >> at > org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184) >> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> >> at >> >> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 >> ) >> >> at >> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl >> .java:25) >> >> at java.lang.reflect.Method.invoke(Method.java:597) >> >> at org.apache.hadoop.util.RunJar.main(RunJar.java:156) >> >> >> >> >> >> >> >> Am I not suppose to use XML file as an input? >> >> >> >> >> >> Regards, >> >> Divya >> >> >
