The only clusters-n directory which should ever have n=0 would be the prior directory clusters-0. More iterations won't help, it kind looks like there are no input vectors being read. What's your k value (6?)? How many input vectors do you have? During each iteration all the input vectors will be assigned to one of the clusters, so the n= values should all sum to the number of input vectors in each iteration (clusters-i). Depending upon your prior and Model chosen its more likely that you'd get all the points into the same cluster, n=#vectors, than n=0. With a bit more about your problem, I could help more.
-----Original Message----- From: Keith Thompson [mailto:[email protected]] Sent: Wednesday, May 11, 2011 5:56 PM To: [email protected] Subject: Re: Dirichlet Clustering Thank you for your help. I was able to run seq2sparse on my SequenceFile and then run dirichlet on the resulting output. I guess I just need to spend a lot more time reading through the help docs and API. I then used the clusterdumper to view the output. Unfortunately, I think something must have gone wrong somewhere along the way because the 6 cluster centers are shown but n = 0 for all but one of them and n = 1 for the other. Maybe it's just because I only ran 10 iterations to see if I had it working or not. Perhaps with more iterations, I would get better output. On Wed, May 11, 2011 at 4:37 PM, Dhruv <[email protected]> wrote: > The clustering classes require the input data in the form of > VectorWritable. > > You can convert the data to a SequenceFile, and then to the VectorWritable > format using the SparseVectorsFromSequenceFiles class: > > > https://builds.apache.org/hudson/job/Mahout-Quality/clover/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.html#SparseVectorsFromSequenceFiles > > All this can be a little confusing, so here are some pointers: > > Writable is an interface used by Hadoop and (Mahout) classes for > serialization. > > http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Writable.html > > SequenceFile is a type of input to a Hadoop application which consists of > compressed binary key value pairs. > > http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/SequenceFile.html > > VectorWritable is a class in Mahout which implements Vectors in space in a > Writable (serializable) format. > > https://builds.apache.org/hudson/job/Mahout-Quality/clover/org/apache/mahout/math/VectorWritable.html#VectorWritable > > > > On Wed, May 11, 2011 at 2:59 PM, Keith Thompson <[email protected] > >wrote: > > > I am trying to run the example at > > > > > https://cwiki.apache.org/confluence/display/MAHOUT/Clustering+of+synthetic+control+data > > . > > Since I am new to both Hadoop and Mahout, my problem is most likely an > > inadequate understanding of Hadoop at this point. I have converted the > > input file to a sequence file and am now trying to run the Dirichlet > > clustering algorithm. It seems to want a VectorWritable rather than a > > text. How do I make the necessary adjustments? > > > > k_thomp@linux-8awa:~> trunk/bin/mahout dirichlet -i output/chunk-0 -o > > output > > -x 10 -k 6 > > Running on hadoop, using HADOOP_HOME=/usr/local/hadoop-0.20.2 > > No HADOOP_CONF_DIR set, using /usr/local/hadoop-0.20.2/src/conf > > 11/05/10 14:40:01 INFO common.AbstractJob: Command line arguments: > > {--alpha=1.0, > > > > > --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, > > --emitMostLikely=true, --endPhase=2147483647, --input=output/chunk-0, > > --maxIter=10, --method=mapreduce, > > > > > --modelDist=org.apache.mahout.clustering.dirichlet.models.GaussianClusterDistribution, > > --modelPrototype=org.apache.mahout.math.RandomAccessSparseVector, > > --numClusters=6, --output=output, --startPhase=0, --tempDir=temp, > > --threshold=0} > > Exception in thread "main" java.lang.ClassCastException: > > org.apache.hadoop.io.Text cannot be cast to > > org.apache.mahout.math.VectorWritable > > at > > > > > org.apache.mahout.clustering.dirichlet.DirichletDriver.readPrototypeSize(DirichletDriver.java:250) > > at > > > > > org.apache.mahout.clustering.dirichlet.DirichletDriver.run(DirichletDriver.java:112) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > at > > > > > org.apache.mahout.clustering.dirichlet.DirichletDriver.main(DirichletDriver.java:67) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > > at > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > at java.lang.reflect.Method.invoke(Method.java:597) > > at > > > > > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) > > at > > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) > > at > org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > > at > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > at java.lang.reflect.Method.invoke(Method.java:597) > > at org.apache.hadoop.util.RunJar.main(RunJar.java:156 > > >
