RE: Dirichlet Clustering

Jeff Eastman Wed, 11 May 2011 18:07:13 -0700

The only clusters-n directory which should ever have n=0 would be the prior 
directory clusters-0. More iterations won't help, it kind looks like there are 
no input vectors being read. What's your k value (6?)? How many input vectors 
do you have? During each iteration all the input vectors will be assigned to 
one of the clusters, so the n= values should all sum to the number of input 
vectors in each iteration (clusters-i). Depending upon your prior and Model 
chosen its more likely that you'd get all the points into the same cluster, 
n=#vectors, than n=0. With a bit more about your problem, I could help more.


-----Original Message-----
From: Keith Thompson [mailto:[email protected]] 
Sent: Wednesday, May 11, 2011 5:56 PM
To: [email protected]
Subject: Re: Dirichlet Clustering

Thank you for your help.  I was able to run seq2sparse on my SequenceFile
and then run dirichlet on the resulting output. I guess I just need to spend
a lot more time reading through the help docs and API.  I then used the
clusterdumper to view the output.  Unfortunately, I think something must
have gone wrong somewhere along the way because the 6 cluster centers are
shown but n = 0 for all but one of them and n = 1 for the other.  Maybe it's
just because I only ran 10 iterations to see if I had it working or not.
Perhaps with more iterations, I would get better output.



On Wed, May 11, 2011 at 4:37 PM, Dhruv <[email protected]> wrote:

> The clustering classes require the input data in the form of
> VectorWritable.
>
> You can convert the data to a SequenceFile, and then to the VectorWritable
> format using the SparseVectorsFromSequenceFiles class:
>
>
> https://builds.apache.org/hudson/job/Mahout-Quality/clover/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.html#SparseVectorsFromSequenceFiles
>
> All this can be a little confusing, so here are some pointers:
>
> Writable is an interface used by Hadoop and (Mahout) classes for
> serialization.
>
> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Writable.html
>
> SequenceFile is a type of input to a Hadoop application which consists of
> compressed binary key value pairs.
>
> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/SequenceFile.html
>
> VectorWritable is a class in Mahout which implements Vectors in space in a
> Writable (serializable) format.
>
> https://builds.apache.org/hudson/job/Mahout-Quality/clover/org/apache/mahout/math/VectorWritable.html#VectorWritable
>
>
>
> On Wed, May 11, 2011 at 2:59 PM, Keith Thompson <[email protected]
> >wrote:
>
> > I am trying to run the example at
> >
> >
> https://cwiki.apache.org/confluence/display/MAHOUT/Clustering+of+synthetic+control+data
> > .
> > Since I am new to both Hadoop and Mahout, my problem is most likely an
> > inadequate understanding of Hadoop at this point.  I have converted the
> > input file to a sequence file and am now trying to run the Dirichlet
> > clustering algorithm.  It seems to want a VectorWritable rather than a
> > text.  How do I make the necessary adjustments?
> >
> > k_thomp@linux-8awa:~> trunk/bin/mahout dirichlet -i output/chunk-0 -o
> > output
> > -x 10 -k 6
> > Running on hadoop, using HADOOP_HOME=/usr/local/hadoop-0.20.2
> > No HADOOP_CONF_DIR set, using /usr/local/hadoop-0.20.2/src/conf
> > 11/05/10 14:40:01 INFO common.AbstractJob: Command line arguments:
> > {--alpha=1.0,
> >
> >
> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
> > --emitMostLikely=true, --endPhase=2147483647, --input=output/chunk-0,
> > --maxIter=10, --method=mapreduce,
> >
> >
> --modelDist=org.apache.mahout.clustering.dirichlet.models.GaussianClusterDistribution,
> > --modelPrototype=org.apache.mahout.math.RandomAccessSparseVector,
> > --numClusters=6, --output=output, --startPhase=0, --tempDir=temp,
> > --threshold=0}
> > Exception in thread "main" java.lang.ClassCastException:
> > org.apache.hadoop.io.Text cannot be cast to
> > org.apache.mahout.math.VectorWritable
> >        at
> >
> >
> org.apache.mahout.clustering.dirichlet.DirichletDriver.readPrototypeSize(DirichletDriver.java:250)
> >        at
> >
> >
> org.apache.mahout.clustering.dirichlet.DirichletDriver.run(DirichletDriver.java:112)
> >        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >        at
> >
> >
> org.apache.mahout.clustering.dirichlet.DirichletDriver.main(DirichletDriver.java:67)
> >        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >        at
> >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >        at
> >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >        at java.lang.reflect.Method.invoke(Method.java:597)
> >        at
> >
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> >        at
> > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> >        at
> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
> >        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >        at
> >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >        at
> >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >        at java.lang.reflect.Method.invoke(Method.java:597)
> >        at org.apache.hadoop.util.RunJar.main(RunJar.java:156
> >
>

RE: Dirichlet Clustering

Reply via email to