RE: Dirichlet Clustering

Jeff Eastman Wed, 11 May 2011 19:48:15 -0700

The -p option should point to the clusteredPoints directory as you have done, 
but unless you add a -cl argument to the Dirichlet job you won't actually 
produce any. A lot of people miss this and are confused by the lack of 
clustered output. Perhaps the default should be to cluster points, but it can 
be time consuming and many times people are just interested in the models 
themselves. Since you did not specify a model, you got the default 
GaussianClusterDistribution, with a prior sampled from a Gaussian distribution 
with std=1.


I'm still not sure what's happening with your cluster n values. The 
DisplayDirichlet example uses a small randomly generated dataset and produces 
nonzero n values. If you run it, be sure to change the runClusterer Boolean to 
true, else you will get the new, experimental ClusterClassifier version. It 
produces nonzero n values in both implementations.

-----Original Message-----
From: Keith Thompson [mailto:[email protected]] 
Sent: Wednesday, May 11, 2011 7:09 PM
To: [email protected]
Subject: Re: Dirichlet Clustering

Yes, I used k = 6 with x = 10 iterations.  The data has n=600 records. The
basic steps I used were these (if I recall correctly):

I created files testdata, seqFileOutput, vectorOutput, dirichletOutput,
clusteredPoints in HDFS.
trunk/bin/mahout seqdirectory -i testdata/control_input_data -o
seqFileOutput
trunk/bin/mahout seq2sparse -i seqFileOutput/chunk-0 -o vectorOutput
trunk/bin/mahout dirichlet -i vectorOutput/part-00000 -o dirichletOutput
trunk/bin/mahout clusterdump -s dirichletOutput/clusters-10 -p
clusteredPoints -o /home/k_thomp/Documents/dirichletCluster.txt

I was a little confused about what the -p option was supposed to do.

My output in the dirichletCluster.txt file was something like:

C-0: GC:0{n=1 c=[] r=[0.699, -0.290 ...

C-1: GC:1{n=0 c=[0.533, -0 ...

C-2: GC:2{n=0 c=[0.781, 0.395, ...

up to ...

C-5: GC:5{n=0 c=[0.010, -0.093

On Wed, May 11, 2011 at 9:06 PM, Jeff Eastman <[email protected]> wrote:

> The only clusters-n directory which should ever have n=0 would be the prior
> directory clusters-0. More iterations won't help, it kind looks like there
> are no input vectors being read. What's your k value (6?)? How many input
> vectors do you have? During each iteration all the input vectors will be
> assigned to one of the clusters, so the n= values should all sum to the
> number of input vectors in each iteration (clusters-i). Depending upon your
> prior and Model chosen its more likely that you'd get all the points into
> the same cluster, n=#vectors, than n=0. With a bit more about your problem,
> I could help more.
>
> -----Original Message-----
> From: Keith Thompson [mailto:[email protected]]
> Sent: Wednesday, May 11, 2011 5:56 PM
> To: [email protected]
> Subject: Re: Dirichlet Clustering
>
> Thank you for your help.  I was able to run seq2sparse on my SequenceFile
> and then run dirichlet on the resulting output. I guess I just need to
> spend
> a lot more time reading through the help docs and API.  I then used the
> clusterdumper to view the output.  Unfortunately, I think something must
> have gone wrong somewhere along the way because the 6 cluster centers are
> shown but n = 0 for all but one of them and n = 1 for the other.  Maybe
> it's
> just because I only ran 10 iterations to see if I had it working or not.
> Perhaps with more iterations, I would get better output.
>
>
>
> On Wed, May 11, 2011 at 4:37 PM, Dhruv <[email protected]> wrote:
>
> > The clustering classes require the input data in the form of
> > VectorWritable.
> >
> > You can convert the data to a SequenceFile, and then to the
> VectorWritable
> > format using the SparseVectorsFromSequenceFiles class:
> >
> >
> >
> https://builds.apache.org/hudson/job/Mahout-Quality/clover/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.html#SparseVectorsFromSequenceFiles
> >
> > All this can be a little confusing, so here are some pointers:
> >
> > Writable is an interface used by Hadoop and (Mahout) classes for
> > serialization.
> >
> >
> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Writable.html
> >
> > SequenceFile is a type of input to a Hadoop application which consists of
> > compressed binary key value pairs.
> >
> >
> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/SequenceFile.html
> >
> > VectorWritable is a class in Mahout which implements Vectors in space in
> a
> > Writable (serializable) format.
> >
> >
> https://builds.apache.org/hudson/job/Mahout-Quality/clover/org/apache/mahout/math/VectorWritable.html#VectorWritable
> >
> >
> >
> > On Wed, May 11, 2011 at 2:59 PM, Keith Thompson <[email protected]
> > >wrote:
> >
> > > I am trying to run the example at
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/MAHOUT/Clustering+of+synthetic+control+data
> > > .
> > > Since I am new to both Hadoop and Mahout, my problem is most likely an
> > > inadequate understanding of Hadoop at this point.  I have converted the
> > > input file to a sequence file and am now trying to run the Dirichlet
> > > clustering algorithm.  It seems to want a VectorWritable rather than a
> > > text.  How do I make the necessary adjustments?
> > >
> > > k_thomp@linux-8awa:~> trunk/bin/mahout dirichlet -i output/chunk-0 -o
> > > output
> > > -x 10 -k 6
> > > Running on hadoop, using HADOOP_HOME=/usr/local/hadoop-0.20.2
> > > No HADOOP_CONF_DIR set, using /usr/local/hadoop-0.20.2/src/conf
> > > 11/05/10 14:40:01 INFO common.AbstractJob: Command line arguments:
> > > {--alpha=1.0,
> > >
> > >
> >
> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
> > > --emitMostLikely=true, --endPhase=2147483647, --input=output/chunk-0,
> > > --maxIter=10, --method=mapreduce,
> > >
> > >
> >
> --modelDist=org.apache.mahout.clustering.dirichlet.models.GaussianClusterDistribution,
> > > --modelPrototype=org.apache.mahout.math.RandomAccessSparseVector,
> > > --numClusters=6, --output=output, --startPhase=0, --tempDir=temp,
> > > --threshold=0}
> > > Exception in thread "main" java.lang.ClassCastException:
> > > org.apache.hadoop.io.Text cannot be cast to
> > > org.apache.mahout.math.VectorWritable
> > >        at
> > >
> > >
> >
> org.apache.mahout.clustering.dirichlet.DirichletDriver.readPrototypeSize(DirichletDriver.java:250)
> > >        at
> > >
> > >
> >
> org.apache.mahout.clustering.dirichlet.DirichletDriver.run(DirichletDriver.java:112)
> > >        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > >        at
> > >
> > >
> >
> org.apache.mahout.clustering.dirichlet.DirichletDriver.main(DirichletDriver.java:67)
> > >        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >        at
> > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > >        at
> > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > >        at java.lang.reflect.Method.invoke(Method.java:597)
> > >        at
> > >
> > >
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> > >        at
> > > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> > >        at
> > org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
> > >        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >        at
> > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > >        at
> > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > >        at java.lang.reflect.Method.invoke(Method.java:597)
> > >        at org.apache.hadoop.util.RunJar.main(RunJar.java:156
> > >
> >
>

RE: Dirichlet Clustering

Reply via email to