Re: Dirichlet Clustering

Keith Thompson Wed, 11 May 2011 19:09:24 -0700

Yes, I used k = 6 with x = 10 iterations.  The data has n=600 records. The
basic steps I used were these (if I recall correctly):


I created files testdata, seqFileOutput, vectorOutput, dirichletOutput,
clusteredPoints in HDFS.
trunk/bin/mahout seqdirectory -i testdata/control_input_data -o
seqFileOutput
trunk/bin/mahout seq2sparse -i seqFileOutput/chunk-0 -o vectorOutput
trunk/bin/mahout dirichlet -i vectorOutput/part-00000 -o dirichletOutput
trunk/bin/mahout clusterdump -s dirichletOutput/clusters-10 -p
clusteredPoints -o /home/k_thomp/Documents/dirichletCluster.txt

I was a little confused about what the -p option was supposed to do.

My output in the dirichletCluster.txt file was something like:

C-0: GC:0{n=1 c=[] r=[0.699, -0.290 ...

C-1: GC:1{n=0 c=[0.533, -0 ...

C-2: GC:2{n=0 c=[0.781, 0.395, ...

up to ...

C-5: GC:5{n=0 c=[0.010, -0.093

On Wed, May 11, 2011 at 9:06 PM, Jeff Eastman <[email protected]> wrote:

> The only clusters-n directory which should ever have n=0 would be the prior
> directory clusters-0. More iterations won't help, it kind looks like there
> are no input vectors being read. What's your k value (6?)? How many input
> vectors do you have? During each iteration all the input vectors will be
> assigned to one of the clusters, so the n= values should all sum to the
> number of input vectors in each iteration (clusters-i). Depending upon your
> prior and Model chosen its more likely that you'd get all the points into
> the same cluster, n=#vectors, than n=0. With a bit more about your problem,
> I could help more.
>
> -----Original Message-----
> From: Keith Thompson [mailto:[email protected]]
> Sent: Wednesday, May 11, 2011 5:56 PM
> To: [email protected]
> Subject: Re: Dirichlet Clustering
>
> Thank you for your help.  I was able to run seq2sparse on my SequenceFile
> and then run dirichlet on the resulting output. I guess I just need to
> spend
> a lot more time reading through the help docs and API.  I then used the
> clusterdumper to view the output.  Unfortunately, I think something must
> have gone wrong somewhere along the way because the 6 cluster centers are
> shown but n = 0 for all but one of them and n = 1 for the other.  Maybe
> it's
> just because I only ran 10 iterations to see if I had it working or not.
> Perhaps with more iterations, I would get better output.
>
>
>
> On Wed, May 11, 2011 at 4:37 PM, Dhruv <[email protected]> wrote:
>
> > The clustering classes require the input data in the form of
> > VectorWritable.
> >
> > You can convert the data to a SequenceFile, and then to the
> VectorWritable
> > format using the SparseVectorsFromSequenceFiles class:
> >
> >
> >
> https://builds.apache.org/hudson/job/Mahout-Quality/clover/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.html#SparseVectorsFromSequenceFiles
> >
> > All this can be a little confusing, so here are some pointers:
> >
> > Writable is an interface used by Hadoop and (Mahout) classes for
> > serialization.
> >
> >
> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Writable.html
> >
> > SequenceFile is a type of input to a Hadoop application which consists of
> > compressed binary key value pairs.
> >
> >
> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/SequenceFile.html
> >
> > VectorWritable is a class in Mahout which implements Vectors in space in
> a
> > Writable (serializable) format.
> >
> >
> https://builds.apache.org/hudson/job/Mahout-Quality/clover/org/apache/mahout/math/VectorWritable.html#VectorWritable
> >
> >
> >
> > On Wed, May 11, 2011 at 2:59 PM, Keith Thompson <[email protected]
> > >wrote:
> >
> > > I am trying to run the example at
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/MAHOUT/Clustering+of+synthetic+control+data
> > > .
> > > Since I am new to both Hadoop and Mahout, my problem is most likely an
> > > inadequate understanding of Hadoop at this point.  I have converted the
> > > input file to a sequence file and am now trying to run the Dirichlet
> > > clustering algorithm.  It seems to want a VectorWritable rather than a
> > > text.  How do I make the necessary adjustments?
> > >
> > > k_thomp@linux-8awa:~> trunk/bin/mahout dirichlet -i output/chunk-0 -o
> > > output
> > > -x 10 -k 6
> > > Running on hadoop, using HADOOP_HOME=/usr/local/hadoop-0.20.2
> > > No HADOOP_CONF_DIR set, using /usr/local/hadoop-0.20.2/src/conf
> > > 11/05/10 14:40:01 INFO common.AbstractJob: Command line arguments:
> > > {--alpha=1.0,
> > >
> > >
> >
> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
> > > --emitMostLikely=true, --endPhase=2147483647, --input=output/chunk-0,
> > > --maxIter=10, --method=mapreduce,
> > >
> > >
> >
> --modelDist=org.apache.mahout.clustering.dirichlet.models.GaussianClusterDistribution,
> > > --modelPrototype=org.apache.mahout.math.RandomAccessSparseVector,
> > > --numClusters=6, --output=output, --startPhase=0, --tempDir=temp,
> > > --threshold=0}
> > > Exception in thread "main" java.lang.ClassCastException:
> > > org.apache.hadoop.io.Text cannot be cast to
> > > org.apache.mahout.math.VectorWritable
> > >        at
> > >
> > >
> >
> org.apache.mahout.clustering.dirichlet.DirichletDriver.readPrototypeSize(DirichletDriver.java:250)
> > >        at
> > >
> > >
> >
> org.apache.mahout.clustering.dirichlet.DirichletDriver.run(DirichletDriver.java:112)
> > >        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > >        at
> > >
> > >
> >
> org.apache.mahout.clustering.dirichlet.DirichletDriver.main(DirichletDriver.java:67)
> > >        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >        at
> > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > >        at
> > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > >        at java.lang.reflect.Method.invoke(Method.java:597)
> > >        at
> > >
> > >
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> > >        at
> > > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> > >        at
> > org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
> > >        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >        at
> > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > >        at
> > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > >        at java.lang.reflect.Method.invoke(Method.java:597)
> > >        at org.apache.hadoop.util.RunJar.main(RunJar.java:156
> > >
> >
>

Re: Dirichlet Clustering

Reply via email to