Re: Dirichlet Clustering

Dhruv Wed, 11 May 2011 13:38:01 -0700

The clustering classes require the input data in the form of VectorWritable.


You can convert the data to a SequenceFile, and then to the VectorWritable
format using the SparseVectorsFromSequenceFiles class:

https://builds.apache.org/hudson/job/Mahout-Quality/clover/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.html#SparseVectorsFromSequenceFiles

All this can be a little confusing, so here are some pointers:

Writable is an interface used by Hadoop and (Mahout) classes for
serialization.
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Writable.html

SequenceFile is a type of input to a Hadoop application which consists of
compressed binary key value pairs.
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/SequenceFile.html

VectorWritable is a class in Mahout which implements Vectors in space in a
Writable (serializable) format.
https://builds.apache.org/hudson/job/Mahout-Quality/clover/org/apache/mahout/math/VectorWritable.html#VectorWritable



On Wed, May 11, 2011 at 2:59 PM, Keith Thompson <[email protected]>wrote:

> I am trying to run the example at
>
> https://cwiki.apache.org/confluence/display/MAHOUT/Clustering+of+synthetic+control+data
> .
> Since I am new to both Hadoop and Mahout, my problem is most likely an
> inadequate understanding of Hadoop at this point.  I have converted the
> input file to a sequence file and am now trying to run the Dirichlet
> clustering algorithm.  It seems to want a VectorWritable rather than a
> text.  How do I make the necessary adjustments?
>
> k_thomp@linux-8awa:~> trunk/bin/mahout dirichlet -i output/chunk-0 -o
> output
> -x 10 -k 6
> Running on hadoop, using HADOOP_HOME=/usr/local/hadoop-0.20.2
> No HADOOP_CONF_DIR set, using /usr/local/hadoop-0.20.2/src/conf
> 11/05/10 14:40:01 INFO common.AbstractJob: Command line arguments:
> {--alpha=1.0,
>
> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
> --emitMostLikely=true, --endPhase=2147483647, --input=output/chunk-0,
> --maxIter=10, --method=mapreduce,
>
> --modelDist=org.apache.mahout.clustering.dirichlet.models.GaussianClusterDistribution,
> --modelPrototype=org.apache.mahout.math.RandomAccessSparseVector,
> --numClusters=6, --output=output, --startPhase=0, --tempDir=temp,
> --threshold=0}
> Exception in thread "main" java.lang.ClassCastException:
> org.apache.hadoop.io.Text cannot be cast to
> org.apache.mahout.math.VectorWritable
>        at
>
> org.apache.mahout.clustering.dirichlet.DirichletDriver.readPrototypeSize(DirichletDriver.java:250)
>        at
>
> org.apache.mahout.clustering.dirichlet.DirichletDriver.run(DirichletDriver.java:112)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at
>
> org.apache.mahout.clustering.dirichlet.DirichletDriver.main(DirichletDriver.java:67)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at
>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>        at
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156
>

Re: Dirichlet Clustering

Reply via email to