Re: Dirichlet Clustering

Konstantin Shmakov Wed, 11 May 2011 17:45:09 -0700

Nice summary.
One suggestion: there are not many tools for converting different user data
to the format accepted by Mahout. This is the first problem many users of
Mahout encounter (I did) if they want to try data different from text
documents examples.  It would be helpful to add "generic" input converters.


Seq2sparse is designed for one specific problem and doing much more than
just converting the inputs. (SparseVectorsFromSequenceFiles is called by
seq2sparse tool)

The only other conversion tool I saw used in synthetic controlled data
example org.apache.mahout.clustering.conversion.InputDriver class:
https://builds.apache.org/hudson/job/Mahout-Quality/clover/org/apache/mahout/clustering/conversion/InputDriver.html.
It works with InputMapper.java that specifies how to read input data. For
different formats one has to write their own Mapper classes.

I think It would be helpful for beginners to have some basic conversion
tools in Mahout library.

Thanks
Konstantin




On Wed, May 11, 2011 at 1:37 PM, Dhruv <[email protected]> wrote:

> The clustering classes require the input data in the form of
> VectorWritable.
>
> You can convert the data to a SequenceFile, and then to the VectorWritable
> format using the SparseVectorsFromSequenceFiles class:
>
>
> https://builds.apache.org/hudson/job/Mahout-Quality/clover/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.html#SparseVectorsFromSequenceFiles
>
> All this can be a little confusing, so here are some pointers:
>
> Writable is an interface used by Hadoop and (Mahout) classes for
> serialization.
>
> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Writable.html
>
> SequenceFile is a type of input to a Hadoop application which consists of
> compressed binary key value pairs.
>
> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/SequenceFile.html
>
> VectorWritable is a class in Mahout which implements Vectors in space in a
> Writable (serializable) format.
>
> https://builds.apache.org/hudson/job/Mahout-Quality/clover/org/apache/mahout/math/VectorWritable.html#VectorWritable
>
>
>
> On Wed, May 11, 2011 at 2:59 PM, Keith Thompson <[email protected]
> >wrote:
>
> > I am trying to run the example at
> >
> >
> https://cwiki.apache.org/confluence/display/MAHOUT/Clustering+of+synthetic+control+data
> > .
> > Since I am new to both Hadoop and Mahout, my problem is most likely an
> > inadequate understanding of Hadoop at this point.  I have converted the
> > input file to a sequence file and am now trying to run the Dirichlet
> > clustering algorithm.  It seems to want a VectorWritable rather than a
> > text.  How do I make the necessary adjustments?
> >
> > k_thomp@linux-8awa:~> trunk/bin/mahout dirichlet -i output/chunk-0 -o
> > output
> > -x 10 -k 6
> > Running on hadoop, using HADOOP_HOME=/usr/local/hadoop-0.20.2
> > No HADOOP_CONF_DIR set, using /usr/local/hadoop-0.20.2/src/conf
> > 11/05/10 14:40:01 INFO common.AbstractJob: Command line arguments:
> > {--alpha=1.0,
> >
> >
> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
> > --emitMostLikely=true, --endPhase=2147483647, --input=output/chunk-0,
> > --maxIter=10, --method=mapreduce,
> >
> >
> --modelDist=org.apache.mahout.clustering.dirichlet.models.GaussianClusterDistribution,
> > --modelPrototype=org.apache.mahout.math.RandomAccessSparseVector,
> > --numClusters=6, --output=output, --startPhase=0, --tempDir=temp,
> > --threshold=0}
> > Exception in thread "main" java.lang.ClassCastException:
> > org.apache.hadoop.io.Text cannot be cast to
> > org.apache.mahout.math.VectorWritable
> >        at
> >
> >
> org.apache.mahout.clustering.dirichlet.DirichletDriver.readPrototypeSize(DirichletDriver.java:250)
> >        at
> >
> >
> org.apache.mahout.clustering.dirichlet.DirichletDriver.run(DirichletDriver.java:112)
> >        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >        at
> >
> >
> org.apache.mahout.clustering.dirichlet.DirichletDriver.main(DirichletDriver.java:67)
> >        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >        at
> >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >        at
> >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >        at java.lang.reflect.Method.invoke(Method.java:597)
> >        at
> >
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> >        at
> > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> >        at
> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
> >        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >        at
> >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >        at
> >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >        at java.lang.reflect.Method.invoke(Method.java:597)
> >        at org.apache.hadoop.util.RunJar.main(RunJar.java:156
> >
>



-- 
ksh:

Re: Dirichlet Clustering

Reply via email to