The clustering classes require the input data in the form of VectorWritable.
You can convert the data to a SequenceFile, and then to the VectorWritable format using the SparseVectorsFromSequenceFiles class: https://builds.apache.org/hudson/job/Mahout-Quality/clover/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.html#SparseVectorsFromSequenceFiles All this can be a little confusing, so here are some pointers: Writable is an interface used by Hadoop and (Mahout) classes for serialization. http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Writable.html SequenceFile is a type of input to a Hadoop application which consists of compressed binary key value pairs. http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/SequenceFile.html VectorWritable is a class in Mahout which implements Vectors in space in a Writable (serializable) format. https://builds.apache.org/hudson/job/Mahout-Quality/clover/org/apache/mahout/math/VectorWritable.html#VectorWritable On Wed, May 11, 2011 at 2:59 PM, Keith Thompson <[email protected]>wrote: > I am trying to run the example at > > https://cwiki.apache.org/confluence/display/MAHOUT/Clustering+of+synthetic+control+data > . > Since I am new to both Hadoop and Mahout, my problem is most likely an > inadequate understanding of Hadoop at this point. I have converted the > input file to a sequence file and am now trying to run the Dirichlet > clustering algorithm. It seems to want a VectorWritable rather than a > text. How do I make the necessary adjustments? > > k_thomp@linux-8awa:~> trunk/bin/mahout dirichlet -i output/chunk-0 -o > output > -x 10 -k 6 > Running on hadoop, using HADOOP_HOME=/usr/local/hadoop-0.20.2 > No HADOOP_CONF_DIR set, using /usr/local/hadoop-0.20.2/src/conf > 11/05/10 14:40:01 INFO common.AbstractJob: Command line arguments: > {--alpha=1.0, > > --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, > --emitMostLikely=true, --endPhase=2147483647, --input=output/chunk-0, > --maxIter=10, --method=mapreduce, > > --modelDist=org.apache.mahout.clustering.dirichlet.models.GaussianClusterDistribution, > --modelPrototype=org.apache.mahout.math.RandomAccessSparseVector, > --numClusters=6, --output=output, --startPhase=0, --tempDir=temp, > --threshold=0} > Exception in thread "main" java.lang.ClassCastException: > org.apache.hadoop.io.Text cannot be cast to > org.apache.mahout.math.VectorWritable > at > > org.apache.mahout.clustering.dirichlet.DirichletDriver.readPrototypeSize(DirichletDriver.java:250) > at > > org.apache.mahout.clustering.dirichlet.DirichletDriver.run(DirichletDriver.java:112) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at > > org.apache.mahout.clustering.dirichlet.DirichletDriver.main(DirichletDriver.java:67) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) > at > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) > at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.main(RunJar.java:156 >
