Nice summary. One suggestion: there are not many tools for converting different user data to the format accepted by Mahout. This is the first problem many users of Mahout encounter (I did) if they want to try data different from text documents examples. It would be helpful to add "generic" input converters.
Seq2sparse is designed for one specific problem and doing much more than just converting the inputs. (SparseVectorsFromSequenceFiles is called by seq2sparse tool) The only other conversion tool I saw used in synthetic controlled data example org.apache.mahout.clustering.conversion.InputDriver class: https://builds.apache.org/hudson/job/Mahout-Quality/clover/org/apache/mahout/clustering/conversion/InputDriver.html. It works with InputMapper.java that specifies how to read input data. For different formats one has to write their own Mapper classes. I think It would be helpful for beginners to have some basic conversion tools in Mahout library. Thanks Konstantin On Wed, May 11, 2011 at 1:37 PM, Dhruv <[email protected]> wrote: > The clustering classes require the input data in the form of > VectorWritable. > > You can convert the data to a SequenceFile, and then to the VectorWritable > format using the SparseVectorsFromSequenceFiles class: > > > https://builds.apache.org/hudson/job/Mahout-Quality/clover/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.html#SparseVectorsFromSequenceFiles > > All this can be a little confusing, so here are some pointers: > > Writable is an interface used by Hadoop and (Mahout) classes for > serialization. > > http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/Writable.html > > SequenceFile is a type of input to a Hadoop application which consists of > compressed binary key value pairs. > > http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/io/SequenceFile.html > > VectorWritable is a class in Mahout which implements Vectors in space in a > Writable (serializable) format. > > https://builds.apache.org/hudson/job/Mahout-Quality/clover/org/apache/mahout/math/VectorWritable.html#VectorWritable > > > > On Wed, May 11, 2011 at 2:59 PM, Keith Thompson <[email protected] > >wrote: > > > I am trying to run the example at > > > > > https://cwiki.apache.org/confluence/display/MAHOUT/Clustering+of+synthetic+control+data > > . > > Since I am new to both Hadoop and Mahout, my problem is most likely an > > inadequate understanding of Hadoop at this point. I have converted the > > input file to a sequence file and am now trying to run the Dirichlet > > clustering algorithm. It seems to want a VectorWritable rather than a > > text. How do I make the necessary adjustments? > > > > k_thomp@linux-8awa:~> trunk/bin/mahout dirichlet -i output/chunk-0 -o > > output > > -x 10 -k 6 > > Running on hadoop, using HADOOP_HOME=/usr/local/hadoop-0.20.2 > > No HADOOP_CONF_DIR set, using /usr/local/hadoop-0.20.2/src/conf > > 11/05/10 14:40:01 INFO common.AbstractJob: Command line arguments: > > {--alpha=1.0, > > > > > --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, > > --emitMostLikely=true, --endPhase=2147483647, --input=output/chunk-0, > > --maxIter=10, --method=mapreduce, > > > > > --modelDist=org.apache.mahout.clustering.dirichlet.models.GaussianClusterDistribution, > > --modelPrototype=org.apache.mahout.math.RandomAccessSparseVector, > > --numClusters=6, --output=output, --startPhase=0, --tempDir=temp, > > --threshold=0} > > Exception in thread "main" java.lang.ClassCastException: > > org.apache.hadoop.io.Text cannot be cast to > > org.apache.mahout.math.VectorWritable > > at > > > > > org.apache.mahout.clustering.dirichlet.DirichletDriver.readPrototypeSize(DirichletDriver.java:250) > > at > > > > > org.apache.mahout.clustering.dirichlet.DirichletDriver.run(DirichletDriver.java:112) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > at > > > > > org.apache.mahout.clustering.dirichlet.DirichletDriver.main(DirichletDriver.java:67) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > > at > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > at java.lang.reflect.Method.invoke(Method.java:597) > > at > > > > > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) > > at > > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) > > at > org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > > at > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > at java.lang.reflect.Method.invoke(Method.java:597) > > at org.apache.hadoop.util.RunJar.main(RunJar.java:156 > > > -- ksh:
