Hello, I want to cluster document. I see lucene to be great help to do pre processing 'StandardAnalyzer' to remove stop words, stemming etc.
As Mahout requires input in its vector format. They have provided following props for the same, which I have used. org.apache.mahout.utils.vectors.lucene.Driver = lucene.vector : Generate Vectors from a Lucene index However it is giving me following error: hadoop jar mahout-examples-0.7-job.jar * org.apache.mahout.utils.vectors.lucene.Driver* --dir /mnt/news/index --field text --dictOut /mnt/news/dict.txt --output /mnt/news/out.txt Warning: $HADOOP_HOME is deprecated. 13/04/01 15:55:06 INFO lucene.Driver: Output File: /mnt/news/out.txt 13/04/01 15:55:07 INFO util.NativeCodeLoader: Loaded the native-hadoop library 13/04/01 15:55:07 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 13/04/01 15:55:07 INFO compress.CodecPool: Got brand-new compressor 13/04/01 15:55:07 ERROR lucene.LuceneIterator: *There are too many documents that do not have a term vector for text* I tried doing https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html mentioned in *C**onverting existing vectors to Mahout's format * *"Probably the easiest way to go would be to implement your own Iterable<Vector> (called VectorIterable in the example below)"* VectorWriter vectorWriter = SequenceFile.createWriter(filesystem, configuration, outfile, LongWritable.class, SparseVector.class);long numDocs = vectorWriter.write(new *VectorIterable*(), Long.MAX_VALUE); I am not able to locate classes "*SparseVector*" in mahout / hadoop jars - mahout-examples-0.7-job.jar;hadoop-core-1.0.4.jar. Could you please let me know how to *to implement Iterable<Vector> *mentioned above*? * Thanks, Rajesh
