* Using lucene as pre processing for document clustering *

Rajesh Nikam Mon, 01 Apr 2013 03:44:37 -0700

Hello,

I want to cluster document. I see lucene to be great help to do pre
processing
'StandardAnalyzer' to remove stop words, stemming etc.


As Mahout requires input in its vector format.

They have provided following props for the same, which I have used.

org.apache.mahout.utils.vectors.lucene.Driver = lucene.vector : Generate
Vectors from a Lucene index

However it is giving me following error:

hadoop jar mahout-examples-0.7-job.jar *
org.apache.mahout.utils.vectors.lucene.Driver* --dir /mnt/news/index
--field text --dictOut /mnt/news/dict.txt --output /mnt/news/out.txt
Warning: $HADOOP_HOME is deprecated.

13/04/01 15:55:06 INFO lucene.Driver: Output File: /mnt/news/out.txt
13/04/01 15:55:07 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
13/04/01 15:55:07 INFO zlib.ZlibFactory: Successfully loaded & initialized
native-zlib library
13/04/01 15:55:07 INFO compress.CodecPool: Got brand-new compressor
13/04/01 15:55:07 ERROR lucene.LuceneIterator: *There are too many
documents that do not have a term vector for text*


I tried doing
https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html mentioned in
*C**onverting existing vectors to Mahout's format

*
*"Probably the easiest way to go would be to implement your own
Iterable<Vector> (called VectorIterable in the example below)"*

VectorWriter vectorWriter = SequenceFile.createWriter(filesystem,
configuration, outfile, LongWritable.class, SparseVector.class);long
numDocs = vectorWriter.write(new *VectorIterable*(), Long.MAX_VALUE);

I am not able to locate classes  "*SparseVector*" in mahout / hadoop jars -

mahout-examples-0.7-job.jar;hadoop-core-1.0.4.jar.

Could you please let me know how to *to implement Iterable<Vector> *mentioned
above*?

*
Thanks,
Rajesh

* Using lucene as pre processing for document clustering *

Reply via email to