Help needed on TF IDF.

Junaid Surve Sun, 08 Jan 2012 16:45:53 -0800

Hi

I got your email address from one of the Mahout forum.


I need some help.

I have about 60 docs for which I am calculating the TF IDF.

The steps that I am following -
1. Convert the files into Sequence file using SequenceFilesFromDirectory
run() method.
2. Tokenize the generated sequence file using DocumentProcessor
tokenizeDocuments() method.
3. Create Term Frequency Vector using - DictionaryVectorizer
createTermFrequencyVectors() method.
4. Create the TF IDF using TFIDFConverter processTfIdf() method.
5. Create the Matrix using code from RowIdJob.

What more is to be done?

*I want to find the similarity between each document. Something like *
*Doc 1 - Doc 2 is XXX similar*
*Doc 1 - Doc 3 is YYY similar*
*Doc 2 - Doc 3 is ZZZ similar*
*
*
Can you please help?

-- 
Regards
Junaid

Help needed on TF IDF.

Reply via email to