Hi I got your email address from one of the Mahout forum.
I need some help. I have about 60 docs for which I am calculating the TF IDF. The steps that I am following - 1. Convert the files into Sequence file using SequenceFilesFromDirectory run() method. 2. Tokenize the generated sequence file using DocumentProcessor tokenizeDocuments() method. 3. Create Term Frequency Vector using - DictionaryVectorizer createTermFrequencyVectors() method. 4. Create the TF IDF using TFIDFConverter processTfIdf() method. 5. Create the Matrix using code from RowIdJob. What more is to be done? *I want to find the similarity between each document. Something like * *Doc 1 - Doc 2 is XXX similar* *Doc 1 - Doc 3 is YYY similar* *Doc 2 - Doc 3 is ZZZ similar* * * Can you please help? -- Regards Junaid
