Pe 04.01.2012 23:34, Grant Ingersoll a scris:
On Jan 4, 2012, at 3:22 PM, Dmitriy Lyubimov wrote:
also via command line, the same processing is (I think ) achieved by
seqdirectory command.
./bin/mahout seqdirectory will convert to sequence files
./bin/mahout seq2sparse will do the TF-IDF conversion
You can find the mapping between the commmand name and the driver class
in driver.classes.props file (in the conf dir of your mahout
distribution or src/conf if you have mahout trunk). This is how mahout
finds the name of the class to run.
For example: ./bin/mahout seqdirectory will run the class
org.apache.mahout.text.SequenceFilesFromDirectory as described by the line:
org.apache.mahout.text.SequenceFilesFromDirectory = seqdirectory :
Generate sequence files (of Text) from a directory
See examples/bin/cluster-reuters, amongst others, for examples of these in
action.
On Wed, Jan 4, 2012 at 8:31 AM, Grant Ingersoll<[email protected]> wrote:
Hu Junaid,
Have a look at the SparseVectorsFromSequenceFiles class, as this does this
already, in combination with SequenceFilesFromDirectory which can convert text
files to SequenceFiles.
-Grant
On Jan 4, 2012, at 8:30 AM, Junaid Surve wrote:
Hi
I want to develop a Prototype to calculate the TF IDF from the documents
present in a directory.
Can you please help me with the Steps to go about it using Apache Mahout?
Thank you.
--
Regards
Junaid
--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com
--
Ioan Eugen Stan
http://ieugen.blogspot.com