Pe 04.01.2012 23:34, Grant Ingersoll a scris:

On Jan 4, 2012, at 3:22 PM, Dmitriy Lyubimov wrote:

also via command line, the same processing is (I think ) achieved by
seqdirectory command.

./bin/mahout seqdirectory will convert to sequence files
./bin/mahout seq2sparse will do the TF-IDF conversion


You can find the mapping between the commmand name and the driver class in driver.classes.props file (in the conf dir of your mahout distribution or src/conf if you have mahout trunk). This is how mahout finds the name of the class to run.

For example: ./bin/mahout seqdirectory will run the class org.apache.mahout.text.SequenceFilesFromDirectory as described by the line:

org.apache.mahout.text.SequenceFilesFromDirectory = seqdirectory : Generate sequence files (of Text) from a directory

See examples/bin/cluster-reuters, amongst others, for examples of these in 
action.


On Wed, Jan 4, 2012 at 8:31 AM, Grant Ingersoll<[email protected]>  wrote:
Hu Junaid,

Have a look at the SparseVectorsFromSequenceFiles class, as this does this 
already, in combination with SequenceFilesFromDirectory which can convert text 
files to SequenceFiles.

-Grant
On Jan 4, 2012, at 8:30 AM, Junaid Surve wrote:

Hi

I want to develop a Prototype to calculate the TF IDF from the documents
present in a directory.

Can you please help me with the Steps to go about it using Apache Mahout?
Thank you.

--
Regards
Junaid

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com






--
Ioan Eugen Stan
http://ieugen.blogspot.com

Reply via email to