The *seqdirectory *command takes every file in the specified directory and makes a Hadoop Sequence File <http://wiki.apache.org/hadoop/SequenceFile>out of it. Sequence Files have a key and a value, and in the case you want to turn a list of file into Sequence Files then the file name will be the key and the file contents will be the value. Nonetheless, this is quite unpractical if your corpus is large as disk reading and writing can become painfully slow. You might want to have a look at this discussion on StackOverflow<http://stackoverflow.com/questions/11645294/how-can-i-use-mahouts-sequencefile-api-code/>which discusses how to use the Sequence File API to transform a key-value CSV file into sequence files
The *seq2sparse *Mahout shell command converts the text documents in Sequence File format to vectors using either TF or TF-IDF<http://en.wikipedia.org/wiki/Tf*idf>weighting with n-gram generation. I suggest looking at this quick tour<https://cwiki.apache.org/MAHOUT/quick-tour-of-text-analysis-using-the-mahout-command-line.html> for now, but I would strongly recommend reading the Mahout in Action book<http://manning.com/owen/>, specifically chapter 8. Hope this helps On Mon, Sep 17, 2012 at 11:18 AM, David Scarlatti <[email protected]>wrote: > Hi, I'd appreciate any hint on the best source of reference information... > I've found different examples and quick guides but If I want to know i.e. > what seqdirecoty or seq2sparse exactly does and which are the different > command line options with a detailed description, I can't find the place... > Is this something still to do in Mahout? Should I look to the source code > to knos this? > > Thanks in advance. > > -- > ----- > David. >
