Re: Pointer to Reference Docs

Julian Ortega Mon, 17 Sep 2012 02:33:28 -0700

The *seqdirectory *command takes every file in the specified directory and
makes a Hadoop Sequence File
<http://wiki.apache.org/hadoop/SequenceFile>out of it. Sequence Files
have a key and a value, and in the case you want
to turn a list of file into Sequence Files then the file name will be the
key and the file contents will be the value. Nonetheless, this is quite
unpractical if your corpus is large as disk reading and writing can become
painfully slow. You might want to have a look at this discussion on
StackOverflow<http://stackoverflow.com/questions/11645294/how-can-i-use-mahouts-sequencefile-api-code/>which
discusses how to use the Sequence File API to transform a key-value
CSV file into sequence files

The *seq2sparse *Mahout shell command converts the text documents in
Sequence File format to vectors using either TF or
TF-IDF<http://en.wikipedia.org/wiki/Tf*idf>weighting with n-gram
generation.

I suggest looking at this quick
tour<https://cwiki.apache.org/MAHOUT/quick-tour-of-text-analysis-using-the-mahout-command-line.html>
for
now, but I would strongly recommend reading the Mahout in Action
book<http://manning.com/owen/>,
specifically chapter 8.

Hope this helps

On Mon, Sep 17, 2012 at 11:18 AM, David Scarlatti <[email protected]>wrote:

> Hi, I'd appreciate  any hint on the best source of reference information...
> I've found different examples and quick guides but If I want to know i.e.
> what seqdirecoty or seq2sparse exactly does and which are the different
> command line options with a detailed description, I can't find the place...
> Is this something still to do in Mahout? Should I look to the source code
> to knos this?
>
> Thanks in advance.
>
> --
> -----
> David.
>

Re: Pointer to Reference Docs

Reply via email to