Re: Mahout Seqfile format

Lance Norskog Wed, 28 Dec 2011 12:35:36 -0800

>From sequence file to sparse vector file is the fun part: there are
(roughly) two phases:
1) parse the text file and decide what is a document
2) analyze the document with the Lucene text search API and create
vectors from the output.


#1 you can figure out from example code, like the Wikipedia, Reuters
and Newsgorups code.
#2 takes some technical background, but you can use it as a black box.
It is explained in Chapter 14 of Mahout In
Action.

On Wed, Dec 28, 2011 at 8:57 AM, Josh Patterson <[email protected]> wrote:
> Rahul,
> Currently the text file to sequence file functionality is contained in
> (as of Mahout 0.6 / trunk):
>
> org.apache.mahout.text.SequenceFilesFromDirectory
>
> and it write a K/V pair to a standard sequence file in the form of:
>
> { filepath (Text), contents of file (Text) }
>
> In the current single process form of the code it uses a custom
> PathFilter (SequenceFilesFromDirectoryFilter) to recursively walk down
> a directory and its child directories to write the contained files
> into a series of sequence files based on a variety of options like
> "chunk size".
>
> An example of running this would be:
>
> bin/mahout seqdirectory -c UTF-8 -i reuters/ -o reuters-seqfiles
>
> Josh
>
> On Wed, Dec 28, 2011 at 7:00 AM, rahul raghavendhra
> <[email protected]> wrote:
>> I am new to Mahout.. i just want to know how text file is converted into
>> seqfile and then to sparse vectors..
>> any kind of text file can  be converted into seq file using ./mahout
>> seqdirectory ?
>>
>> thanks in advance..
>>
>> ./rahul
>
>
>
> --
> Twitter: @jpatanooga
> Solution Architect @ Cloudera
> hadoop: http://www.cloudera.com



-- 
Lance Norskog
[email protected]

Re: Mahout Seqfile format

Reply via email to