>From sequence file to sparse vector file is the fun part: there are (roughly) two phases: 1) parse the text file and decide what is a document 2) analyze the document with the Lucene text search API and create vectors from the output.
#1 you can figure out from example code, like the Wikipedia, Reuters and Newsgorups code. #2 takes some technical background, but you can use it as a black box. It is explained in Chapter 14 of Mahout In Action. On Wed, Dec 28, 2011 at 8:57 AM, Josh Patterson <[email protected]> wrote: > Rahul, > Currently the text file to sequence file functionality is contained in > (as of Mahout 0.6 / trunk): > > org.apache.mahout.text.SequenceFilesFromDirectory > > and it write a K/V pair to a standard sequence file in the form of: > > { filepath (Text), contents of file (Text) } > > In the current single process form of the code it uses a custom > PathFilter (SequenceFilesFromDirectoryFilter) to recursively walk down > a directory and its child directories to write the contained files > into a series of sequence files based on a variety of options like > "chunk size". > > An example of running this would be: > > bin/mahout seqdirectory -c UTF-8 -i reuters/ -o reuters-seqfiles > > Josh > > On Wed, Dec 28, 2011 at 7:00 AM, rahul raghavendhra > <[email protected]> wrote: >> I am new to Mahout.. i just want to know how text file is converted into >> seqfile and then to sparse vectors.. >> any kind of text file can be converted into seq file using ./mahout >> seqdirectory ? >> >> thanks in advance.. >> >> ./rahul > > > > -- > Twitter: @jpatanooga > Solution Architect @ Cloudera > hadoop: http://www.cloudera.com -- Lance Norskog [email protected]
