Charly,Thank you for the script. I am going to try it to create vectors for the 
classification and LDA algorithms.  I think I need to pay attention to the 
variable type of the Key.  Anyway, I'll give it a shot. Thanks, again.
Nick


> From: [email protected]
> Date: Wed, 31 Oct 2012 17:30:48 -0300
> Subject: Re: Converting one large text file with multiple documents to 
> SequenceFile format
> To: [email protected]
> 
> I don't think you need that. Just a simple mapper.
> 
> static class IdentityMapper extends  Mapper<LongWritable, Text, Text, Text>
> {
> 
>         @Override
>         protected void map(LongWritable key, Text value, Context context)
> throws IOException, InterruptedException {
> 
>             String[] fields = value.toString().split("\t") ;
>             if  ( fields.length >= 2) {
>                 context.write(new Text(fields[0]), new Text(fields[1]))
> ;
>             }
> 
>         }
> 
>     }
> 
> and then run a simple job..
> 
>         Job text2SequenceFileJob = this.prepareJob(this.getInputPath(),
> this.getOutputPath(), TextInputFormat.class, IdentityMapper.class,
> Text.class, Text.class, SequenceFileOutputFormat.class) ;
> 
>         text2SequenceFileJob.setOutputKeyClass(Text.class) ;
>         text2SequenceFileJob.setOutputValueClass(Text.class) ;
>         text2SequenceFileJob.setNumReduceTasks(0) ;
> 
>         text2SequenceFileJob.waitForCompletion(true) ;
> 
> Cheers!
> Charly
> 
> On Wed, Oct 31, 2012 at 4:57 PM, Nick Woodward <[email protected]> wrote:
> 
> >
> > Yeah, I've looked at filter classes, but nothing worked.  I guess I'll do
> > something similar and continuously save each line into a file and then run
> > seqdiretory.  The running time won't look good, but at least it should
> > work.  Thanks for the response.
> >
> > Nick
> >
> > > From: [email protected]
> > > Date: Tue, 30 Oct 2012 18:07:58 -0300
> > > Subject: Re: Converting one large text file with multiple documents to
> > SequenceFile format
> > > To: [email protected]
> > >
> > > I had the exact same issue and I tried to use the seqdirectory command
> > with
> > > a different filter class but It did not work. It seems there's a bug in
> > the
> > > mahout-0.6 code.
> > >
> > > It ended up as writing a custom map-reduce program that performs just
> > that.
> > >
> > > Greetiings!
> > > Charly
> > >
> > > On Tue, Oct 30, 2012 at 5:00 PM, Nick Woodward <[email protected]>
> > wrote:
> > >
> > > >
> > > > I have done a lot of searching on the web for this, but I've found
> > > > nothing, even though I feel like it has to be somewhat common. I have
> > used
> > > > Mahout's 'seqdirectory' command to convert a folder containing text
> > files
> > > > (each file is a separate document) in the past. But in this case there
> > are
> > > > so many documents (in the 100,000s) that I have one very large text
> > file in
> > > > which each line is a document. How can I convert this large file to
> > > > SequenceFile format so that Mahout understands that each line should be
> > > > considered a separate document?  Would it be better if the file was
> > > > structured like so....docId1 {tab} document textdocId2 {tab} document
> > > > textdocId3 {tab} document text...
> > > >
> > > > Thank you very much for any help.Nick
> > > >
> >
> >

                                          

Reply via email to