Re: Converting one large text file with multiple documents to SequenceFile format

Diego Ceccarelli Mon, 12 Nov 2012 04:22:15 -0800

Dear Nick,

I experienced the same problem, the fact is that when you call cvb, it
expects in input the folder matrix inside the output folder of
seq2sparce,
so instead of


mahout cvb -i output/tf-vectors -dict output/dictionary.file-0 -o
topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7 -x 10

please try:

mahout cvb -i output/tf-vectors/matrix -dict output/dictionary.file-0
-o topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7 -x
10

let me know if it solved ;)

cheers,
Diego

On Mon, Nov 12, 2012 at 1:18 AM, Nick Woodward <[email protected]> wrote:
>
> Diego,Thank you so much for the script. I used it to convert my large text 
> file to a sequence file. I have been trying to use the sequence file to feed 
> Mahout's LDA implementation (Mahout 0.7 so the CVB implementation).  I first 
> converted the sequence file to vectors with this, "mahout seq2sparse -i 
> input/processedaa.seq -o output -ow -wt tf -nr 7" and then ran the LDA with 
> this, "mahout cvb -i output/tf-vectors -dict output/dictionary.file-0 -o 
> topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7 -x 10".  The 
> seq2sparse command produces the tf vectors alright, but the problem is that 
> no matter what I use for parameters, the LDA job sits at map 0% reduce 0% for 
> an hour before outputting the error below.  It has an error casting Text to 
> IntWritable.  My question is when you said that the key is the line number, 
> what variable type is the key?  Is it Text?
>
> My output..."12/11/11 16:10:50 INFO common.AbstractJob: Command line 
> arguments: {--convergenceDelta=[0], --dictionary=[output/dictionary.file-0], 
> --doc_topic_output=[documents], --doc_topic_smoothing=[0.0001], 
> --endPhase=[2147483647], --input=[output/tf-vectors], 
> --iteration_block_size=[10], --maxIter=[10], --max_doc_topic_iters=[10], 
> --num_reduce_tasks=[7], --num_topics=[100], --num_train_threads=[4], 
> --num_update_threads=[1], --output=[topics], --overwrite=null, 
> --startPhase=[0], --tempDir=[temp], --term_topic_smoothing=[0.0001], 
> --test_set_fraction=[0], --topic_model_temp_dir=[states]}12/11/11 16:10:52 
> INFO mapred.JobClient: Running job: job_201211111553_000512/11/11 16:10:53 
> INFO mapred.JobClient:  map 0% reduce 0%12/11/11 17:11:16 INFO 
> mapred.JobClient: Task Id : attempt_201211111553_0005_m_000003_0, Status : 
> FAILEDjava.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast 
> to org.apache.hadoop.io.IntWritable at 
> org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55)
>  at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at 
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at 
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at 
> org.apache.hadoop.mapred.Child$4.run(Child.java:255) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:396) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>  at org.apache.hadoop.mapred.Child.main(Child.java:249)"
>
> Thank you again for your help!Nick
>
>
>> From: [email protected]
>> Date: Thu, 1 Nov 2012 01:07:29 +0100
>> Subject: Re: Converting one large text file with multiple documents to 
>> SequenceFile format
>> To: [email protected]
>>
>> Hei Nick,
>> I had exatly the same problem ;)
>> I wrote a simple command line utility to create a sequence
>> file where each line of the input document is an entry
>> (the key is the line number).
>>
>> https://dl.dropbox.com/u/4663256/tmp/lda-helper.jar
>>
>> java -cp lda-helper.jar it.cnr.isti.hpc.lda.cli.LinesToSequenceFileCLI
>> -input tweets -output tweets.seq
>>
>> enjoy ;)
>> Diego
>>
>> On Wed, Oct 31, 2012 at 9:30 PM, Charly Lizarralde
>> <[email protected]> wrote:
>> > I don't think you need that. Just a simple mapper.
>> >
>> > static class IdentityMapper extends  Mapper<LongWritable, Text, Text, Text>
>> > {
>> >
>> >         @Override
>> >         protected void map(LongWritable key, Text value, Context context)
>> > throws IOException, InterruptedException {
>> >
>> >             String[] fields = value.toString().split("\t") ;
>> >             if  ( fields.length >= 2) {
>> >                 context.write(new Text(fields[0]), new Text(fields[1]))
>> > ;
>> >             }
>> >
>> >         }
>> >
>> >     }
>> >
>> > and then run a simple job..
>> >
>> >         Job text2SequenceFileJob = this.prepareJob(this.getInputPath(),
>> > this.getOutputPath(), TextInputFormat.class, IdentityMapper.class,
>> > Text.class, Text.class, SequenceFileOutputFormat.class) ;
>> >
>> >         text2SequenceFileJob.setOutputKeyClass(Text.class) ;
>> >         text2SequenceFileJob.setOutputValueClass(Text.class) ;
>> >         text2SequenceFileJob.setNumReduceTasks(0) ;
>> >
>> >         text2SequenceFileJob.waitForCompletion(true) ;
>> >
>> > Cheers!
>> > Charly
>> >
>> > On Wed, Oct 31, 2012 at 4:57 PM, Nick Woodward <[email protected]> wrote:
>> >
>> >>
>> >> Yeah, I've looked at filter classes, but nothing worked.  I guess I'll do
>> >> something similar and continuously save each line into a file and then run
>> >> seqdiretory.  The running time won't look good, but at least it should
>> >> work.  Thanks for the response.
>> >>
>> >> Nick
>> >>
>> >> > From: [email protected]
>> >> > Date: Tue, 30 Oct 2012 18:07:58 -0300
>> >> > Subject: Re: Converting one large text file with multiple documents to
>> >> SequenceFile format
>> >> > To: [email protected]
>> >> >
>> >> > I had the exact same issue and I tried to use the seqdirectory command
>> >> with
>> >> > a different filter class but It did not work. It seems there's a bug in
>> >> the
>> >> > mahout-0.6 code.
>> >> >
>> >> > It ended up as writing a custom map-reduce program that performs just
>> >> that.
>> >> >
>> >> > Greetiings!
>> >> > Charly
>> >> >
>> >> > On Tue, Oct 30, 2012 at 5:00 PM, Nick Woodward <[email protected]>
>> >> wrote:
>> >> >
>> >> > >
>> >> > > I have done a lot of searching on the web for this, but I've found
>> >> > > nothing, even though I feel like it has to be somewhat common. I have
>> >> used
>> >> > > Mahout's 'seqdirectory' command to convert a folder containing text
>> >> files
>> >> > > (each file is a separate document) in the past. But in this case there
>> >> are
>> >> > > so many documents (in the 100,000s) that I have one very large text
>> >> file in
>> >> > > which each line is a document. How can I convert this large file to
>> >> > > SequenceFile format so that Mahout understands that each line should 
>> >> > > be
>> >> > > considered a separate document?  Would it be better if the file was
>> >> > > structured like so....docId1 {tab} document textdocId2 {tab} document
>> >> > > textdocId3 {tab} document text...
>> >> > >
>> >> > > Thank you very much for any help.Nick
>> >> > >
>> >>
>> >>
>>
>>
>>
>> --
>> Computers are useless. They can only give you answers.
>> (Pablo Picasso)
>> _______________
>> Diego Ceccarelli
>> High Performance Computing Laboratory
>> Information Science and Technologies Institute (ISTI)
>> Italian National Research Council (CNR)
>> Via Moruzzi, 1
>> 56124 - Pisa - Italy
>>
>> Phone: +39 050 315 3055
>> Fax: +39 050 315 2040
>> ________________________________________
>



-- 
Computers are useless. They can only give you answers.
(Pablo Picasso)
_______________
Diego Ceccarelli
High Performance Computing Laboratory
Information Science and Technologies Institute (ISTI)
Italian National Research Council (CNR)
Via Moruzzi, 1
56124 - Pisa - Italy

Phone: +39 050 315 3055
Fax: +39 050 315 2040
________________________________________

Re: Converting one large text file with multiple documents to SequenceFile format

Reply via email to