Dear Nick, I experienced the same problem, the fact is that when you call cvb, it expects in input the folder matrix inside the output folder of seq2sparce, so instead of
mahout cvb -i output/tf-vectors -dict output/dictionary.file-0 -o topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7 -x 10 please try: mahout cvb -i output/tf-vectors/matrix -dict output/dictionary.file-0 -o topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7 -x 10 let me know if it solved ;) cheers, Diego On Mon, Nov 12, 2012 at 1:18 AM, Nick Woodward <[email protected]> wrote: > > Diego,Thank you so much for the script. I used it to convert my large text > file to a sequence file. I have been trying to use the sequence file to feed > Mahout's LDA implementation (Mahout 0.7 so the CVB implementation). I first > converted the sequence file to vectors with this, "mahout seq2sparse -i > input/processedaa.seq -o output -ow -wt tf -nr 7" and then ran the LDA with > this, "mahout cvb -i output/tf-vectors -dict output/dictionary.file-0 -o > topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7 -x 10". The > seq2sparse command produces the tf vectors alright, but the problem is that > no matter what I use for parameters, the LDA job sits at map 0% reduce 0% for > an hour before outputting the error below. It has an error casting Text to > IntWritable. My question is when you said that the key is the line number, > what variable type is the key? Is it Text? > > My output..."12/11/11 16:10:50 INFO common.AbstractJob: Command line > arguments: {--convergenceDelta=[0], --dictionary=[output/dictionary.file-0], > --doc_topic_output=[documents], --doc_topic_smoothing=[0.0001], > --endPhase=[2147483647], --input=[output/tf-vectors], > --iteration_block_size=[10], --maxIter=[10], --max_doc_topic_iters=[10], > --num_reduce_tasks=[7], --num_topics=[100], --num_train_threads=[4], > --num_update_threads=[1], --output=[topics], --overwrite=null, > --startPhase=[0], --tempDir=[temp], --term_topic_smoothing=[0.0001], > --test_set_fraction=[0], --topic_model_temp_dir=[states]}12/11/11 16:10:52 > INFO mapred.JobClient: Running job: job_201211111553_000512/11/11 16:10:53 > INFO mapred.JobClient: map 0% reduce 0%12/11/11 17:11:16 INFO > mapred.JobClient: Task Id : attempt_201211111553_0005_m_000003_0, Status : > FAILEDjava.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast > to org.apache.hadoop.io.IntWritable at > org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at > org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at > org.apache.hadoop.mapred.Child$4.run(Child.java:255) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:396) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) > at org.apache.hadoop.mapred.Child.main(Child.java:249)" > > Thank you again for your help!Nick > > >> From: [email protected] >> Date: Thu, 1 Nov 2012 01:07:29 +0100 >> Subject: Re: Converting one large text file with multiple documents to >> SequenceFile format >> To: [email protected] >> >> Hei Nick, >> I had exatly the same problem ;) >> I wrote a simple command line utility to create a sequence >> file where each line of the input document is an entry >> (the key is the line number). >> >> https://dl.dropbox.com/u/4663256/tmp/lda-helper.jar >> >> java -cp lda-helper.jar it.cnr.isti.hpc.lda.cli.LinesToSequenceFileCLI >> -input tweets -output tweets.seq >> >> enjoy ;) >> Diego >> >> On Wed, Oct 31, 2012 at 9:30 PM, Charly Lizarralde >> <[email protected]> wrote: >> > I don't think you need that. Just a simple mapper. >> > >> > static class IdentityMapper extends Mapper<LongWritable, Text, Text, Text> >> > { >> > >> > @Override >> > protected void map(LongWritable key, Text value, Context context) >> > throws IOException, InterruptedException { >> > >> > String[] fields = value.toString().split("\t") ; >> > if ( fields.length >= 2) { >> > context.write(new Text(fields[0]), new Text(fields[1])) >> > ; >> > } >> > >> > } >> > >> > } >> > >> > and then run a simple job.. >> > >> > Job text2SequenceFileJob = this.prepareJob(this.getInputPath(), >> > this.getOutputPath(), TextInputFormat.class, IdentityMapper.class, >> > Text.class, Text.class, SequenceFileOutputFormat.class) ; >> > >> > text2SequenceFileJob.setOutputKeyClass(Text.class) ; >> > text2SequenceFileJob.setOutputValueClass(Text.class) ; >> > text2SequenceFileJob.setNumReduceTasks(0) ; >> > >> > text2SequenceFileJob.waitForCompletion(true) ; >> > >> > Cheers! >> > Charly >> > >> > On Wed, Oct 31, 2012 at 4:57 PM, Nick Woodward <[email protected]> wrote: >> > >> >> >> >> Yeah, I've looked at filter classes, but nothing worked. I guess I'll do >> >> something similar and continuously save each line into a file and then run >> >> seqdiretory. The running time won't look good, but at least it should >> >> work. Thanks for the response. >> >> >> >> Nick >> >> >> >> > From: [email protected] >> >> > Date: Tue, 30 Oct 2012 18:07:58 -0300 >> >> > Subject: Re: Converting one large text file with multiple documents to >> >> SequenceFile format >> >> > To: [email protected] >> >> > >> >> > I had the exact same issue and I tried to use the seqdirectory command >> >> with >> >> > a different filter class but It did not work. It seems there's a bug in >> >> the >> >> > mahout-0.6 code. >> >> > >> >> > It ended up as writing a custom map-reduce program that performs just >> >> that. >> >> > >> >> > Greetiings! >> >> > Charly >> >> > >> >> > On Tue, Oct 30, 2012 at 5:00 PM, Nick Woodward <[email protected]> >> >> wrote: >> >> > >> >> > > >> >> > > I have done a lot of searching on the web for this, but I've found >> >> > > nothing, even though I feel like it has to be somewhat common. I have >> >> used >> >> > > Mahout's 'seqdirectory' command to convert a folder containing text >> >> files >> >> > > (each file is a separate document) in the past. But in this case there >> >> are >> >> > > so many documents (in the 100,000s) that I have one very large text >> >> file in >> >> > > which each line is a document. How can I convert this large file to >> >> > > SequenceFile format so that Mahout understands that each line should >> >> > > be >> >> > > considered a separate document? Would it be better if the file was >> >> > > structured like so....docId1 {tab} document textdocId2 {tab} document >> >> > > textdocId3 {tab} document text... >> >> > > >> >> > > Thank you very much for any help.Nick >> >> > > >> >> >> >> >> >> >> >> -- >> Computers are useless. They can only give you answers. >> (Pablo Picasso) >> _______________ >> Diego Ceccarelli >> High Performance Computing Laboratory >> Information Science and Technologies Institute (ISTI) >> Italian National Research Council (CNR) >> Via Moruzzi, 1 >> 56124 - Pisa - Italy >> >> Phone: +39 050 315 3055 >> Fax: +39 050 315 2040 >> ________________________________________ > -- Computers are useless. They can only give you answers. (Pablo Picasso) _______________ Diego Ceccarelli High Performance Computing Laboratory Information Science and Technologies Institute (ISTI) Italian National Research Council (CNR) Via Moruzzi, 1 56124 - Pisa - Italy Phone: +39 050 315 3055 Fax: +39 050 315 2040 ________________________________________
