Dan, Thank you. Specifying the matrix folder did the trick. After a few test runs I'm figuring out that I will have to dramatically reduce the size of my corpus. Running LDA on a 200 MB chunk of the corpus took more than 24 hours, and the total corpus is almost 2 GB. Combining LDA runs from slices of the corpus doesn't sound very feasible, so I'll have to rethink my approach. But thanks again for the help.
Nick > Date: Mon, 12 Nov 2012 09:27:32 -0800 > From: [email protected] > Subject: Re: Converting one large text file with multiple documents to > SequenceFile format > To: [email protected] > CC: [email protected] > > CVB requires the vector input to be Key=IntWritable, Value=VectorWritable. > rowid will convert the seq2sparse output to this format as you assumed. But > when you ran rowid I assume the vector output was written to this file: > output/matrix/Matrix > > So, try running CVB like this: > > mahout cvb -i output/matrix/Matrix -dict output/dictionary.file-0 -o topics > -dt documents -mt states -ow -k 100 --num_reduce_tasks 7 > > rowid creates a Matrix file but also creates a docIndex file that maps the > original sparse vector keys (Text) to the integer id's that rowid created. > I'm guessing CVB is blowing up because it is trying to process that docIndex > file. So explicitly specify the Matrix file as input to CVB or move the > docIndex file to some other folder as I have done (before starting CVB): > > http://comments.gmane.org/gmane.comp.apache.mahout.user/13112 > > You may also want to check if output/matrix/Matrix contains data, e.g., > > mahout seqdumper -s output/matrix/Matrix > > Dan > > > > ________________________________ > From: Nick Woodward <[email protected]> > To: [email protected] > Sent: Monday, November 12, 2012 11:52 AM > Subject: RE: Converting one large text file with multiple documents to > SequenceFile format > > > Diego, Thank you for your response. There was no matrix folder in > /tf-vectors from seq2sparse so I created it with this, "mahout rowid -i > output/tf-vectors -o output/matrix". Then I tried cvb again with the matrix > folder, "mahout cvb -i output/matrix -dict output/dictionary.file-0 -o topics > -dt documents -mt states -ow -k 100 --num_reduce_tasks 7". The results were > similar, though this time it failed after 1%. > c211-109$ mahout cvb -i output/matrix -dict output/dictionary.file-0 -o > topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7 -x > 150MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.Warning: > $HADOOP_HOME is deprecated. > Running on hadoop, using /home/01541/levar1/xsede/hadoop-1.0.3/bin/hadoop and > HADOOP_CONF_DIR=/home/01541/levar1/.hadoop2/conf/MAHOUT-JOB: > /scratch/01541/levar1/lib/mahout-distribution-0.7/mahout-examples-0.7-job.jarWarning: > $HADOOP_HOME is deprecated. > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use > org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties > files.12/11/12 09:35:13 WARN driver.MahoutDriver: No cvb.props found on > classpath, will use command-line arguments only12/11/12 09:35:13 INFO > common.AbstractJob: Command line arguments: {--convergenceDelta=[0], > --dictionary=[output/dictionary.file-0], --doc_topic_output=[documents], > --doc_topic_smoothing=[0.0001], --endPhase=[2147483647], > --input=[output/matrix], --iteration_block_size=[10], --maxIter=[150], > --max_doc_topic_iters=[10], --num_reduce_tasks=[7], --num_topics=[100], > --num_train_threads=[4], --num_update_threads=[1], --output=[topics], > --overwrite=null, --startPhase=[0], --tempDir=[temp], > --term_topic_smoothing=[0.0001], --test_set_fraction=[0], > --topic_model_temp_dir=[states]}12/11/12 09:35:15 INFO cvb.CVB0Driver: Will > run Collapsed Variational Bayes (0th-derivative approximation) learning for > LDA on output/matrix (numTerms: 699072), finding 100-topics, with > document/topic prior 1.0E-4, topic/term prior 1.0E-4. Maximum iterations to > run will be 150, unless the change in perplexity is less than 0.0. Topic > model output (p(term|topic) for each topic) will be stored topics. Random > initialization seed is 7355, holding out 0.0 of the data for perplexity check > 12/11/12 09:35:15 INFO cvb.CVB0Driver: Dictionary to be used located > output/dictionary.file-0p(topic|docId) will be stored documents > 12/11/12 09:35:15 INFO cvb.CVB0Driver: Current iteration number: 012/11/12 > 09:35:15 INFO cvb.CVB0Driver: About to run iteration 1 of 15012/11/12 > 09:35:15 INFO cvb.CVB0Driver: About to run: Iteration 1 of 150, input path: > states/model-012/11/12 09:35:16 INFO input.FileInputFormat: Total input paths > to process : 212/11/12 09:35:16 INFO mapred.JobClient: Running job: > job_201211120919_000512/11/12 09:35:17 INFO mapred.JobClient: map 0% reduce > 0%12/11/12 10:25:25 INFO mapred.JobClient: map 1% reduce 0%12/11/12 10:35:46 > INFO mapred.JobClient: Task Id : attempt_201211120919_0005_m_000001_0, Status > : FAILEDjava.lang.ClassCastException: org.apache.hadoop.io.Text cannot be > cast to org.apache.mahout.math.VectorWritable at > org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at > org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at > org.apache.hadoop.mapred.Child$4.run(Child.java:255) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:396) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) > at org.apache.hadoop.mapred.Child.main(Child.java:249) > Task attempt_201211120919_0005_m_000001_0 failed to report status for 3601 > seconds. Killing! > > Any ideas? > Regards,Nick > > > > From: [email protected] > > Date: Mon, 12 Nov 2012 13:21:21 +0100 > > Subject: Re: Converting one large text file with multiple documents to > > SequenceFile format > > To: [email protected] > > > > Dear Nick, > > > > I experienced the same problem, the fact is that when you call cvb, it > > expects in input the folder matrix inside the output folder of > > seq2sparce, > > so instead of > > > > mahout cvb -i output/tf-vectors -dict output/dictionary.file-0 -o > > topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7 -x 10 > > > > please try: > > > > mahout cvb -i output/tf-vectors/matrix -dict output/dictionary.file-0 > > -o topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7 -x > > 10 > > > > let me know if it solved ;) > > > > cheers, > > Diego > > > > On Mon, Nov 12, 2012 at 1:18 AM, Nick Woodward <[email protected]> wrote: > > > > > > Diego,Thank you so much for the script. I used it to convert my large > > > text file to a sequence file. I have been trying to use the sequence file > > > to feed Mahout's LDA implementation (Mahout 0.7 so the CVB > > > implementation). I first converted the sequence file to vectors with > > > this, "mahout seq2sparse -i input/processedaa.seq -o output -ow -wt tf > > > -nr 7" and then ran the LDA with this, "mahout cvb -i output/tf-vectors > > > -dict output/dictionary.file-0 -o topics -dt documents -mt states -ow -k > > > 100 --num_reduce_tasks 7 -x 10". The seq2sparse command produces the tf > > > vectors alright, but the problem is that no matter what I use for > > > parameters, the LDA job sits at map 0% reduce 0% for an hour before > > > outputting the error below. It has an error casting Text to IntWritable. > > > My question is when you said that the key is the line number, what > > > variable type is the key? Is it Text? > > > > > > My output..."12/11/11 16:10:50 INFO common.AbstractJob: Command line > > > arguments: {--convergenceDelta=[0], > > > --dictionary=[output/dictionary.file-0], --doc_topic_output=[documents], > > > --doc_topic_smoothing=[0.0001], --endPhase=[2147483647], > > > --input=[output/tf-vectors], --iteration_block_size=[10], --maxIter=[10], > > > --max_doc_topic_iters=[10], --num_reduce_tasks=[7], --num_topics=[100], > > > --num_train_threads=[4], --num_update_threads=[1], --output=[topics], > > > --overwrite=null, --startPhase=[0], --tempDir=[temp], > > > --term_topic_smoothing=[0.0001], --test_set_fraction=[0], > > > --topic_model_temp_dir=[states]}12/11/11 16:10:52 INFO mapred.JobClient: > > > Running job: job_201211111553_000512/11/11 16:10:53 INFO > > > mapred.JobClient: map 0% reduce 0%12/11/11 17:11:16 INFO > > > mapred.JobClient: Task Id : attempt_201211111553_0005_m_000003_0, Status > > > : FAILEDjava.lang.ClassCastException: org.apache.hadoop.io.Text cannot be > > > cast to org.apache.hadoop.io.IntWritable at > > org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at > org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at > org.apache.hadoop.mapred.Child$4.run(Child.java:255) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:396) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) > at org.apache.hadoop.mapred.Child.main(Child.java:249)" > > > > > > Thank you again for your help!Nick > > > > > > > > >> From: [email protected] > > >> Date: Thu, 1 Nov 2012 01:07:29 +0100 > > >> Subject: Re: Converting one large text file with multiple documents to > > >> SequenceFile format > > >> To: [email protected] > > >> > > >> Hei Nick, > > >> I had exatly the same problem ;) > > >> I wrote a simple command line utility to create a sequence > > >> file where each line of the input document is an entry > > >> (the key is the line number). > > >> > > >> https://dl.dropbox.com/u/4663256/tmp/lda-helper.jar > > >> > > >> java -cp lda-helper.jar it.cnr.isti.hpc.lda.cli.LinesToSequenceFileCLI > > >> -input tweets -output tweets.seq > > >> > > >> enjoy ;) > > >> Diego > > >> > > >> On Wed, Oct 31, 2012 at 9:30 PM, Charly Lizarralde > > >> <[email protected]> wrote: > > >> > I don't think you need that. Just a simple mapper. > > >> > > > >> > static class IdentityMapper extends Mapper<LongWritable, Text, Text, > > >> > Text> > > >> > { > > >> > > > >> > @Override > > >> > protected void map(LongWritable key, Text value, Context > > >> > context) > > >> > throws IOException, InterruptedException { > > >> > > > >> > String[] fields = value.toString().split("\t") ; > > >> > if ( fields.length >= 2) { > > >> > context.write(new Text(fields[0]), new Text(fields[1])) > > >> > ; > > >> > } > > >> > > > >> > } > > >> > > > >> > } > > >> > > > >> > and then run a simple job.. > > >> > > > >> > Job text2SequenceFileJob = this.prepareJob(this.getInputPath(), > > >> > this.getOutputPath(), TextInputFormat.class, IdentityMapper.class, > > >> > Text.class, Text.class, SequenceFileOutputFormat.class) ; > > >> > > > >> > text2SequenceFileJob.setOutputKeyClass(Text.class) ; > > >> > text2SequenceFileJob.setOutputValueClass(Text.class) ; > > >> > text2SequenceFileJob.setNumReduceTasks(0) ; > > >> > > > >> > text2SequenceFileJob.waitForCompletion(true) ; > > >> > > > >> > Cheers! > > >> > Charly > > >> > > > >> > On Wed, Oct 31, 2012 at 4:57 PM, Nick Woodward <[email protected]> > > >> > wrote: > > >> > > > >> >> > > >> >> Yeah, I've looked at filter classes, but nothing worked. I guess > > >> >> I'll do > > >> >> something similar and continuously save each line into a file and > > >> >> then run > > >> >> seqdiretory. The running time won't look good, but at least it should > > >> >> work. Thanks for the response. > > >> >> > > >> >> Nick > > >> >> > > >> >> > From: [email protected] > > >> >> > Date: Tue, 30 Oct 2012 18:07:58 -0300 > > >> >> > Subject: Re: Converting one large text file with multiple documents > > >> >> > to > > >> >> SequenceFile format > > >> >> > To: [email protected] > > >> >> > > > >> >> > I had the exact same issue and I tried to use the seqdirectory > > >> >> > command > > >> >> with > > >> >> > a different filter class but It did not work. It seems there's a > > >> >> > bug in > > >> >> the > > >> >> > mahout-0.6 code. > > >> >> > > > >> >> > It ended up as writing a custom map-reduce program that performs > > >> >> > just > > >> >> that. > > >> >> > > > >> >> > Greetiings! > > >> >> > Charly > > >> >> > > > >> >> > On Tue, Oct 30, 2012 at 5:00 PM, Nick Woodward <[email protected]> > > >> >> wrote: > > >> >> > > > >> >> > > > > >> >> > > I have done a lot of searching on the web for this, but I've found > > >> >> > > nothing, even though I feel like it has to be somewhat common. I > > >> >> > > have > > >> >> used > > >> >> > > Mahout's 'seqdirectory' command to convert a folder containing > > >> >> > > text > > >> >> files > > >> >> > > (each file is a separate document) in the past. But in this case > > >> >> > > there > > >> >> are > > >> >> > > so many documents (in the 100,000s) that I have one very large > > >> >> > > text > > >> >> file in > > >> >> > > which each line is a document. How can I convert this large file > > >> >> > > to > > >> >> > > SequenceFile format so that Mahout understands that each line > > >> >> > > should be > > >> >> > > considered a separate document? Would it be better if the file > > >> >> > > was > > >> >> > > structured like so....docId1 {tab} document textdocId2 {tab} > > >> >> > > document > > >> >> > > textdocId3 {tab} document text... > > >> >> > > > > >> >> > > Thank you very much for any help.Nick > > >> >> > > > > >> >> > > >> >> > > >> > > >> > > >> > > >> -- > > >> Computers are useless. They can only give you answers. > > >> (Pablo Picasso) > > >> _______________ > > >> Diego Ceccarelli > > >> High Performance Computing Laboratory > > >> Information Science and Technologies Institute (ISTI) > > >> Italian National Research Council (CNR) > > >> Via Moruzzi, 1 > > >> 56124 - Pisa - Italy > > >> > > >> Phone: +39 050 315 3055 > > >> Fax: +39 050 315 2040 > > >> ________________________________________ > > > > > > > > > > > -- > > Computers are useless. They can only give you answers. > > (Pablo Picasso) > > _______________ > > Diego Ceccarelli > > High Performance Computing Laboratory > > Information Science and Technologies Institute (ISTI) > > Italian National Research Council (CNR) > > Via Moruzzi, 1 > > 56124 - Pisa - Italy > > > > Phone: +39 050 315 3055 > > Fax: +39 050 315 2040 > > ________________________________________
