Nick, I helped write the cvb0 implementation with Jake Mannix over the summer of 2011. I generally create all input data via Pig as this is exceedingly more flexible than the existing utilities for data wrangling in Mahout. I have a pig script which generates a tfidf doc-term matrix from input data and properly partitions its rows into a specifiable number of part files. Unfortunately, I haven't looked into other mechanisms (e.g. rowid mods) to achieve this.
Andy @sagemintblue On Wed, Nov 14, 2012 at 9:02 AM, Nick Woodward <[email protected]> wrote: > > Andy,Yes, I see now that 'rowid' creates a single Matrix file (not > part-r-xxxxx files). This big file is probably getting passed to all of > the mappers. I searched for this issue and saw a mod to rowid from Dan > Helm in July that creates separate part files for input to 'cvb'. Has this > been incorporated into 0.8 or will I need to write something to split up > the Matrix file into parts? > > Nick > > > > Date: Tue, 13 Nov 2012 11:12:56 -0800 > > Subject: Re: Converting one large text file with multiple documents to > SequenceFile format > > From: [email protected] > > To: [email protected] > > > > Nick, > > > > Make sure your matrix part files are balanced in terms of number of > vectors > > (docs) per part file. The organization of the CVB computation here will > > benefit from balanced input splits to the map phase of each iteration's > > job. If you can increase the number of mappers here (by increasing the > > number of input splits) you may see improved throughput. > > > > Andy > > > > > > On Tue, Nov 13, 2012 at 9:04 AM, Nick Woodward <[email protected]> > wrote: > > > > > > > > Dan, Thank you. Specifying the matrix folder did the trick. After a > few > > > test runs I'm figuring out that I will have to dramatically reduce the > size > > > of my corpus. Running LDA on a 200 MB chunk of the corpus took more > than > > > 24 hours, and the total corpus is almost 2 GB. Combining LDA runs from > > > slices of the corpus doesn't sound very feasible, so I'll have to > rethink > > > my approach. But thanks again for the help. > > > > > > > > > Nick > > > > > > > Date: Mon, 12 Nov 2012 09:27:32 -0800 > > > > From: [email protected] > > > > Subject: Re: Converting one large text file with multiple documents > to > > > SequenceFile format > > > > To: [email protected] > > > > CC: [email protected] > > > > > > > > CVB requires the vector input to be Key=IntWritable, > > > Value=VectorWritable. rowid will convert the seq2sparse output to this > > > format as you assumed. But when you ran rowid I assume the vector > output > > > was written to this file: output/matrix/Matrix > > > > > > > > So, try running CVB like this: > > > > > > > > mahout cvb -i output/matrix/Matrix -dict output/dictionary.file-0 -o > > > topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7 > > > > > > > > rowid creates a Matrix file but also creates a docIndex file that > maps > > > the original sparse vector keys (Text) to the integer id's that rowid > > > created. I'm guessing CVB is blowing up because it is trying to > process > > > that docIndex file. So explicitly specify the Matrix file as input to > CVB > > > or move the docIndex file to some other folder as I have done (before > > > starting CVB): > > > > > > > > http://comments.gmane.org/gmane.comp.apache.mahout.user/13112 > > > > > > > > You may also want to check if output/matrix/Matrix contains data, > e.g., > > > > > > > > mahout seqdumper -s output/matrix/Matrix > > > > > > > > Dan > > > > > > > > > > > > > > > > ________________________________ > > > > From: Nick Woodward <[email protected]> > > > > To: [email protected] > > > > Sent: Monday, November 12, 2012 11:52 AM > > > > Subject: RE: Converting one large text file with multiple documents > to > > > SequenceFile format > > > > > > > > > > > > Diego, Thank you for your response. There was no matrix folder in > > > /tf-vectors from seq2sparse so I created it with this, "mahout rowid -i > > > output/tf-vectors -o output/matrix". Then I tried cvb again with the > > > matrix folder, "mahout cvb -i output/matrix -dict > output/dictionary.file-0 > > > -o topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7". > The > > > results were similar, though this time it failed after 1%. > > > > c211-109$ mahout cvb -i output/matrix -dict output/dictionary.file-0 > -o > > > topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7 -x > > > 150MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to > classpath.Warning: > > > $HADOOP_HOME is deprecated. > > > > Running on hadoop, using > > > /home/01541/levar1/xsede/hadoop-1.0.3/bin/hadoop and > > > HADOOP_CONF_DIR=/home/01541/levar1/.hadoop2/conf/MAHOUT-JOB: > > > > /scratch/01541/levar1/lib/mahout-distribution-0.7/mahout-examples-0.7-job.jarWarning: > > > $HADOOP_HOME is deprecated. > > > > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. > > > Please use org.apache.hadoop.log.metrics.EventCounter in all the > > > log4j.properties files.12/11/12 09:35:13 WARN driver.MahoutDriver: No > > > cvb.props found on classpath, will use command-line arguments > only12/11/12 > > > 09:35:13 INFO common.AbstractJob: Command line arguments: > > > {--convergenceDelta=[0], --dictionary=[output/dictionary.file-0], > > > --doc_topic_output=[documents], --doc_topic_smoothing=[0.0001], > > > --endPhase=[2147483647], --input=[output/matrix], > > > --iteration_block_size=[10], --maxIter=[150], > --max_doc_topic_iters=[10], > > > --num_reduce_tasks=[7], --num_topics=[100], --num_train_threads=[4], > > > --num_update_threads=[1], --output=[topics], --overwrite=null, > > > --startPhase=[0], --tempDir=[temp], --term_topic_smoothing=[0.0001], > > > --test_set_fraction=[0], --topic_model_temp_dir=[states]}12/11/12 > 09:35:15 > > > INFO cvb.CVB0Driver: Will run Collapsed Variational Bayes > (0th-derivative > > > approximation) learning for > > > > LDA on output/matrix (numTerms: 699072), finding 100-topics, with > > > document/topic prior 1.0E-4, topic/term prior 1.0E-4. Maximum > iterations > > > to run will be 150, unless the change in perplexity is less than 0.0. > > > Topic model output (p(term|topic) for each topic) will be stored > topics. > > > Random initialization seed is 7355, holding out 0.0 of the data for > > > perplexity check > > > > 12/11/12 09:35:15 INFO cvb.CVB0Driver: Dictionary to be used located > > > output/dictionary.file-0p(topic|docId) will be stored documents > > > > 12/11/12 09:35:15 INFO cvb.CVB0Driver: Current iteration number: > > > 012/11/12 09:35:15 INFO cvb.CVB0Driver: About to run iteration 1 of > > > 15012/11/12 09:35:15 INFO cvb.CVB0Driver: About to run: Iteration 1 of > 150, > > > input path: states/model-012/11/12 09:35:16 INFO input.FileInputFormat: > > > Total input paths to process : 212/11/12 09:35:16 INFO > mapred.JobClient: > > > Running job: job_201211120919_000512/11/12 09:35:17 INFO > mapred.JobClient: > > > map 0% reduce 0%12/11/12 10:25:25 INFO mapred.JobClient: map 1% > reduce > > > 0%12/11/12 10:35:46 INFO mapred.JobClient: Task Id : > > > attempt_201211120919_0005_m_000001_0, Status : > > > FAILEDjava.lang.ClassCastException: org.apache.hadoop.io.Text cannot be > > > cast to org.apache.mahout.math.VectorWritable at > > > > org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55) > > > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at > > > > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at > > > org.apache.hadoop.mapred.Child$4.run(Child.java:255) at > > > java.security.AccessController.doPrivileged(Native Method) at > > > javax.security.auth.Subject.doAs(Subject.java:396) at > > > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) > > > at org.apache.hadoop.mapred.Child.main(Child.java:249) > > > > Task attempt_201211120919_0005_m_000001_0 failed to report status for > > > 3601 seconds. Killing! > > > > > > > > Any ideas? > > > > Regards,Nick > > > > > > > > > > > > > From: [email protected] > > > > > Date: Mon, 12 Nov 2012 13:21:21 +0100 > > > > > Subject: Re: Converting one large text file with multiple > documents to > > > SequenceFile format > > > > > To: [email protected] > > > > > > > > > > Dear Nick, > > > > > > > > > > I experienced the same problem, the fact is that when you call > cvb, it > > > > > expects in input the folder matrix inside the output folder of > > > > > seq2sparce, > > > > > so instead of > > > > > > > > > > mahout cvb -i output/tf-vectors -dict output/dictionary.file-0 -o > > > > > topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7 -x > 10 > > > > > > > > > > please try: > > > > > > > > > > mahout cvb -i output/tf-vectors/matrix -dict > output/dictionary.file-0 > > > > > -o topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7 > -x > > > > > 10 > > > > > > > > > > let me know if it solved ;) > > > > > > > > > > cheers, > > > > > Diego > > > > > > > > > > On Mon, Nov 12, 2012 at 1:18 AM, Nick Woodward <[email protected] > > > > > wrote: > > > > > > > > > > > > Diego,Thank you so much for the script. I used it to convert my > > > large text file to a sequence file. I have been trying to use the > sequence > > > file to feed Mahout's LDA implementation (Mahout 0.7 so the CVB > > > implementation). I first converted the sequence file to vectors with > this, > > > "mahout seq2sparse -i input/processedaa.seq -o output -ow -wt tf -nr > 7" and > > > then ran the LDA with this, "mahout cvb -i output/tf-vectors -dict > > > output/dictionary.file-0 -o topics -dt documents -mt states -ow -k 100 > > > --num_reduce_tasks 7 -x 10". The seq2sparse command produces the tf > > > vectors alright, but the problem is that no matter what I use for > > > parameters, the LDA job sits at map 0% reduce 0% for an hour before > > > outputting the error below. It has an error casting Text to > IntWritable. > > > My question is when you said that the key is the line number, what > > > variable type is the key? Is it Text? > > > > > > > > > > > > My output..."12/11/11 16:10:50 INFO common.AbstractJob: Command > line > > > arguments: {--convergenceDelta=[0], > > > --dictionary=[output/dictionary.file-0], > --doc_topic_output=[documents], > > > --doc_topic_smoothing=[0.0001], --endPhase=[2147483647], > > > --input=[output/tf-vectors], --iteration_block_size=[10], > --maxIter=[10], > > > --max_doc_topic_iters=[10], --num_reduce_tasks=[7], --num_topics=[100], > > > --num_train_threads=[4], --num_update_threads=[1], --output=[topics], > > > --overwrite=null, --startPhase=[0], --tempDir=[temp], > > > --term_topic_smoothing=[0.0001], --test_set_fraction=[0], > > > --topic_model_temp_dir=[states]}12/11/11 16:10:52 INFO > mapred.JobClient: > > > Running job: job_201211111553_000512/11/11 16:10:53 INFO > mapred.JobClient: > > > map 0% reduce 0%12/11/11 17:11:16 INFO mapred.JobClient: Task Id : > > > attempt_201211111553_0005_m_000003_0, Status : > > > FAILEDjava.lang.ClassCastException: org.apache.hadoop.io.Text cannot be > > > cast to org.apache.hadoop.io.IntWritable at > > > > > > > > > org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55) > > > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at > > > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at > > > org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at > > > org.apache.hadoop.mapred.Child$4.run(Child.java:255) at > > > java.security.AccessController.doPrivileged(Native Method) at > > > javax.security.auth.Subject.doAs(Subject.java:396) at > > > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) > > > at org.apache.hadoop.mapred.Child.main(Child.java:249)" > > > > > > > > > > > > Thank you again for your help!Nick > > > > > > > > > > > > > > > > > >> From: [email protected] > > > > > >> Date: Thu, 1 Nov 2012 01:07:29 +0100 > > > > > >> Subject: Re: Converting one large text file with multiple > documents > > > to SequenceFile format > > > > > >> To: [email protected] > > > > > >> > > > > > >> Hei Nick, > > > > > >> I had exatly the same problem ;) > > > > > >> I wrote a simple command line utility to create a sequence > > > > > >> file where each line of the input document is an entry > > > > > >> (the key is the line number). > > > > > >> > > > > > >> https://dl.dropbox.com/u/4663256/tmp/lda-helper.jar > > > > > >> > > > > > >> java -cp lda-helper.jar > > > it.cnr.isti.hpc.lda.cli.LinesToSequenceFileCLI > > > > > >> -input tweets -output tweets.seq > > > > > >> > > > > > >> enjoy ;) > > > > > >> Diego > > > > > >> > > > > > >> On Wed, Oct 31, 2012 at 9:30 PM, Charly Lizarralde > > > > > >> <[email protected]> wrote: > > > > > >> > I don't think you need that. Just a simple mapper. > > > > > >> > > > > > > >> > static class IdentityMapper extends Mapper<LongWritable, > Text, > > > Text, Text> > > > > > >> > { > > > > > >> > > > > > > >> > @Override > > > > > >> > protected void map(LongWritable key, Text value, > Context > > > context) > > > > > >> > throws IOException, InterruptedException { > > > > > >> > > > > > > >> > String[] fields = value.toString().split("\t") ; > > > > > >> > if ( fields.length >= 2) { > > > > > >> > context.write(new Text(fields[0]), new > > > Text(fields[1])) > > > > > >> > ; > > > > > >> > } > > > > > >> > > > > > > >> > } > > > > > >> > > > > > > >> > } > > > > > >> > > > > > > >> > and then run a simple job.. > > > > > >> > > > > > > >> > Job text2SequenceFileJob = > > > this.prepareJob(this.getInputPath(), > > > > > >> > this.getOutputPath(), TextInputFormat.class, > IdentityMapper.class, > > > > > >> > Text.class, Text.class, SequenceFileOutputFormat.class) ; > > > > > >> > > > > > > >> > text2SequenceFileJob.setOutputKeyClass(Text.class) ; > > > > > >> > text2SequenceFileJob.setOutputValueClass(Text.class) ; > > > > > >> > text2SequenceFileJob.setNumReduceTasks(0) ; > > > > > >> > > > > > > >> > text2SequenceFileJob.waitForCompletion(true) ; > > > > > >> > > > > > > >> > Cheers! > > > > > >> > Charly > > > > > >> > > > > > > >> > On Wed, Oct 31, 2012 at 4:57 PM, Nick Woodward < > > > [email protected]> wrote: > > > > > >> > > > > > > >> >> > > > > > >> >> Yeah, I've looked at filter classes, but nothing worked. I > > > guess I'll do > > > > > >> >> something similar and continuously save each line into a file > > > and then run > > > > > >> >> seqdiretory. The running time won't look good, but at least > it > > > should > > > > > >> >> work. Thanks for the response. > > > > > >> >> > > > > > >> >> Nick > > > > > >> >> > > > > > >> >> > From: [email protected] > > > > > >> >> > Date: Tue, 30 Oct 2012 18:07:58 -0300 > > > > > >> >> > Subject: Re: Converting one large text file with multiple > > > documents to > > > > > >> >> SequenceFile format > > > > > >> >> > To: [email protected] > > > > > >> >> > > > > > > >> >> > I had the exact same issue and I tried to use the > seqdirectory > > > command > > > > > >> >> with > > > > > >> >> > a different filter class but It did not work. It seems > there's > > > a bug in > > > > > >> >> the > > > > > >> >> > mahout-0.6 code. > > > > > >> >> > > > > > > >> >> > It ended up as writing a custom map-reduce program that > > > performs just > > > > > >> >> that. > > > > > >> >> > > > > > > >> >> > Greetiings! > > > > > >> >> > Charly > > > > > >> >> > > > > > > >> >> > On Tue, Oct 30, 2012 at 5:00 PM, Nick Woodward < > > > [email protected]> > > > > > >> >> wrote: > > > > > >> >> > > > > > > >> >> > > > > > > > >> >> > > I have done a lot of searching on the web for this, but > I've > > > found > > > > > >> >> > > nothing, even though I feel like it has to be somewhat > > > common. I have > > > > > >> >> used > > > > > >> >> > > Mahout's 'seqdirectory' command to convert a folder > > > containing text > > > > > >> >> files > > > > > >> >> > > (each file is a separate document) in the past. But in > this > > > case there > > > > > >> >> are > > > > > >> >> > > so many documents (in the 100,000s) that I have one very > > > large text > > > > > >> >> file in > > > > > >> >> > > which each line is a document. How can I convert this > large > > > file to > > > > > >> >> > > SequenceFile format so that Mahout understands that each > > > line should be > > > > > >> >> > > considered a separate document? Would it be better if > the > > > file was > > > > > >> >> > > structured like so....docId1 {tab} document textdocId2 > {tab} > > > document > > > > > >> >> > > textdocId3 {tab} document text... > > > > > >> >> > > > > > > > >> >> > > Thank you very much for any help.Nick > > > > > >> >> > > > > > > > >> >> > > > > > >> >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> -- > > > > > >> Computers are useless. They can only give you answers. > > > > > >> (Pablo Picasso) > > > > > >> _______________ > > > > > >> Diego Ceccarelli > > > > > >> High Performance Computing Laboratory > > > > > >> Information Science and Technologies Institute (ISTI) > > > > > >> Italian National Research Council (CNR) > > > > > >> Via Moruzzi, 1 > > > > > >> 56124 - Pisa - Italy > > > > > >> > > > > > >> Phone: +39 050 315 3055 > > > > > >> Fax: +39 050 315 2040 > > > > > >> ________________________________________ > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Computers are useless. They can only give you answers. > > > > > (Pablo Picasso) > > > > > _______________ > > > > > Diego Ceccarelli > > > > > High Performance Computing Laboratory > > > > > Information Science and Technologies Institute (ISTI) > > > > > Italian National Research Council (CNR) > > > > > Via Moruzzi, 1 > > > > > 56124 - Pisa - Italy > > > > > > > > > > Phone: +39 050 315 3055 > > > > > Fax: +39 050 315 2040 > > > > > ________________________________________ > > > > > > > >
