RE: Converting one large text file with multiple documents to SequenceFile format

Nick Woodward Tue, 13 Nov 2012 09:05:21 -0800

Dan, Thank you.  Specifying the matrix folder did the trick.  After a few test 
runs I'm figuring out that I will have to dramatically reduce the size of my 
corpus.  Running LDA on a 200 MB chunk of the corpus took more than 24 hours, 
and the total corpus is almost 2 GB.  Combining LDA runs from slices of the 
corpus doesn't sound very feasible, so I'll have to rethink my approach.  But 
thanks again for the help.



Nick

> Date: Mon, 12 Nov 2012 09:27:32 -0800
> From: [email protected]
> Subject: Re: Converting one large text file with multiple documents to 
> SequenceFile format
> To: [email protected]
> CC: [email protected]
> 
> CVB requires the vector input to be Key=IntWritable, Value=VectorWritable.  
> rowid will convert the seq2sparse output to this format as you assumed.  But 
> when you ran rowid I assume the vector output was written to this file: 
> output/matrix/Matrix
>  
> So, try running CVB like this:
>  
> mahout cvb -i output/matrix/Matrix -dict output/dictionary.file-0 -o topics 
> -dt documents -mt states -ow -k 100 --num_reduce_tasks 7
>  
> rowid creates a Matrix file but also creates a docIndex file that maps the 
> original sparse vector keys (Text) to the integer id's that rowid created.  
> I'm guessing CVB is blowing up because it is trying to process that docIndex 
> file.  So explicitly specify the Matrix file as input to CVB or move the 
> docIndex file to some other folder as I have done (before starting CVB):
>  
> http://comments.gmane.org/gmane.comp.apache.mahout.user/13112
>  
> You may also want to check if output/matrix/Matrix  contains data, e.g.,
>  
> mahout seqdumper -s output/matrix/Matrix 
>  
> Dan
> 
>  
> 
> ________________________________
>  From: Nick Woodward <[email protected]>
> To: [email protected] 
> Sent: Monday, November 12, 2012 11:52 AM
> Subject: RE: Converting one large text file with multiple documents to 
> SequenceFile format
>   
> 
> Diego, Thank you for your response.  There was no matrix folder in 
> /tf-vectors from seq2sparse so I created it with this, "mahout rowid -i 
> output/tf-vectors -o output/matrix".  Then I tried cvb again with the matrix 
> folder, "mahout cvb -i output/matrix -dict output/dictionary.file-0 -o topics 
> -dt documents -mt states -ow -k 100 --num_reduce_tasks 7".  The results were 
> similar, though this time it failed after 1%.  
> c211-109$ mahout cvb -i output/matrix -dict output/dictionary.file-0 -o 
> topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7 -x 
> 150MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.Warning: 
> $HADOOP_HOME is deprecated.
> Running on hadoop, using /home/01541/levar1/xsede/hadoop-1.0.3/bin/hadoop and 
> HADOOP_CONF_DIR=/home/01541/levar1/.hadoop2/conf/MAHOUT-JOB: 
> /scratch/01541/levar1/lib/mahout-distribution-0.7/mahout-examples-0.7-job.jarWarning:
>  $HADOOP_HOME is deprecated.
> WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use 
> org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties 
> files.12/11/12 09:35:13 WARN driver.MahoutDriver: No cvb.props found on 
> classpath, will use command-line arguments only12/11/12 09:35:13 INFO 
> common.AbstractJob: Command line arguments: {--convergenceDelta=[0], 
> --dictionary=[output/dictionary.file-0], --doc_topic_output=[documents], 
> --doc_topic_smoothing=[0.0001], --endPhase=[2147483647], 
> --input=[output/matrix], --iteration_block_size=[10], --maxIter=[150], 
> --max_doc_topic_iters=[10], --num_reduce_tasks=[7], --num_topics=[100], 
> --num_train_threads=[4], --num_update_threads=[1], --output=[topics], 
> --overwrite=null, --startPhase=[0], --tempDir=[temp], 
> --term_topic_smoothing=[0.0001], --test_set_fraction=[0], 
> --topic_model_temp_dir=[states]}12/11/12 09:35:15 INFO cvb.CVB0Driver: Will 
> run Collapsed Variational Bayes (0th-derivative approximation) learning for
>  LDA on output/matrix (numTerms: 699072), finding 100-topics, with 
> document/topic prior 1.0E-4, topic/term prior 1.0E-4.  Maximum iterations to 
> run will be 150, unless the change in perplexity is less than 0.0.  Topic 
> model output (p(term|topic) for each topic) will be stored topics.  Random 
> initialization seed is 7355, holding out 0.0 of the data for perplexity check
> 12/11/12 09:35:15 INFO cvb.CVB0Driver: Dictionary to be used located 
> output/dictionary.file-0p(topic|docId) will be stored documents
> 12/11/12 09:35:15 INFO cvb.CVB0Driver: Current iteration number: 012/11/12 
> 09:35:15 INFO cvb.CVB0Driver: About to run iteration 1 of 15012/11/12 
> 09:35:15 INFO cvb.CVB0Driver: About to run: Iteration 1 of 150, input path: 
> states/model-012/11/12 09:35:16 INFO input.FileInputFormat: Total input paths 
> to process : 212/11/12 09:35:16 INFO mapred.JobClient: Running job: 
> job_201211120919_000512/11/12 09:35:17 INFO mapred.JobClient:  map 0% reduce 
> 0%12/11/12 10:25:25 INFO mapred.JobClient:  map 1% reduce 0%12/11/12 10:35:46 
> INFO mapred.JobClient: Task Id : attempt_201211120919_0005_m_000001_0, Status 
> : FAILEDjava.lang.ClassCastException: org.apache.hadoop.io.Text cannot be 
> cast to org.apache.mahout.math.VectorWritable        at 
> org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)        at
>  org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)        at 
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)        at 
> org.apache.hadoop.mapred.Child$4.run(Child.java:255)        at 
> java.security.AccessController.doPrivileged(Native Method)        at 
> javax.security.auth.Subject.doAs(Subject.java:396)        at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
> Task attempt_201211120919_0005_m_000001_0 failed to report status for 3601 
> seconds. Killing!
> 
> Any ideas?
> Regards,Nick
> 
> 
> > From: [email protected]
> > Date: Mon, 12 Nov 2012 13:21:21 +0100
> > Subject: Re: Converting one large text file with multiple documents to 
> > SequenceFile format
> > To: [email protected]
> > 
> > Dear Nick,
> > 
> > I experienced the same problem, the fact is that when you call cvb, it
> > expects in input the folder matrix inside the output folder of
> > seq2sparce,
> > so instead of
> > 
> > mahout cvb -i output/tf-vectors -dict output/dictionary.file-0 -o
> > topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7 -x 10
> > 
> > please try:
> > 
> > mahout cvb -i output/tf-vectors/matrix -dict output/dictionary.file-0
> > -o topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7 -x
> > 10
> > 
> > let me know if it solved ;)
> > 
> > cheers,
> > Diego
> > 
> > On Mon, Nov 12, 2012 at 1:18 AM, Nick Woodward <[email protected]> wrote:
> > >
> > > Diego,Thank you so much for the script. I used it to convert my large 
> > > text file to a sequence file. I have been trying to use the sequence file 
> > > to feed Mahout's LDA implementation (Mahout 0.7 so the CVB 
> > > implementation).  I first converted the sequence file to vectors with 
> > > this, "mahout seq2sparse -i input/processedaa.seq -o output -ow -wt tf 
> > > -nr 7" and then ran the LDA with this, "mahout cvb -i output/tf-vectors 
> > > -dict output/dictionary.file-0 -o topics -dt documents -mt states -ow -k 
> > > 100 --num_reduce_tasks 7 -x 10".  The seq2sparse command produces the tf 
> > > vectors alright, but the problem is that no matter what I use for 
> > > parameters, the LDA job sits at map 0% reduce 0% for an hour before 
> > > outputting the error below.  It has an error casting Text to IntWritable. 
> > >  My question is when you said that the key is the line number, what 
> > > variable type is the key?  Is it Text?
> > >
> > > My output..."12/11/11 16:10:50 INFO common.AbstractJob: Command line 
> > > arguments: {--convergenceDelta=[0], 
> > > --dictionary=[output/dictionary.file-0], --doc_topic_output=[documents], 
> > > --doc_topic_smoothing=[0.0001], --endPhase=[2147483647], 
> > > --input=[output/tf-vectors], --iteration_block_size=[10], --maxIter=[10], 
> > > --max_doc_topic_iters=[10], --num_reduce_tasks=[7], --num_topics=[100], 
> > > --num_train_threads=[4], --num_update_threads=[1], --output=[topics], 
> > > --overwrite=null, --startPhase=[0], --tempDir=[temp], 
> > > --term_topic_smoothing=[0.0001], --test_set_fraction=[0], 
> > > --topic_model_temp_dir=[states]}12/11/11 16:10:52 INFO mapred.JobClient: 
> > > Running job: job_201211111553_000512/11/11 16:10:53 INFO 
> > > mapred.JobClient:  map 0% reduce 0%12/11/11 17:11:16 INFO 
> > > mapred.JobClient: Task Id : attempt_201211111553_0005_m_000003_0, Status 
> > > : FAILEDjava.lang.ClassCastException: org.apache.hadoop.io.Text cannot be 
> > > cast to org.apache.hadoop.io.IntWritable at
>  
> org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55)
>  at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at 
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at 
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at 
> org.apache.hadoop.mapred.Child$4.run(Child.java:255) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:396) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>  at org.apache.hadoop.mapred.Child.main(Child.java:249)"
> > >
> > > Thank you again for your help!Nick
> > >
> > >
> > >> From: [email protected]
> > >> Date: Thu, 1 Nov 2012 01:07:29 +0100
> > >> Subject: Re: Converting one large text file with multiple documents to 
> > >> SequenceFile format
> > >> To: [email protected]
> > >>
> > >> Hei Nick,
> > >> I had exatly the same problem ;)
> > >> I wrote a simple command line utility to create a sequence
> > >> file where each line of the input document is an entry
> > >> (the key is the line number).
> > >>
> > >> https://dl.dropbox.com/u/4663256/tmp/lda-helper.jar
> > >>
> > >> java -cp lda-helper.jar it.cnr.isti.hpc.lda.cli.LinesToSequenceFileCLI
> > >> -input tweets -output tweets.seq
> > >>
> > >> enjoy ;)
> > >> Diego
> > >>
> > >> On Wed, Oct 31, 2012 at 9:30 PM, Charly Lizarralde
> > >> <[email protected]> wrote:
> > >> > I don't think you need that. Just a simple mapper.
> > >> >
> > >> > static class IdentityMapper extends  Mapper<LongWritable, Text, Text, 
> > >> > Text>
> > >> > {
> > >> >
> > >> >         @Override
> > >> >         protected void map(LongWritable key, Text value, Context 
> > >> > context)
> > >> > throws IOException, InterruptedException {
> > >> >
> > >> >             String[] fields = value.toString().split("\t") ;
> > >> >             if  ( fields.length >= 2) {
> > >> >                 context.write(new Text(fields[0]), new Text(fields[1]))
> > >> > ;
> > >> >             }
> > >> >
> > >> >         }
> > >> >
> > >> >     }
> > >> >
> > >> > and then run a simple job..
> > >> >
> > >> >         Job text2SequenceFileJob = this.prepareJob(this.getInputPath(),
> > >> > this.getOutputPath(), TextInputFormat.class, IdentityMapper.class,
> > >> > Text.class, Text.class, SequenceFileOutputFormat.class) ;
> > >> >
> > >> >         text2SequenceFileJob.setOutputKeyClass(Text.class) ;
> > >> >         text2SequenceFileJob.setOutputValueClass(Text.class) ;
> > >> >         text2SequenceFileJob.setNumReduceTasks(0) ;
> > >> >
> > >> >         text2SequenceFileJob.waitForCompletion(true) ;
> > >> >
> > >> > Cheers!
> > >> > Charly
> > >> >
> > >> > On Wed, Oct 31, 2012 at 4:57 PM, Nick Woodward <[email protected]> 
> > >> > wrote:
> > >> >
> > >> >>
> > >> >> Yeah, I've looked at filter classes, but nothing worked.  I guess 
> > >> >> I'll do
> > >> >> something similar and continuously save each line into a file and 
> > >> >> then run
> > >> >> seqdiretory.  The running time won't look good, but at least it should
> > >> >> work.  Thanks for the response.
> > >> >>
> > >> >> Nick
> > >> >>
> > >> >> > From: [email protected]
> > >> >> > Date: Tue, 30 Oct 2012 18:07:58 -0300
> > >> >> > Subject: Re: Converting one large text file with multiple documents 
> > >> >> > to
> > >> >> SequenceFile format
> > >> >> > To: [email protected]
> > >> >> >
> > >> >> > I had the exact same issue and I tried to use the seqdirectory 
> > >> >> > command
> > >> >> with
> > >> >> > a different filter class but It did not work. It seems there's a 
> > >> >> > bug in
> > >> >> the
> > >> >> > mahout-0.6 code.
> > >> >> >
> > >> >> > It ended up as writing a custom map-reduce program that performs 
> > >> >> > just
> > >> >> that.
> > >> >> >
> > >> >> > Greetiings!
> > >> >> > Charly
> > >> >> >
> > >> >> > On Tue, Oct 30, 2012 at 5:00 PM, Nick Woodward <[email protected]>
> > >> >> wrote:
> > >> >> >
> > >> >> > >
> > >> >> > > I have done a lot of searching on the web for this, but I've found
> > >> >> > > nothing, even though I feel like it has to be somewhat common. I 
> > >> >> > > have
> > >> >> used
> > >> >> > > Mahout's 'seqdirectory' command to convert a folder containing 
> > >> >> > > text
> > >> >> files
> > >> >> > > (each file is a separate document) in the past. But in this case 
> > >> >> > > there
> > >> >> are
> > >> >> > > so many documents (in the 100,000s) that I have one very large 
> > >> >> > > text
> > >> >> file in
> > >> >> > > which each line is a document. How can I convert this large file 
> > >> >> > > to
> > >> >> > > SequenceFile format so that Mahout understands that each line 
> > >> >> > > should be
> > >> >> > > considered a separate document?  Would it be better if the file 
> > >> >> > > was
> > >> >> > > structured like so....docId1 {tab} document textdocId2 {tab} 
> > >> >> > > document
> > >> >> > > textdocId3 {tab} document text...
> > >> >> > >
> > >> >> > > Thank you very much for any help.Nick
> > >> >> > >
> > >> >>
> > >> >>
> > >>
> > >>
> > >>
> > >> --
> > >> Computers are useless. They can only give you answers.
> > >> (Pablo Picasso)
> > >> _______________
> > >> Diego Ceccarelli
> > >> High Performance Computing Laboratory
> > >> Information Science and Technologies Institute (ISTI)
> > >> Italian National Research Council (CNR)
> > >> Via Moruzzi, 1
> > >> 56124 - Pisa - Italy
> > >>
> > >> Phone: +39 050 315 3055
> > >> Fax: +39 050 315 2040
> > >> ________________________________________
> > >
> > 
> > 
> > 
> > -- 
> > Computers are useless. They can only give you answers.
> > (Pablo Picasso)
> > _______________
> > Diego Ceccarelli
> > High Performance Computing Laboratory
> > Information Science and Technologies Institute (ISTI)
> > Italian National Research Council (CNR)
> > Via Moruzzi, 1
> > 56124 - Pisa - Italy
> > 
> > Phone: +39 050 315 3055
> > Fax: +39 050 315 2040
> > ________________________________________

RE: Converting one large text file with multiple documents to SequenceFile format

Reply via email to