RE: Converting one large text file with multiple documents to SequenceFile format

Nick Woodward Mon, 12 Nov 2012 08:52:35 -0800

Diego, Thank you for your response.  There was no matrix folder in /tf-vectors 
from seq2sparse so I created it with this, "mahout rowid -i output/tf-vectors 
-o output/matrix".  Then I tried cvb again with the matrix folder, "mahout cvb 
-i output/matrix -dict output/dictionary.file-0 -o topics -dt documents -mt 
states -ow -k 100 --num_reduce_tasks 7".  The results were similar, though this 
time it failed after 1%.  
c211-109$ mahout cvb -i output/matrix -dict output/dictionary.file-0 -o topics 
-dt documents -mt states -ow -k 100 --num_reduce_tasks 7 -x 150MAHOUT_LOCAL is 
not set; adding HADOOP_CONF_DIR to classpath.Warning: $HADOOP_HOME is 
deprecated.
Running on hadoop, using /home/01541/levar1/xsede/hadoop-1.0.3/bin/hadoop and 
HADOOP_CONF_DIR=/home/01541/levar1/.hadoop2/conf/MAHOUT-JOB: 
/scratch/01541/levar1/lib/mahout-distribution-0.7/mahout-examples-0.7-job.jarWarning:
 $HADOOP_HOME is deprecated.
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use 
org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties 
files.12/11/12 09:35:13 WARN driver.MahoutDriver: No cvb.props found on 
classpath, will use command-line arguments only12/11/12 09:35:13 INFO 
common.AbstractJob: Command line arguments: {--convergenceDelta=[0], 
--dictionary=[output/dictionary.file-0], --doc_topic_output=[documents], 
--doc_topic_smoothing=[0.0001], --endPhase=[2147483647], 
--input=[output/matrix], --iteration_block_size=[10], --maxIter=[150], 
--max_doc_topic_iters=[10], --num_reduce_tasks=[7], --num_topics=[100], 
--num_train_threads=[4], --num_update_threads=[1], --output=[topics], 
--overwrite=null, --startPhase=[0], --tempDir=[temp], 
--term_topic_smoothing=[0.0001], --test_set_fraction=[0], 
--topic_model_temp_dir=[states]}12/11/12 09:35:15 INFO cvb.CVB0Driver: Will run 
Collapsed Variational Bayes (0th-derivative approximation) learning for LDA on 
output/matrix (numTerms: 699072), finding 100-topics, with document/topic prior 
1.0E-4, topic/term prior 1.0E-4.  Maximum iterations to run will be 150, unless 
the change in perplexity is less than 0.0.  Topic model output (p(term|topic) 
for each topic) will be stored topics.  Random initialization seed is 7355, 
holding out 0.0 of the data for perplexity check
12/11/12 09:35:15 INFO cvb.CVB0Driver: Dictionary to be used located 
output/dictionary.file-0p(topic|docId) will be stored documents
12/11/12 09:35:15 INFO cvb.CVB0Driver: Current iteration number: 012/11/12 
09:35:15 INFO cvb.CVB0Driver: About to run iteration 1 of 15012/11/12 09:35:15 
INFO cvb.CVB0Driver: About to run: Iteration 1 of 150, input path: 
states/model-012/11/12 09:35:16 INFO input.FileInputFormat: Total input paths 
to process : 212/11/12 09:35:16 INFO mapred.JobClient: Running job: 
job_201211120919_000512/11/12 09:35:17 INFO mapred.JobClient:  map 0% reduce 
0%12/11/12 10:25:25 INFO mapred.JobClient:  map 1% reduce 0%12/11/12 10:35:46 
INFO mapred.JobClient: Task Id : attempt_201211120919_0005_m_000001_0, Status : 
FAILEDjava.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to 
org.apache.mahout.math.VectorWritable        at 
org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)        at 
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)        at 
org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)        at 
org.apache.hadoop.mapred.Child$4.run(Child.java:255)        at 
java.security.AccessController.doPrivileged(Native Method)        at 
javax.security.auth.Subject.doAs(Subject.java:396)        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
        at org.apache.hadoop.mapred.Child.main(Child.java:249)
Task attempt_201211120919_0005_m_000001_0 failed to report status for 3601 
seconds. Killing!


Any ideas?
Regards,Nick


> From: [email protected]
> Date: Mon, 12 Nov 2012 13:21:21 +0100
> Subject: Re: Converting one large text file with multiple documents to 
> SequenceFile format
> To: [email protected]
> 
> Dear Nick,
> 
> I experienced the same problem, the fact is that when you call cvb, it
> expects in input the folder matrix inside the output folder of
> seq2sparce,
> so instead of
> 
> mahout cvb -i output/tf-vectors -dict output/dictionary.file-0 -o
> topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7 -x 10
> 
> please try:
> 
> mahout cvb -i output/tf-vectors/matrix -dict output/dictionary.file-0
> -o topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7 -x
> 10
> 
> let me know if it solved ;)
> 
> cheers,
> Diego
> 
> On Mon, Nov 12, 2012 at 1:18 AM, Nick Woodward <[email protected]> wrote:
> >
> > Diego,Thank you so much for the script. I used it to convert my large text 
> > file to a sequence file. I have been trying to use the sequence file to 
> > feed Mahout's LDA implementation (Mahout 0.7 so the CVB implementation).  I 
> > first converted the sequence file to vectors with this, "mahout seq2sparse 
> > -i input/processedaa.seq -o output -ow -wt tf -nr 7" and then ran the LDA 
> > with this, "mahout cvb -i output/tf-vectors -dict output/dictionary.file-0 
> > -o topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7 -x 10".  
> > The seq2sparse command produces the tf vectors alright, but the problem is 
> > that no matter what I use for parameters, the LDA job sits at map 0% reduce 
> > 0% for an hour before outputting the error below.  It has an error casting 
> > Text to IntWritable.  My question is when you said that the key is the line 
> > number, what variable type is the key?  Is it Text?
> >
> > My output..."12/11/11 16:10:50 INFO common.AbstractJob: Command line 
> > arguments: {--convergenceDelta=[0], 
> > --dictionary=[output/dictionary.file-0], --doc_topic_output=[documents], 
> > --doc_topic_smoothing=[0.0001], --endPhase=[2147483647], 
> > --input=[output/tf-vectors], --iteration_block_size=[10], --maxIter=[10], 
> > --max_doc_topic_iters=[10], --num_reduce_tasks=[7], --num_topics=[100], 
> > --num_train_threads=[4], --num_update_threads=[1], --output=[topics], 
> > --overwrite=null, --startPhase=[0], --tempDir=[temp], 
> > --term_topic_smoothing=[0.0001], --test_set_fraction=[0], 
> > --topic_model_temp_dir=[states]}12/11/11 16:10:52 INFO mapred.JobClient: 
> > Running job: job_201211111553_000512/11/11 16:10:53 INFO mapred.JobClient:  
> > map 0% reduce 0%12/11/11 17:11:16 INFO mapred.JobClient: Task Id : 
> > attempt_201211111553_0005_m_000003_0, Status : 
> > FAILEDjava.lang.ClassCastException: org.apache.hadoop.io.Text cannot be 
> > cast to org.apache.hadoop.io.IntWritable at 
> > org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55)
> >  at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at 
> > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at 
> > org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at 
> > org.apache.hadoop.mapred.Child$4.run(Child.java:255) at 
> > java.security.AccessController.doPrivileged(Native Method) at 
> > javax.security.auth.Subject.doAs(Subject.java:396) at 
> > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
> >  at org.apache.hadoop.mapred.Child.main(Child.java:249)"
> >
> > Thank you again for your help!Nick
> >
> >
> >> From: [email protected]
> >> Date: Thu, 1 Nov 2012 01:07:29 +0100
> >> Subject: Re: Converting one large text file with multiple documents to 
> >> SequenceFile format
> >> To: [email protected]
> >>
> >> Hei Nick,
> >> I had exatly the same problem ;)
> >> I wrote a simple command line utility to create a sequence
> >> file where each line of the input document is an entry
> >> (the key is the line number).
> >>
> >> https://dl.dropbox.com/u/4663256/tmp/lda-helper.jar
> >>
> >> java -cp lda-helper.jar it.cnr.isti.hpc.lda.cli.LinesToSequenceFileCLI
> >> -input tweets -output tweets.seq
> >>
> >> enjoy ;)
> >> Diego
> >>
> >> On Wed, Oct 31, 2012 at 9:30 PM, Charly Lizarralde
> >> <[email protected]> wrote:
> >> > I don't think you need that. Just a simple mapper.
> >> >
> >> > static class IdentityMapper extends  Mapper<LongWritable, Text, Text, 
> >> > Text>
> >> > {
> >> >
> >> >         @Override
> >> >         protected void map(LongWritable key, Text value, Context context)
> >> > throws IOException, InterruptedException {
> >> >
> >> >             String[] fields = value.toString().split("\t") ;
> >> >             if  ( fields.length >= 2) {
> >> >                 context.write(new Text(fields[0]), new Text(fields[1]))
> >> > ;
> >> >             }
> >> >
> >> >         }
> >> >
> >> >     }
> >> >
> >> > and then run a simple job..
> >> >
> >> >         Job text2SequenceFileJob = this.prepareJob(this.getInputPath(),
> >> > this.getOutputPath(), TextInputFormat.class, IdentityMapper.class,
> >> > Text.class, Text.class, SequenceFileOutputFormat.class) ;
> >> >
> >> >         text2SequenceFileJob.setOutputKeyClass(Text.class) ;
> >> >         text2SequenceFileJob.setOutputValueClass(Text.class) ;
> >> >         text2SequenceFileJob.setNumReduceTasks(0) ;
> >> >
> >> >         text2SequenceFileJob.waitForCompletion(true) ;
> >> >
> >> > Cheers!
> >> > Charly
> >> >
> >> > On Wed, Oct 31, 2012 at 4:57 PM, Nick Woodward <[email protected]> 
> >> > wrote:
> >> >
> >> >>
> >> >> Yeah, I've looked at filter classes, but nothing worked.  I guess I'll 
> >> >> do
> >> >> something similar and continuously save each line into a file and then 
> >> >> run
> >> >> seqdiretory.  The running time won't look good, but at least it should
> >> >> work.  Thanks for the response.
> >> >>
> >> >> Nick
> >> >>
> >> >> > From: [email protected]
> >> >> > Date: Tue, 30 Oct 2012 18:07:58 -0300
> >> >> > Subject: Re: Converting one large text file with multiple documents to
> >> >> SequenceFile format
> >> >> > To: [email protected]
> >> >> >
> >> >> > I had the exact same issue and I tried to use the seqdirectory command
> >> >> with
> >> >> > a different filter class but It did not work. It seems there's a bug 
> >> >> > in
> >> >> the
> >> >> > mahout-0.6 code.
> >> >> >
> >> >> > It ended up as writing a custom map-reduce program that performs just
> >> >> that.
> >> >> >
> >> >> > Greetiings!
> >> >> > Charly
> >> >> >
> >> >> > On Tue, Oct 30, 2012 at 5:00 PM, Nick Woodward <[email protected]>
> >> >> wrote:
> >> >> >
> >> >> > >
> >> >> > > I have done a lot of searching on the web for this, but I've found
> >> >> > > nothing, even though I feel like it has to be somewhat common. I 
> >> >> > > have
> >> >> used
> >> >> > > Mahout's 'seqdirectory' command to convert a folder containing text
> >> >> files
> >> >> > > (each file is a separate document) in the past. But in this case 
> >> >> > > there
> >> >> are
> >> >> > > so many documents (in the 100,000s) that I have one very large text
> >> >> file in
> >> >> > > which each line is a document. How can I convert this large file to
> >> >> > > SequenceFile format so that Mahout understands that each line 
> >> >> > > should be
> >> >> > > considered a separate document?  Would it be better if the file was
> >> >> > > structured like so....docId1 {tab} document textdocId2 {tab} 
> >> >> > > document
> >> >> > > textdocId3 {tab} document text...
> >> >> > >
> >> >> > > Thank you very much for any help.Nick
> >> >> > >
> >> >>
> >> >>
> >>
> >>
> >>
> >> --
> >> Computers are useless. They can only give you answers.
> >> (Pablo Picasso)
> >> _______________
> >> Diego Ceccarelli
> >> High Performance Computing Laboratory
> >> Information Science and Technologies Institute (ISTI)
> >> Italian National Research Council (CNR)
> >> Via Moruzzi, 1
> >> 56124 - Pisa - Italy
> >>
> >> Phone: +39 050 315 3055
> >> Fax: +39 050 315 2040
> >> ________________________________________
> >
> 
> 
> 
> -- 
> Computers are useless. They can only give you answers.
> (Pablo Picasso)
> _______________
> Diego Ceccarelli
> High Performance Computing Laboratory
> Information Science and Technologies Institute (ISTI)
> Italian National Research Council (CNR)
> Via Moruzzi, 1
> 56124 - Pisa - Italy
> 
> Phone: +39 050 315 3055
> Fax: +39 050 315 2040
> ________________________________________

RE: Converting one large text file with multiple documents to SequenceFile format

Reply via email to