RE: Converting one large text file with multiple documents to SequenceFile format

Nick Woodward Wed, 14 Nov 2012 09:02:44 -0800

Andy,Yes, I see now that 'rowid' creates a single Matrix file (not part-r-xxxxx 
files).  This big file is probably getting passed to all of the mappers.  I 
searched for this issue and saw a mod to rowid from Dan Helm in July that 
creates separate part files for input to 'cvb'.  Has this been incorporated 
into 0.8 or will I need to write something to split up the Matrix file into 
parts?


Nick


> Date: Tue, 13 Nov 2012 11:12:56 -0800
> Subject: Re: Converting one large text file with multiple documents to 
> SequenceFile format
> From: [email protected]
> To: [email protected]
> 
> Nick,
> 
> Make sure your matrix part files are balanced in terms of number of vectors
> (docs) per part file. The organization of the CVB computation here will
> benefit from balanced input splits to the map phase of each iteration's
> job. If you can increase the number of mappers here (by increasing the
> number of input splits) you may see improved throughput.
> 
> Andy
> 
> 
> On Tue, Nov 13, 2012 at 9:04 AM, Nick Woodward <[email protected]> wrote:
> 
> >
> > Dan, Thank you.  Specifying the matrix folder did the trick.  After a few
> > test runs I'm figuring out that I will have to dramatically reduce the size
> > of my corpus.  Running LDA on a 200 MB chunk of the corpus took more than
> > 24 hours, and the total corpus is almost 2 GB.  Combining LDA runs from
> > slices of the corpus doesn't sound very feasible, so I'll have to rethink
> > my approach.  But thanks again for the help.
> >
> >
> > Nick
> >
> > > Date: Mon, 12 Nov 2012 09:27:32 -0800
> > > From: [email protected]
> > > Subject: Re: Converting one large text file with multiple documents to
> > SequenceFile format
> > > To: [email protected]
> > > CC: [email protected]
> > >
> > > CVB requires the vector input to be Key=IntWritable,
> > Value=VectorWritable.  rowid will convert the seq2sparse output to this
> > format as you assumed.  But when you ran rowid I assume the vector output
> > was written to this file: output/matrix/Matrix
> > >
> > > So, try running CVB like this:
> > >
> > > mahout cvb -i output/matrix/Matrix -dict output/dictionary.file-0 -o
> > topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7
> > >
> > > rowid creates a Matrix file but also creates a docIndex file that maps
> > the original sparse vector keys (Text) to the integer id's that rowid
> > created.  I'm guessing CVB is blowing up because it is trying to process
> > that docIndex file.  So explicitly specify the Matrix file as input to CVB
> > or move the docIndex file to some other folder as I have done (before
> > starting CVB):
> > >
> > > http://comments.gmane.org/gmane.comp.apache.mahout.user/13112
> > >
> > > You may also want to check if output/matrix/Matrix  contains data, e.g.,
> > >
> > > mahout seqdumper -s output/matrix/Matrix
> > >
> > > Dan
> > >
> > >
> > >
> > > ________________________________
> > >  From: Nick Woodward <[email protected]>
> > > To: [email protected]
> > > Sent: Monday, November 12, 2012 11:52 AM
> > > Subject: RE: Converting one large text file with multiple documents to
> > SequenceFile format
> > >
> > >
> > > Diego, Thank you for your response.  There was no matrix folder in
> > /tf-vectors from seq2sparse so I created it with this, "mahout rowid -i
> > output/tf-vectors -o output/matrix".  Then I tried cvb again with the
> > matrix folder, "mahout cvb -i output/matrix -dict output/dictionary.file-0
> > -o topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7".  The
> > results were similar, though this time it failed after 1%.
> > > c211-109$ mahout cvb -i output/matrix -dict output/dictionary.file-0 -o
> > topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7 -x
> > 150MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.Warning:
> > $HADOOP_HOME is deprecated.
> > > Running on hadoop, using
> > /home/01541/levar1/xsede/hadoop-1.0.3/bin/hadoop and
> > HADOOP_CONF_DIR=/home/01541/levar1/.hadoop2/conf/MAHOUT-JOB:
> > /scratch/01541/levar1/lib/mahout-distribution-0.7/mahout-examples-0.7-job.jarWarning:
> > $HADOOP_HOME is deprecated.
> > > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated.
> > Please use org.apache.hadoop.log.metrics.EventCounter in all the
> > log4j.properties files.12/11/12 09:35:13 WARN driver.MahoutDriver: No
> > cvb.props found on classpath, will use command-line arguments only12/11/12
> > 09:35:13 INFO common.AbstractJob: Command line arguments:
> > {--convergenceDelta=[0], --dictionary=[output/dictionary.file-0],
> > --doc_topic_output=[documents], --doc_topic_smoothing=[0.0001],
> > --endPhase=[2147483647], --input=[output/matrix],
> > --iteration_block_size=[10], --maxIter=[150], --max_doc_topic_iters=[10],
> > --num_reduce_tasks=[7], --num_topics=[100], --num_train_threads=[4],
> > --num_update_threads=[1], --output=[topics], --overwrite=null,
> > --startPhase=[0], --tempDir=[temp], --term_topic_smoothing=[0.0001],
> > --test_set_fraction=[0], --topic_model_temp_dir=[states]}12/11/12 09:35:15
> > INFO cvb.CVB0Driver: Will run Collapsed Variational Bayes (0th-derivative
> > approximation) learning for
> > >  LDA on output/matrix (numTerms: 699072), finding 100-topics, with
> > document/topic prior 1.0E-4, topic/term prior 1.0E-4.  Maximum iterations
> > to run will be 150, unless the change in perplexity is less than 0.0.
> >  Topic model output (p(term|topic) for each topic) will be stored topics.
> >  Random initialization seed is 7355, holding out 0.0 of the data for
> > perplexity check
> > > 12/11/12 09:35:15 INFO cvb.CVB0Driver: Dictionary to be used located
> > output/dictionary.file-0p(topic|docId) will be stored documents
> > > 12/11/12 09:35:15 INFO cvb.CVB0Driver: Current iteration number:
> > 012/11/12 09:35:15 INFO cvb.CVB0Driver: About to run iteration 1 of
> > 15012/11/12 09:35:15 INFO cvb.CVB0Driver: About to run: Iteration 1 of 150,
> > input path: states/model-012/11/12 09:35:16 INFO input.FileInputFormat:
> > Total input paths to process : 212/11/12 09:35:16 INFO mapred.JobClient:
> > Running job: job_201211120919_000512/11/12 09:35:17 INFO mapred.JobClient:
> >  map 0% reduce 0%12/11/12 10:25:25 INFO mapred.JobClient:  map 1% reduce
> > 0%12/11/12 10:35:46 INFO mapred.JobClient: Task Id :
> > attempt_201211120919_0005_m_000001_0, Status :
> > FAILEDjava.lang.ClassCastException: org.apache.hadoop.io.Text cannot be
> > cast to org.apache.mahout.math.VectorWritable        at
> > org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55)
> >        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)        at
> > >  org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> >  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)        at
> > org.apache.hadoop.mapred.Child$4.run(Child.java:255)        at
> > java.security.AccessController.doPrivileged(Native Method)        at
> > javax.security.auth.Subject.doAs(Subject.java:396)        at
> > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
> >        at org.apache.hadoop.mapred.Child.main(Child.java:249)
> > > Task attempt_201211120919_0005_m_000001_0 failed to report status for
> > 3601 seconds. Killing!
> > >
> > > Any ideas?
> > > Regards,Nick
> > >
> > >
> > > > From: [email protected]
> > > > Date: Mon, 12 Nov 2012 13:21:21 +0100
> > > > Subject: Re: Converting one large text file with multiple documents to
> > SequenceFile format
> > > > To: [email protected]
> > > >
> > > > Dear Nick,
> > > >
> > > > I experienced the same problem, the fact is that when you call cvb, it
> > > > expects in input the folder matrix inside the output folder of
> > > > seq2sparce,
> > > > so instead of
> > > >
> > > > mahout cvb -i output/tf-vectors -dict output/dictionary.file-0 -o
> > > > topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7 -x 10
> > > >
> > > > please try:
> > > >
> > > > mahout cvb -i output/tf-vectors/matrix -dict output/dictionary.file-0
> > > > -o topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7 -x
> > > > 10
> > > >
> > > > let me know if it solved ;)
> > > >
> > > > cheers,
> > > > Diego
> > > >
> > > > On Mon, Nov 12, 2012 at 1:18 AM, Nick Woodward <[email protected]>
> > wrote:
> > > > >
> > > > > Diego,Thank you so much for the script. I used it to convert my
> > large text file to a sequence file. I have been trying to use the sequence
> > file to feed Mahout's LDA implementation (Mahout 0.7 so the CVB
> > implementation).  I first converted the sequence file to vectors with this,
> > "mahout seq2sparse -i input/processedaa.seq -o output -ow -wt tf -nr 7" and
> > then ran the LDA with this, "mahout cvb -i output/tf-vectors -dict
> > output/dictionary.file-0 -o topics -dt documents -mt states -ow -k 100
> > --num_reduce_tasks 7 -x 10".  The seq2sparse command produces the tf
> > vectors alright, but the problem is that no matter what I use for
> > parameters, the LDA job sits at map 0% reduce 0% for an hour before
> > outputting the error below.  It has an error casting Text to IntWritable.
> >  My question is when you said that the key is the line number, what
> > variable type is the key?  Is it Text?
> > > > >
> > > > > My output..."12/11/11 16:10:50 INFO common.AbstractJob: Command line
> > arguments: {--convergenceDelta=[0],
> > --dictionary=[output/dictionary.file-0], --doc_topic_output=[documents],
> > --doc_topic_smoothing=[0.0001], --endPhase=[2147483647],
> > --input=[output/tf-vectors], --iteration_block_size=[10], --maxIter=[10],
> > --max_doc_topic_iters=[10], --num_reduce_tasks=[7], --num_topics=[100],
> > --num_train_threads=[4], --num_update_threads=[1], --output=[topics],
> > --overwrite=null, --startPhase=[0], --tempDir=[temp],
> > --term_topic_smoothing=[0.0001], --test_set_fraction=[0],
> > --topic_model_temp_dir=[states]}12/11/11 16:10:52 INFO mapred.JobClient:
> > Running job: job_201211111553_000512/11/11 16:10:53 INFO mapred.JobClient:
> >  map 0% reduce 0%12/11/11 17:11:16 INFO mapred.JobClient: Task Id :
> > attempt_201211111553_0005_m_000003_0, Status :
> > FAILEDjava.lang.ClassCastException: org.apache.hadoop.io.Text cannot be
> > cast to org.apache.hadoop.io.IntWritable at
> > >
> >  
> > org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55)
> > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at
> > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at
> > org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at
> > org.apache.hadoop.mapred.Child$4.run(Child.java:255) at
> > java.security.AccessController.doPrivileged(Native Method) at
> > javax.security.auth.Subject.doAs(Subject.java:396) at
> > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
> > at org.apache.hadoop.mapred.Child.main(Child.java:249)"
> > > > >
> > > > > Thank you again for your help!Nick
> > > > >
> > > > >
> > > > >> From: [email protected]
> > > > >> Date: Thu, 1 Nov 2012 01:07:29 +0100
> > > > >> Subject: Re: Converting one large text file with multiple documents
> > to SequenceFile format
> > > > >> To: [email protected]
> > > > >>
> > > > >> Hei Nick,
> > > > >> I had exatly the same problem ;)
> > > > >> I wrote a simple command line utility to create a sequence
> > > > >> file where each line of the input document is an entry
> > > > >> (the key is the line number).
> > > > >>
> > > > >> https://dl.dropbox.com/u/4663256/tmp/lda-helper.jar
> > > > >>
> > > > >> java -cp lda-helper.jar
> > it.cnr.isti.hpc.lda.cli.LinesToSequenceFileCLI
> > > > >> -input tweets -output tweets.seq
> > > > >>
> > > > >> enjoy ;)
> > > > >> Diego
> > > > >>
> > > > >> On Wed, Oct 31, 2012 at 9:30 PM, Charly Lizarralde
> > > > >> <[email protected]> wrote:
> > > > >> > I don't think you need that. Just a simple mapper.
> > > > >> >
> > > > >> > static class IdentityMapper extends  Mapper<LongWritable, Text,
> > Text, Text>
> > > > >> > {
> > > > >> >
> > > > >> >         @Override
> > > > >> >         protected void map(LongWritable key, Text value, Context
> > context)
> > > > >> > throws IOException, InterruptedException {
> > > > >> >
> > > > >> >             String[] fields = value.toString().split("\t") ;
> > > > >> >             if  ( fields.length >= 2) {
> > > > >> >                 context.write(new Text(fields[0]), new
> > Text(fields[1]))
> > > > >> > ;
> > > > >> >             }
> > > > >> >
> > > > >> >         }
> > > > >> >
> > > > >> >     }
> > > > >> >
> > > > >> > and then run a simple job..
> > > > >> >
> > > > >> >         Job text2SequenceFileJob =
> > this.prepareJob(this.getInputPath(),
> > > > >> > this.getOutputPath(), TextInputFormat.class, IdentityMapper.class,
> > > > >> > Text.class, Text.class, SequenceFileOutputFormat.class) ;
> > > > >> >
> > > > >> >         text2SequenceFileJob.setOutputKeyClass(Text.class) ;
> > > > >> >         text2SequenceFileJob.setOutputValueClass(Text.class) ;
> > > > >> >         text2SequenceFileJob.setNumReduceTasks(0) ;
> > > > >> >
> > > > >> >         text2SequenceFileJob.waitForCompletion(true) ;
> > > > >> >
> > > > >> > Cheers!
> > > > >> > Charly
> > > > >> >
> > > > >> > On Wed, Oct 31, 2012 at 4:57 PM, Nick Woodward <
> > [email protected]> wrote:
> > > > >> >
> > > > >> >>
> > > > >> >> Yeah, I've looked at filter classes, but nothing worked.  I
> > guess I'll do
> > > > >> >> something similar and continuously save each line into a file
> > and then run
> > > > >> >> seqdiretory.  The running time won't look good, but at least it
> > should
> > > > >> >> work.  Thanks for the response.
> > > > >> >>
> > > > >> >> Nick
> > > > >> >>
> > > > >> >> > From: [email protected]
> > > > >> >> > Date: Tue, 30 Oct 2012 18:07:58 -0300
> > > > >> >> > Subject: Re: Converting one large text file with multiple
> > documents to
> > > > >> >> SequenceFile format
> > > > >> >> > To: [email protected]
> > > > >> >> >
> > > > >> >> > I had the exact same issue and I tried to use the seqdirectory
> > command
> > > > >> >> with
> > > > >> >> > a different filter class but It did not work. It seems there's
> > a bug in
> > > > >> >> the
> > > > >> >> > mahout-0.6 code.
> > > > >> >> >
> > > > >> >> > It ended up as writing a custom map-reduce program that
> > performs just
> > > > >> >> that.
> > > > >> >> >
> > > > >> >> > Greetiings!
> > > > >> >> > Charly
> > > > >> >> >
> > > > >> >> > On Tue, Oct 30, 2012 at 5:00 PM, Nick Woodward <
> > [email protected]>
> > > > >> >> wrote:
> > > > >> >> >
> > > > >> >> > >
> > > > >> >> > > I have done a lot of searching on the web for this, but I've
> > found
> > > > >> >> > > nothing, even though I feel like it has to be somewhat
> > common. I have
> > > > >> >> used
> > > > >> >> > > Mahout's 'seqdirectory' command to convert a folder
> > containing text
> > > > >> >> files
> > > > >> >> > > (each file is a separate document) in the past. But in this
> > case there
> > > > >> >> are
> > > > >> >> > > so many documents (in the 100,000s) that I have one very
> > large text
> > > > >> >> file in
> > > > >> >> > > which each line is a document. How can I convert this large
> > file to
> > > > >> >> > > SequenceFile format so that Mahout understands that each
> > line should be
> > > > >> >> > > considered a separate document?  Would it be better if the
> > file was
> > > > >> >> > > structured like so....docId1 {tab} document textdocId2 {tab}
> > document
> > > > >> >> > > textdocId3 {tab} document text...
> > > > >> >> > >
> > > > >> >> > > Thank you very much for any help.Nick
> > > > >> >> > >
> > > > >> >>
> > > > >> >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Computers are useless. They can only give you answers.
> > > > >> (Pablo Picasso)
> > > > >> _______________
> > > > >> Diego Ceccarelli
> > > > >> High Performance Computing Laboratory
> > > > >> Information Science and Technologies Institute (ISTI)
> > > > >> Italian National Research Council (CNR)
> > > > >> Via Moruzzi, 1
> > > > >> 56124 - Pisa - Italy
> > > > >>
> > > > >> Phone: +39 050 315 3055
> > > > >> Fax: +39 050 315 2040
> > > > >> ________________________________________
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Computers are useless. They can only give you answers.
> > > > (Pablo Picasso)
> > > > _______________
> > > > Diego Ceccarelli
> > > > High Performance Computing Laboratory
> > > > Information Science and Technologies Institute (ISTI)
> > > > Italian National Research Council (CNR)
> > > > Via Moruzzi, 1
> > > > 56124 - Pisa - Italy
> > > >
> > > > Phone: +39 050 315 3055
> > > > Fax: +39 050 315 2040
> > > > ________________________________________
> >
> >

RE: Converting one large text file with multiple documents to SequenceFile format

Reply via email to