Re: Converting one large text file with multiple documents to SequenceFile format

Andy Schlaikjer Tue, 13 Nov 2012 11:13:26 -0800

Nick,

Make sure your matrix part files are balanced in terms of number of vectors
(docs) per part file. The organization of the CVB computation here will
benefit from balanced input splits to the map phase of each iteration's
job. If you can increase the number of mappers here (by increasing the
number of input splits) you may see improved throughput.


Andy


On Tue, Nov 13, 2012 at 9:04 AM, Nick Woodward <[email protected]> wrote:

>
> Dan, Thank you.  Specifying the matrix folder did the trick.  After a few
> test runs I'm figuring out that I will have to dramatically reduce the size
> of my corpus.  Running LDA on a 200 MB chunk of the corpus took more than
> 24 hours, and the total corpus is almost 2 GB.  Combining LDA runs from
> slices of the corpus doesn't sound very feasible, so I'll have to rethink
> my approach.  But thanks again for the help.
>
>
> Nick
>
> > Date: Mon, 12 Nov 2012 09:27:32 -0800
> > From: [email protected]
> > Subject: Re: Converting one large text file with multiple documents to
> SequenceFile format
> > To: [email protected]
> > CC: [email protected]
> >
> > CVB requires the vector input to be Key=IntWritable,
> Value=VectorWritable.  rowid will convert the seq2sparse output to this
> format as you assumed.  But when you ran rowid I assume the vector output
> was written to this file: output/matrix/Matrix
> >
> > So, try running CVB like this:
> >
> > mahout cvb -i output/matrix/Matrix -dict output/dictionary.file-0 -o
> topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7
> >
> > rowid creates a Matrix file but also creates a docIndex file that maps
> the original sparse vector keys (Text) to the integer id's that rowid
> created.  I'm guessing CVB is blowing up because it is trying to process
> that docIndex file.  So explicitly specify the Matrix file as input to CVB
> or move the docIndex file to some other folder as I have done (before
> starting CVB):
> >
> > http://comments.gmane.org/gmane.comp.apache.mahout.user/13112
> >
> > You may also want to check if output/matrix/Matrix  contains data, e.g.,
> >
> > mahout seqdumper -s output/matrix/Matrix
> >
> > Dan
> >
> >
> >
> > ________________________________
> >  From: Nick Woodward <[email protected]>
> > To: [email protected]
> > Sent: Monday, November 12, 2012 11:52 AM
> > Subject: RE: Converting one large text file with multiple documents to
> SequenceFile format
> >
> >
> > Diego, Thank you for your response.  There was no matrix folder in
> /tf-vectors from seq2sparse so I created it with this, "mahout rowid -i
> output/tf-vectors -o output/matrix".  Then I tried cvb again with the
> matrix folder, "mahout cvb -i output/matrix -dict output/dictionary.file-0
> -o topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7".  The
> results were similar, though this time it failed after 1%.
> > c211-109$ mahout cvb -i output/matrix -dict output/dictionary.file-0 -o
> topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7 -x
> 150MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.Warning:
> $HADOOP_HOME is deprecated.
> > Running on hadoop, using
> /home/01541/levar1/xsede/hadoop-1.0.3/bin/hadoop and
> HADOOP_CONF_DIR=/home/01541/levar1/.hadoop2/conf/MAHOUT-JOB:
> /scratch/01541/levar1/lib/mahout-distribution-0.7/mahout-examples-0.7-job.jarWarning:
> $HADOOP_HOME is deprecated.
> > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated.
> Please use org.apache.hadoop.log.metrics.EventCounter in all the
> log4j.properties files.12/11/12 09:35:13 WARN driver.MahoutDriver: No
> cvb.props found on classpath, will use command-line arguments only12/11/12
> 09:35:13 INFO common.AbstractJob: Command line arguments:
> {--convergenceDelta=[0], --dictionary=[output/dictionary.file-0],
> --doc_topic_output=[documents], --doc_topic_smoothing=[0.0001],
> --endPhase=[2147483647], --input=[output/matrix],
> --iteration_block_size=[10], --maxIter=[150], --max_doc_topic_iters=[10],
> --num_reduce_tasks=[7], --num_topics=[100], --num_train_threads=[4],
> --num_update_threads=[1], --output=[topics], --overwrite=null,
> --startPhase=[0], --tempDir=[temp], --term_topic_smoothing=[0.0001],
> --test_set_fraction=[0], --topic_model_temp_dir=[states]}12/11/12 09:35:15
> INFO cvb.CVB0Driver: Will run Collapsed Variational Bayes (0th-derivative
> approximation) learning for
> >  LDA on output/matrix (numTerms: 699072), finding 100-topics, with
> document/topic prior 1.0E-4, topic/term prior 1.0E-4.  Maximum iterations
> to run will be 150, unless the change in perplexity is less than 0.0.
>  Topic model output (p(term|topic) for each topic) will be stored topics.
>  Random initialization seed is 7355, holding out 0.0 of the data for
> perplexity check
> > 12/11/12 09:35:15 INFO cvb.CVB0Driver: Dictionary to be used located
> output/dictionary.file-0p(topic|docId) will be stored documents
> > 12/11/12 09:35:15 INFO cvb.CVB0Driver: Current iteration number:
> 012/11/12 09:35:15 INFO cvb.CVB0Driver: About to run iteration 1 of
> 15012/11/12 09:35:15 INFO cvb.CVB0Driver: About to run: Iteration 1 of 150,
> input path: states/model-012/11/12 09:35:16 INFO input.FileInputFormat:
> Total input paths to process : 212/11/12 09:35:16 INFO mapred.JobClient:
> Running job: job_201211120919_000512/11/12 09:35:17 INFO mapred.JobClient:
>  map 0% reduce 0%12/11/12 10:25:25 INFO mapred.JobClient:  map 1% reduce
> 0%12/11/12 10:35:46 INFO mapred.JobClient: Task Id :
> attempt_201211120919_0005_m_000001_0, Status :
> FAILEDjava.lang.ClassCastException: org.apache.hadoop.io.Text cannot be
> cast to org.apache.mahout.math.VectorWritable        at
> org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55)
>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)        at
> >  org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)        at
> org.apache.hadoop.mapred.Child$4.run(Child.java:255)        at
> java.security.AccessController.doPrivileged(Native Method)        at
> javax.security.auth.Subject.doAs(Subject.java:396)        at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>        at org.apache.hadoop.mapred.Child.main(Child.java:249)
> > Task attempt_201211120919_0005_m_000001_0 failed to report status for
> 3601 seconds. Killing!
> >
> > Any ideas?
> > Regards,Nick
> >
> >
> > > From: [email protected]
> > > Date: Mon, 12 Nov 2012 13:21:21 +0100
> > > Subject: Re: Converting one large text file with multiple documents to
> SequenceFile format
> > > To: [email protected]
> > >
> > > Dear Nick,
> > >
> > > I experienced the same problem, the fact is that when you call cvb, it
> > > expects in input the folder matrix inside the output folder of
> > > seq2sparce,
> > > so instead of
> > >
> > > mahout cvb -i output/tf-vectors -dict output/dictionary.file-0 -o
> > > topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7 -x 10
> > >
> > > please try:
> > >
> > > mahout cvb -i output/tf-vectors/matrix -dict output/dictionary.file-0
> > > -o topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7 -x
> > > 10
> > >
> > > let me know if it solved ;)
> > >
> > > cheers,
> > > Diego
> > >
> > > On Mon, Nov 12, 2012 at 1:18 AM, Nick Woodward <[email protected]>
> wrote:
> > > >
> > > > Diego,Thank you so much for the script. I used it to convert my
> large text file to a sequence file. I have been trying to use the sequence
> file to feed Mahout's LDA implementation (Mahout 0.7 so the CVB
> implementation).  I first converted the sequence file to vectors with this,
> "mahout seq2sparse -i input/processedaa.seq -o output -ow -wt tf -nr 7" and
> then ran the LDA with this, "mahout cvb -i output/tf-vectors -dict
> output/dictionary.file-0 -o topics -dt documents -mt states -ow -k 100
> --num_reduce_tasks 7 -x 10".  The seq2sparse command produces the tf
> vectors alright, but the problem is that no matter what I use for
> parameters, the LDA job sits at map 0% reduce 0% for an hour before
> outputting the error below.  It has an error casting Text to IntWritable.
>  My question is when you said that the key is the line number, what
> variable type is the key?  Is it Text?
> > > >
> > > > My output..."12/11/11 16:10:50 INFO common.AbstractJob: Command line
> arguments: {--convergenceDelta=[0],
> --dictionary=[output/dictionary.file-0], --doc_topic_output=[documents],
> --doc_topic_smoothing=[0.0001], --endPhase=[2147483647],
> --input=[output/tf-vectors], --iteration_block_size=[10], --maxIter=[10],
> --max_doc_topic_iters=[10], --num_reduce_tasks=[7], --num_topics=[100],
> --num_train_threads=[4], --num_update_threads=[1], --output=[topics],
> --overwrite=null, --startPhase=[0], --tempDir=[temp],
> --term_topic_smoothing=[0.0001], --test_set_fraction=[0],
> --topic_model_temp_dir=[states]}12/11/11 16:10:52 INFO mapred.JobClient:
> Running job: job_201211111553_000512/11/11 16:10:53 INFO mapred.JobClient:
>  map 0% reduce 0%12/11/11 17:11:16 INFO mapred.JobClient: Task Id :
> attempt_201211111553_0005_m_000003_0, Status :
> FAILEDjava.lang.ClassCastException: org.apache.hadoop.io.Text cannot be
> cast to org.apache.hadoop.io.IntWritable at
> >
>  
> org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at
> org.apache.hadoop.mapred.Child$4.run(Child.java:255) at
> java.security.AccessController.doPrivileged(Native Method) at
> javax.security.auth.Subject.doAs(Subject.java:396) at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
> at org.apache.hadoop.mapred.Child.main(Child.java:249)"
> > > >
> > > > Thank you again for your help!Nick
> > > >
> > > >
> > > >> From: [email protected]
> > > >> Date: Thu, 1 Nov 2012 01:07:29 +0100
> > > >> Subject: Re: Converting one large text file with multiple documents
> to SequenceFile format
> > > >> To: [email protected]
> > > >>
> > > >> Hei Nick,
> > > >> I had exatly the same problem ;)
> > > >> I wrote a simple command line utility to create a sequence
> > > >> file where each line of the input document is an entry
> > > >> (the key is the line number).
> > > >>
> > > >> https://dl.dropbox.com/u/4663256/tmp/lda-helper.jar
> > > >>
> > > >> java -cp lda-helper.jar
> it.cnr.isti.hpc.lda.cli.LinesToSequenceFileCLI
> > > >> -input tweets -output tweets.seq
> > > >>
> > > >> enjoy ;)
> > > >> Diego
> > > >>
> > > >> On Wed, Oct 31, 2012 at 9:30 PM, Charly Lizarralde
> > > >> <[email protected]> wrote:
> > > >> > I don't think you need that. Just a simple mapper.
> > > >> >
> > > >> > static class IdentityMapper extends  Mapper<LongWritable, Text,
> Text, Text>
> > > >> > {
> > > >> >
> > > >> >         @Override
> > > >> >         protected void map(LongWritable key, Text value, Context
> context)
> > > >> > throws IOException, InterruptedException {
> > > >> >
> > > >> >             String[] fields = value.toString().split("\t") ;
> > > >> >             if  ( fields.length >= 2) {
> > > >> >                 context.write(new Text(fields[0]), new
> Text(fields[1]))
> > > >> > ;
> > > >> >             }
> > > >> >
> > > >> >         }
> > > >> >
> > > >> >     }
> > > >> >
> > > >> > and then run a simple job..
> > > >> >
> > > >> >         Job text2SequenceFileJob =
> this.prepareJob(this.getInputPath(),
> > > >> > this.getOutputPath(), TextInputFormat.class, IdentityMapper.class,
> > > >> > Text.class, Text.class, SequenceFileOutputFormat.class) ;
> > > >> >
> > > >> >         text2SequenceFileJob.setOutputKeyClass(Text.class) ;
> > > >> >         text2SequenceFileJob.setOutputValueClass(Text.class) ;
> > > >> >         text2SequenceFileJob.setNumReduceTasks(0) ;
> > > >> >
> > > >> >         text2SequenceFileJob.waitForCompletion(true) ;
> > > >> >
> > > >> > Cheers!
> > > >> > Charly
> > > >> >
> > > >> > On Wed, Oct 31, 2012 at 4:57 PM, Nick Woodward <
> [email protected]> wrote:
> > > >> >
> > > >> >>
> > > >> >> Yeah, I've looked at filter classes, but nothing worked.  I
> guess I'll do
> > > >> >> something similar and continuously save each line into a file
> and then run
> > > >> >> seqdiretory.  The running time won't look good, but at least it
> should
> > > >> >> work.  Thanks for the response.
> > > >> >>
> > > >> >> Nick
> > > >> >>
> > > >> >> > From: [email protected]
> > > >> >> > Date: Tue, 30 Oct 2012 18:07:58 -0300
> > > >> >> > Subject: Re: Converting one large text file with multiple
> documents to
> > > >> >> SequenceFile format
> > > >> >> > To: [email protected]
> > > >> >> >
> > > >> >> > I had the exact same issue and I tried to use the seqdirectory
> command
> > > >> >> with
> > > >> >> > a different filter class but It did not work. It seems there's
> a bug in
> > > >> >> the
> > > >> >> > mahout-0.6 code.
> > > >> >> >
> > > >> >> > It ended up as writing a custom map-reduce program that
> performs just
> > > >> >> that.
> > > >> >> >
> > > >> >> > Greetiings!
> > > >> >> > Charly
> > > >> >> >
> > > >> >> > On Tue, Oct 30, 2012 at 5:00 PM, Nick Woodward <
> [email protected]>
> > > >> >> wrote:
> > > >> >> >
> > > >> >> > >
> > > >> >> > > I have done a lot of searching on the web for this, but I've
> found
> > > >> >> > > nothing, even though I feel like it has to be somewhat
> common. I have
> > > >> >> used
> > > >> >> > > Mahout's 'seqdirectory' command to convert a folder
> containing text
> > > >> >> files
> > > >> >> > > (each file is a separate document) in the past. But in this
> case there
> > > >> >> are
> > > >> >> > > so many documents (in the 100,000s) that I have one very
> large text
> > > >> >> file in
> > > >> >> > > which each line is a document. How can I convert this large
> file to
> > > >> >> > > SequenceFile format so that Mahout understands that each
> line should be
> > > >> >> > > considered a separate document?  Would it be better if the
> file was
> > > >> >> > > structured like so....docId1 {tab} document textdocId2 {tab}
> document
> > > >> >> > > textdocId3 {tab} document text...
> > > >> >> > >
> > > >> >> > > Thank you very much for any help.Nick
> > > >> >> > >
> > > >> >>
> > > >> >>
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Computers are useless. They can only give you answers.
> > > >> (Pablo Picasso)
> > > >> _______________
> > > >> Diego Ceccarelli
> > > >> High Performance Computing Laboratory
> > > >> Information Science and Technologies Institute (ISTI)
> > > >> Italian National Research Council (CNR)
> > > >> Via Moruzzi, 1
> > > >> 56124 - Pisa - Italy
> > > >>
> > > >> Phone: +39 050 315 3055
> > > >> Fax: +39 050 315 2040
> > > >> ________________________________________
> > > >
> > >
> > >
> > >
> > > --
> > > Computers are useless. They can only give you answers.
> > > (Pablo Picasso)
> > > _______________
> > > Diego Ceccarelli
> > > High Performance Computing Laboratory
> > > Information Science and Technologies Institute (ISTI)
> > > Italian National Research Council (CNR)
> > > Via Moruzzi, 1
> > > 56124 - Pisa - Italy
> > >
> > > Phone: +39 050 315 3055
> > > Fax: +39 050 315 2040
> > > ________________________________________
>
>

Re: Converting one large text file with multiple documents to SequenceFile format

Reply via email to