Re: Converting one large text file with multiple documents to SequenceFile format

Andy Schlaikjer Wed, 14 Nov 2012 09:16:41 -0800

Nick, I helped write the cvb0 implementation with Jake Mannix over the
summer of 2011. I generally create all input data via Pig as this is
exceedingly more flexible than the existing utilities for data wrangling in
Mahout. I have a pig script which generates a tfidf doc-term matrix from
input data and properly partitions its rows into a specifiable number of
part files. Unfortunately, I haven't looked into other mechanisms (e.g.
rowid mods) to achieve this.


Andy
@sagemintblue



On Wed, Nov 14, 2012 at 9:02 AM, Nick Woodward <[email protected]> wrote:

>
> Andy,Yes, I see now that 'rowid' creates a single Matrix file (not
> part-r-xxxxx files).  This big file is probably getting passed to all of
> the mappers.  I searched for this issue and saw a mod to rowid from Dan
> Helm in July that creates separate part files for input to 'cvb'.  Has this
> been incorporated into 0.8 or will I need to write something to split up
> the Matrix file into parts?
>
> Nick
>
>
> > Date: Tue, 13 Nov 2012 11:12:56 -0800
> > Subject: Re: Converting one large text file with multiple documents to
> SequenceFile format
> > From: [email protected]
> > To: [email protected]
> >
> > Nick,
> >
> > Make sure your matrix part files are balanced in terms of number of
> vectors
> > (docs) per part file. The organization of the CVB computation here will
> > benefit from balanced input splits to the map phase of each iteration's
> > job. If you can increase the number of mappers here (by increasing the
> > number of input splits) you may see improved throughput.
> >
> > Andy
> >
> >
> > On Tue, Nov 13, 2012 at 9:04 AM, Nick Woodward <[email protected]>
> wrote:
> >
> > >
> > > Dan, Thank you.  Specifying the matrix folder did the trick.  After a
> few
> > > test runs I'm figuring out that I will have to dramatically reduce the
> size
> > > of my corpus.  Running LDA on a 200 MB chunk of the corpus took more
> than
> > > 24 hours, and the total corpus is almost 2 GB.  Combining LDA runs from
> > > slices of the corpus doesn't sound very feasible, so I'll have to
> rethink
> > > my approach.  But thanks again for the help.
> > >
> > >
> > > Nick
> > >
> > > > Date: Mon, 12 Nov 2012 09:27:32 -0800
> > > > From: [email protected]
> > > > Subject: Re: Converting one large text file with multiple documents
> to
> > > SequenceFile format
> > > > To: [email protected]
> > > > CC: [email protected]
> > > >
> > > > CVB requires the vector input to be Key=IntWritable,
> > > Value=VectorWritable.  rowid will convert the seq2sparse output to this
> > > format as you assumed.  But when you ran rowid I assume the vector
> output
> > > was written to this file: output/matrix/Matrix
> > > >
> > > > So, try running CVB like this:
> > > >
> > > > mahout cvb -i output/matrix/Matrix -dict output/dictionary.file-0 -o
> > > topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7
> > > >
> > > > rowid creates a Matrix file but also creates a docIndex file that
> maps
> > > the original sparse vector keys (Text) to the integer id's that rowid
> > > created.  I'm guessing CVB is blowing up because it is trying to
> process
> > > that docIndex file.  So explicitly specify the Matrix file as input to
> CVB
> > > or move the docIndex file to some other folder as I have done (before
> > > starting CVB):
> > > >
> > > > http://comments.gmane.org/gmane.comp.apache.mahout.user/13112
> > > >
> > > > You may also want to check if output/matrix/Matrix  contains data,
> e.g.,
> > > >
> > > > mahout seqdumper -s output/matrix/Matrix
> > > >
> > > > Dan
> > > >
> > > >
> > > >
> > > > ________________________________
> > > >  From: Nick Woodward <[email protected]>
> > > > To: [email protected]
> > > > Sent: Monday, November 12, 2012 11:52 AM
> > > > Subject: RE: Converting one large text file with multiple documents
> to
> > > SequenceFile format
> > > >
> > > >
> > > > Diego, Thank you for your response.  There was no matrix folder in
> > > /tf-vectors from seq2sparse so I created it with this, "mahout rowid -i
> > > output/tf-vectors -o output/matrix".  Then I tried cvb again with the
> > > matrix folder, "mahout cvb -i output/matrix -dict
> output/dictionary.file-0
> > > -o topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7".
>  The
> > > results were similar, though this time it failed after 1%.
> > > > c211-109$ mahout cvb -i output/matrix -dict output/dictionary.file-0
> -o
> > > topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7 -x
> > > 150MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to
> classpath.Warning:
> > > $HADOOP_HOME is deprecated.
> > > > Running on hadoop, using
> > > /home/01541/levar1/xsede/hadoop-1.0.3/bin/hadoop and
> > > HADOOP_CONF_DIR=/home/01541/levar1/.hadoop2/conf/MAHOUT-JOB:
> > >
> /scratch/01541/levar1/lib/mahout-distribution-0.7/mahout-examples-0.7-job.jarWarning:
> > > $HADOOP_HOME is deprecated.
> > > > WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated.
> > > Please use org.apache.hadoop.log.metrics.EventCounter in all the
> > > log4j.properties files.12/11/12 09:35:13 WARN driver.MahoutDriver: No
> > > cvb.props found on classpath, will use command-line arguments
> only12/11/12
> > > 09:35:13 INFO common.AbstractJob: Command line arguments:
> > > {--convergenceDelta=[0], --dictionary=[output/dictionary.file-0],
> > > --doc_topic_output=[documents], --doc_topic_smoothing=[0.0001],
> > > --endPhase=[2147483647], --input=[output/matrix],
> > > --iteration_block_size=[10], --maxIter=[150],
> --max_doc_topic_iters=[10],
> > > --num_reduce_tasks=[7], --num_topics=[100], --num_train_threads=[4],
> > > --num_update_threads=[1], --output=[topics], --overwrite=null,
> > > --startPhase=[0], --tempDir=[temp], --term_topic_smoothing=[0.0001],
> > > --test_set_fraction=[0], --topic_model_temp_dir=[states]}12/11/12
> 09:35:15
> > > INFO cvb.CVB0Driver: Will run Collapsed Variational Bayes
> (0th-derivative
> > > approximation) learning for
> > > >  LDA on output/matrix (numTerms: 699072), finding 100-topics, with
> > > document/topic prior 1.0E-4, topic/term prior 1.0E-4.  Maximum
> iterations
> > > to run will be 150, unless the change in perplexity is less than 0.0.
> > >  Topic model output (p(term|topic) for each topic) will be stored
> topics.
> > >  Random initialization seed is 7355, holding out 0.0 of the data for
> > > perplexity check
> > > > 12/11/12 09:35:15 INFO cvb.CVB0Driver: Dictionary to be used located
> > > output/dictionary.file-0p(topic|docId) will be stored documents
> > > > 12/11/12 09:35:15 INFO cvb.CVB0Driver: Current iteration number:
> > > 012/11/12 09:35:15 INFO cvb.CVB0Driver: About to run iteration 1 of
> > > 15012/11/12 09:35:15 INFO cvb.CVB0Driver: About to run: Iteration 1 of
> 150,
> > > input path: states/model-012/11/12 09:35:16 INFO input.FileInputFormat:
> > > Total input paths to process : 212/11/12 09:35:16 INFO
> mapred.JobClient:
> > > Running job: job_201211120919_000512/11/12 09:35:17 INFO
> mapred.JobClient:
> > >  map 0% reduce 0%12/11/12 10:25:25 INFO mapred.JobClient:  map 1%
> reduce
> > > 0%12/11/12 10:35:46 INFO mapred.JobClient: Task Id :
> > > attempt_201211120919_0005_m_000001_0, Status :
> > > FAILEDjava.lang.ClassCastException: org.apache.hadoop.io.Text cannot be
> > > cast to org.apache.mahout.math.VectorWritable        at
> > >
> org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55)
> > >        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>    at
> > > >  org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> > >  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)        at
> > > org.apache.hadoop.mapred.Child$4.run(Child.java:255)        at
> > > java.security.AccessController.doPrivileged(Native Method)        at
> > > javax.security.auth.Subject.doAs(Subject.java:396)        at
> > >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
> > >        at org.apache.hadoop.mapred.Child.main(Child.java:249)
> > > > Task attempt_201211120919_0005_m_000001_0 failed to report status for
> > > 3601 seconds. Killing!
> > > >
> > > > Any ideas?
> > > > Regards,Nick
> > > >
> > > >
> > > > > From: [email protected]
> > > > > Date: Mon, 12 Nov 2012 13:21:21 +0100
> > > > > Subject: Re: Converting one large text file with multiple
> documents to
> > > SequenceFile format
> > > > > To: [email protected]
> > > > >
> > > > > Dear Nick,
> > > > >
> > > > > I experienced the same problem, the fact is that when you call
> cvb, it
> > > > > expects in input the folder matrix inside the output folder of
> > > > > seq2sparce,
> > > > > so instead of
> > > > >
> > > > > mahout cvb -i output/tf-vectors -dict output/dictionary.file-0 -o
> > > > > topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7 -x
> 10
> > > > >
> > > > > please try:
> > > > >
> > > > > mahout cvb -i output/tf-vectors/matrix -dict
> output/dictionary.file-0
> > > > > -o topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7
> -x
> > > > > 10
> > > > >
> > > > > let me know if it solved ;)
> > > > >
> > > > > cheers,
> > > > > Diego
> > > > >
> > > > > On Mon, Nov 12, 2012 at 1:18 AM, Nick Woodward <[email protected]
> >
> > > wrote:
> > > > > >
> > > > > > Diego,Thank you so much for the script. I used it to convert my
> > > large text file to a sequence file. I have been trying to use the
> sequence
> > > file to feed Mahout's LDA implementation (Mahout 0.7 so the CVB
> > > implementation).  I first converted the sequence file to vectors with
> this,
> > > "mahout seq2sparse -i input/processedaa.seq -o output -ow -wt tf -nr
> 7" and
> > > then ran the LDA with this, "mahout cvb -i output/tf-vectors -dict
> > > output/dictionary.file-0 -o topics -dt documents -mt states -ow -k 100
> > > --num_reduce_tasks 7 -x 10".  The seq2sparse command produces the tf
> > > vectors alright, but the problem is that no matter what I use for
> > > parameters, the LDA job sits at map 0% reduce 0% for an hour before
> > > outputting the error below.  It has an error casting Text to
> IntWritable.
> > >  My question is when you said that the key is the line number, what
> > > variable type is the key?  Is it Text?
> > > > > >
> > > > > > My output..."12/11/11 16:10:50 INFO common.AbstractJob: Command
> line
> > > arguments: {--convergenceDelta=[0],
> > > --dictionary=[output/dictionary.file-0],
> --doc_topic_output=[documents],
> > > --doc_topic_smoothing=[0.0001], --endPhase=[2147483647],
> > > --input=[output/tf-vectors], --iteration_block_size=[10],
> --maxIter=[10],
> > > --max_doc_topic_iters=[10], --num_reduce_tasks=[7], --num_topics=[100],
> > > --num_train_threads=[4], --num_update_threads=[1], --output=[topics],
> > > --overwrite=null, --startPhase=[0], --tempDir=[temp],
> > > --term_topic_smoothing=[0.0001], --test_set_fraction=[0],
> > > --topic_model_temp_dir=[states]}12/11/11 16:10:52 INFO
> mapred.JobClient:
> > > Running job: job_201211111553_000512/11/11 16:10:53 INFO
> mapred.JobClient:
> > >  map 0% reduce 0%12/11/11 17:11:16 INFO mapred.JobClient: Task Id :
> > > attempt_201211111553_0005_m_000003_0, Status :
> > > FAILEDjava.lang.ClassCastException: org.apache.hadoop.io.Text cannot be
> > > cast to org.apache.hadoop.io.IntWritable at
> > > >
> > >
>  
> org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55)
> > > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at
> > > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at
> > > org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at
> > > org.apache.hadoop.mapred.Child$4.run(Child.java:255) at
> > > java.security.AccessController.doPrivileged(Native Method) at
> > > javax.security.auth.Subject.doAs(Subject.java:396) at
> > >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
> > > at org.apache.hadoop.mapred.Child.main(Child.java:249)"
> > > > > >
> > > > > > Thank you again for your help!Nick
> > > > > >
> > > > > >
> > > > > >> From: [email protected]
> > > > > >> Date: Thu, 1 Nov 2012 01:07:29 +0100
> > > > > >> Subject: Re: Converting one large text file with multiple
> documents
> > > to SequenceFile format
> > > > > >> To: [email protected]
> > > > > >>
> > > > > >> Hei Nick,
> > > > > >> I had exatly the same problem ;)
> > > > > >> I wrote a simple command line utility to create a sequence
> > > > > >> file where each line of the input document is an entry
> > > > > >> (the key is the line number).
> > > > > >>
> > > > > >> https://dl.dropbox.com/u/4663256/tmp/lda-helper.jar
> > > > > >>
> > > > > >> java -cp lda-helper.jar
> > > it.cnr.isti.hpc.lda.cli.LinesToSequenceFileCLI
> > > > > >> -input tweets -output tweets.seq
> > > > > >>
> > > > > >> enjoy ;)
> > > > > >> Diego
> > > > > >>
> > > > > >> On Wed, Oct 31, 2012 at 9:30 PM, Charly Lizarralde
> > > > > >> <[email protected]> wrote:
> > > > > >> > I don't think you need that. Just a simple mapper.
> > > > > >> >
> > > > > >> > static class IdentityMapper extends  Mapper<LongWritable,
> Text,
> > > Text, Text>
> > > > > >> > {
> > > > > >> >
> > > > > >> >         @Override
> > > > > >> >         protected void map(LongWritable key, Text value,
> Context
> > > context)
> > > > > >> > throws IOException, InterruptedException {
> > > > > >> >
> > > > > >> >             String[] fields = value.toString().split("\t") ;
> > > > > >> >             if  ( fields.length >= 2) {
> > > > > >> >                 context.write(new Text(fields[0]), new
> > > Text(fields[1]))
> > > > > >> > ;
> > > > > >> >             }
> > > > > >> >
> > > > > >> >         }
> > > > > >> >
> > > > > >> >     }
> > > > > >> >
> > > > > >> > and then run a simple job..
> > > > > >> >
> > > > > >> >         Job text2SequenceFileJob =
> > > this.prepareJob(this.getInputPath(),
> > > > > >> > this.getOutputPath(), TextInputFormat.class,
> IdentityMapper.class,
> > > > > >> > Text.class, Text.class, SequenceFileOutputFormat.class) ;
> > > > > >> >
> > > > > >> >         text2SequenceFileJob.setOutputKeyClass(Text.class) ;
> > > > > >> >         text2SequenceFileJob.setOutputValueClass(Text.class) ;
> > > > > >> >         text2SequenceFileJob.setNumReduceTasks(0) ;
> > > > > >> >
> > > > > >> >         text2SequenceFileJob.waitForCompletion(true) ;
> > > > > >> >
> > > > > >> > Cheers!
> > > > > >> > Charly
> > > > > >> >
> > > > > >> > On Wed, Oct 31, 2012 at 4:57 PM, Nick Woodward <
> > > [email protected]> wrote:
> > > > > >> >
> > > > > >> >>
> > > > > >> >> Yeah, I've looked at filter classes, but nothing worked.  I
> > > guess I'll do
> > > > > >> >> something similar and continuously save each line into a file
> > > and then run
> > > > > >> >> seqdiretory.  The running time won't look good, but at least
> it
> > > should
> > > > > >> >> work.  Thanks for the response.
> > > > > >> >>
> > > > > >> >> Nick
> > > > > >> >>
> > > > > >> >> > From: [email protected]
> > > > > >> >> > Date: Tue, 30 Oct 2012 18:07:58 -0300
> > > > > >> >> > Subject: Re: Converting one large text file with multiple
> > > documents to
> > > > > >> >> SequenceFile format
> > > > > >> >> > To: [email protected]
> > > > > >> >> >
> > > > > >> >> > I had the exact same issue and I tried to use the
> seqdirectory
> > > command
> > > > > >> >> with
> > > > > >> >> > a different filter class but It did not work. It seems
> there's
> > > a bug in
> > > > > >> >> the
> > > > > >> >> > mahout-0.6 code.
> > > > > >> >> >
> > > > > >> >> > It ended up as writing a custom map-reduce program that
> > > performs just
> > > > > >> >> that.
> > > > > >> >> >
> > > > > >> >> > Greetiings!
> > > > > >> >> > Charly
> > > > > >> >> >
> > > > > >> >> > On Tue, Oct 30, 2012 at 5:00 PM, Nick Woodward <
> > > [email protected]>
> > > > > >> >> wrote:
> > > > > >> >> >
> > > > > >> >> > >
> > > > > >> >> > > I have done a lot of searching on the web for this, but
> I've
> > > found
> > > > > >> >> > > nothing, even though I feel like it has to be somewhat
> > > common. I have
> > > > > >> >> used
> > > > > >> >> > > Mahout's 'seqdirectory' command to convert a folder
> > > containing text
> > > > > >> >> files
> > > > > >> >> > > (each file is a separate document) in the past. But in
> this
> > > case there
> > > > > >> >> are
> > > > > >> >> > > so many documents (in the 100,000s) that I have one very
> > > large text
> > > > > >> >> file in
> > > > > >> >> > > which each line is a document. How can I convert this
> large
> > > file to
> > > > > >> >> > > SequenceFile format so that Mahout understands that each
> > > line should be
> > > > > >> >> > > considered a separate document?  Would it be better if
> the
> > > file was
> > > > > >> >> > > structured like so....docId1 {tab} document textdocId2
> {tab}
> > > document
> > > > > >> >> > > textdocId3 {tab} document text...
> > > > > >> >> > >
> > > > > >> >> > > Thank you very much for any help.Nick
> > > > > >> >> > >
> > > > > >> >>
> > > > > >> >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> Computers are useless. They can only give you answers.
> > > > > >> (Pablo Picasso)
> > > > > >> _______________
> > > > > >> Diego Ceccarelli
> > > > > >> High Performance Computing Laboratory
> > > > > >> Information Science and Technologies Institute (ISTI)
> > > > > >> Italian National Research Council (CNR)
> > > > > >> Via Moruzzi, 1
> > > > > >> 56124 - Pisa - Italy
> > > > > >>
> > > > > >> Phone: +39 050 315 3055
> > > > > >> Fax: +39 050 315 2040
> > > > > >> ________________________________________
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Computers are useless. They can only give you answers.
> > > > > (Pablo Picasso)
> > > > > _______________
> > > > > Diego Ceccarelli
> > > > > High Performance Computing Laboratory
> > > > > Information Science and Technologies Institute (ISTI)
> > > > > Italian National Research Council (CNR)
> > > > > Via Moruzzi, 1
> > > > > 56124 - Pisa - Italy
> > > > >
> > > > > Phone: +39 050 315 3055
> > > > > Fax: +39 050 315 2040
> > > > > ________________________________________
> > >
> > >
>
>

Re: Converting one large text file with multiple documents to SequenceFile format

Reply via email to