Diego, Thank you for your response. There was no matrix folder in /tf-vectors
from seq2sparse so I created it with this, "mahout rowid -i output/tf-vectors
-o output/matrix". Then I tried cvb again with the matrix folder, "mahout cvb
-i output/matrix -dict output/dictionary.file-0 -o topics -dt documents -mt
states -ow -k 100 --num_reduce_tasks 7". The results were similar, though this
time it failed after 1%.
c211-109$ mahout cvb -i output/matrix -dict output/dictionary.file-0 -o topics
-dt documents -mt states -ow -k 100 --num_reduce_tasks 7 -x 150MAHOUT_LOCAL is
not set; adding HADOOP_CONF_DIR to classpath.Warning: $HADOOP_HOME is
deprecated.
Running on hadoop, using /home/01541/levar1/xsede/hadoop-1.0.3/bin/hadoop and
HADOOP_CONF_DIR=/home/01541/levar1/.hadoop2/conf/MAHOUT-JOB:
/scratch/01541/levar1/lib/mahout-distribution-0.7/mahout-examples-0.7-job.jarWarning:
$HADOOP_HOME is deprecated.
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use
org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties
files.12/11/12 09:35:13 WARN driver.MahoutDriver: No cvb.props found on
classpath, will use command-line arguments only12/11/12 09:35:13 INFO
common.AbstractJob: Command line arguments: {--convergenceDelta=[0],
--dictionary=[output/dictionary.file-0], --doc_topic_output=[documents],
--doc_topic_smoothing=[0.0001], --endPhase=[2147483647],
--input=[output/matrix], --iteration_block_size=[10], --maxIter=[150],
--max_doc_topic_iters=[10], --num_reduce_tasks=[7], --num_topics=[100],
--num_train_threads=[4], --num_update_threads=[1], --output=[topics],
--overwrite=null, --startPhase=[0], --tempDir=[temp],
--term_topic_smoothing=[0.0001], --test_set_fraction=[0],
--topic_model_temp_dir=[states]}12/11/12 09:35:15 INFO cvb.CVB0Driver: Will run
Collapsed Variational Bayes (0th-derivative approximation) learning for LDA on
output/matrix (numTerms: 699072), finding 100-topics, with document/topic prior
1.0E-4, topic/term prior 1.0E-4. Maximum iterations to run will be 150, unless
the change in perplexity is less than 0.0. Topic model output (p(term|topic)
for each topic) will be stored topics. Random initialization seed is 7355,
holding out 0.0 of the data for perplexity check
12/11/12 09:35:15 INFO cvb.CVB0Driver: Dictionary to be used located
output/dictionary.file-0p(topic|docId) will be stored documents
12/11/12 09:35:15 INFO cvb.CVB0Driver: Current iteration number: 012/11/12
09:35:15 INFO cvb.CVB0Driver: About to run iteration 1 of 15012/11/12 09:35:15
INFO cvb.CVB0Driver: About to run: Iteration 1 of 150, input path:
states/model-012/11/12 09:35:16 INFO input.FileInputFormat: Total input paths
to process : 212/11/12 09:35:16 INFO mapred.JobClient: Running job:
job_201211120919_000512/11/12 09:35:17 INFO mapred.JobClient: map 0% reduce
0%12/11/12 10:25:25 INFO mapred.JobClient: map 1% reduce 0%12/11/12 10:35:46
INFO mapred.JobClient: Task Id : attempt_201211120919_0005_m_000001_0, Status :
FAILEDjava.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
org.apache.mahout.math.VectorWritable at
org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at
org.apache.hadoop.mapred.Child$4.run(Child.java:255) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:396) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Task attempt_201211120919_0005_m_000001_0 failed to report status for 3601
seconds. Killing!
Any ideas?
Regards,Nick
> From: [email protected]
> Date: Mon, 12 Nov 2012 13:21:21 +0100
> Subject: Re: Converting one large text file with multiple documents to
> SequenceFile format
> To: [email protected]
>
> Dear Nick,
>
> I experienced the same problem, the fact is that when you call cvb, it
> expects in input the folder matrix inside the output folder of
> seq2sparce,
> so instead of
>
> mahout cvb -i output/tf-vectors -dict output/dictionary.file-0 -o
> topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7 -x 10
>
> please try:
>
> mahout cvb -i output/tf-vectors/matrix -dict output/dictionary.file-0
> -o topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7 -x
> 10
>
> let me know if it solved ;)
>
> cheers,
> Diego
>
> On Mon, Nov 12, 2012 at 1:18 AM, Nick Woodward <[email protected]> wrote:
> >
> > Diego,Thank you so much for the script. I used it to convert my large text
> > file to a sequence file. I have been trying to use the sequence file to
> > feed Mahout's LDA implementation (Mahout 0.7 so the CVB implementation). I
> > first converted the sequence file to vectors with this, "mahout seq2sparse
> > -i input/processedaa.seq -o output -ow -wt tf -nr 7" and then ran the LDA
> > with this, "mahout cvb -i output/tf-vectors -dict output/dictionary.file-0
> > -o topics -dt documents -mt states -ow -k 100 --num_reduce_tasks 7 -x 10".
> > The seq2sparse command produces the tf vectors alright, but the problem is
> > that no matter what I use for parameters, the LDA job sits at map 0% reduce
> > 0% for an hour before outputting the error below. It has an error casting
> > Text to IntWritable. My question is when you said that the key is the line
> > number, what variable type is the key? Is it Text?
> >
> > My output..."12/11/11 16:10:50 INFO common.AbstractJob: Command line
> > arguments: {--convergenceDelta=[0],
> > --dictionary=[output/dictionary.file-0], --doc_topic_output=[documents],
> > --doc_topic_smoothing=[0.0001], --endPhase=[2147483647],
> > --input=[output/tf-vectors], --iteration_block_size=[10], --maxIter=[10],
> > --max_doc_topic_iters=[10], --num_reduce_tasks=[7], --num_topics=[100],
> > --num_train_threads=[4], --num_update_threads=[1], --output=[topics],
> > --overwrite=null, --startPhase=[0], --tempDir=[temp],
> > --term_topic_smoothing=[0.0001], --test_set_fraction=[0],
> > --topic_model_temp_dir=[states]}12/11/11 16:10:52 INFO mapred.JobClient:
> > Running job: job_201211111553_000512/11/11 16:10:53 INFO mapred.JobClient:
> > map 0% reduce 0%12/11/11 17:11:16 INFO mapred.JobClient: Task Id :
> > attempt_201211111553_0005_m_000003_0, Status :
> > FAILEDjava.lang.ClassCastException: org.apache.hadoop.io.Text cannot be
> > cast to org.apache.hadoop.io.IntWritable at
> > org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55)
> > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at
> > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at
> > org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at
> > org.apache.hadoop.mapred.Child$4.run(Child.java:255) at
> > java.security.AccessController.doPrivileged(Native Method) at
> > javax.security.auth.Subject.doAs(Subject.java:396) at
> > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
> > at org.apache.hadoop.mapred.Child.main(Child.java:249)"
> >
> > Thank you again for your help!Nick
> >
> >
> >> From: [email protected]
> >> Date: Thu, 1 Nov 2012 01:07:29 +0100
> >> Subject: Re: Converting one large text file with multiple documents to
> >> SequenceFile format
> >> To: [email protected]
> >>
> >> Hei Nick,
> >> I had exatly the same problem ;)
> >> I wrote a simple command line utility to create a sequence
> >> file where each line of the input document is an entry
> >> (the key is the line number).
> >>
> >> https://dl.dropbox.com/u/4663256/tmp/lda-helper.jar
> >>
> >> java -cp lda-helper.jar it.cnr.isti.hpc.lda.cli.LinesToSequenceFileCLI
> >> -input tweets -output tweets.seq
> >>
> >> enjoy ;)
> >> Diego
> >>
> >> On Wed, Oct 31, 2012 at 9:30 PM, Charly Lizarralde
> >> <[email protected]> wrote:
> >> > I don't think you need that. Just a simple mapper.
> >> >
> >> > static class IdentityMapper extends Mapper<LongWritable, Text, Text,
> >> > Text>
> >> > {
> >> >
> >> > @Override
> >> > protected void map(LongWritable key, Text value, Context context)
> >> > throws IOException, InterruptedException {
> >> >
> >> > String[] fields = value.toString().split("\t") ;
> >> > if ( fields.length >= 2) {
> >> > context.write(new Text(fields[0]), new Text(fields[1]))
> >> > ;
> >> > }
> >> >
> >> > }
> >> >
> >> > }
> >> >
> >> > and then run a simple job..
> >> >
> >> > Job text2SequenceFileJob = this.prepareJob(this.getInputPath(),
> >> > this.getOutputPath(), TextInputFormat.class, IdentityMapper.class,
> >> > Text.class, Text.class, SequenceFileOutputFormat.class) ;
> >> >
> >> > text2SequenceFileJob.setOutputKeyClass(Text.class) ;
> >> > text2SequenceFileJob.setOutputValueClass(Text.class) ;
> >> > text2SequenceFileJob.setNumReduceTasks(0) ;
> >> >
> >> > text2SequenceFileJob.waitForCompletion(true) ;
> >> >
> >> > Cheers!
> >> > Charly
> >> >
> >> > On Wed, Oct 31, 2012 at 4:57 PM, Nick Woodward <[email protected]>
> >> > wrote:
> >> >
> >> >>
> >> >> Yeah, I've looked at filter classes, but nothing worked. I guess I'll
> >> >> do
> >> >> something similar and continuously save each line into a file and then
> >> >> run
> >> >> seqdiretory. The running time won't look good, but at least it should
> >> >> work. Thanks for the response.
> >> >>
> >> >> Nick
> >> >>
> >> >> > From: [email protected]
> >> >> > Date: Tue, 30 Oct 2012 18:07:58 -0300
> >> >> > Subject: Re: Converting one large text file with multiple documents to
> >> >> SequenceFile format
> >> >> > To: [email protected]
> >> >> >
> >> >> > I had the exact same issue and I tried to use the seqdirectory command
> >> >> with
> >> >> > a different filter class but It did not work. It seems there's a bug
> >> >> > in
> >> >> the
> >> >> > mahout-0.6 code.
> >> >> >
> >> >> > It ended up as writing a custom map-reduce program that performs just
> >> >> that.
> >> >> >
> >> >> > Greetiings!
> >> >> > Charly
> >> >> >
> >> >> > On Tue, Oct 30, 2012 at 5:00 PM, Nick Woodward <[email protected]>
> >> >> wrote:
> >> >> >
> >> >> > >
> >> >> > > I have done a lot of searching on the web for this, but I've found
> >> >> > > nothing, even though I feel like it has to be somewhat common. I
> >> >> > > have
> >> >> used
> >> >> > > Mahout's 'seqdirectory' command to convert a folder containing text
> >> >> files
> >> >> > > (each file is a separate document) in the past. But in this case
> >> >> > > there
> >> >> are
> >> >> > > so many documents (in the 100,000s) that I have one very large text
> >> >> file in
> >> >> > > which each line is a document. How can I convert this large file to
> >> >> > > SequenceFile format so that Mahout understands that each line
> >> >> > > should be
> >> >> > > considered a separate document? Would it be better if the file was
> >> >> > > structured like so....docId1 {tab} document textdocId2 {tab}
> >> >> > > document
> >> >> > > textdocId3 {tab} document text...
> >> >> > >
> >> >> > > Thank you very much for any help.Nick
> >> >> > >
> >> >>
> >> >>
> >>
> >>
> >>
> >> --
> >> Computers are useless. They can only give you answers.
> >> (Pablo Picasso)
> >> _______________
> >> Diego Ceccarelli
> >> High Performance Computing Laboratory
> >> Information Science and Technologies Institute (ISTI)
> >> Italian National Research Council (CNR)
> >> Via Moruzzi, 1
> >> 56124 - Pisa - Italy
> >>
> >> Phone: +39 050 315 3055
> >> Fax: +39 050 315 2040
> >> ________________________________________
> >
>
>
>
> --
> Computers are useless. They can only give you answers.
> (Pablo Picasso)
> _______________
> Diego Ceccarelli
> High Performance Computing Laboratory
> Information Science and Technologies Institute (ISTI)
> Italian National Research Council (CNR)
> Via Moruzzi, 1
> 56124 - Pisa - Italy
>
> Phone: +39 050 315 3055
> Fax: +39 050 315 2040
> ________________________________________