CVB was added to cluster_reuters.sh in 0.8, u wouldn't see it in 0.7. Suggest that you work off of 0.8.
________________________________ From: Marco <[email protected]> To: "[email protected]" <[email protected]>; Suneel Marthi <[email protected]> Sent: Wednesday, July 31, 2013 11:05 AM Subject: Re: Latent Dirichlet Allocatio (cvb) already looked there. no cvb examle or vectordump :( ________________________________ Da: Suneel Marthi <[email protected]> A: "[email protected]" <[email protected]>; Marco <[email protected]> Inviato: Mercoledì 31 Luglio 2013 16:55 Oggetto: Re: Latent Dirichlet Allocatio (cvb) @Marco, look at examples/bin/cluster-reuters.sh for reference on how to run cvb (or any other clustering algo in Mahout) and also on how to invoke the vectordump with the option flags. ________________________________ From: Jake Mannix <[email protected]> To: "[email protected]" <[email protected]>; Marco <[email protected]> Sent: Wednesday, July 31, 2013 10:51 AM Subject: Re: Latent Dirichlet Allocatio (cvb) On Wed, Jul 31, 2013 at 7:44 AM, Marco <[email protected]> wrote: > ok. i'll re run it without that nt (which i supposed was NOT optional). > Well, it's not optional if you don't supply a dictionary (which is optional) - one of the two is necessary, or else the system doesn't know how big to make the model. > meanwhile i've re-run it on a smallare datasets and though it run > successfully (and faster!) when i run vectordump i always get Heap space > issue even though we've updated MAHOUT_HEAPSIZE to 10000m > When you use vectordump, what flags are you giving it? There may be a big here. Also, what version of Mahout are you using? > > > > > ________________________________ > Da: Jake Mannix <[email protected]> > A: "[email protected]" <[email protected]>; Marco < > [email protected]> > Cc: Suneel Marthi <[email protected]> > Inviato: Mercoledì 31 Luglio 2013 16:34 > Oggetto: Re: Latent Dirichlet Allocatio (cvb) > > > If you're supplying a dictionary file (as you are), I'd suggest not > specifying the "-nt 90000" option - you're apparently specifying a numTerms > less than the actual number of terms in some of your vectors. If you > supply the -dict option, it'll infer the number of terms from reading the > dictionary, and you don't need to specify it. > > > On Wed, Jul 31, 2013 at 7:02 AM, Marco <[email protected]> wrote: > > > oops! that did the trick. > > > > nonetheless i think the fact that you have to do "rowid" and generate the > > matrix should be added to the wiki. > > > > after waiting for more than an hour i got and error on > > Writing final document/topic inference from lda/matrix/matrix to > > jojoba/do-output > > > > the error is : org.apache.mahout.math.IndexException: Index 90011 is > > outside allowable range of [0,90000) > > > > Here is how I launched it: > > mahout cvb -i jojoba/matrix/matrix -dict jojoba/vectors/dictionary.file-0 > > -o jojoba/to-output -dt jojoba/do-output -k 190 -nt 90000 -mt jojoba/mt > > --maxIter 2 -mipd 1 -a 0.01 -e 0.01 -seed 37 -block 1 > > > > weird thing is also that the job described as " Writing final topic/term > > distributions from jojoba/mt/model-2 to jojoba/to-output" run > successfully > > but if i now do a vectodump i always get a Java Heaps Space error > > > > > > > > ________________________________ > > Da: Suneel Marthi <[email protected]> > > A: "[email protected]" <[email protected]>; Marco < > > [email protected]> > > Inviato: Mercoledì 31 Luglio 2013 11:01 > > Oggetto: Re: Latent Dirichlet Allocatio (cvb) > > > > > > RowId job creates a matrix (IntWritable, VectorWritable) and a docIndex > > (IntWritable, Text). > > > > So you should be seeing 2 files generated - jojoba/matrix/matrix and > > jojoba/matrix/docIndex. > > > > Seems like you have been feeding docIndex as input to cvb which would > > cause this exception, its the matrix that needs to be fed as input to > cvb. > > > > So the input to vb needs to be "jojoba/matrix/matrix". > > > > Give that a try and let us know. > > > > > > > > > > ________________________________ > > From: Marco <[email protected]> > > To: "[email protected]" <[email protected]> > > Sent: Wednesday, July 31, 2013 4:20 AM > > Subject: Latent Dirichlet Allocatio (cvb) > > > > > > Hi, I'm new here so forgive my little experience with Mahout. > > > > We're trying to use Mahout (on our hadoop cluster) for calculating topics > > on almost 14000 documents. > > > > I've been following this wiki page (http://goo.gl/DcPVjB) but still > > getting errors. > > > > Here's what I'm doing: > > > > 1) creating sequence file from text files (mahout seqdirectory -i > > jojoba/text-files -o jojoba/seqfiles) > > 2) creating vectors FROM sequence files (mahout seq2sparse -i > > jojoba/seqfiles -o jojoba/vectors -wt tf > > -nv) > > 3) launching CVB like this: > > mahout cvb -i jojoba/vectors/tf-vectors/ -dict > > jojoba/vectors/dictionary.file-0 -o jojoba/to-output -dt jojoba/do-output > > -k 190 -nt 90000 -mt jojoba/mt --maxIter 2 -mipd 1 -a 0.01 -e 0.01 -seed > 37 > > -block 1 > > > > and I get Exception in thread "main" java.lang.InterruptedException: > > Failed to complete iteration 1 stage 1 > > > > I later learned here ( > > http://stackoverflow.com/questions/14757162/run-cvb-in-mahout-0-8/) that > > I should actually feed cvb a matrix and not the vectors (shouldn't it be > > clearly stated in the wiki?). > > So then I run: > > mahout rowid -i jojoba/vectors/tf-vectors/ -o jojoba/matrix > > > > 3bis) I rerun CVB giving jojoba/matrix as input and I now get > > java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to > > org.apache.mahout.math.VectorWritable > > > > What am I missing? > > > > Thanks > > a lot for your help > > > > > > -- > > -jake > -- -jake
