Re: Latent Dirichlet Allocatio (cvb)

Jake Mannix Wed, 31 Jul 2013 08:06:58 -0700

On Wed, Jul 31, 2013 at 8:01 AM, Marco <[email protected]> wrote:


> running:
> mahout vectordump -i jojoba/to-output -d jojoba/vectors/dictionary.file-0
> -dt sequencefile --vectorSize 10 -sort jojoba/to-output
>

Yeah, that looks right.


> it's mahout 0.7 (we're using cloudera CDH4.2)
>

Ah, that's a vectordump bug in 0.7, fixed in 0.8, sorry about that.


>
>
>
>
> ________________________________
>  Da: Jake Mannix <[email protected]>
> A: "[email protected]" <[email protected]>; Marco <
> [email protected]>
> Inviato: Mercoledì 31 Luglio 2013 16:51
> Oggetto: Re: Latent Dirichlet Allocatio (cvb)
>
>
> On Wed, Jul 31, 2013 at 7:44 AM, Marco <[email protected]> wrote:
>
> > ok. i'll re run it without that nt (which i supposed was NOT optional).
> >
>
> Well, it's not optional if you don't supply a dictionary (which is
> optional) - one of the two is necessary, or else the system doesn't know
> how big to make the model.
>
>
> > meanwhile i've re-run it on a smallare datasets and though it run
> > successfully (and faster!) when i run vectordump i always get Heap space
> > issue even though we've updated MAHOUT_HEAPSIZE to 10000m
> >
>
> When you use vectordump, what flags are you giving it?  There may be a big
> here.  Also, what version of Mahout are you using?
>
>
> >
> >
> >
> >
> > ________________________________
> >  Da: Jake Mannix <[email protected]>
> > A: "[email protected]" <[email protected]>; Marco <
> > [email protected]>
> > Cc: Suneel Marthi <[email protected]>
> > Inviato: Mercoledì 31 Luglio 2013 16:34
> > Oggetto: Re: Latent Dirichlet Allocatio (cvb)
> >
> >
> > If you're supplying a dictionary file (as you are), I'd suggest not
> > specifying the "-nt 90000" option - you're apparently specifying a
> numTerms
> > less than the actual number of terms in some of your vectors.  If you
> > supply the -dict option, it'll infer the number of terms from reading the
> > dictionary, and you don't need to specify it.
> >
> >
> > On Wed, Jul 31, 2013 at 7:02 AM, Marco <[email protected]> wrote:
> >
> > > oops! that did the trick.
> > >
> > > nonetheless i think the fact that you have to do "rowid" and generate
> the
> > > matrix should be added to the wiki.
> > >
> > > after waiting for more than an hour i got and error on
> > > Writing final document/topic inference from lda/matrix/matrix to
> > > jojoba/do-output
> > >
> > > the error is : org.apache.mahout.math.IndexException: Index 90011 is
> > > outside allowable range of [0,90000)
> > >
> > > Here is how I launched it:
> > > mahout cvb -i jojoba/matrix/matrix -dict
> jojoba/vectors/dictionary.file-0
> > > -o jojoba/to-output -dt jojoba/do-output -k 190 -nt 90000 -mt jojoba/mt
> > > --maxIter 2 -mipd 1 -a 0.01 -e 0.01 -seed 37 -block 1
> > >
> > > weird thing is also that the job described as " Writing final
> topic/term
> > > distributions from jojoba/mt/model-2 to jojoba/to-output" run
> > successfully
> > > but if i now do a vectodump i always get a Java Heaps Space error
> > >
> > >
> > >
> > > ________________________________
> > >  Da: Suneel Marthi <[email protected]>
> > > A: "[email protected]" <[email protected]>; Marco <
> > > [email protected]>
> > > Inviato: Mercoledì 31 Luglio 2013 11:01
> > > Oggetto: Re: Latent Dirichlet Allocatio (cvb)
> > >
> > >
> > > RowId job creates a matrix (IntWritable, VectorWritable) and a docIndex
> > > (IntWritable, Text).
> > >
> > > So you should be seeing 2 files generated -  jojoba/matrix/matrix and
> > > jojoba/matrix/docIndex.
> > >
> > > Seems like you have been feeding docIndex as input to cvb which would
> > > cause this exception,  its the matrix that needs to be fed as input to
> > cvb.
> > >
> > > So the input to vb needs to be "jojoba/matrix/matrix".
> > >
> > > Give that a try and let us know.
> > >
> > >
> > >
> > >
> > > ________________________________
> > > From: Marco <[email protected]>
> > > To: "[email protected]" <[email protected]>
> > > Sent: Wednesday, July 31, 2013 4:20 AM
> > > Subject: Latent Dirichlet Allocatio (cvb)
> > >
> > >
> > > Hi, I'm new here so forgive my little experience with Mahout.
> > >
> > > We're trying to use Mahout (on our hadoop cluster) for calculating
> topics
> > > on almost 14000 documents.
> > >
> > > I've been following this wiki page (http://goo.gl/DcPVjB) but still
> > > getting errors.
> > >
> > > Here's what I'm doing:
> > >
> > > 1) creating sequence file from text files (mahout seqdirectory -i
> > > jojoba/text-files -o jojoba/seqfiles)
> > > 2) creating vectors FROM sequence files (mahout seq2sparse -i
> > > jojoba/seqfiles -o jojoba/vectors -wt tf
> > >  -nv)
> > > 3) launching CVB like this:
> > > mahout cvb -i jojoba/vectors/tf-vectors/ -dict
> > > jojoba/vectors/dictionary.file-0 -o jojoba/to-output -dt
> jojoba/do-output
> > > -k 190 -nt 90000 -mt jojoba/mt --maxIter 2 -mipd 1 -a 0.01 -e 0.01
> -seed
> > 37
> > > -block 1
> > >
> > > and I get Exception in thread "main" java.lang.InterruptedException:
> > > Failed to complete iteration 1 stage 1
> > >
> > > I later learned here (
> > > http://stackoverflow.com/questions/14757162/run-cvb-in-mahout-0-8/)
> that
> > > I should actually feed cvb a matrix and not the vectors (shouldn't it
> be
> > > clearly stated in the wiki?).
> > > So then I run:
> > > mahout rowid -i jojoba/vectors/tf-vectors/ -o jojoba/matrix
> > >
> > > 3bis) I rerun CVB giving jojoba/matrix as input and I now get
> > > java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast
> to
> > > org.apache.mahout.math.VectorWritable
> > >
> > > What am I missing?
> > >
> > > Thanks
> > >  a lot for your help
> > >
> >
> >
> >
> > --
> >
> >   -jake
> >
>
>
>
> --
>
>   -jake
>



-- 

  -jake

Re: Latent Dirichlet Allocatio (cvb)

Reply via email to