Re: NaN in cvb topic models after lucene2seq

Jake Mannix Thu, 25 Jul 2013 13:42:07 -0700

Ah!  Good to know, it would have been really sad if this had not worked at
all, given that Mahout 0.8 was *just* released.


Please to report back with results you get from your LDA run - I think
you're possibly the first person to do a lucene2seq -> cvb0 roundtrip, so
it would be great to hear how it goes (including any issues you have with
either the topic/term distributions, or the document/topic distributions).


On Thu, Jul 25, 2013 at 1:28 PM, Liz Merkhofer <
[email protected]> wrote:

> Thanks for your help again. Logged a JIRA, then made a last-ditch effort to
> check my commands and realized that the field I wanted to use for -id was
> typed wrong (camel case instead of lower). Fixed it and was able to get the
> appropriate number of rows in my matrix. Still waiting for cvb output from
> that, but I'll wrap up this thread since the problem with lucene2seq boils
> down to user error.
>
> So the problem was on my end: what I entered as -id did not exist in my
> Solr schema and so my documents were not delimited.
>
> Sorry for the false alarm; thanks for your helpfulness.
>
>
>
> On Thu, Jul 25, 2013 at 12:46 PM, Suneel Marthi <[email protected]
> >wrote:
>
> > Agree with Jake that this is definitely an issue with lucene2seq.
> >
> > RowId should have created a matrix with 70000 rows (= no. of documents
> > from your input corpus), but seems like lucene2seq is creating one single
> > document
> > for all of them.
> >
> > Could you log a JIRA for this?
> >
> > Thanks again for reporting this.
> >
> >
> >
> > ________________________________
> >  From: Jake Mannix <[email protected]>
> > To: "[email protected]" <[email protected]>
> > Sent: Thursday, July 25, 2013 12:39 PM
> > Subject: Re: NaN in cvb topic models after lucene2seq
> >
> >
> > On Thu, Jul 25, 2013 at 9:07 AM, Liz Merkhofer <
> > [email protected]> wrote:
> >
> > > Thanks so much for your response, Suneel.
> > >
> > > Unfortunately, the Solr index is not mine to post. But short of that,
> are
> > > there any useful answers I can provide? At the time I ran this, it
> > > contained 70,000 documents... I'm adding several times that today,
> > though.
> > >
> > > I tried lucene2seq again.
> > >
> > > Running with the MapReduce default, the directory it creates contains
> > > _SUCCESS part-m-00003 part-m-00007 part-m-00011
> > > part-m-00000 part-m-00004 part-m-00008 part-m-00012
> > > part-m-00001 part-m-00005 part-m-00009 part-m-00013
> > > part-m-00002 part-m-00006 part-m-00010 part-m-00014
> > >
> > > With -xm sequential, however, it creates only "index."
> > >
> > > Looking at part-m-00014 or index, I see about the same thing: a header
> > like
> > >
> > >
> > >
> >
> SEQ^F^Yorg.apache.hadoop.io.Text^Yorg.apache.hadoop.io.Text^@^@^@^@^@^@ua<80>yäØQõ-ãe<93>n5<9d>¡^@^@^C)^@^@^@^A^@<8e>^C%(
> > >
> > > And then the concatenated text of (all?) my documents
> > >
> > >
> > This is definitely the problem:
> >
> >
> > > When I run "rowid," I get
> > >
> > > 13/07/25 09:45:19 INFO vectors.RowIdJob: Wrote out matrix with 1 rows
> and
> > > 465540 columns to /tmp/cvb/rowidout/matrix
> > >
> >
> > >
> > > In comparison, I'm working off the closest example I could find, from
> the
> > > book Hadoop MapReduce Cookbook (page in Safari Books Online:
> > > http://goo.gl/n3YVCz). Running seqdirectory on their sample, a
> directory
> > > containing data from 20 newsgroups, my output is called part-m-00000
> and
> > > looks like
> > >
> > >
> > >
> >
> SEQ^F^Yorg.apache.hadoop.io.Text^Yorg.apache.hadoop.io.Text^A^@*org.apache.hadoop.io.compress.DefaultCodec^@^@^@^@<8a>FA4ëÇ"Fª>þ^H^_-Â¯^@^@^WÇ^@^@^@^S^R/alt.atheism/49960x<9c><8d>Z]W"K²}¯_<91><87>
> > >
> > > etc. When that gets to the point of running rowid, I get
> > >
> > > 13/07/25 10:44:45 INFO vectors.RowIdJob: Wrote out matrix with 19997
> rows
> > > and 193659 columns to tmp/20news/int/matrix
> > >
> > > where those aprox 20,000 rows are plausibly each a document in the
> 20news
> > > dataset.
> > >
> > > It seems then, to me, that lucene2seq is the culprit.
> >
> >
> > Yep, that looks to be the case.
> >
> >
> > > Maybe the best
> > > solution will falling back on lucene.vector:
> > >
> > > ./mahout lucene.vector --dir <path to solr data>/index --output
> > > /tmp/lv-cvb/luceneout --field textbody_en --dictOut
> > /tmp/lv-cvb/lucenedict
> > > --idField docid --norm 2 --weight TF --seqDictOut
> /tmp/lv-cvb/seqDictOut
> > > --norm 2 -x 70
> > >
> > > The output did look like the appropriately garbled.
> > >
> > > However, rowid doesn't like the output from lucene.vector,
> > > "java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast
> > to
> > > org.apache.hadoop.io.IntWritable" and crossing my fingers and skipping
> > > rowid also had a problem with the LongWriteable,
> > > "java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
> > be
> > > cast to org.apache.hadoop.io.IntWritable."
> > >
> >
> > That's very sad to see.  lucene.vector is spitting out sequence files
> which
> > have keys being LongWritable, seq2sparse is spitting out sequence files
> > which have Text keys, and LDA wants inputs which are IntWritable keys.
> > RowId alleviates only one problem: taking Text keys and turning them into
> > IntWritable keys.
> >
> > I would be very sad if it turns out your only current option is to write
> a
> > trivially
> > changed version of RowId (it's a really simple job) which can handle
> > LongWritable
> > keys as well as Text.  In fact, it would be a great modification for that
> > job to be
> > changed to take *any* key type.  It currently doesn't care what its keys
> > are,
> > so it should be pretty easy to change all instances of "Text" in RowIdJob
> > to
> > "WritableComparable" (or ? extends WritableComparable) and it should
> "just
> > work".  Lame!
> >
> >
> > >
> > > My commands:
> > > ./mahout rowid -i /tmp/lv-cvb/luceneout  -o /tmp/lv-cvb/matrix
> > >
> > > ./mahout cvb -i /tmp/lv-cvb/luceneout -o /tmp/lv-cvb/out -k 20 -x 10
> > -dict
> > > /tmp/lv-cvb/seqDictOut -dt /tmp/lv-cvb/topics -mt /tmp/lv-cvb/model
> > >
> > > Is there something I'm missing?
> > >
> > > Thank you,
> > > Liz
> > >
> > >
> > > On Thu, Jul 25, 2013 at 12:20 AM, Suneel Marthi <
> [email protected]
> > > >wrote:
> > >
> > > > Liz,
> > > >
> > > > lucene2seq was a recent addition to Mahout 0.8 and its good that you
> > are
> > > > taking this for a test drive and reporting issues.
> > > > In order to troubleshoot this:
> > > >
> > > > a) Could you try running lucene2seq with a '-xm sequential' option
> and
> > > > verify the output?  The default option now is MapReduce and I am
> trying
> > > to
> > > > determine
> > > >  if the issue could be with the MapReduce version or if its something
> > > more
> > > > basic.
> > > > b) Is it possible for you to post your Solr index to these forums, I
> > can
> > > > take a stab at this to see as to what's wrong.
> > > >
> > > > Suneel
> > > >
> > > >
> > > >
> > > >
> > > > ________________________________
> > > >  From: Liz Merkhofer <[email protected]>
> > > > To: [email protected]
> > > > Sent: Wednesday, July 24, 2013 5:07 PM
> > > > Subject: NaN in cvb topic models after lucene2seq
> > > >
> > > >
> > > > Hello list,
> > > >
> > > > I'm having some problems using cvb (now that lda is deprecated) on my
> > > > Lucene (or Solr, if you will) index. I am using Mahout 0.8.
> > > >
> > > > My workflow is lucene2seq -> seq2sparse-> rowid -> cvb. Everything
> > seems
> > > to
> > > > be working, until all my topics come out, with seqdumper, as NaN,
> like:
> > > >
> > > > Key class: class org.apache.hadoop.io.IntWritable Value Class: class
> > > > org.apache.mahout.math.VectorWritable
> > > > Key: 0: Value:
> > > >
> > > >
> > >
> >
> {0:NaN,1:NaN,2:NaN,3:NaN,4:NaN,5:NaN,6:NaN,7:NaN,8:NaN,9:NaN,10:NaN,11:NaN,12:NaN,13:NaN,14:NaN,15:NaN,16:NaN,17:NaN,18:NaN,19:NaN,20:NaN,21:NaN,22:NaN,
> > > >
> > > > ... etc. I suspect my problem is in the output of lucene2seq, which
> is
> > a
> > > > folder of files 14 files called /part-m-000xx that look very much
> like
> > > the
> > > > text in my Lucene index and nothing like the unreadable jumble I
> would
> > > get
> > > > from 'seqdirectory' on an actual directory of text files.
> > > >
> > > > If it helps, here's how I'm doing this:
> > > >
> > > > ./mahout lucene2seq -o /tmp/cvb/lucene2seqout -i <path to my solr
> > > > data>index -id docId -f textbody_en
> > > >
> > > > ./mahout seq2sparse -i /tmp/cvb/lucene2seqout -o
> /tmp/cvb/seq2sparseout
> > > > --namedVector --maxDFPercent 70 --weight TF -n 2 -a
> > > > org.apache.lucene.analysis.core.WhitespaceAnalyzer
> > > >
> > > > ./mahout rowid -i /tmp/cvb/seq2sparseout/tf-vectors -o
> > /tmp/cvb/rowidout
> > > >
> > > > ./mahout cvb -i /tmp/cvb/rowidout/matrix -o /tmp/cvb/out -k 200 -x 30
> > > -dict
> > > > /tmp/cvb/seq2sparseout/dictionary.file-0 -dt /tmp/cvb/topics -mt
> > > > /tmp/cvb/model
> > > >
> > > > Any thoughts?
> > > >
> > > > Thank you,
> > > > Liz
> > > >
> > >
> >
> >
> >
> > --
> >
> >   -jake
> >
>



-- 

  -jake

Re: NaN in cvb topic models after lucene2seq

Reply via email to