Re: NaN in cvb topic models after lucene2seq

Suneel Marthi Thu, 25 Jul 2013 14:08:43 -0700

Agree with Ted, the code should have thrown some kind of an exception - 
something like 'invalid field specified' or 'field does not exist'.
You didn't have to wait for rowIdJob to report that there was only one row 
while u were expecting 'x' rows.


What was happening is that in SequenceFilesFromLuceneStorage.java, method - 
collect(), Line 109 - we are returning an empty String as Key value if
the requested field is not found in Lucene/Solr index and hence lucene2seq 
completes with no errors.

I am gonna reopen this JIRA and work on this, thanks again for reporting this.
As Jake had mentioned you are the first person to be actually using lucene2seq.




________________________________
 From: Ted Dunning <[email protected]>
To: "[email protected]" <[email protected]> 
Sent: Thursday, July 25, 2013 4:42 PM
Subject: Re: NaN in cvb topic models after lucene2seq
 

Liz,

I don't see this as a false alarm.  I think that the conclusion is about
type of bug, not existence.

Can you add a comment to your JIRA with suggestions about how the code
could have informed you of your user error?

(I view user errors as extremely valuable tests .... too rare to make it
easy to correct software and too common to make correction unnecessary.
Please record your now fleeting innocent and naive state.)




On Thu, Jul 25, 2013 at 1:28 PM, Liz Merkhofer <
[email protected]> wrote:

> Thanks for your help again. Logged a JIRA, then made a last-ditch effort to
> check my commands and realized that the field I wanted to use for -id was
> typed wrong (camel case instead of lower). Fixed it and was able to get the
> appropriate number of rows in my matrix. Still waiting for cvb output from
> that, but I'll wrap up this thread since the problem with lucene2seq boils
> down to user error.
>
> So the problem was on my end: what I entered as -id did not exist in my
> Solr schema and so my documents were not delimited.
>
> Sorry for the false alarm; thanks for your helpfulness.
>
>
>
> On Thu, Jul 25, 2013 at 12:46 PM, Suneel Marthi <[email protected]
> >wrote:
>
> > Agree with Jake that this is definitely an issue with lucene2seq.
> >
> > RowId should have created a matrix with 70000 rows (= no. of documents
> > from your input corpus), but seems like lucene2seq is creating one single
> > document
> > for all of them.
> >
> > Could you log a JIRA for this?
> >
> > Thanks again for reporting this.
> >
> >
> >
> > ________________________________
> >  From: Jake Mannix <[email protected]>
> > To: "[email protected]" <[email protected]>
> > Sent: Thursday, July 25, 2013 12:39 PM
> > Subject: Re: NaN in cvb topic models after lucene2seq
> >
> >
> > On Thu, Jul 25, 2013 at 9:07 AM, Liz Merkhofer <
> > [email protected]> wrote:
> >
> > > Thanks so much for your response, Suneel.
> > >
> > > Unfortunately, the Solr index is not mine to post. But short of that,
> are
> > > there any useful answers I can provide? At the time I ran this, it
> > > contained 70,000 documents... I'm adding several times that today,
> > though.
> > >
> > > I tried lucene2seq again.
> > >
> > > Running with the MapReduce default, the directory it creates contains
> > > _SUCCESS part-m-00003 part-m-00007 part-m-00011
> > > part-m-00000 part-m-00004 part-m-00008 part-m-00012
> > > part-m-00001 part-m-00005 part-m-00009 part-m-00013
> > > part-m-00002 part-m-00006 part-m-00010 part-m-00014
> > >
> > > With -xm sequential, however, it creates only "index."
> > >
> > > Looking at part-m-00014 or index, I see about the same thing: a header
> > like
> > >
> > >
> > >
> >
> SEQ^F^Yorg.apache.hadoop.io.Text^Yorg.apache.hadoop.io.Text^@^@^@^@^@^@ua<80>yäØQõ-ãe<93>n5<9d>¡^@^@^C)^@^@^@^A^@<8e>^C%(
> > >
> > > And then the concatenated text of (all?) my documents
> > >
> > >
> > This is definitely the problem:
> >
> >
> > > When I run "rowid," I get
> > >
> > > 13/07/25 09:45:19 INFO vectors.RowIdJob: Wrote out matrix with 1 rows
> and
> > > 465540 columns to /tmp/cvb/rowidout/matrix
> > >
> >
> > >
> > > In comparison, I'm working off the closest example I could find, from
> the
> > > book Hadoop MapReduce Cookbook (page in Safari Books Online:
> > > http://goo.gl/n3YVCz). Running seqdirectory on their sample, a
> directory
> > > containing data from 20 newsgroups, my output is called part-m-00000
> and
> > > looks like
> > >
> > >
> > >
> >
> SEQ^F^Yorg.apache.hadoop.io.Text^Yorg.apache.hadoop.io.Text^A^@*org.apache.hadoop.io.compress.DefaultCodec^@^@^@^@<8a>FA4ëÇ"Fª>þ^H^_-Â¯^@^@^WÇ^@^@^@^S^R/alt.atheism/49960x<9c><8d>Z]W"K²}¯_<91><87>
> > >
> > > etc. When that gets to the point of running rowid, I get
> > >
> > > 13/07/25 10:44:45 INFO vectors.RowIdJob: Wrote out matrix with 19997
> rows
> > > and 193659 columns to tmp/20news/int/matrix
> > >
> > > where those aprox 20,000 rows are plausibly each a document in the
> 20news
> > > dataset.
> > >
> > > It seems then, to me, that lucene2seq is the culprit.
> >
> >
> > Yep, that looks to be the case.
> >
> >
> > > Maybe the best
> > > solution will falling back on lucene.vector:
> > >
> > > ./mahout lucene.vector --dir <path to solr data>/index --output
> > > /tmp/lv-cvb/luceneout --field textbody_en --dictOut
> > /tmp/lv-cvb/lucenedict
> > > --idField docid --norm 2 --weight TF --seqDictOut
> /tmp/lv-cvb/seqDictOut
> > > --norm 2 -x 70
> > >
> > > The output did look like the appropriately garbled.
> > >
> > > However, rowid doesn't like the output from lucene.vector,
> > > "java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast
> > to
> > > org.apache.hadoop.io.IntWritable" and crossing my fingers and skipping
> > > rowid also had a problem with the LongWriteable,
> > > "java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
> > be
> > > cast to org.apache.hadoop.io.IntWritable."
> > >
> >
> > That's very sad to see.  lucene.vector is spitting out sequence files
> which
> > have keys being LongWritable, seq2sparse is spitting out sequence files
> > which have Text keys, and LDA wants inputs which are IntWritable keys.
> > RowId alleviates only one problem: taking Text keys and turning them into
> > IntWritable keys.
> >
> > I would be very sad if it turns out your only current option is to write
> a
> > trivially
> > changed version of RowId (it's a really simple job) which can handle
> > LongWritable
> > keys as well as Text.  In fact, it would be a great modification for that
> > job to be
> > changed to take *any* key type.  It currently doesn't care what its keys
> > are,
> > so it should be pretty easy to change all instances of "Text" in RowIdJob
> > to
> > "WritableComparable" (or ? extends WritableComparable) and it should
> "just
> > work".  Lame!
> >
> >
> > >
> > > My commands:
> > > ./mahout rowid -i /tmp/lv-cvb/luceneout  -o /tmp/lv-cvb/matrix
> > >
> > > ./mahout cvb -i /tmp/lv-cvb/luceneout -o /tmp/lv-cvb/out -k 20 -x 10
> > -dict
> > > /tmp/lv-cvb/seqDictOut -dt /tmp/lv-cvb/topics -mt /tmp/lv-cvb/model
> > >
> > > Is there something I'm missing?
> > >
> > > Thank you,
> > > Liz
> > >
> > >
> > > On Thu, Jul 25, 2013 at 12:20 AM, Suneel Marthi <
> [email protected]
> > > >wrote:
> > >
> > > > Liz,
> > > >
> > > > lucene2seq was a recent addition to Mahout 0.8 and its good that you
> > are
> > > > taking this for a test drive and reporting issues.
> > > > In order to troubleshoot this:
> > > >
> > > > a) Could you try running lucene2seq with a '-xm sequential' option
> and
> > > > verify the output?  The default option now is MapReduce and I am
> trying
> > > to
> > > > determine
> > > >  if the issue could be with the MapReduce version or if its something
> > > more
> > > > basic.
> > > > b) Is it possible for you to post your Solr index to these forums, I
> > can
> > > > take a stab at this to see as to what's wrong.
> > > >
> > > > Suneel
> > > >
> > > >
> > > >
> > > >
> > > > ________________________________
> > > >  From: Liz Merkhofer <[email protected]>
> > > > To: [email protected]
> > > > Sent: Wednesday, July 24, 2013 5:07 PM
> > > > Subject: NaN in cvb topic models after lucene2seq
> > > >
> > > >
> > > > Hello list,
> > > >
> > > > I'm having some problems using cvb (now that lda is deprecated) on my
> > > > Lucene (or Solr, if you will) index. I am using Mahout 0.8.
> > > >
> > > > My workflow is lucene2seq -> seq2sparse-> rowid -> cvb. Everything
> > seems
> > > to
> > > > be working, until all my topics come out, with seqdumper, as NaN,
> like:
> > > >
> > > > Key class: class org.apache.hadoop.io.IntWritable Value Class: class
> > > > org.apache.mahout.math.VectorWritable
> > > > Key: 0: Value:
> > > >
> > > >
> > >
> >
> {0:NaN,1:NaN,2:NaN,3:NaN,4:NaN,5:NaN,6:NaN,7:NaN,8:NaN,9:NaN,10:NaN,11:NaN,12:NaN,13:NaN,14:NaN,15:NaN,16:NaN,17:NaN,18:NaN,19:NaN,20:NaN,21:NaN,22:NaN,
> > > >
> > > > ... etc. I suspect my problem is in the output of lucene2seq, which
> is
> > a
> > > > folder of files 14 files called /part-m-000xx that look very much
> like
> > > the
> > > > text in my Lucene index and nothing like the unreadable jumble I
> would
> > > get
> > > > from 'seqdirectory' on an actual directory of text files.
> > > >
> > > > If it helps, here's how I'm doing this:
> > > >
> > > > ./mahout lucene2seq -o /tmp/cvb/lucene2seqout -i <path to my solr
> > > > data>index -id docId -f textbody_en
> > > >
> > > > ./mahout seq2sparse -i /tmp/cvb/lucene2seqout -o
> /tmp/cvb/seq2sparseout
> > > > --namedVector --maxDFPercent 70 --weight TF -n 2 -a
> > > > org.apache.lucene.analysis.core.WhitespaceAnalyzer
> > > >
> > > > ./mahout rowid -i /tmp/cvb/seq2sparseout/tf-vectors -o
> > /tmp/cvb/rowidout
> > > >
> > > > ./mahout cvb -i /tmp/cvb/rowidout/matrix -o /tmp/cvb/out -k 200 -x 30
> > > -dict
> > > > /tmp/cvb/seq2sparseout/dictionary.file-0 -dt /tmp/cvb/topics -mt
> > > > /tmp/cvb/model
> > > >
> > > > Any thoughts?
> > > >
> > > > Thank you,
> > > > Liz
> > > >
> > >
> >
> >
> >
> > --
> >
> >   -jake
> >
>

Re: NaN in cvb topic models after lucene2seq

Reply via email to