Ah! Good to know, it would have been really sad if this had not worked at all, given that Mahout 0.8 was *just* released.
Please to report back with results you get from your LDA run - I think you're possibly the first person to do a lucene2seq -> cvb0 roundtrip, so it would be great to hear how it goes (including any issues you have with either the topic/term distributions, or the document/topic distributions). On Thu, Jul 25, 2013 at 1:28 PM, Liz Merkhofer < [email protected]> wrote: > Thanks for your help again. Logged a JIRA, then made a last-ditch effort to > check my commands and realized that the field I wanted to use for -id was > typed wrong (camel case instead of lower). Fixed it and was able to get the > appropriate number of rows in my matrix. Still waiting for cvb output from > that, but I'll wrap up this thread since the problem with lucene2seq boils > down to user error. > > So the problem was on my end: what I entered as -id did not exist in my > Solr schema and so my documents were not delimited. > > Sorry for the false alarm; thanks for your helpfulness. > > > > On Thu, Jul 25, 2013 at 12:46 PM, Suneel Marthi <[email protected] > >wrote: > > > Agree with Jake that this is definitely an issue with lucene2seq. > > > > RowId should have created a matrix with 70000 rows (= no. of documents > > from your input corpus), but seems like lucene2seq is creating one single > > document > > for all of them. > > > > Could you log a JIRA for this? > > > > Thanks again for reporting this. > > > > > > > > ________________________________ > > From: Jake Mannix <[email protected]> > > To: "[email protected]" <[email protected]> > > Sent: Thursday, July 25, 2013 12:39 PM > > Subject: Re: NaN in cvb topic models after lucene2seq > > > > > > On Thu, Jul 25, 2013 at 9:07 AM, Liz Merkhofer < > > [email protected]> wrote: > > > > > Thanks so much for your response, Suneel. > > > > > > Unfortunately, the Solr index is not mine to post. But short of that, > are > > > there any useful answers I can provide? At the time I ran this, it > > > contained 70,000 documents... I'm adding several times that today, > > though. > > > > > > I tried lucene2seq again. > > > > > > Running with the MapReduce default, the directory it creates contains > > > _SUCCESS part-m-00003 part-m-00007 part-m-00011 > > > part-m-00000 part-m-00004 part-m-00008 part-m-00012 > > > part-m-00001 part-m-00005 part-m-00009 part-m-00013 > > > part-m-00002 part-m-00006 part-m-00010 part-m-00014 > > > > > > With -xm sequential, however, it creates only "index." > > > > > > Looking at part-m-00014 or index, I see about the same thing: a header > > like > > > > > > > > > > > > SEQ^F^Yorg.apache.hadoop.io.Text^Yorg.apache.hadoop.io.Text^@^@^@^@^@^@ua<80>yäØQõ-ãe<93>n5<9d>¡^@^@^C)^@^@^@^A^@<8e>^C%( > > > > > > And then the concatenated text of (all?) my documents > > > > > > > > This is definitely the problem: > > > > > > > When I run "rowid," I get > > > > > > 13/07/25 09:45:19 INFO vectors.RowIdJob: Wrote out matrix with 1 rows > and > > > 465540 columns to /tmp/cvb/rowidout/matrix > > > > > > > > > > > In comparison, I'm working off the closest example I could find, from > the > > > book Hadoop MapReduce Cookbook (page in Safari Books Online: > > > http://goo.gl/n3YVCz). Running seqdirectory on their sample, a > directory > > > containing data from 20 newsgroups, my output is called part-m-00000 > and > > > looks like > > > > > > > > > > > > SEQ^F^Yorg.apache.hadoop.io.Text^Yorg.apache.hadoop.io.Text^A^@*org.apache.hadoop.io.compress.DefaultCodec^@^@^@^@<8a>FA4ëÇ"Fª>þ^H^_-¯^@^@^WÇ^@^@^@^S^R/alt.atheism/49960x<9c><8d>Z]W"K²}¯_<91><87> > > > > > > etc. When that gets to the point of running rowid, I get > > > > > > 13/07/25 10:44:45 INFO vectors.RowIdJob: Wrote out matrix with 19997 > rows > > > and 193659 columns to tmp/20news/int/matrix > > > > > > where those aprox 20,000 rows are plausibly each a document in the > 20news > > > dataset. > > > > > > It seems then, to me, that lucene2seq is the culprit. > > > > > > Yep, that looks to be the case. > > > > > > > Maybe the best > > > solution will falling back on lucene.vector: > > > > > > ./mahout lucene.vector --dir <path to solr data>/index --output > > > /tmp/lv-cvb/luceneout --field textbody_en --dictOut > > /tmp/lv-cvb/lucenedict > > > --idField docid --norm 2 --weight TF --seqDictOut > /tmp/lv-cvb/seqDictOut > > > --norm 2 -x 70 > > > > > > The output did look like the appropriately garbled. > > > > > > However, rowid doesn't like the output from lucene.vector, > > > "java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast > > to > > > org.apache.hadoop.io.IntWritable" and crossing my fingers and skipping > > > rowid also had a problem with the LongWriteable, > > > "java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot > > be > > > cast to org.apache.hadoop.io.IntWritable." > > > > > > > That's very sad to see. lucene.vector is spitting out sequence files > which > > have keys being LongWritable, seq2sparse is spitting out sequence files > > which have Text keys, and LDA wants inputs which are IntWritable keys. > > RowId alleviates only one problem: taking Text keys and turning them into > > IntWritable keys. > > > > I would be very sad if it turns out your only current option is to write > a > > trivially > > changed version of RowId (it's a really simple job) which can handle > > LongWritable > > keys as well as Text. In fact, it would be a great modification for that > > job to be > > changed to take *any* key type. It currently doesn't care what its keys > > are, > > so it should be pretty easy to change all instances of "Text" in RowIdJob > > to > > "WritableComparable" (or ? extends WritableComparable) and it should > "just > > work". Lame! > > > > > > > > > > My commands: > > > ./mahout rowid -i /tmp/lv-cvb/luceneout -o /tmp/lv-cvb/matrix > > > > > > ./mahout cvb -i /tmp/lv-cvb/luceneout -o /tmp/lv-cvb/out -k 20 -x 10 > > -dict > > > /tmp/lv-cvb/seqDictOut -dt /tmp/lv-cvb/topics -mt /tmp/lv-cvb/model > > > > > > Is there something I'm missing? > > > > > > Thank you, > > > Liz > > > > > > > > > On Thu, Jul 25, 2013 at 12:20 AM, Suneel Marthi < > [email protected] > > > >wrote: > > > > > > > Liz, > > > > > > > > lucene2seq was a recent addition to Mahout 0.8 and its good that you > > are > > > > taking this for a test drive and reporting issues. > > > > In order to troubleshoot this: > > > > > > > > a) Could you try running lucene2seq with a '-xm sequential' option > and > > > > verify the output? The default option now is MapReduce and I am > trying > > > to > > > > determine > > > > if the issue could be with the MapReduce version or if its something > > > more > > > > basic. > > > > b) Is it possible for you to post your Solr index to these forums, I > > can > > > > take a stab at this to see as to what's wrong. > > > > > > > > Suneel > > > > > > > > > > > > > > > > > > > > ________________________________ > > > > From: Liz Merkhofer <[email protected]> > > > > To: [email protected] > > > > Sent: Wednesday, July 24, 2013 5:07 PM > > > > Subject: NaN in cvb topic models after lucene2seq > > > > > > > > > > > > Hello list, > > > > > > > > I'm having some problems using cvb (now that lda is deprecated) on my > > > > Lucene (or Solr, if you will) index. I am using Mahout 0.8. > > > > > > > > My workflow is lucene2seq -> seq2sparse-> rowid -> cvb. Everything > > seems > > > to > > > > be working, until all my topics come out, with seqdumper, as NaN, > like: > > > > > > > > Key class: class org.apache.hadoop.io.IntWritable Value Class: class > > > > org.apache.mahout.math.VectorWritable > > > > Key: 0: Value: > > > > > > > > > > > > > > {0:NaN,1:NaN,2:NaN,3:NaN,4:NaN,5:NaN,6:NaN,7:NaN,8:NaN,9:NaN,10:NaN,11:NaN,12:NaN,13:NaN,14:NaN,15:NaN,16:NaN,17:NaN,18:NaN,19:NaN,20:NaN,21:NaN,22:NaN, > > > > > > > > ... etc. I suspect my problem is in the output of lucene2seq, which > is > > a > > > > folder of files 14 files called /part-m-000xx that look very much > like > > > the > > > > text in my Lucene index and nothing like the unreadable jumble I > would > > > get > > > > from 'seqdirectory' on an actual directory of text files. > > > > > > > > If it helps, here's how I'm doing this: > > > > > > > > ./mahout lucene2seq -o /tmp/cvb/lucene2seqout -i <path to my solr > > > > data>index -id docId -f textbody_en > > > > > > > > ./mahout seq2sparse -i /tmp/cvb/lucene2seqout -o > /tmp/cvb/seq2sparseout > > > > --namedVector --maxDFPercent 70 --weight TF -n 2 -a > > > > org.apache.lucene.analysis.core.WhitespaceAnalyzer > > > > > > > > ./mahout rowid -i /tmp/cvb/seq2sparseout/tf-vectors -o > > /tmp/cvb/rowidout > > > > > > > > ./mahout cvb -i /tmp/cvb/rowidout/matrix -o /tmp/cvb/out -k 200 -x 30 > > > -dict > > > > /tmp/cvb/seq2sparseout/dictionary.file-0 -dt /tmp/cvb/topics -mt > > > > /tmp/cvb/model > > > > > > > > Any thoughts? > > > > > > > > Thank you, > > > > Liz > > > > > > > > > > > > > > > -- > > > > -jake > > > -- -jake
