Thanks so much for your response, Suneel. Unfortunately, the Solr index is not mine to post. But short of that, are there any useful answers I can provide? At the time I ran this, it contained 70,000 documents... I'm adding several times that today, though.
I tried lucene2seq again. Running with the MapReduce default, the directory it creates contains _SUCCESS part-m-00003 part-m-00007 part-m-00011 part-m-00000 part-m-00004 part-m-00008 part-m-00012 part-m-00001 part-m-00005 part-m-00009 part-m-00013 part-m-00002 part-m-00006 part-m-00010 part-m-00014 With -xm sequential, however, it creates only "index." Looking at part-m-00014 or index, I see about the same thing: a header like SEQ^F^Yorg.apache.hadoop.io.Text^Yorg.apache.hadoop.io.Text^@^@^@^@^@^@ua<80>yäØQõ-ãe<93>n5<9d>¡^@^@^C)^@^@^@^A^@<8e>^C%( And then the concatenated text of (all?) my documents When I run "rowid," I get 13/07/25 09:45:19 INFO vectors.RowIdJob: Wrote out matrix with 1 rows and 465540 columns to /tmp/cvb/rowidout/matrix In comparison, I'm working off the closest example I could find, from the book Hadoop MapReduce Cookbook (page in Safari Books Online: http://goo.gl/n3YVCz). Running seqdirectory on their sample, a directory containing data from 20 newsgroups, my output is called part-m-00000 and looks like SEQ^F^Yorg.apache.hadoop.io.Text^Yorg.apache.hadoop.io.Text^A^@*org.apache.hadoop.io.compress.DefaultCodec^@^@^@^@<8a>FA4ëÇ"Fª>þ^H^_-¯^@^@^WÇ^@^@^@^S^R/alt.atheism/49960x<9c><8d>Z]W"K²}¯_<91><87> etc. When that gets to the point of running rowid, I get 13/07/25 10:44:45 INFO vectors.RowIdJob: Wrote out matrix with 19997 rows and 193659 columns to tmp/20news/int/matrix where those aprox 20,000 rows are plausibly each a document in the 20news dataset. It seems then, to me, that lucene2seq is the culprit. Maybe the best solution will falling back on lucene.vector: ./mahout lucene.vector --dir <path to solr data>/index --output /tmp/lv-cvb/luceneout --field textbody_en --dictOut /tmp/lv-cvb/lucenedict --idField docid --norm 2 --weight TF --seqDictOut /tmp/lv-cvb/seqDictOut --norm 2 -x 70 The output did look like the appropriately garbled. However, rowid doesn't like the output from lucene.vector, "java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable" and crossing my fingers and skipping rowid also had a problem with the LongWriteable, "java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.IntWritable." My commands: ./mahout rowid -i /tmp/lv-cvb/luceneout -o /tmp/lv-cvb/matrix ./mahout cvb -i /tmp/lv-cvb/luceneout -o /tmp/lv-cvb/out -k 20 -x 10 -dict /tmp/lv-cvb/seqDictOut -dt /tmp/lv-cvb/topics -mt /tmp/lv-cvb/model Is there something I'm missing? Thank you, Liz On Thu, Jul 25, 2013 at 12:20 AM, Suneel Marthi <[email protected]>wrote: > Liz, > > lucene2seq was a recent addition to Mahout 0.8 and its good that you are > taking this for a test drive and reporting issues. > In order to troubleshoot this: > > a) Could you try running lucene2seq with a '-xm sequential' option and > verify the output? The default option now is MapReduce and I am trying to > determine > if the issue could be with the MapReduce version or if its something more > basic. > b) Is it possible for you to post your Solr index to these forums, I can > take a stab at this to see as to what's wrong. > > Suneel > > > > > ________________________________ > From: Liz Merkhofer <[email protected]> > To: [email protected] > Sent: Wednesday, July 24, 2013 5:07 PM > Subject: NaN in cvb topic models after lucene2seq > > > Hello list, > > I'm having some problems using cvb (now that lda is deprecated) on my > Lucene (or Solr, if you will) index. I am using Mahout 0.8. > > My workflow is lucene2seq -> seq2sparse-> rowid -> cvb. Everything seems to > be working, until all my topics come out, with seqdumper, as NaN, like: > > Key class: class org.apache.hadoop.io.IntWritable Value Class: class > org.apache.mahout.math.VectorWritable > Key: 0: Value: > > {0:NaN,1:NaN,2:NaN,3:NaN,4:NaN,5:NaN,6:NaN,7:NaN,8:NaN,9:NaN,10:NaN,11:NaN,12:NaN,13:NaN,14:NaN,15:NaN,16:NaN,17:NaN,18:NaN,19:NaN,20:NaN,21:NaN,22:NaN, > > ... etc. I suspect my problem is in the output of lucene2seq, which is a > folder of files 14 files called /part-m-000xx that look very much like the > text in my Lucene index and nothing like the unreadable jumble I would get > from 'seqdirectory' on an actual directory of text files. > > If it helps, here's how I'm doing this: > > ./mahout lucene2seq -o /tmp/cvb/lucene2seqout -i <path to my solr > data>index -id docId -f textbody_en > > ./mahout seq2sparse -i /tmp/cvb/lucene2seqout -o /tmp/cvb/seq2sparseout > --namedVector --maxDFPercent 70 --weight TF -n 2 -a > org.apache.lucene.analysis.core.WhitespaceAnalyzer > > ./mahout rowid -i /tmp/cvb/seq2sparseout/tf-vectors -o /tmp/cvb/rowidout > > ./mahout cvb -i /tmp/cvb/rowidout/matrix -o /tmp/cvb/out -k 200 -x 30 -dict > /tmp/cvb/seq2sparseout/dictionary.file-0 -dt /tmp/cvb/topics -mt > /tmp/cvb/model > > Any thoughts? > > Thank you, > Liz >
