Lucene -> LDA experiments... some confusion.

Paul Rudin Sat, 03 Dec 2011 03:40:41 -0800

I'm new to Mahout (and indeed Hadoop). I'm trying a couple of
experiments with some documents from Lucene, but I'm struggling to get
lda to produce anything useful. Maybe there's something I don't get.


First I've used:

$mahout lucene.vector --dir index --output /10000/vec --field description
--dictOut 10000.dict --norm 2 --maxPercentErrorDocs 1 --max 10000

This seems to be extracting data:

$ hadoop fs -ls /10000
Found 1 items
-rw-r--r--   1 hduser supergroup    2409404 2011-12-03 11:05 /10000/vec

I'm not quite sure about the format here - presumably this is really a
representation of a matrix - columns for each document, and rows being
word frequencies therein (or transposed)?

Then I invoke lda - I understand it takes a directory and uses the
contents of the directory as input.

mahout lda -i /10000 -o /10000-out -k 20 -ow

This whirs away for a bit, but stops after a few iterations with a log
likelihood of around -430000 (so something is presumably wrong). There
is some output in /10000-out, but ldatopics doesn't give any ouput when
I run it. Maybe I've misunderstood what it's expecting as input?

I have a feeling I'm missing something obvious here... TIA for any
hints.

Lucene -> LDA experiments... some confusion.

Reply via email to