I'm new to Mahout (and indeed Hadoop). I'm trying a couple of experiments with some documents from Lucene, but I'm struggling to get lda to produce anything useful. Maybe there's something I don't get.
First I've used: $mahout lucene.vector --dir index --output /10000/vec --field description --dictOut 10000.dict --norm 2 --maxPercentErrorDocs 1 --max 10000 This seems to be extracting data: $ hadoop fs -ls /10000 Found 1 items -rw-r--r-- 1 hduser supergroup 2409404 2011-12-03 11:05 /10000/vec I'm not quite sure about the format here - presumably this is really a representation of a matrix - columns for each document, and rows being word frequencies therein (or transposed)? Then I invoke lda - I understand it takes a directory and uses the contents of the directory as input. mahout lda -i /10000 -o /10000-out -k 20 -ow This whirs away for a bit, but stops after a few iterations with a log likelihood of around -430000 (so something is presumably wrong). There is some output in /10000-out, but ldatopics doesn't give any ouput when I run it. Maybe I've misunderstood what it's expecting as input? I have a feeling I'm missing something obvious here... TIA for any hints.
