hi all,

i'm kicking the tyres of lda by running it against the 2009 portion of
the westbury usenet corpus http://bit.ly/eUejPa

here's what i'm doing, based heavily on the build-reuters example

1) download the 2009 section of the corpus to hdfs 'corpus.raw'
it's about 4.5e6 posts over 880e6 lines in 50 bzipped files

2) pack into sequence files where each key is 0 and each value is a
single usenet post
hadoop jar ~hadoop/contrib/streaming/hadoop-streaming.jar \
 -D mapred.reduce.tasks=0 \
 -input corpus.raw \
 -output corpus.seq \
 -outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat \
 -mapper 'ruby one_article_per_line.rb' \
 -file one_article_per_line.rb

( the one_article_per_line.rb script can be seen at
https://gist.github.com/923435 )

3) convert to sparse sequence format
./bin/mahout seq2sparse -i corpus.seq -o corpus.seq-sparse -wt tf -seq -nr 45

4) check number of tokens in dictionary
hadoop fs -text corpus.seq-sparse/dictionary.file-0 | wc -l
1654229

5) run lda using number terms in dictionary (plus a bit) as number of terms
./bin/mahout lda -i corpus.seq-sparse/tf-vectors -o corpus-lda -k 10
-v 1700000 -ow -x 100
(converges after only 4 iterations)

6) dump the topics
./bin/mahout ldatopics -i corpus-lda/state-4 -d
corpus.seq-sparse/dictionary.file-0 -dt sequencefile

the topics i end up with are pretty much all the same (some crazy rants)
topic0: do our from have you like who i murder he would war zionist alex nazi
topic1: god jews death what all war murder you know can america zionist our
topic2: american alex murder he all like have i our us against justice
america death
topic3: alex your our i all murder 911 against who innocent can
humanity have what

if i run again to convergence from 3 using a subtly different number of reducers

./bin/mahout seq2sparse -i corpus.seq -o corpus.seq-sparse -wt tf -seq
-nr 40 # will this give different results
./bin/mahout lda -i corpus.seq-sparse/tf-vectors -o corpus-lda -k 10
-v 1700000 -ow -x 100
./bin/mahout ldatopics -i corpus-lda/state-5 -d
corpus.seq-sparse/dictionary.file-0 -dt sequencefile

i get the topics
topic0: us what cost budgetary costs war total iraq so we do budget comes would
topic1: account loss had were cost iraq trillion have should costs
than execution do
topic2: effort have item what cost war execution total us too also
iraq billion difficult
topic3: income what have trillion iraq execution costs we were victory
up budgetary

and if i try another different number of reducers i get another result
(though it takes a lot longer to converge)

./bin/mahout seq2sparse -i corpus.seq -o corpus.seq-sparse -wt tf -seq -nr 41
./bin/mahout lda -i corpus.seq-sparse/tf-vectors -o corpus-lda -k 10
-v 1700000 -ow -x 100
./bin/mahout ldatopics -i corpus-lda/state-13 -d
corpus.seq-sparse/dictionary.file-0 -dt sequencefile

topic0: sex nude mature free men women sexy beautiful
topic1: nude sexy free men pics videos photos naked hot
topic2: videos naked asian pictures women free beautiful nude
topic3: sexy older hot naked mature pictures women photos

i expected each topic to be different & also expected that modifying
the number of reducers should have had no impact in what topics were
found (?)

is my packing of the sequence file wrong for lda? i was following the
reuters example of an entire email as a single value in the sequence
file.

is my number of topics, in this case 10, reasonable?

is my approach of using the number of number of terms in the
dictionary as the -v param to lda correct? (there is only one
dictionary.file)

finally here's the contents of the seq-sparse directory; not sure if
the file sizes suggest anything. the contents of the files looks sane
https://gist.github.com/4eb5d5a3a90a064dd612

any thoughts most welcome, i'm happy to rerun using whatever
suggestions people might have

cheers!
mat

Reply via email to