hi all, i'm kicking the tyres of lda by running it against the 2009 portion of the westbury usenet corpus http://bit.ly/eUejPa
here's what i'm doing, based heavily on the build-reuters example 1) download the 2009 section of the corpus to hdfs 'corpus.raw' it's about 4.5e6 posts over 880e6 lines in 50 bzipped files 2) pack into sequence files where each key is 0 and each value is a single usenet post hadoop jar ~hadoop/contrib/streaming/hadoop-streaming.jar \ -D mapred.reduce.tasks=0 \ -input corpus.raw \ -output corpus.seq \ -outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat \ -mapper 'ruby one_article_per_line.rb' \ -file one_article_per_line.rb ( the one_article_per_line.rb script can be seen at https://gist.github.com/923435 ) 3) convert to sparse sequence format ./bin/mahout seq2sparse -i corpus.seq -o corpus.seq-sparse -wt tf -seq -nr 45 4) check number of tokens in dictionary hadoop fs -text corpus.seq-sparse/dictionary.file-0 | wc -l 1654229 5) run lda using number terms in dictionary (plus a bit) as number of terms ./bin/mahout lda -i corpus.seq-sparse/tf-vectors -o corpus-lda -k 10 -v 1700000 -ow -x 100 (converges after only 4 iterations) 6) dump the topics ./bin/mahout ldatopics -i corpus-lda/state-4 -d corpus.seq-sparse/dictionary.file-0 -dt sequencefile the topics i end up with are pretty much all the same (some crazy rants) topic0: do our from have you like who i murder he would war zionist alex nazi topic1: god jews death what all war murder you know can america zionist our topic2: american alex murder he all like have i our us against justice america death topic3: alex your our i all murder 911 against who innocent can humanity have what if i run again to convergence from 3 using a subtly different number of reducers ./bin/mahout seq2sparse -i corpus.seq -o corpus.seq-sparse -wt tf -seq -nr 40 # will this give different results ./bin/mahout lda -i corpus.seq-sparse/tf-vectors -o corpus-lda -k 10 -v 1700000 -ow -x 100 ./bin/mahout ldatopics -i corpus-lda/state-5 -d corpus.seq-sparse/dictionary.file-0 -dt sequencefile i get the topics topic0: us what cost budgetary costs war total iraq so we do budget comes would topic1: account loss had were cost iraq trillion have should costs than execution do topic2: effort have item what cost war execution total us too also iraq billion difficult topic3: income what have trillion iraq execution costs we were victory up budgetary and if i try another different number of reducers i get another result (though it takes a lot longer to converge) ./bin/mahout seq2sparse -i corpus.seq -o corpus.seq-sparse -wt tf -seq -nr 41 ./bin/mahout lda -i corpus.seq-sparse/tf-vectors -o corpus-lda -k 10 -v 1700000 -ow -x 100 ./bin/mahout ldatopics -i corpus-lda/state-13 -d corpus.seq-sparse/dictionary.file-0 -dt sequencefile topic0: sex nude mature free men women sexy beautiful topic1: nude sexy free men pics videos photos naked hot topic2: videos naked asian pictures women free beautiful nude topic3: sexy older hot naked mature pictures women photos i expected each topic to be different & also expected that modifying the number of reducers should have had no impact in what topics were found (?) is my packing of the sequence file wrong for lda? i was following the reuters example of an entire email as a single value in the sequence file. is my number of topics, in this case 10, reasonable? is my approach of using the number of number of terms in the dictionary as the -v param to lda correct? (there is only one dictionary.file) finally here's the contents of the seq-sparse directory; not sure if the file sizes suggest anything. the contents of the files looks sane https://gist.github.com/4eb5d5a3a90a064dd612 any thoughts most welcome, i'm happy to rerun using whatever suggestions people might have cheers! mat
