On Apr 17, 2011, at 8:06 AM, Mat Kelcey wrote: > hi all, > > i'm kicking the tyres of lda by running it against the 2009 portion of > the westbury usenet corpus http://bit.ly/eUejPa > > here's what i'm doing, based heavily on the build-reuters example > > 1) download the 2009 section of the corpus to hdfs 'corpus.raw' > it's about 4.5e6 posts over 880e6 lines in 50 bzipped files > > 2) pack into sequence files where each key is 0 and each value is a > single usenet post > hadoop jar ~hadoop/contrib/streaming/hadoop-streaming.jar \ > -D mapred.reduce.tasks=0 \ > -input corpus.raw \ > -output corpus.seq \ > -outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat \ > -mapper 'ruby one_article_per_line.rb' \ > -file one_article_per_line.rb > > ( the one_article_per_line.rb script can be seen at > https://gist.github.com/923435 ) > > 3) convert to sparse sequence format > ./bin/mahout seq2sparse -i corpus.seq -o corpus.seq-sparse -wt tf -seq -nr 45 > > 4) check number of tokens in dictionary > hadoop fs -text corpus.seq-sparse/dictionary.file-0 | wc -l > 1654229 > > 5) run lda using number terms in dictionary (plus a bit) as number of terms > ./bin/mahout lda -i corpus.seq-sparse/tf-vectors -o corpus-lda -k 10 > -v 1700000 -ow -x 100 > (converges after only 4 iterations) > > 6) dump the topics > ./bin/mahout ldatopics -i corpus-lda/state-4 -d > corpus.seq-sparse/dictionary.file-0 -dt sequencefile > > the topics i end up with are pretty much all the same (some crazy rants) > topic0: do our from have you like who i murder he would war zionist alex nazi > topic1: god jews death what all war murder you know can america zionist our > topic2: american alex murder he all like have i our us against justice > america death > topic3: alex your our i all murder 911 against who innocent can > humanity have what > > if i run again to convergence from 3 using a subtly different number of > reducers > > ./bin/mahout seq2sparse -i corpus.seq -o corpus.seq-sparse -wt tf -seq > -nr 40 # will this give different results > ./bin/mahout lda -i corpus.seq-sparse/tf-vectors -o corpus-lda -k 10 > -v 1700000 -ow -x 100 > ./bin/mahout ldatopics -i corpus-lda/state-5 -d > corpus.seq-sparse/dictionary.file-0 -dt sequencefile > > i get the topics > topic0: us what cost budgetary costs war total iraq so we do budget comes > would > topic1: account loss had were cost iraq trillion have should costs > than execution do > topic2: effort have item what cost war execution total us too also > iraq billion difficult > topic3: income what have trillion iraq execution costs we were victory > up budgetary > > and if i try another different number of reducers i get another result > (though it takes a lot longer to converge) > > ./bin/mahout seq2sparse -i corpus.seq -o corpus.seq-sparse -wt tf -seq -nr 41 > ./bin/mahout lda -i corpus.seq-sparse/tf-vectors -o corpus-lda -k 10 > -v 1700000 -ow -x 100 > ./bin/mahout ldatopics -i corpus-lda/state-13 -d > corpus.seq-sparse/dictionary.file-0 -dt sequencefile > > topic0: sex nude mature free men women sexy beautiful > topic1: nude sexy free men pics videos photos naked hot > topic2: videos naked asian pictures women free beautiful nude > topic3: sexy older hot naked mature pictures women photos > > i expected each topic to be different & also expected that modifying > the number of reducers should have had no impact in what topics were > found (?) > > is my packing of the sequence file wrong for lda? i was following the > reuters example of an entire email as a single value in the sequence > file.
I don't see anything wrong offhand. You might look at MAHOUT-399. I think we are trying to review how LDA performs at the moment. From what I understand, you aren't guaranteed the same results each time (I wonder if there is a way to at least provide some sort of seed value so that one can reproduce a set of results) At any rate, it's good that you put up detailed instructions of what you did, so that we can compare them. > > is my number of topics, in this case 10, reasonable? > > is my approach of using the number of number of terms in the > dictionary as the -v param to lda correct? (there is only one > dictionary.file) > > finally here's the contents of the seq-sparse directory; not sure if > the file sizes suggest anything. the contents of the files looks sane > https://gist.github.com/4eb5d5a3a90a064dd612 > > any thoughts most welcome, i'm happy to rerun using whatever > suggestions people might have > > cheers! > mat -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem docs using Solr/Lucene: http://www.lucidimagination.com/search
