On Apr 17, 2011, at 8:06 AM, Mat Kelcey wrote:

> hi all,
> 
> i'm kicking the tyres of lda by running it against the 2009 portion of
> the westbury usenet corpus http://bit.ly/eUejPa
> 
> here's what i'm doing, based heavily on the build-reuters example
> 
> 1) download the 2009 section of the corpus to hdfs 'corpus.raw'
> it's about 4.5e6 posts over 880e6 lines in 50 bzipped files
> 
> 2) pack into sequence files where each key is 0 and each value is a
> single usenet post
> hadoop jar ~hadoop/contrib/streaming/hadoop-streaming.jar \
> -D mapred.reduce.tasks=0 \
> -input corpus.raw \
> -output corpus.seq \
> -outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat \
> -mapper 'ruby one_article_per_line.rb' \
> -file one_article_per_line.rb
> 
> ( the one_article_per_line.rb script can be seen at
> https://gist.github.com/923435 )
> 
> 3) convert to sparse sequence format
> ./bin/mahout seq2sparse -i corpus.seq -o corpus.seq-sparse -wt tf -seq -nr 45
> 
> 4) check number of tokens in dictionary
> hadoop fs -text corpus.seq-sparse/dictionary.file-0 | wc -l
> 1654229
> 
> 5) run lda using number terms in dictionary (plus a bit) as number of terms
> ./bin/mahout lda -i corpus.seq-sparse/tf-vectors -o corpus-lda -k 10
> -v 1700000 -ow -x 100
> (converges after only 4 iterations)
> 
> 6) dump the topics
> ./bin/mahout ldatopics -i corpus-lda/state-4 -d
> corpus.seq-sparse/dictionary.file-0 -dt sequencefile
> 
> the topics i end up with are pretty much all the same (some crazy rants)
> topic0: do our from have you like who i murder he would war zionist alex nazi
> topic1: god jews death what all war murder you know can america zionist our
> topic2: american alex murder he all like have i our us against justice
> america death
> topic3: alex your our i all murder 911 against who innocent can
> humanity have what
> 
> if i run again to convergence from 3 using a subtly different number of 
> reducers
> 
> ./bin/mahout seq2sparse -i corpus.seq -o corpus.seq-sparse -wt tf -seq
> -nr 40 # will this give different results
> ./bin/mahout lda -i corpus.seq-sparse/tf-vectors -o corpus-lda -k 10
> -v 1700000 -ow -x 100
> ./bin/mahout ldatopics -i corpus-lda/state-5 -d
> corpus.seq-sparse/dictionary.file-0 -dt sequencefile
> 
> i get the topics
> topic0: us what cost budgetary costs war total iraq so we do budget comes 
> would
> topic1: account loss had were cost iraq trillion have should costs
> than execution do
> topic2: effort have item what cost war execution total us too also
> iraq billion difficult
> topic3: income what have trillion iraq execution costs we were victory
> up budgetary
> 
> and if i try another different number of reducers i get another result
> (though it takes a lot longer to converge)
> 
> ./bin/mahout seq2sparse -i corpus.seq -o corpus.seq-sparse -wt tf -seq -nr 41
> ./bin/mahout lda -i corpus.seq-sparse/tf-vectors -o corpus-lda -k 10
> -v 1700000 -ow -x 100
> ./bin/mahout ldatopics -i corpus-lda/state-13 -d
> corpus.seq-sparse/dictionary.file-0 -dt sequencefile
> 
> topic0: sex nude mature free men women sexy beautiful
> topic1: nude sexy free men pics videos photos naked hot
> topic2: videos naked asian pictures women free beautiful nude
> topic3: sexy older hot naked mature pictures women photos
> 
> i expected each topic to be different & also expected that modifying
> the number of reducers should have had no impact in what topics were
> found (?)
> 
> is my packing of the sequence file wrong for lda? i was following the
> reuters example of an entire email as a single value in the sequence
> file.

I don't see anything wrong offhand.  You might look at MAHOUT-399.  I think we 
are trying to review how LDA performs at the moment.  From what I understand, 
you aren't guaranteed the same results each time (I wonder if there is a way to 
at least provide some sort of seed value so that one can reproduce a set of 
results)

At any rate, it's good that you put up detailed instructions of what you did, 
so that we can compare them.

> 
> is my number of topics, in this case 10, reasonable?
> 
> is my approach of using the number of number of terms in the
> dictionary as the -v param to lda correct? (there is only one
> dictionary.file)
> 
> finally here's the contents of the seq-sparse directory; not sure if
> the file sizes suggest anything. the contents of the files looks sane
> https://gist.github.com/4eb5d5a3a90a064dd612
> 
> any thoughts most welcome, i'm happy to rerun using whatever
> suggestions people might have
> 
> cheers!
> mat

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem docs using Solr/Lucene:
http://www.lucidimagination.com/search

Reply via email to