I was experimenting with using Mahout's LDA algorithm. My corpus has around 80000 small documents, and roughly 45,000 terms. I was getting good results, but the algorithm takes too long to run. On every iteration the mapper takes around an hour, so with 10 iterations it takes a little over 10 hours to run. I notice that even though I'm running on a large hdfs cluster, each mapper stage is run in only a single mapper. The reducer stage is run on a large number of reducers, but even if run on only one reducer it takes less than a minute so in my case this part doesn't need scalability.
I'm running LDA through the CVB0Driver class. My parameters: numTopics = 50 numTerms = number of unique terms seen across all documents alpha = 1 (originally I tried the default, .0001 for alpha and eta) eta = 1 For everything else I'm just using the defaults. Is it possible somehow to get the job to run faster (other than lowering the number of topics or terms)? Would the algorithm not work if it used more than 1 mapper? Thanks for any help! Vishnu
