Horizontally scaling / speeding up Mahout's LDA

Vishnu Modi Fri, 13 Dec 2013 11:09:53 -0800

I was experimenting with using Mahout's LDA algorithm. My corpus has around
80000 small documents, and roughly 45,000 terms. I was getting good
results, but the algorithm takes too long to run. On every iteration the
mapper takes around an hour, so with 10 iterations it takes a little over
10 hours to run. I notice that even though I'm running on a large hdfs
cluster, each mapper stage is run in only a single mapper. The reducer
stage is run on a large number of reducers, but even if run on only one
reducer it takes less than a minute so in my case this part doesn't need
scalability.


I'm running LDA through the CVB0Driver class. My parameters:

numTopics = 50
numTerms = number of unique terms seen across all documents
alpha = 1 (originally I tried the default, .0001 for alpha and eta)
eta = 1

For everything else I'm just using the defaults. Is it possible somehow to
get the job to run faster (other than lowering the number of topics or
terms)? Would the algorithm not work if it used more than 1 mapper?

Thanks for any help!
Vishnu

Horizontally scaling / speeding up Mahout's LDA

Reply via email to