Hi Vishnu, You may reduce the split size by setting mapred.max.split.size configuration parameter of hadoop.
Number of map tasks then will be equal to number of splits (input size/split size) Best Sent from my iPhone > On Dec 13, 2013, at 21:08, Vishnu Modi <[email protected]> wrote: > > I was experimenting with using Mahout's LDA algorithm. My corpus has around > 80000 small documents, and roughly 45,000 terms. I was getting good > results, but the algorithm takes too long to run. On every iteration the > mapper takes around an hour, so with 10 iterations it takes a little over > 10 hours to run. I notice that even though I'm running on a large hdfs > cluster, each mapper stage is run in only a single mapper. The reducer > stage is run on a large number of reducers, but even if run on only one > reducer it takes less than a minute so in my case this part doesn't need > scalability. > > I'm running LDA through the CVB0Driver class. My parameters: > > numTopics = 50 > numTerms = number of unique terms seen across all documents > alpha = 1 (originally I tried the default, .0001 for alpha and eta) > eta = 1 > > For everything else I'm just using the defaults. Is it possible somehow to > get the job to run faster (other than lowering the number of topics or > terms)? Would the algorithm not work if it used more than 1 mapper? > > Thanks for any help! > Vishnu
