Hi Vishnu,

You may reduce the split size by setting mapred.max.split.size
configuration parameter of hadoop.

Number of map tasks then will be equal to  number of splits (input
size/split size)

Best
Sent from my iPhone

> On Dec 13, 2013, at 21:08, Vishnu Modi <[email protected]> wrote:
>
> I was experimenting with using Mahout's LDA algorithm. My corpus has around
> 80000 small documents, and roughly 45,000 terms. I was getting good
> results, but the algorithm takes too long to run. On every iteration the
> mapper takes around an hour, so with 10 iterations it takes a little over
> 10 hours to run. I notice that even though I'm running on a large hdfs
> cluster, each mapper stage is run in only a single mapper. The reducer
> stage is run on a large number of reducers, but even if run on only one
> reducer it takes less than a minute so in my case this part doesn't need
> scalability.
>
> I'm running LDA through the CVB0Driver class. My parameters:
>
> numTopics = 50
> numTerms = number of unique terms seen across all documents
> alpha = 1 (originally I tried the default, .0001 for alpha and eta)
> eta = 1
>
> For everything else I'm just using the defaults. Is it possible somehow to
> get the job to run faster (other than lowering the number of topics or
> terms)? Would the algorithm not work if it used more than 1 mapper?
>
> Thanks for any help!
> Vishnu

Reply via email to