I suggest taking a look at my blog post: http://bickson.blogspot.com/2011/03/tunning-hadoop-configuration-for-high.html There could be many potential reasons among them: * Premature timeouts which makes task fails just before they should finish * Bad configuration of numbers of mappers/reducers - too few or too many may significantly slow down things. * Possible use of compression may speed disk access I think you will have to "get your hands dirty" by analyzing the logs and finding out what slows you down.
On Tue, Sep 6, 2011 at 10:35 AM, Chris Lu <[email protected]> wrote: > Hi, > > I am running LDA on 18k documents, each document has 5k terms. total 300k > terms. Topics is set to 100. > > Running LDA on Hadoop single node configuration takes about 5 hours per > stage. And 20 stages would take 100 hours. > > However, given 20 machines, running on Amazon EMR is actually much much > slower. It takes 1000 minutes per stage. (It takes about 10 minutes for 1% > mapping progress.) Reducing is much faster is counted in seconds, almost > neglect-able. > > Does anyone has similar experience or my setup is wrong? > > Chris > >
