I think that Sean and Danny are right. You need to give a bit more information.
- what kinds of machines in the single case and the cluster case? - did you actually complete a stage with the EMR cluster? - did you have any task failures? - were your machines swapping? - what was CPU usage? - what was network usage? - how much data was registered as having been read? Was that reasonable? On Tue, Sep 6, 2011 at 3:11 AM, Sean Owen <[email protected]> wrote: > Running on a real cluster increases the amount of work done, and > significantly, as compared to one node: now, data actually has to be > transferred on/off the machine! > > Amazon EMR workers, in my experience, are bottlenecked on I/O. I am not > sure > what instance type you are using but I got better mileage when I used > larger > instances (and more of my own workers per instance, of course; it does that > for you too). > > You may have trouble correctly extrapolating from the time it takes to hit > 1% as there are setup costs as the instance spin up. Try letting it run a > bit more to see how fast it really seems to go. > > Are you saying you extrapolate that it would take 1 EMR machine 1000 > minutes > to finish? that sounds quite reasonable compared to 300 minutes locally. If > you mean the whole 20 machines is taking 1000 minutes to finish, that > sounds > quite bad. > > > On Tue, Sep 6, 2011 at 8:35 AM, Chris Lu <[email protected]> wrote: > > > Hi, > > > > I am running LDA on 18k documents, each document has 5k terms. total 300k > > terms. Topics is set to 100. > > > > Running LDA on Hadoop single node configuration takes about 5 hours per > > stage. And 20 stages would take 100 hours. > > > > However, given 20 machines, running on Amazon EMR is actually much much > > slower. It takes 1000 minutes per stage. (It takes about 10 minutes for > 1% > > mapping progress.) Reducing is much faster is counted in seconds, almost > > neglect-able. > > > > Does anyone has similar experience or my setup is wrong? > > > > Chris > > > > >
