Re: LDA on single node is much faster than 20 nodes

Ted Dunning Tue, 06 Sep 2011 08:40:01 -0700

I think that Sean and Danny are right.  You need to give a bit more
information.


- what kinds of machines in the single case and the cluster case?

- did you actually complete a stage with the EMR cluster?

- did you have any task failures?

- were your machines swapping?

- what was CPU usage?

- what was network usage?

- how much data was registered as having been read?  Was that reasonable?

On Tue, Sep 6, 2011 at 3:11 AM, Sean Owen <[email protected]> wrote:

> Running on a real cluster increases the amount of work done, and
> significantly, as compared to one node: now, data actually has to be
> transferred on/off the machine!
>
> Amazon EMR workers, in my experience, are bottlenecked on I/O. I am not
> sure
> what instance type you are using but I got better mileage when I used
> larger
> instances (and more of my own workers per instance, of course; it does that
> for you too).
>
> You may have trouble correctly extrapolating from the time it takes to hit
> 1% as there are setup costs as the instance spin up. Try letting it run a
> bit more to see how fast it really seems to go.
>
> Are you saying you extrapolate that it would take 1 EMR machine 1000
> minutes
> to finish? that sounds quite reasonable compared to 300 minutes locally. If
> you mean the whole 20 machines is taking 1000 minutes to finish, that
> sounds
> quite bad.
>
>
> On Tue, Sep 6, 2011 at 8:35 AM, Chris Lu <[email protected]> wrote:
>
> > Hi,
> >
> > I am running LDA on 18k documents, each document has 5k terms. total 300k
> > terms. Topics is set to 100.
> >
> > Running LDA on Hadoop single node configuration takes about 5 hours per
> > stage. And 20 stages would take 100 hours.
> >
> > However, given 20 machines, running on Amazon EMR is actually much much
> > slower. It takes 1000 minutes per stage. (It takes about 10 minutes for
> 1%
> > mapping progress.) Reducing is much faster is counted in seconds, almost
> > neglect-able.
> >
> > Does anyone has similar experience or my setup is wrong?
> >
> > Chris
> >
> >
>

Re: LDA on single node is much faster than 20 nodes

Reply via email to