Re: LDA on single node is much faster than 20 nodes

Chris Lu Tue, 06 Sep 2011 10:57:51 -0700

Thanks for all the suggestions!

All the inputs are the same. It takes 85 hours for 4 iterations on 20Amazon small machines. On my local single node, it got to iteration 19for also 85 hours.


Here is a section of the Amazon log output.
It covers the start of iteration 1, and between iteration 4 and iteration 5.

The number of map tasks is set to 2. Should it be larger or related tonumber of CPU cores?



- what kinds of machines in the single case and the cluster case?

Amazon Small type.

- did you actually complete a stage with the EMR cluster?

Yes.

- did you have any task failures?

Everything seems ok.

- were your machines swapping?

If run on local machine with 1G max heap size, no memory swapping. Amazon's 
Small type has 1.7G.

- what was CPU usage?

I can not tell myself.

- what was network usage?

I can not tell. The mapping process should not use much network. However, it's 
really slow.

- how much data was registered as having been read?  Was that reasonable?
The input size is 47M, stored on S3.

2011-09-02 21:40:42,905 INFO org.apache.mahout.clustering.lda.LDADriver(main): LDA Iteration 12011-09-02 21:40:42,989 INFO org.apache.hadoop.mapred.JobClient (main):Default number of map tasks: 22011-09-02 21:40:42,989 INFO org.apache.hadoop.mapred.JobClient (main):Default number of reduce tasks: 332011-09-02 21:40:46,081 INFOorg.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Totalinput paths to process : 12011-09-02 21:40:50,199 INFO org.apache.hadoop.mapred.JobClient (main):Running job: job_201109022131_00012011-09-02 21:40:51,208 INFO org.apache.hadoop.mapred.JobClient (main):map 0% reduce 0%2011-09-02 22:00:25,591 INFO org.apache.hadoop.mapred.JobClient (main):map 1% reduce 0%2011-09-02 22:17:37,315 INFO org.apache.hadoop.mapred.JobClient (main):map 2% reduce 0%2011-09-02 22:30:36,622 INFO org.apache.hadoop.mapred.JobClient (main):map 3% reduce 0%2011-09-02 22:42:44,128 INFO org.apache.hadoop.mapred.JobClient (main):map 4% reduce 0%2011-09-02 22:57:20,817 INFO org.apache.hadoop.mapred.JobClient (main):map 5% reduce 0%2011-09-02 23:11:20,329 INFO org.apache.hadoop.mapred.JobClient (main):map 6% reduce 0%2011-09-02 23:24:54,832 INFO org.apache.hadoop.mapred.JobClient (main):map 7% reduce 0%2011-09-02 23:37:50,205 INFO org.apache.hadoop.mapred.JobClient (main):map 8% reduce 0%

...............
...............
...............
...............

2011-09-06 16:50:35,471 INFO org.apache.hadoop.mapred.JobClient (main):map 100% reduce 91%2011-09-06 16:50:36,499 INFO org.apache.hadoop.mapred.JobClient (main):map 100% reduce 92%2011-09-06 16:50:37,503 INFO org.apache.hadoop.mapred.JobClient (main):map 100% reduce 94%2011-09-06 16:50:39,511 INFO org.apache.hadoop.mapred.JobClient (main):map 100% reduce 96%2011-09-06 16:50:53,576 INFO org.apache.hadoop.mapred.JobClient (main):map 100% reduce 97%2011-09-06 16:51:15,665 INFO org.apache.hadoop.mapred.JobClient (main):map 100% reduce 99%2011-09-06 16:51:58,848 INFO org.apache.hadoop.mapred.JobClient (main):map 100% reduce 100%2011-09-06 16:52:42,039 INFO org.apache.hadoop.mapred.JobClient (main):Job complete: job_201109022131_00042011-09-06 16:52:42,041 INFO org.apache.hadoop.mapred.JobClient (main):Counters: 172011-09-06 16:52:42,041 INFO org.apache.hadoop.mapred.JobClient(main): Job Counters2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient(main): Launched reduce tasks=332011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient(main): Rack-local map tasks=12011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient(main): Launched map tasks=12011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient(main): FileSystemCounters2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient(main): S3N_BYTES_READ=7321258022011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient(main): FILE_BYTES_READ=279583011472011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient(main): S3N_BYTES_WRITTEN=6845379032011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient(main): FILE_BYTES_WRITTEN=261646969392011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient(main): Map-Reduce Framework2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient(main): Reduce input groups=282421012011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient(main): Combine output records=8512548452011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient(main): Map input records=182852011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient(main): Reduce shuffle bytes=3341530342011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient(main): Reduce output records=282421012011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient(main): Spilled Records=20935878872011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient(main): Map output bytes=195710333602011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient(main): Combine input records=20462023292011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient(main): Map output records=12231895852011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient(main): Reduce input records=282421012011-09-06 16:54:03,998 INFO org.apache.mahout.clustering.lda.LDADriver(main): Iteration 4 finished. Log Likelihood: -2.738323998517175E82011-09-06 16:54:03,998 INFO org.apache.mahout.clustering.lda.LDADriver(main): (Old LL: -2.810158757091537E8)2011-09-06 16:54:03,998 INFO org.apache.mahout.clustering.lda.LDADriver(main): (Rel Change: 0.02556252681208305)2011-09-06 16:54:03,998 INFO org.apache.mahout.clustering.lda.LDADriver(main): LDA Iteration 52011-09-06 16:54:04,024 INFO org.apache.hadoop.mapred.JobClient (main):Default number of map tasks: 22011-09-06 16:54:04,024 INFO org.apache.hadoop.mapred.JobClient (main):Default number of reduce tasks: 332011-09-06 16:54:04,751 INFOorg.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Totalinput paths to process : 12011-09-06 16:54:05,829 INFO org.apache.hadoop.mapred.JobClient (main):Running job: job_201109022131_00052011-09-06 16:54:06,835 INFO org.apache.hadoop.mapred.JobClient (main):map 0% reduce 0%2011-09-06 17:08:40,517 INFO org.apache.hadoop.mapred.JobClient (main):map 1% reduce 0%2011-09-06 17:19:20,825 INFO org.apache.hadoop.mapred.JobClient (main):map 2% reduce 0%



On 09/06/2011 08:39 AM, Ted Dunning wrote:

I think that Sean and Danny are right.  You need to give a bit more
information.

- what kinds of machines in the single case and the cluster case?

- did you actually complete a stage with the EMR cluster?

- did you have any task failures?

- were your machines swapping?

- what was CPU usage?

- what was network usage?

- how much data was registered as having been read?  Was that reasonable?

On Tue, Sep 6, 2011 at 3:11 AM, Sean Owen<[email protected]>  wrote:

Running on a real cluster increases the amount of work done, and
significantly, as compared to one node: now, data actually has to be
transferred on/off the machine!

Amazon EMR workers, in my experience, are bottlenecked on I/O. I am not
sure
what instance type you are using but I got better mileage when I used
larger
instances (and more of my own workers per instance, of course; it does that
for you too).

You may have trouble correctly extrapolating from the time it takes to hit
1% as there are setup costs as the instance spin up. Try letting it run a
bit more to see how fast it really seems to go.

Are you saying you extrapolate that it would take 1 EMR machine 1000
minutes
to finish? that sounds quite reasonable compared to 300 minutes locally. If
you mean the whole 20 machines is taking 1000 minutes to finish, that
sounds
quite bad.


On Tue, Sep 6, 2011 at 8:35 AM, Chris Lu<[email protected]>  wrote:

Hi,

I am running LDA on 18k documents, each document has 5k terms. total 300k
terms. Topics is set to 100.

Running LDA on Hadoop single node configuration takes about 5 hours per
stage. And 20 stages would take 100 hours.

However, given 20 machines, running on Amazon EMR is actually much much
slower. It takes 1000 minutes per stage. (It takes about 10 minutes for

1%

mapping progress.) Reducing is much faster is counted in seconds, almost
neglect-able.

Does anyone has similar experience or my setup is wrong?

Chris

Re: LDA on single node is much faster than 20 nodes

Reply via email to