Thanks for all the suggestions!
All the inputs are the same. It takes 85 hours for 4 iterations on 20
Amazon small machines. On my local single node, it got to iteration 19
for also 85 hours.
Here is a section of the Amazon log output.
It covers the start of iteration 1, and between iteration 4 and iteration 5.
The number of map tasks is set to 2. Should it be larger or related to
number of CPU cores?
- what kinds of machines in the single case and the cluster case?
Amazon Small type.
- did you actually complete a stage with the EMR cluster?
Yes.
- did you have any task failures?
Everything seems ok.
- were your machines swapping?
If run on local machine with 1G max heap size, no memory swapping. Amazon's
Small type has 1.7G.
- what was CPU usage?
I can not tell myself.
- what was network usage?
I can not tell. The mapping process should not use much network. However, it's
really slow.
- how much data was registered as having been read? Was that reasonable?
The input size is 47M, stored on S3.
2011-09-02 21:40:42,905 INFO org.apache.mahout.clustering.lda.LDADriver
(main): LDA Iteration 1
2011-09-02 21:40:42,989 INFO org.apache.hadoop.mapred.JobClient (main):
Default number of map tasks: 2
2011-09-02 21:40:42,989 INFO org.apache.hadoop.mapred.JobClient (main):
Default number of reduce tasks: 33
2011-09-02 21:40:46,081 INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total
input paths to process : 1
2011-09-02 21:40:50,199 INFO org.apache.hadoop.mapred.JobClient (main):
Running job: job_201109022131_0001
2011-09-02 21:40:51,208 INFO org.apache.hadoop.mapred.JobClient (main):
map 0% reduce 0%
2011-09-02 22:00:25,591 INFO org.apache.hadoop.mapred.JobClient (main):
map 1% reduce 0%
2011-09-02 22:17:37,315 INFO org.apache.hadoop.mapred.JobClient (main):
map 2% reduce 0%
2011-09-02 22:30:36,622 INFO org.apache.hadoop.mapred.JobClient (main):
map 3% reduce 0%
2011-09-02 22:42:44,128 INFO org.apache.hadoop.mapred.JobClient (main):
map 4% reduce 0%
2011-09-02 22:57:20,817 INFO org.apache.hadoop.mapred.JobClient (main):
map 5% reduce 0%
2011-09-02 23:11:20,329 INFO org.apache.hadoop.mapred.JobClient (main):
map 6% reduce 0%
2011-09-02 23:24:54,832 INFO org.apache.hadoop.mapred.JobClient (main):
map 7% reduce 0%
2011-09-02 23:37:50,205 INFO org.apache.hadoop.mapred.JobClient (main):
map 8% reduce 0%
...............
...............
...............
...............
2011-09-06 16:50:35,471 INFO org.apache.hadoop.mapred.JobClient (main):
map 100% reduce 91%
2011-09-06 16:50:36,499 INFO org.apache.hadoop.mapred.JobClient (main):
map 100% reduce 92%
2011-09-06 16:50:37,503 INFO org.apache.hadoop.mapred.JobClient (main):
map 100% reduce 94%
2011-09-06 16:50:39,511 INFO org.apache.hadoop.mapred.JobClient (main):
map 100% reduce 96%
2011-09-06 16:50:53,576 INFO org.apache.hadoop.mapred.JobClient (main):
map 100% reduce 97%
2011-09-06 16:51:15,665 INFO org.apache.hadoop.mapred.JobClient (main):
map 100% reduce 99%
2011-09-06 16:51:58,848 INFO org.apache.hadoop.mapred.JobClient (main):
map 100% reduce 100%
2011-09-06 16:52:42,039 INFO org.apache.hadoop.mapred.JobClient (main):
Job complete: job_201109022131_0004
2011-09-06 16:52:42,041 INFO org.apache.hadoop.mapred.JobClient (main):
Counters: 17
2011-09-06 16:52:42,041 INFO org.apache.hadoop.mapred.JobClient
(main): Job Counters
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
(main): Launched reduce tasks=33
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
(main): Rack-local map tasks=1
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
(main): Launched map tasks=1
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
(main): FileSystemCounters
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
(main): S3N_BYTES_READ=732125802
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
(main): FILE_BYTES_READ=27958301147
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
(main): S3N_BYTES_WRITTEN=684537903
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
(main): FILE_BYTES_WRITTEN=26164696939
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
(main): Map-Reduce Framework
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
(main): Reduce input groups=28242101
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
(main): Combine output records=851254845
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
(main): Map input records=18285
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
(main): Reduce shuffle bytes=334153034
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
(main): Reduce output records=28242101
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
(main): Spilled Records=2093587887
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
(main): Map output bytes=19571033360
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
(main): Combine input records=2046202329
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
(main): Map output records=1223189585
2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
(main): Reduce input records=28242101
2011-09-06 16:54:03,998 INFO org.apache.mahout.clustering.lda.LDADriver
(main): Iteration 4 finished. Log Likelihood: -2.738323998517175E8
2011-09-06 16:54:03,998 INFO org.apache.mahout.clustering.lda.LDADriver
(main): (Old LL: -2.810158757091537E8)
2011-09-06 16:54:03,998 INFO org.apache.mahout.clustering.lda.LDADriver
(main): (Rel Change: 0.02556252681208305)
2011-09-06 16:54:03,998 INFO org.apache.mahout.clustering.lda.LDADriver
(main): LDA Iteration 5
2011-09-06 16:54:04,024 INFO org.apache.hadoop.mapred.JobClient (main):
Default number of map tasks: 2
2011-09-06 16:54:04,024 INFO org.apache.hadoop.mapred.JobClient (main):
Default number of reduce tasks: 33
2011-09-06 16:54:04,751 INFO
org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total
input paths to process : 1
2011-09-06 16:54:05,829 INFO org.apache.hadoop.mapred.JobClient (main):
Running job: job_201109022131_0005
2011-09-06 16:54:06,835 INFO org.apache.hadoop.mapred.JobClient (main):
map 0% reduce 0%
2011-09-06 17:08:40,517 INFO org.apache.hadoop.mapred.JobClient (main):
map 1% reduce 0%
2011-09-06 17:19:20,825 INFO org.apache.hadoop.mapred.JobClient (main):
map 2% reduce 0%
On 09/06/2011 08:39 AM, Ted Dunning wrote:
I think that Sean and Danny are right. You need to give a bit more
information.
- what kinds of machines in the single case and the cluster case?
- did you actually complete a stage with the EMR cluster?
- did you have any task failures?
- were your machines swapping?
- what was CPU usage?
- what was network usage?
- how much data was registered as having been read? Was that reasonable?
On Tue, Sep 6, 2011 at 3:11 AM, Sean Owen<[email protected]> wrote:
Running on a real cluster increases the amount of work done, and
significantly, as compared to one node: now, data actually has to be
transferred on/off the machine!
Amazon EMR workers, in my experience, are bottlenecked on I/O. I am not
sure
what instance type you are using but I got better mileage when I used
larger
instances (and more of my own workers per instance, of course; it does that
for you too).
You may have trouble correctly extrapolating from the time it takes to hit
1% as there are setup costs as the instance spin up. Try letting it run a
bit more to see how fast it really seems to go.
Are you saying you extrapolate that it would take 1 EMR machine 1000
minutes
to finish? that sounds quite reasonable compared to 300 minutes locally. If
you mean the whole 20 machines is taking 1000 minutes to finish, that
sounds
quite bad.
On Tue, Sep 6, 2011 at 8:35 AM, Chris Lu<[email protected]> wrote:
Hi,
I am running LDA on 18k documents, each document has 5k terms. total 300k
terms. Topics is set to 100.
Running LDA on Hadoop single node configuration takes about 5 hours per
stage. And 20 stages would take 100 hours.
However, given 20 machines, running on Amazon EMR is actually much much
slower. It takes 1000 minutes per stage. (It takes about 10 minutes for
1%
mapping progress.) Reducing is much faster is counted in seconds, almost
neglect-able.
Does anyone has similar experience or my setup is wrong?
Chris