Don't use small instances from EC2. Using a few medium or large machines gives much better performance and costs less in the end than using a lot of the small instances.
--sebastian On 06.09.2011 20:02, Chris Lu wrote: > Thanks for all the suggestions! > > All the inputs are the same. It takes 85 hours for 4 iterations on 20 > Amazon small machines. On my local single node, it got to iteration 19 > for also 85 hours. > > Here is a section of the Amazon log output. > It covers the start of iteration 1, and between iteration 4 and > iteration 5. > > The number of map tasks is set to 2. Should it be larger or related to > number of CPU cores? > > > - what kinds of machines in the single case and the cluster case? > > Amazon Small type. > > - did you actually complete a stage with the EMR cluster? > > Yes. > > - did you have any task failures? > > Everything seems ok. > > - were your machines swapping? > > If run on local machine with 1G max heap size, no memory swapping. > Amazon's Small type has 1.7G. > > - what was CPU usage? > > I can not tell myself. > > - what was network usage? > > I can not tell. The mapping process should not use much network. > However, it's really slow. > > - how much data was registered as having been read? Was that reasonable? > The input size is 47M, stored on S3. > > > 2011-09-02 21:40:42,905 INFO org.apache.mahout.clustering.lda.LDADriver > (main): LDA Iteration 1 > 2011-09-02 21:40:42,989 INFO org.apache.hadoop.mapred.JobClient (main): > Default number of map tasks: 2 > 2011-09-02 21:40:42,989 INFO org.apache.hadoop.mapred.JobClient (main): > Default number of reduce tasks: 33 > 2011-09-02 21:40:46,081 INFO > org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total > input paths to process : 1 > 2011-09-02 21:40:50,199 INFO org.apache.hadoop.mapred.JobClient (main): > Running job: job_201109022131_0001 > 2011-09-02 21:40:51,208 INFO org.apache.hadoop.mapred.JobClient (main): > map 0% reduce 0% > 2011-09-02 22:00:25,591 INFO org.apache.hadoop.mapred.JobClient (main): > map 1% reduce 0% > 2011-09-02 22:17:37,315 INFO org.apache.hadoop.mapred.JobClient (main): > map 2% reduce 0% > 2011-09-02 22:30:36,622 INFO org.apache.hadoop.mapred.JobClient (main): > map 3% reduce 0% > 2011-09-02 22:42:44,128 INFO org.apache.hadoop.mapred.JobClient (main): > map 4% reduce 0% > 2011-09-02 22:57:20,817 INFO org.apache.hadoop.mapred.JobClient (main): > map 5% reduce 0% > 2011-09-02 23:11:20,329 INFO org.apache.hadoop.mapred.JobClient (main): > map 6% reduce 0% > 2011-09-02 23:24:54,832 INFO org.apache.hadoop.mapred.JobClient (main): > map 7% reduce 0% > 2011-09-02 23:37:50,205 INFO org.apache.hadoop.mapred.JobClient (main): > map 8% reduce 0% > ............... > ............... > ............... > ............... > 2011-09-06 16:50:35,471 INFO org.apache.hadoop.mapred.JobClient (main): > map 100% reduce 91% > 2011-09-06 16:50:36,499 INFO org.apache.hadoop.mapred.JobClient (main): > map 100% reduce 92% > 2011-09-06 16:50:37,503 INFO org.apache.hadoop.mapred.JobClient (main): > map 100% reduce 94% > 2011-09-06 16:50:39,511 INFO org.apache.hadoop.mapred.JobClient (main): > map 100% reduce 96% > 2011-09-06 16:50:53,576 INFO org.apache.hadoop.mapred.JobClient (main): > map 100% reduce 97% > 2011-09-06 16:51:15,665 INFO org.apache.hadoop.mapred.JobClient (main): > map 100% reduce 99% > 2011-09-06 16:51:58,848 INFO org.apache.hadoop.mapred.JobClient (main): > map 100% reduce 100% > 2011-09-06 16:52:42,039 INFO org.apache.hadoop.mapred.JobClient (main): > Job complete: job_201109022131_0004 > 2011-09-06 16:52:42,041 INFO org.apache.hadoop.mapred.JobClient (main): > Counters: 17 > 2011-09-06 16:52:42,041 INFO org.apache.hadoop.mapred.JobClient > (main): Job Counters > 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient > (main): Launched reduce tasks=33 > 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient > (main): Rack-local map tasks=1 > 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient > (main): Launched map tasks=1 > 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient > (main): FileSystemCounters > 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient > (main): S3N_BYTES_READ=732125802 > 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient > (main): FILE_BYTES_READ=27958301147 > 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient > (main): S3N_BYTES_WRITTEN=684537903 > 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient > (main): FILE_BYTES_WRITTEN=26164696939 > 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient > (main): Map-Reduce Framework > 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient > (main): Reduce input groups=28242101 > 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient > (main): Combine output records=851254845 > 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient > (main): Map input records=18285 > 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient > (main): Reduce shuffle bytes=334153034 > 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient > (main): Reduce output records=28242101 > 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient > (main): Spilled Records=2093587887 > 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient > (main): Map output bytes=19571033360 > 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient > (main): Combine input records=2046202329 > 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient > (main): Map output records=1223189585 > 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient > (main): Reduce input records=28242101 > 2011-09-06 16:54:03,998 INFO org.apache.mahout.clustering.lda.LDADriver > (main): Iteration 4 finished. Log Likelihood: -2.738323998517175E8 > 2011-09-06 16:54:03,998 INFO org.apache.mahout.clustering.lda.LDADriver > (main): (Old LL: -2.810158757091537E8) > 2011-09-06 16:54:03,998 INFO org.apache.mahout.clustering.lda.LDADriver > (main): (Rel Change: 0.02556252681208305) > 2011-09-06 16:54:03,998 INFO org.apache.mahout.clustering.lda.LDADriver > (main): LDA Iteration 5 > 2011-09-06 16:54:04,024 INFO org.apache.hadoop.mapred.JobClient (main): > Default number of map tasks: 2 > 2011-09-06 16:54:04,024 INFO org.apache.hadoop.mapred.JobClient (main): > Default number of reduce tasks: 33 > 2011-09-06 16:54:04,751 INFO > org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total > input paths to process : 1 > 2011-09-06 16:54:05,829 INFO org.apache.hadoop.mapred.JobClient (main): > Running job: job_201109022131_0005 > 2011-09-06 16:54:06,835 INFO org.apache.hadoop.mapred.JobClient (main): > map 0% reduce 0% > 2011-09-06 17:08:40,517 INFO org.apache.hadoop.mapred.JobClient (main): > map 1% reduce 0% > 2011-09-06 17:19:20,825 INFO org.apache.hadoop.mapred.JobClient (main): > map 2% reduce 0% > > > On 09/06/2011 08:39 AM, Ted Dunning wrote: >> I think that Sean and Danny are right. You need to give a bit more >> information. >> >> - what kinds of machines in the single case and the cluster case? >> >> - did you actually complete a stage with the EMR cluster? >> >> - did you have any task failures? >> >> - were your machines swapping? >> >> - what was CPU usage? >> >> - what was network usage? >> >> - how much data was registered as having been read? Was that reasonable? >> >> On Tue, Sep 6, 2011 at 3:11 AM, Sean Owen<[email protected]> wrote: >> >>> Running on a real cluster increases the amount of work done, and >>> significantly, as compared to one node: now, data actually has to be >>> transferred on/off the machine! >>> >>> Amazon EMR workers, in my experience, are bottlenecked on I/O. I am not >>> sure >>> what instance type you are using but I got better mileage when I used >>> larger >>> instances (and more of my own workers per instance, of course; it >>> does that >>> for you too). >>> >>> You may have trouble correctly extrapolating from the time it takes >>> to hit >>> 1% as there are setup costs as the instance spin up. Try letting it >>> run a >>> bit more to see how fast it really seems to go. >>> >>> Are you saying you extrapolate that it would take 1 EMR machine 1000 >>> minutes >>> to finish? that sounds quite reasonable compared to 300 minutes >>> locally. If >>> you mean the whole 20 machines is taking 1000 minutes to finish, that >>> sounds >>> quite bad. >>> >>> >>> On Tue, Sep 6, 2011 at 8:35 AM, Chris Lu<[email protected]> wrote: >>> >>>> Hi, >>>> >>>> I am running LDA on 18k documents, each document has 5k terms. total >>>> 300k >>>> terms. Topics is set to 100. >>>> >>>> Running LDA on Hadoop single node configuration takes about 5 hours per >>>> stage. And 20 stages would take 100 hours. >>>> >>>> However, given 20 machines, running on Amazon EMR is actually much much >>>> slower. It takes 1000 minutes per stage. (It takes about 10 minutes for >>> 1% >>>> mapping progress.) Reducing is much faster is counted in seconds, >>>> almost >>>> neglect-able. >>>> >>>> Does anyone has similar experience or my setup is wrong? >>>> >>>> Chris >>>> >>>> >
