Re: LDA on single node is much faster than 20 nodes

Sebastian Schelter Tue, 06 Sep 2011 11:14:01 -0700

Don't use small instances from EC2.

Using a few medium or large machines gives much better performance and
costs less in the end than using a lot of the small instances.


--sebastian

On 06.09.2011 20:02, Chris Lu wrote:
> Thanks for all the suggestions!
> 
> All the inputs are the same. It takes 85 hours for 4 iterations on 20
> Amazon small machines. On my local single node, it got to iteration 19
> for also 85 hours.
> 
> Here is a section of the Amazon log output.
> It covers the start of iteration 1, and between iteration 4 and
> iteration 5.
> 
> The number of map tasks is set to 2. Should it be larger or related to
> number of CPU cores?
> 
> 
> - what kinds of machines in the single case and the cluster case?
> 
> Amazon Small type.
> 
> - did you actually complete a stage with the EMR cluster?
> 
> Yes.
> 
> - did you have any task failures?
> 
> Everything seems ok.
> 
> - were your machines swapping?
> 
> If run on local machine with 1G max heap size, no memory swapping.
> Amazon's Small type has 1.7G.
> 
> - what was CPU usage?
> 
> I can not tell myself.
> 
> - what was network usage?
> 
> I can not tell. The mapping process should not use much network.
> However, it's really slow.
> 
> - how much data was registered as having been read?  Was that reasonable?
> The input size is 47M, stored on S3.
> 
> 
> 2011-09-02 21:40:42,905 INFO org.apache.mahout.clustering.lda.LDADriver
> (main): LDA Iteration 1
> 2011-09-02 21:40:42,989 INFO org.apache.hadoop.mapred.JobClient (main):
> Default number of map tasks: 2
> 2011-09-02 21:40:42,989 INFO org.apache.hadoop.mapred.JobClient (main):
> Default number of reduce tasks: 33
> 2011-09-02 21:40:46,081 INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total
> input paths to process : 1
> 2011-09-02 21:40:50,199 INFO org.apache.hadoop.mapred.JobClient (main):
> Running job: job_201109022131_0001
> 2011-09-02 21:40:51,208 INFO org.apache.hadoop.mapred.JobClient (main): 
> map 0% reduce 0%
> 2011-09-02 22:00:25,591 INFO org.apache.hadoop.mapred.JobClient (main): 
> map 1% reduce 0%
> 2011-09-02 22:17:37,315 INFO org.apache.hadoop.mapred.JobClient (main): 
> map 2% reduce 0%
> 2011-09-02 22:30:36,622 INFO org.apache.hadoop.mapred.JobClient (main): 
> map 3% reduce 0%
> 2011-09-02 22:42:44,128 INFO org.apache.hadoop.mapred.JobClient (main): 
> map 4% reduce 0%
> 2011-09-02 22:57:20,817 INFO org.apache.hadoop.mapred.JobClient (main): 
> map 5% reduce 0%
> 2011-09-02 23:11:20,329 INFO org.apache.hadoop.mapred.JobClient (main): 
> map 6% reduce 0%
> 2011-09-02 23:24:54,832 INFO org.apache.hadoop.mapred.JobClient (main): 
> map 7% reduce 0%
> 2011-09-02 23:37:50,205 INFO org.apache.hadoop.mapred.JobClient (main): 
> map 8% reduce 0%
> ...............
> ...............
> ...............
> ...............
> 2011-09-06 16:50:35,471 INFO org.apache.hadoop.mapred.JobClient (main): 
> map 100% reduce 91%
> 2011-09-06 16:50:36,499 INFO org.apache.hadoop.mapred.JobClient (main): 
> map 100% reduce 92%
> 2011-09-06 16:50:37,503 INFO org.apache.hadoop.mapred.JobClient (main): 
> map 100% reduce 94%
> 2011-09-06 16:50:39,511 INFO org.apache.hadoop.mapred.JobClient (main): 
> map 100% reduce 96%
> 2011-09-06 16:50:53,576 INFO org.apache.hadoop.mapred.JobClient (main): 
> map 100% reduce 97%
> 2011-09-06 16:51:15,665 INFO org.apache.hadoop.mapred.JobClient (main): 
> map 100% reduce 99%
> 2011-09-06 16:51:58,848 INFO org.apache.hadoop.mapred.JobClient (main): 
> map 100% reduce 100%
> 2011-09-06 16:52:42,039 INFO org.apache.hadoop.mapred.JobClient (main):
> Job complete: job_201109022131_0004
> 2011-09-06 16:52:42,041 INFO org.apache.hadoop.mapred.JobClient (main):
> Counters: 17
> 2011-09-06 16:52:42,041 INFO org.apache.hadoop.mapred.JobClient
> (main):   Job Counters
> 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
> (main):     Launched reduce tasks=33
> 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
> (main):     Rack-local map tasks=1
> 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
> (main):     Launched map tasks=1
> 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
> (main):   FileSystemCounters
> 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
> (main):     S3N_BYTES_READ=732125802
> 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
> (main):     FILE_BYTES_READ=27958301147
> 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
> (main):     S3N_BYTES_WRITTEN=684537903
> 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
> (main):     FILE_BYTES_WRITTEN=26164696939
> 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
> (main):   Map-Reduce Framework
> 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
> (main):     Reduce input groups=28242101
> 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
> (main):     Combine output records=851254845
> 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
> (main):     Map input records=18285
> 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
> (main):     Reduce shuffle bytes=334153034
> 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
> (main):     Reduce output records=28242101
> 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
> (main):     Spilled Records=2093587887
> 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
> (main):     Map output bytes=19571033360
> 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
> (main):     Combine input records=2046202329
> 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
> (main):     Map output records=1223189585
> 2011-09-06 16:52:42,042 INFO org.apache.hadoop.mapred.JobClient
> (main):     Reduce input records=28242101
> 2011-09-06 16:54:03,998 INFO org.apache.mahout.clustering.lda.LDADriver
> (main): Iteration 4 finished. Log Likelihood: -2.738323998517175E8
> 2011-09-06 16:54:03,998 INFO org.apache.mahout.clustering.lda.LDADriver
> (main): (Old LL: -2.810158757091537E8)
> 2011-09-06 16:54:03,998 INFO org.apache.mahout.clustering.lda.LDADriver
> (main): (Rel Change: 0.02556252681208305)
> 2011-09-06 16:54:03,998 INFO org.apache.mahout.clustering.lda.LDADriver
> (main): LDA Iteration 5
> 2011-09-06 16:54:04,024 INFO org.apache.hadoop.mapred.JobClient (main):
> Default number of map tasks: 2
> 2011-09-06 16:54:04,024 INFO org.apache.hadoop.mapred.JobClient (main):
> Default number of reduce tasks: 33
> 2011-09-06 16:54:04,751 INFO
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total
> input paths to process : 1
> 2011-09-06 16:54:05,829 INFO org.apache.hadoop.mapred.JobClient (main):
> Running job: job_201109022131_0005
> 2011-09-06 16:54:06,835 INFO org.apache.hadoop.mapred.JobClient (main): 
> map 0% reduce 0%
> 2011-09-06 17:08:40,517 INFO org.apache.hadoop.mapred.JobClient (main): 
> map 1% reduce 0%
> 2011-09-06 17:19:20,825 INFO org.apache.hadoop.mapred.JobClient (main): 
> map 2% reduce 0%
> 
> 
> On 09/06/2011 08:39 AM, Ted Dunning wrote:
>> I think that Sean and Danny are right.  You need to give a bit more
>> information.
>>
>> - what kinds of machines in the single case and the cluster case?
>>
>> - did you actually complete a stage with the EMR cluster?
>>
>> - did you have any task failures?
>>
>> - were your machines swapping?
>>
>> - what was CPU usage?
>>
>> - what was network usage?
>>
>> - how much data was registered as having been read?  Was that reasonable?
>>
>> On Tue, Sep 6, 2011 at 3:11 AM, Sean Owen<[email protected]>  wrote:
>>
>>> Running on a real cluster increases the amount of work done, and
>>> significantly, as compared to one node: now, data actually has to be
>>> transferred on/off the machine!
>>>
>>> Amazon EMR workers, in my experience, are bottlenecked on I/O. I am not
>>> sure
>>> what instance type you are using but I got better mileage when I used
>>> larger
>>> instances (and more of my own workers per instance, of course; it
>>> does that
>>> for you too).
>>>
>>> You may have trouble correctly extrapolating from the time it takes
>>> to hit
>>> 1% as there are setup costs as the instance spin up. Try letting it
>>> run a
>>> bit more to see how fast it really seems to go.
>>>
>>> Are you saying you extrapolate that it would take 1 EMR machine 1000
>>> minutes
>>> to finish? that sounds quite reasonable compared to 300 minutes
>>> locally. If
>>> you mean the whole 20 machines is taking 1000 minutes to finish, that
>>> sounds
>>> quite bad.
>>>
>>>
>>> On Tue, Sep 6, 2011 at 8:35 AM, Chris Lu<[email protected]>  wrote:
>>>
>>>> Hi,
>>>>
>>>> I am running LDA on 18k documents, each document has 5k terms. total
>>>> 300k
>>>> terms. Topics is set to 100.
>>>>
>>>> Running LDA on Hadoop single node configuration takes about 5 hours per
>>>> stage. And 20 stages would take 100 hours.
>>>>
>>>> However, given 20 machines, running on Amazon EMR is actually much much
>>>> slower. It takes 1000 minutes per stage. (It takes about 10 minutes for
>>> 1%
>>>> mapping progress.) Reducing is much faster is counted in seconds,
>>>> almost
>>>> neglect-able.
>>>>
>>>> Does anyone has similar experience or my setup is wrong?
>>>>
>>>> Chris
>>>>
>>>>
>

Re: LDA on single node is much faster than 20 nodes

Reply via email to