Re: LDA on single node is much faster than 20 nodes

Konstantin Shmakov Wed, 07 Sep 2011 09:25:02 -0700

Split size can be set through mapred.min.split.size, e.g. to set split
size to 1M:
-Dmapred.min.split.size=1048576


--Konstantin


On Wed, Sep 7, 2011 at 1:39 AM, Sean Owen <[email protected]> wrote:
> I see. On EMR, I think the setting you need
> is mapred.tasktracker.map.tasks.minimum. At least that's what I see digging
> through my old EMR code.
>
> Dhruv, yes a lot of these settings are just suggestions to the framework. I
> am not entirely clear on the heuristics used, but I do know that Jake is
> right, that it's driven primarily off the input size, and how much input it
> thinks should go with a worker. You can override these things, but do
> beware, you're probably incurring more overhead than is sensible. It might
> still make sense if you're running on a dedicated cluster where those
> resources are otherwise completely idle, but, not in general a good idea in
> a shared cluster.
>
> Chris are you sure one mapper was running in your last example? I don't see
> an indication of that from the log output one way or the other.
>
> I don't know LDA well. It sounds like you are saying that LDA mappers take a
> long time on a little input, which would suggest that's a bottleneck. I
> don't know one way or the other there... but if that's true you are right
> that we can bake in settings to force an unusually small input split size.
>
> And Jake's last point echoes Sebastian and Ted's: on EMR, fewer big machines
> are better. One of their biggest instances is probably more economical than
> 20 small ones. And, as a bonus, all of the data and processing will stay on
> one machine. (Of course, the master is still a separate instance. I use 1
> small machine for the master, and make it a reserved instance, not a spot
> instance, so it's really unlikely to die.) Of course you're vulnerable to
> that one machine dying, but, for all practical purposes it's going to be a
> win for you.
>
> Definitely use the spot instance market! Ted's right that pricing is crazy
> good.
>
> On Tue, Sep 6, 2011 at 11:57 PM, Chris Lu <[email protected]> wrote:
>
>> Thanks. Very helpful to me!
>>
>> I tried to change the setting of "mapred.map.tasks".  However, the number
>> map task is still just one on one of the 20 machines.
>>
>> ./elastic-mapreduce --create --alive \
>>   --num-instances 20 --name "LDA" \
>>   --bootstrap-action s3://elasticmapreduce/**bootstrap-actions/configure-*
>> *hadoop \
>>   --bootstrap-name "Configuring number of map tasks per job" \
>>   --args "-m,mapred.map.tasks=40"
>>
>> Anyone knows how to configure the number of mappers?
>> Again, the input size is only 46M.
>>
>> Chris
>>
>>
>> On 09/06/2011 12:09 PM, Ted Dunning wrote:
>>
>>> Well, I think that using small instances is a disaster in general.  The
>>> performance that you get from them can vary easily by an order of
>>> magnitude.
>>>  My own preference for real work is either m2xl or cc14xl.  The latter
>>> machines give you nearly bare metal performance and no noisy neighbors.
>>>  The
>>> m2xl is typically very much underpriced on the spot market.
>>>
>>> Sean is right about your job being misconfigured.  The Hadoop overhead is
>>> considerable and you have only given it two threads to overcome that
>>> overhead.
>>>
>>> On Tue, Sep 6, 2011 at 6:12 PM, Sean Owen<[email protected]>  wrote:
>>>
>>>  That's your biggest issue, certainly. Only 2 mappers are running, even
>>>> though you have 20 machines available. Hadoop determines the number of
>>>> mappers based on input size, and your input isn't so big that it thinks
>>>> you
>>>> need 20 workers. It's launching 33 reducers, so your cluster is put to
>>>> use
>>>> there. But it's no wonder you're not seeing anything like 20x speedup in
>>>> the
>>>> mapper.
>>>>
>>>> You can of course force it to use more mappers, and that's probably a
>>>> good
>>>> idea here. -Dmapred.map.tasks=20 perhaps. More mappers means more
>>>> overhead
>>>> of spinning up mappers to process less data, and Hadoop's guess indicates
>>>> that it thinks it's not efficient to use 20 workers. If you know that
>>>> those
>>>> other 18 are otherwise idle, my guess is you'd benefit from just making
>>>> it
>>>> use 20.
>>>>
>>>> If this were a general large cluster where many people are taking
>>>> advantage
>>>> of the workers, then I'd trust Hadoop's guesses until you are sure  you
>>>> want
>>>> to do otherwise.
>>>>
>>>> On Tue, Sep 6, 2011 at 7:02 PM, Chris Lu<[email protected]>  wrote:
>>>>
>>>>  Thanks for all the suggestions!
>>>>>
>>>>> All the inputs are the same. It takes 85 hours for 4 iterations on 20
>>>>> Amazon small machines. On my local single node, it got to iteration 19
>>>>>
>>>> for
>>>>
>>>>> also 85 hours.
>>>>>
>>>>> Here is a section of the Amazon log output.
>>>>> It covers the start of iteration 1, and between iteration 4 and
>>>>> iteration
>>>>> 5.
>>>>>
>>>>> The number of map tasks is set to 2. Should it be larger or related to
>>>>> number of CPU cores?
>>>>>
>>>>>
>>>>>
>>
>



-- 
ksh:

Re: LDA on single node is much faster than 20 nodes

Reply via email to