Re: Question about the task assignment strategy

Hiroyuki Yamada Tue, 11 Sep 2012 19:19:57 -0700

I figured out the cause.
HDFS block size is 128MB, but
I specify mapred.min.split.size as 512MB,
and data local I/O processing goes wrong for some reason.
When I remove the mapred.min.split.size configuration,
tasktrackers pick data-local tasks.
Why does it happen ?


It seems like a bug.
Split is a logical container of blocks,
so nothing is wrong logically.

On Wed, Sep 12, 2012 at 1:20 AM, Hiroyuki Yamada <[email protected]> wrote:
> Hi, thank you for the comment.
>
>> Task assignment takes data locality into account first and not block 
>> sequence.
>
> Does it work like that when replica factor is set to 1 ?
>
> I just had a experiment to check the behavior.
> There are 14 nodes (node01 to node14) and there are 14 datanodes and
> 14 tasktrackers working.
> I first created a data to be processed in each node (say data01 to data14),
> and I put the each data to the hdfs from each node (at /data
> directory. /data/data01, ... /data/data14).
> Replica factor is set to 1, so according to the default block placement 
> policy,
> each data is stored at local node. (data01 is stored at node01, data02
> is stored at node02 and so on)
> In that setting, I launched a job that processes the /data and
> what happened is that tasktrackers read from data01 to data14 sequentially,
> which means tasktrackers first take all data from node01 and then
> node02 and then node03 and so on.
>
> If tasktracker takes data locality into account as you say,
> each tasktracker should take the local task(data). (tasktrackers at
> node02 should take data02 blocks if there is any)
> But, it didn't work like that.
> What this is happening ?
>
> Is there any documents about this ?
> What part of the source code is doing that ?
>
> Regards,
> Hiroyuki
>
> On Tue, Sep 11, 2012 at 11:27 PM, Hemanth Yamijala
> <[email protected]> wrote:
>> Hi,
>>
>> Task assignment takes data locality into account first and not block
>> sequence. In hadoop, tasktrackers ask the jobtracker to be assigned tasks.
>> When such a request comes to the jobtracker, it will try to look for an
>> unassigned task which needs data that is close to the tasktracker and will
>> assign it.
>>
>> Thanks
>> Hemanth
>>
>>
>> On Tue, Sep 11, 2012 at 6:31 PM, Hiroyuki Yamada <[email protected]> wrote:
>>>
>>> Hi,
>>>
>>> I want to make sure my understanding about task assignment in hadoop
>>> is correct or not.
>>>
>>> When scanning a file with multiple tasktrackers,
>>> I am wondering how a task is assigned to each tasktracker .
>>> Is it based on the block sequence or data locality ?
>>>
>>> Let me explain my question by example.
>>> There is a file which composed of 10 blocks (block1 to block10), and
>>> block1 is the beginning of the file and block10 is the tail of the file.
>>> When scanning the file with 3 tasktrackers (tt1 to tt3),
>>> I am wondering if
>>> task assignment is based on the block sequence like
>>> first tt1 takes block1 and tt2 takes block2 and tt3 takes block3 and
>>> tt1 takes block4 and so on
>>> or
>>> task assignment is based on the task(data) locality like
>>> first tt1 takes block2(because it's located in the local) and tt2
>>> takes block1 (because it's located in the local) and
>>> tt3 takes block 4(because it's located in the local) and so on.
>>>
>>> As far as I experienced and the definitive guide book says,
>>> I think that the first case is the task assignment strategy.
>>> (and if there are many replicas, closest one is picked.)
>>>
>>> Is this right ?
>>>
>>> If this is right, is there any way to do like the second case
>>> with the current implementation ?
>>>
>>> Thanks,
>>>
>>> Hiroyuki
>>
>>

Re: Question about the task assignment strategy

Reply via email to