I figured out the cause. HDFS block size is 128MB, but I specify mapred.min.split.size as 512MB, and data local I/O processing goes wrong for some reason. When I remove the mapred.min.split.size configuration, tasktrackers pick data-local tasks. Why does it happen ?
It seems like a bug. Split is a logical container of blocks, so nothing is wrong logically. On Wed, Sep 12, 2012 at 1:20 AM, Hiroyuki Yamada <[email protected]> wrote: > Hi, thank you for the comment. > >> Task assignment takes data locality into account first and not block >> sequence. > > Does it work like that when replica factor is set to 1 ? > > I just had a experiment to check the behavior. > There are 14 nodes (node01 to node14) and there are 14 datanodes and > 14 tasktrackers working. > I first created a data to be processed in each node (say data01 to data14), > and I put the each data to the hdfs from each node (at /data > directory. /data/data01, ... /data/data14). > Replica factor is set to 1, so according to the default block placement > policy, > each data is stored at local node. (data01 is stored at node01, data02 > is stored at node02 and so on) > In that setting, I launched a job that processes the /data and > what happened is that tasktrackers read from data01 to data14 sequentially, > which means tasktrackers first take all data from node01 and then > node02 and then node03 and so on. > > If tasktracker takes data locality into account as you say, > each tasktracker should take the local task(data). (tasktrackers at > node02 should take data02 blocks if there is any) > But, it didn't work like that. > What this is happening ? > > Is there any documents about this ? > What part of the source code is doing that ? > > Regards, > Hiroyuki > > On Tue, Sep 11, 2012 at 11:27 PM, Hemanth Yamijala > <[email protected]> wrote: >> Hi, >> >> Task assignment takes data locality into account first and not block >> sequence. In hadoop, tasktrackers ask the jobtracker to be assigned tasks. >> When such a request comes to the jobtracker, it will try to look for an >> unassigned task which needs data that is close to the tasktracker and will >> assign it. >> >> Thanks >> Hemanth >> >> >> On Tue, Sep 11, 2012 at 6:31 PM, Hiroyuki Yamada <[email protected]> wrote: >>> >>> Hi, >>> >>> I want to make sure my understanding about task assignment in hadoop >>> is correct or not. >>> >>> When scanning a file with multiple tasktrackers, >>> I am wondering how a task is assigned to each tasktracker . >>> Is it based on the block sequence or data locality ? >>> >>> Let me explain my question by example. >>> There is a file which composed of 10 blocks (block1 to block10), and >>> block1 is the beginning of the file and block10 is the tail of the file. >>> When scanning the file with 3 tasktrackers (tt1 to tt3), >>> I am wondering if >>> task assignment is based on the block sequence like >>> first tt1 takes block1 and tt2 takes block2 and tt3 takes block3 and >>> tt1 takes block4 and so on >>> or >>> task assignment is based on the task(data) locality like >>> first tt1 takes block2(because it's located in the local) and tt2 >>> takes block1 (because it's located in the local) and >>> tt3 takes block 4(because it's located in the local) and so on. >>> >>> As far as I experienced and the definitive guide book says, >>> I think that the first case is the task assignment strategy. >>> (and if there are many replicas, closest one is picked.) >>> >>> Is this right ? >>> >>> If this is right, is there any way to do like the second case >>> with the current implementation ? >>> >>> Thanks, >>> >>> Hiroyuki >> >>
