So... Data locality only works when you actually have data on the cluster itself. Otherwise how can the data be local.
Assuming 3X replication, and you're not doing a custom split and your input file is splittable... You will split along the block delineation. So if your input file has 5 blocks, you will have 5 mappers. Since there are 3 copies of the block, its possible that for that map task to run on the DN which has a copy of that block. So its pretty straight forward to a point. When your cluster starts to get a lot of jobs and a slot opens up, your job may not be data local. With HBase... YMMV With S3 the data isn't local so it doesn't matter which Data Node gets the job. HTH -Mike On Oct 24, 2012, at 1:10 AM, David Parks <[email protected]> wrote: > Even after reading O’reillys book on hadoop I don’t feel like I have a clear > vision of how the map tasks get assigned. > > They depend on splits right? > > But I have 3 jobs running. And splits will come from various sources: HDFS, > S3, and slow HTTP sources. > > So I’ve got some concern as to how the map tasks will be distributed to > handle the data acquisition. > > Can I do anything to ensure that I don’t let the cluster go idle processing > slow HTTP downloads when the boxes could simultaneously be doing HTTP > downloads for one job and reading large files off HDFS for another job? > > I’m imagining a scenario where the only map tasks that are running are all > blocking on splits requiring HTTP downloads and the splits coming from HDFS > are all queuing up behind it, when they’d run more efficiently in parallel > per node. > >
