So... 

Data locality only works when you actually have data on the cluster itself. 
Otherwise how can the data be local. 

Assuming 3X replication, and you're not doing a custom split and your input 
file is splittable...

You will split along the block delineation.  So if your input file has 5 
blocks, you will have 5 mappers.

Since there are 3 copies of the block, its possible that for that map task to 
run on the DN which has a copy of that block. 

So its pretty straight forward to a point. 

When your cluster starts to get a lot of jobs and a slot opens up, your job may 
not be data local. 

With HBase... YMMV 
With S3 the data isn't local so it doesn't matter which Data Node gets the job. 

HTH

-Mike

On Oct 24, 2012, at 1:10 AM, David Parks <[email protected]> wrote:

> Even after reading O’reillys book on hadoop I don’t feel like I have a clear 
> vision of how the map tasks get assigned.
>  
> They depend on splits right?
>  
> But I have 3 jobs running. And splits will come from various sources: HDFS, 
> S3, and slow HTTP sources.
>  
> So I’ve got some concern as to how the map tasks will be distributed to 
> handle the data acquisition.
>  
> Can I do anything to ensure that I don’t let the cluster go idle processing 
> slow HTTP downloads when the boxes could simultaneously be doing HTTP 
> downloads for one job and reading large files off HDFS for another job?
>  
> I’m imagining a scenario where the only map tasks that are running are all 
> blocking on splits requiring HTTP downloads and the splits coming from HDFS 
> are all queuing up behind it, when they’d run more efficiently in parallel 
> per node.
>  
>  

Reply via email to