Re: streamed splitting

Hitesh Shah Thu, 12 Mar 2015 07:44:46 -0700

Hello Johannes, 

This is something we have discussed quite often but have not got around to 
implementing this. There might be an open jira related to “pipelining” of 
splits. If you cannot find it, please go ahead and create one.


The general issues with these are:
   - how to handle dynamic creation of tasks as splits get created
   - how to decide how many splits and which splits a single task should handle
   - involves some facet of grouping to do optimal allocations of newly created 
splits based on available containers. Size of groups could be different e.g a 
single group slit consist of either 5 data local splits or 2 rack-local splits 
or 1 off-rack split when assigning dynamically to a given container.
   - the single task limit also plays into how you handle fault tolerance and 
recovery 
   - given that split creation is now dynamic, if the AM crashes in a scenario 
when not all splits were created but some were already processed, the next 
attempt when it recovers needs to handle it in a such way to ensure correctness 
of data processing.

thanks
— Hitesh

On Mar 12, 2015, at 2:38 AM, Johannes Zillmann <[email protected]> wrote:

> Hey guys,
> 
> dump question. With Tez can i have a input-initializaer which don’t require 
> to create every split before starting the processing of already created 
> splits ?
> Means if i have a lot of splits and my splitting process takes a long time, 
> can the workers start working already while still doing the splitting ?
> 
> Johannes

Re: streamed splitting

Reply via email to