Re: streamed splitting

Johannes Zillmann Thu, 12 Mar 2015 13:16:48 -0700

So.. its complex ;)
Regarding the jira, closest thing i found is 
https://issues.apache.org/jira/browse/TEZ-1166
Should i add to this or create a new one ?


Johannes

> On 12 Mar 2015, at 15:44, Hitesh Shah <[email protected]> wrote:
> 
> Hello Johannes, 
> 
> This is something we have discussed quite often but have not got around to 
> implementing this. There might be an open jira related to “pipelining” of 
> splits. If you cannot find it, please go ahead and create one.
> 
> The general issues with these are:
>   - how to handle dynamic creation of tasks as splits get created
>   - how to decide how many splits and which splits a single task should handle
>   - involves some facet of grouping to do optimal allocations of newly 
> created splits based on available containers. Size of groups could be 
> different e.g a single group slit consist of either 5 data local splits or 2 
> rack-local splits or 1 off-rack split when assigning dynamically to a given 
> container.
>   - the single task limit also plays into how you handle fault tolerance and 
> recovery 
>   - given that split creation is now dynamic, if the AM crashes in a scenario 
> when not all splits were created but some were already processed, the next 
> attempt when it recovers needs to handle it in a such way to ensure 
> correctness of data processing.
> 
> thanks
> — Hitesh
> 
> On Mar 12, 2015, at 2:38 AM, Johannes Zillmann <[email protected]> 
> wrote:
> 
>> Hey guys,
>> 
>> dump question. With Tez can i have a input-initializaer which don’t require 
>> to create every split before starting the processing of already created 
>> splits ?
>> Means if i have a lot of splits and my splitting process takes a long time, 
>> can the workers start working already while still doing the splitting ?
>> 
>> Johannes
>

Re: streamed splitting

Reply via email to