Johannes,
Couple of questions - do you happen to know why the splits are actually taking 
21 minutes to generate? Is the namenode overloaded, or is it just the large 
number of files ? Is the input format (assuming you're using an inputFormat for 
splits) going beyond analyzing block boundaries to generate splits ? - maybe 
looking into file contents.

Some things you may want to look at
- ensure tcp no delay is setup correctly. This can cause a fairly large lag in 
nn communication.
- if your inputFormat is FileInputFormat based, you could try setting #threads 
for fif split generation.

There's definitely complexity in pipelining splits. Tez-1076, iirc, has some 
references to this. Beyond input initializers supporting this, the inputFormat 
api itself is restrictive and that's not something that can be changed easily.

Have been thinking a bit about handling splits - especially from the 
perspective of dynamic grouping. Will create jiras over the next few days, some 
of which could help here - but require work and time.

Thanks
- Sid



> On Mar 12, 2015, at 13:13, Johannes Zillmann <[email protected]> wrote:
> 
> So.. its complex ;)
> Regarding the jira, closest thing i found is 
> https://issues.apache.org/jira/browse/TEZ-1166
> Should i add to this or create a new one ?
> 
> Johannes
> 
>> On 12 Mar 2015, at 15:44, Hitesh Shah <[email protected]> wrote:
>> 
>> Hello Johannes, 
>> 
>> This is something we have discussed quite often but have not got around to 
>> implementing this. There might be an open jira related to “pipelining” of 
>> splits. If you cannot find it, please go ahead and create one.
>> 
>> The general issues with these are:
>>  - how to handle dynamic creation of tasks as splits get created
>>  - how to decide how many splits and which splits a single task should handle
>>  - involves some facet of grouping to do optimal allocations of newly 
>> created splits based on available containers. Size of groups could be 
>> different e.g a single group slit consist of either 5 data local splits or 2 
>> rack-local splits or 1 off-rack split when assigning dynamically to a given 
>> container.
>>  - the single task limit also plays into how you handle fault tolerance and 
>> recovery 
>>  - given that split creation is now dynamic, if the AM crashes in a scenario 
>> when not all splits were created but some were already processed, the next 
>> attempt when it recovers needs to handle it in a such way to ensure 
>> correctness of data processing.
>> 
>> thanks
>> — Hitesh
>> 
>>> On Mar 12, 2015, at 2:38 AM, Johannes Zillmann <[email protected]> 
>>> wrote:
>>> 
>>> Hey guys,
>>> 
>>> dump question. With Tez can i have a input-initializaer which don’t require 
>>> to create every split before starting the processing of already created 
>>> splits ?
>>> Means if i have a lot of splits and my splitting process takes a long time, 
>>> can the workers start working already while still doing the splitting ?
>>> 
>>> Johannes
> 

Reply via email to