Hello Johannes, This is something we have discussed quite often but have not got around to implementing this. There might be an open jira related to “pipelining” of splits. If you cannot find it, please go ahead and create one.
The general issues with these are: - how to handle dynamic creation of tasks as splits get created - how to decide how many splits and which splits a single task should handle - involves some facet of grouping to do optimal allocations of newly created splits based on available containers. Size of groups could be different e.g a single group slit consist of either 5 data local splits or 2 rack-local splits or 1 off-rack split when assigning dynamically to a given container. - the single task limit also plays into how you handle fault tolerance and recovery - given that split creation is now dynamic, if the AM crashes in a scenario when not all splits were created but some were already processed, the next attempt when it recovers needs to handle it in a such way to ensure correctness of data processing. thanks — Hitesh On Mar 12, 2015, at 2:38 AM, Johannes Zillmann <[email protected]> wrote: > Hey guys, > > dump question. With Tez can i have a input-initializaer which don’t require > to create every split before starting the processing of already created > splits ? > Means if i have a lot of splits and my splitting process takes a long time, > can the workers start working already while still doing the splitting ? > > Johannes
