That's not it. Please open a new one. Thanks! -----Original Message----- From: Johannes Zillmann [mailto:[email protected]] Sent: Thursday, March 12, 2015 1:14 PM To: [email protected] Subject: Re: streamed splitting
So.. its complex ;) Regarding the jira, closest thing i found is https://issues.apache.org/jira/browse/TEZ-1166 Should i add to this or create a new one ? Johannes > On 12 Mar 2015, at 15:44, Hitesh Shah <[email protected]> wrote: > > Hello Johannes, > > This is something we have discussed quite often but have not got around to > implementing this. There might be an open jira related to "pipelining" of > splits. If you cannot find it, please go ahead and create one. > > The general issues with these are: > - how to handle dynamic creation of tasks as splits get created > - how to decide how many splits and which splits a single task should handle > - involves some facet of grouping to do optimal allocations of newly > created splits based on available containers. Size of groups could be > different e.g a single group slit consist of either 5 data local splits or 2 > rack-local splits or 1 off-rack split when assigning dynamically to a given > container. > - the single task limit also plays into how you handle fault tolerance and > recovery > - given that split creation is now dynamic, if the AM crashes in a scenario > when not all splits were created but some were already processed, the next > attempt when it recovers needs to handle it in a such way to ensure > correctness of data processing. > > thanks > - Hitesh > > On Mar 12, 2015, at 2:38 AM, Johannes Zillmann <[email protected]> > wrote: > >> Hey guys, >> >> dump question. With Tez can i have a input-initializaer which don't require >> to create every split before starting the processing of already created >> splits ? >> Means if i have a lot of splits and my splitting process takes a long time, >> can the workers start working already while still doing the splitting ? >> >> Johannes >
