Re: streamed splitting
Johannes, Couple of questions - do you happen to know why the splits are actually taking 21 minutes to generate? Is the namenode overloaded, or is it just the large number of files ? Is the input format (assuming you're using an inputFormat for splits) going beyond analyzing block boundaries to generate splits ? - maybe looking into file contents. Some things you may want to look at - ensure tcp no delay is setup correctly. This can cause a fairly large lag in nn communication. - if your inputFormat is FileInputFormat based, you could try setting #threads for fif split generation. There's definitely complexity in pipelining splits. Tez-1076, iirc, has some references to this. Beyond input initializers supporting this, the inputFormat api itself is restrictive and that's not something that can be changed easily. Have been thinking a bit about handling splits - especially from the perspective of dynamic grouping. Will create jiras over the next few days, some of which could help here - but require work and time. Thanks - Sid > On Mar 12, 2015, at 13:13, Johannes Zillmann wrote: > > So.. its complex ;) > Regarding the jira, closest thing i found is > https://issues.apache.org/jira/browse/TEZ-1166 > Should i add to this or create a new one ? > > Johannes > >> On 12 Mar 2015, at 15:44, Hitesh Shah wrote: >> >> Hello Johannes, >> >> This is something we have discussed quite often but have not got around to >> implementing this. There might be an open jira related to “pipelining” of >> splits. If you cannot find it, please go ahead and create one. >> >> The general issues with these are: >> - how to handle dynamic creation of tasks as splits get created >> - how to decide how many splits and which splits a single task should handle >> - involves some facet of grouping to do optimal allocations of newly >> created splits based on available containers. Size of groups could be >> different e.g a single group slit consist of either 5 data local splits or 2 >> rack-local splits or 1 off-rack split when assigning dynamically to a given >> container. >> - the single task limit also plays into how you handle fault tolerance and >> recovery >> - given that split creation is now dynamic, if the AM crashes in a scenario >> when not all splits were created but some were already processed, the next >> attempt when it recovers needs to handle it in a such way to ensure >> correctness of data processing. >> >> thanks >> — Hitesh >> >>> On Mar 12, 2015, at 2:38 AM, Johannes Zillmann >>> wrote: >>> >>> Hey guys, >>> >>> dump question. With Tez can i have a input-initializaer which don’t require >>> to create every split before starting the processing of already created >>> splits ? >>> Means if i have a lot of splits and my splitting process takes a long time, >>> can the workers start working already while still doing the splitting ? >>> >>> Johannes >
RE: streamed splitting
That's not it. Please open a new one. Thanks! -Original Message- From: Johannes Zillmann [mailto:jzillm...@googlemail.com] Sent: Thursday, March 12, 2015 1:14 PM To: user@tez.apache.org Subject: Re: streamed splitting So.. its complex ;) Regarding the jira, closest thing i found is https://issues.apache.org/jira/browse/TEZ-1166 Should i add to this or create a new one ? Johannes > On 12 Mar 2015, at 15:44, Hitesh Shah wrote: > > Hello Johannes, > > This is something we have discussed quite often but have not got around to > implementing this. There might be an open jira related to "pipelining" of > splits. If you cannot find it, please go ahead and create one. > > The general issues with these are: > - how to handle dynamic creation of tasks as splits get created > - how to decide how many splits and which splits a single task should handle > - involves some facet of grouping to do optimal allocations of newly > created splits based on available containers. Size of groups could be > different e.g a single group slit consist of either 5 data local splits or 2 > rack-local splits or 1 off-rack split when assigning dynamically to a given > container. > - the single task limit also plays into how you handle fault tolerance and > recovery > - given that split creation is now dynamic, if the AM crashes in a scenario > when not all splits were created but some were already processed, the next > attempt when it recovers needs to handle it in a such way to ensure > correctness of data processing. > > thanks > - Hitesh > > On Mar 12, 2015, at 2:38 AM, Johannes Zillmann > wrote: > >> Hey guys, >> >> dump question. With Tez can i have a input-initializaer which don't require >> to create every split before starting the processing of already created >> splits ? >> Means if i have a lot of splits and my splitting process takes a long time, >> can the workers start working already while still doing the splitting ? >> >> Johannes >
Re: streamed splitting
So.. its complex ;) Regarding the jira, closest thing i found is https://issues.apache.org/jira/browse/TEZ-1166 Should i add to this or create a new one ? Johannes > On 12 Mar 2015, at 15:44, Hitesh Shah wrote: > > Hello Johannes, > > This is something we have discussed quite often but have not got around to > implementing this. There might be an open jira related to “pipelining” of > splits. If you cannot find it, please go ahead and create one. > > The general issues with these are: > - how to handle dynamic creation of tasks as splits get created > - how to decide how many splits and which splits a single task should handle > - involves some facet of grouping to do optimal allocations of newly > created splits based on available containers. Size of groups could be > different e.g a single group slit consist of either 5 data local splits or 2 > rack-local splits or 1 off-rack split when assigning dynamically to a given > container. > - the single task limit also plays into how you handle fault tolerance and > recovery > - given that split creation is now dynamic, if the AM crashes in a scenario > when not all splits were created but some were already processed, the next > attempt when it recovers needs to handle it in a such way to ensure > correctness of data processing. > > thanks > — Hitesh > > On Mar 12, 2015, at 2:38 AM, Johannes Zillmann > wrote: > >> Hey guys, >> >> dump question. With Tez can i have a input-initializaer which don’t require >> to create every split before starting the processing of already created >> splits ? >> Means if i have a lot of splits and my splitting process takes a long time, >> can the workers start working already while still doing the splitting ? >> >> Johannes >
Re: streamed splitting
Hello Johannes, This is something we have discussed quite often but have not got around to implementing this. There might be an open jira related to “pipelining” of splits. If you cannot find it, please go ahead and create one. The general issues with these are: - how to handle dynamic creation of tasks as splits get created - how to decide how many splits and which splits a single task should handle - involves some facet of grouping to do optimal allocations of newly created splits based on available containers. Size of groups could be different e.g a single group slit consist of either 5 data local splits or 2 rack-local splits or 1 off-rack split when assigning dynamically to a given container. - the single task limit also plays into how you handle fault tolerance and recovery - given that split creation is now dynamic, if the AM crashes in a scenario when not all splits were created but some were already processed, the next attempt when it recovers needs to handle it in a such way to ensure correctness of data processing. thanks — Hitesh On Mar 12, 2015, at 2:38 AM, Johannes Zillmann wrote: > Hey guys, > > dump question. With Tez can i have a input-initializaer which don’t require > to create every split before starting the processing of already created > splits ? > Means if i have a lot of splits and my splitting process takes a long time, > can the workers start working already while still doing the splitting ? > > Johannes
Re: streamed splitting
Hey Jeff, so one scenario i recently encountered was an job on about 300.000 files in hdfs. The splitting alone took 21 minutes. So i thought until the splitting is completed completely the a lot of splits could have already been processed… thanks for you answer! Johannes > On 12 Mar 2015, at 10:51, Jianfeng (Jeff) Zhang > wrote: > > > HI Johannes, > > If the input-initlizeer is not done, workers can not be started. > What¹s your scenario ? Why do you want to start the workers before > splitting is generated ? Just save the launch time or let the worker to do > other stuff ? > > > Best Regard, > Jeff Zhang > > > > > > On 3/12/15, 5:38 PM, "Johannes Zillmann" wrote: > >> Hey guys, >> >> dump question. With Tez can i have a input-initializaer which don¹t >> require to create every split before starting the processing of already >> created splits ? >> Means if i have a lot of splits and my splitting process takes a long >> time, can the workers start working already while still doing the >> splitting ? >> >> Johannes >
Re: streamed splitting
HI Johannes, If the input-initlizeer is not done, workers can not be started. What¹s your scenario ? Why do you want to start the workers before splitting is generated ? Just save the launch time or let the worker to do other stuff ? Best Regard, Jeff Zhang On 3/12/15, 5:38 PM, "Johannes Zillmann" wrote: >Hey guys, > >dump question. With Tez can i have a input-initializaer which don¹t >require to create every split before starting the processing of already >created splits ? >Means if i have a lot of splits and my splitting process takes a long >time, can the workers start working already while still doing the >splitting ? > >Johannes
streamed splitting
Hey guys, dump question. With Tez can i have a input-initializaer which don’t require to create every split before starting the processing of already created splits ? Means if i have a lot of splits and my splitting process takes a long time, can the workers start working already while still doing the splitting ? Johannes