Re: streamed splitting

2015-03-13 Thread Siddharth Seth
Johannes,
Couple of questions - do you happen to know why the splits are actually taking 
21 minutes to generate? Is the namenode overloaded, or is it just the large 
number of files ? Is the input format (assuming you're using an inputFormat for 
splits) going beyond analyzing block boundaries to generate splits ? - maybe 
looking into file contents.

Some things you may want to look at
- ensure tcp no delay is setup correctly. This can cause a fairly large lag in 
nn communication.
- if your inputFormat is FileInputFormat based, you could try setting #threads 
for fif split generation.

There's definitely complexity in pipelining splits. Tez-1076, iirc, has some 
references to this. Beyond input initializers supporting this, the inputFormat 
api itself is restrictive and that's not something that can be changed easily.

Have been thinking a bit about handling splits - especially from the 
perspective of dynamic grouping. Will create jiras over the next few days, some 
of which could help here - but require work and time.

Thanks
- Sid



> On Mar 12, 2015, at 13:13, Johannes Zillmann  wrote:
> 
> So.. its complex ;)
> Regarding the jira, closest thing i found is 
> https://issues.apache.org/jira/browse/TEZ-1166
> Should i add to this or create a new one ?
> 
> Johannes
> 
>> On 12 Mar 2015, at 15:44, Hitesh Shah  wrote:
>> 
>> Hello Johannes, 
>> 
>> This is something we have discussed quite often but have not got around to 
>> implementing this. There might be an open jira related to “pipelining” of 
>> splits. If you cannot find it, please go ahead and create one.
>> 
>> The general issues with these are:
>>  - how to handle dynamic creation of tasks as splits get created
>>  - how to decide how many splits and which splits a single task should handle
>>  - involves some facet of grouping to do optimal allocations of newly 
>> created splits based on available containers. Size of groups could be 
>> different e.g a single group slit consist of either 5 data local splits or 2 
>> rack-local splits or 1 off-rack split when assigning dynamically to a given 
>> container.
>>  - the single task limit also plays into how you handle fault tolerance and 
>> recovery 
>>  - given that split creation is now dynamic, if the AM crashes in a scenario 
>> when not all splits were created but some were already processed, the next 
>> attempt when it recovers needs to handle it in a such way to ensure 
>> correctness of data processing.
>> 
>> thanks
>> — Hitesh
>> 
>>> On Mar 12, 2015, at 2:38 AM, Johannes Zillmann  
>>> wrote:
>>> 
>>> Hey guys,
>>> 
>>> dump question. With Tez can i have a input-initializaer which don’t require 
>>> to create every split before starting the processing of already created 
>>> splits ?
>>> Means if i have a lot of splits and my splitting process takes a long time, 
>>> can the workers start working already while still doing the splitting ?
>>> 
>>> Johannes
> 


RE: streamed splitting

2015-03-12 Thread Bikas Saha
That's not it. Please open a new one. Thanks!

-Original Message-
From: Johannes Zillmann [mailto:jzillm...@googlemail.com] 
Sent: Thursday, March 12, 2015 1:14 PM
To: user@tez.apache.org
Subject: Re: streamed splitting

So.. its complex ;)
Regarding the jira, closest thing i found is 
https://issues.apache.org/jira/browse/TEZ-1166
Should i add to this or create a new one ?

Johannes

> On 12 Mar 2015, at 15:44, Hitesh Shah  wrote:
> 
> Hello Johannes, 
> 
> This is something we have discussed quite often but have not got around to 
> implementing this. There might be an open jira related to "pipelining" of 
> splits. If you cannot find it, please go ahead and create one.
> 
> The general issues with these are:
>   - how to handle dynamic creation of tasks as splits get created
>   - how to decide how many splits and which splits a single task should handle
>   - involves some facet of grouping to do optimal allocations of newly 
> created splits based on available containers. Size of groups could be 
> different e.g a single group slit consist of either 5 data local splits or 2 
> rack-local splits or 1 off-rack split when assigning dynamically to a given 
> container.
>   - the single task limit also plays into how you handle fault tolerance and 
> recovery 
>   - given that split creation is now dynamic, if the AM crashes in a scenario 
> when not all splits were created but some were already processed, the next 
> attempt when it recovers needs to handle it in a such way to ensure 
> correctness of data processing.
> 
> thanks
> - Hitesh
> 
> On Mar 12, 2015, at 2:38 AM, Johannes Zillmann  
> wrote:
> 
>> Hey guys,
>> 
>> dump question. With Tez can i have a input-initializaer which don't require 
>> to create every split before starting the processing of already created 
>> splits ?
>> Means if i have a lot of splits and my splitting process takes a long time, 
>> can the workers start working already while still doing the splitting ?
>> 
>> Johannes
> 



Re: streamed splitting

2015-03-12 Thread Johannes Zillmann
So.. its complex ;)
Regarding the jira, closest thing i found is 
https://issues.apache.org/jira/browse/TEZ-1166
Should i add to this or create a new one ?

Johannes

> On 12 Mar 2015, at 15:44, Hitesh Shah  wrote:
> 
> Hello Johannes, 
> 
> This is something we have discussed quite often but have not got around to 
> implementing this. There might be an open jira related to “pipelining” of 
> splits. If you cannot find it, please go ahead and create one.
> 
> The general issues with these are:
>   - how to handle dynamic creation of tasks as splits get created
>   - how to decide how many splits and which splits a single task should handle
>   - involves some facet of grouping to do optimal allocations of newly 
> created splits based on available containers. Size of groups could be 
> different e.g a single group slit consist of either 5 data local splits or 2 
> rack-local splits or 1 off-rack split when assigning dynamically to a given 
> container.
>   - the single task limit also plays into how you handle fault tolerance and 
> recovery 
>   - given that split creation is now dynamic, if the AM crashes in a scenario 
> when not all splits were created but some were already processed, the next 
> attempt when it recovers needs to handle it in a such way to ensure 
> correctness of data processing.
> 
> thanks
> — Hitesh
> 
> On Mar 12, 2015, at 2:38 AM, Johannes Zillmann  
> wrote:
> 
>> Hey guys,
>> 
>> dump question. With Tez can i have a input-initializaer which don’t require 
>> to create every split before starting the processing of already created 
>> splits ?
>> Means if i have a lot of splits and my splitting process takes a long time, 
>> can the workers start working already while still doing the splitting ?
>> 
>> Johannes
> 



Re: streamed splitting

2015-03-12 Thread Hitesh Shah
Hello Johannes, 

This is something we have discussed quite often but have not got around to 
implementing this. There might be an open jira related to “pipelining” of 
splits. If you cannot find it, please go ahead and create one.

The general issues with these are:
   - how to handle dynamic creation of tasks as splits get created
   - how to decide how many splits and which splits a single task should handle
   - involves some facet of grouping to do optimal allocations of newly created 
splits based on available containers. Size of groups could be different e.g a 
single group slit consist of either 5 data local splits or 2 rack-local splits 
or 1 off-rack split when assigning dynamically to a given container.
   - the single task limit also plays into how you handle fault tolerance and 
recovery 
   - given that split creation is now dynamic, if the AM crashes in a scenario 
when not all splits were created but some were already processed, the next 
attempt when it recovers needs to handle it in a such way to ensure correctness 
of data processing.

thanks
— Hitesh

On Mar 12, 2015, at 2:38 AM, Johannes Zillmann  wrote:

> Hey guys,
> 
> dump question. With Tez can i have a input-initializaer which don’t require 
> to create every split before starting the processing of already created 
> splits ?
> Means if i have a lot of splits and my splitting process takes a long time, 
> can the workers start working already while still doing the splitting ?
> 
> Johannes



Re: streamed splitting

2015-03-12 Thread Johannes Zillmann
Hey Jeff,

so one scenario i recently encountered was an job on about 300.000 files in 
hdfs.
The splitting alone took 21 minutes. So i thought until the splitting is 
completed completely the a lot of splits could have already been processed… 

thanks for you answer!
Johannes

> On 12 Mar 2015, at 10:51, Jianfeng (Jeff) Zhang  
> wrote:
> 
> 
> HI Johannes,
> 
> If the input-initlizeer is not done, workers can not be started.
> What¹s your scenario ? Why do you want to start the workers before
> splitting is generated ? Just save the launch time or let the worker to do
> other stuff ?
> 
> 
> Best Regard,
> Jeff Zhang
> 
> 
> 
> 
> 
> On 3/12/15, 5:38 PM, "Johannes Zillmann"  wrote:
> 
>> Hey guys,
>> 
>> dump question. With Tez can i have a input-initializaer which don¹t
>> require to create every split before starting the processing of already
>> created splits ?
>> Means if i have a lot of splits and my splitting process takes a long
>> time, can the workers start working already while still doing the
>> splitting ?
>> 
>> Johannes
> 



Re: streamed splitting

2015-03-12 Thread Jianfeng (Jeff) Zhang

HI Johannes,

If the input-initlizeer is not done, workers can not be started.
What¹s your scenario ? Why do you want to start the workers before
splitting is generated ? Just save the launch time or let the worker to do
other stuff ?


Best Regard,
Jeff Zhang





On 3/12/15, 5:38 PM, "Johannes Zillmann"  wrote:

>Hey guys,
>
>dump question. With Tez can i have a input-initializaer which don¹t
>require to create every split before starting the processing of already
>created splits ?
>Means if i have a lot of splits and my splitting process takes a long
>time, can the workers start working already while still doing the
>splitting ?
>
>Johannes



streamed splitting

2015-03-12 Thread Johannes Zillmann
Hey guys,

dump question. With Tez can i have a input-initializaer which don’t require to 
create every split before starting the processing of already created splits ?
Means if i have a lot of splits and my splitting process takes a long time, can 
the workers start working already while still doing the splitting ?

Johannes