Re: Improving Apex relaunch time.

2016-09-21 Thread Sandesh Hegde
Relaunching from the same location can be one of the options.

On Tue, Sep 20, 2016, 10:17 PM Tushar Gosavi  wrote:

> In case of application failure, we will like to have ability to
> quickly restart the application while keeping the old state for
> failure
> analysis. Also the problem remains the same when we want to start from
> savepoint, where we will need to copy state from
> savepoint to application.
>
> -Tushar.
>
>
>
> On Tue, Sep 20, 2016 at 8:34 PM, Sandesh Hegde 
> wrote:
> > How about re-launching the app from the same location?
> >
> > If at all they want to store the state we can provide savepoint feature.
> >
> > On Tue, Sep 20, 2016 at 4:39 AM Tushar Gosavi 
> > wrote:
> >
> >> We have observed that application relaunch takes long time.
> >> The one major reason for delay in application startup during relaunch
> >> is time taken to copy state of exisitng application to new application.
> >> This state could grow in GBs and copy is performed in single thread
> before
> >> new application is submitted to Yarn.
> >>
> >> The state of previous application constists
> >> - jars
> >> - stram checkpoint/recovery file.
> >> - events
> >> - container file
> >> - stats recording if enabled.
> >> - operator checkpoints
> >> - operator data.
> >>
> >> We could avoid copying debugging data like stat recording which could
> >> run in TB for long
> >> running application and is not required for functioning of new
> application.
> >>
> >> Similarly operator checkpoints could be read in parallel when they are
> >> launched for first time,
> >> This will also help in copying only required checkpoints and will be
> >> done in parallel
> >> by multiple containers/threads.
> >>
> >> For operator data stored in application directory, we could copy it
> >> completely for now, but
> >> in future we could provide an callback which will allow operator
> >> partition to read only
> >> required state from previous location.
> >>
> >> let me know your though on this.
> >>
> >> Regards,
> >> - Tushar.
> >>
>


Re: Improving Apex relaunch time.

2016-09-20 Thread Sandesh Hegde
How about re-launching the app from the same location?

If at all they want to store the state we can provide savepoint feature.

On Tue, Sep 20, 2016 at 4:39 AM Tushar Gosavi 
wrote:

> We have observed that application relaunch takes long time.
> The one major reason for delay in application startup during relaunch
> is time taken to copy state of exisitng application to new application.
> This state could grow in GBs and copy is performed in single thread before
> new application is submitted to Yarn.
>
> The state of previous application constists
> - jars
> - stram checkpoint/recovery file.
> - events
> - container file
> - stats recording if enabled.
> - operator checkpoints
> - operator data.
>
> We could avoid copying debugging data like stat recording which could
> run in TB for long
> running application and is not required for functioning of new application.
>
> Similarly operator checkpoints could be read in parallel when they are
> launched for first time,
> This will also help in copying only required checkpoints and will be
> done in parallel
> by multiple containers/threads.
>
> For operator data stored in application directory, we could copy it
> completely for now, but
> in future we could provide an callback which will allow operator
> partition to read only
> required state from previous location.
>
> let me know your though on this.
>
> Regards,
> - Tushar.
>


Improving Apex relaunch time.

2016-09-20 Thread Tushar Gosavi
We have observed that application relaunch takes long time.
The one major reason for delay in application startup during relaunch
is time taken to copy state of exisitng application to new application.
This state could grow in GBs and copy is performed in single thread before
new application is submitted to Yarn.

The state of previous application constists
- jars
- stram checkpoint/recovery file.
- events
- container file
- stats recording if enabled.
- operator checkpoints
- operator data.

We could avoid copying debugging data like stat recording which could
run in TB for long
running application and is not required for functioning of new application.

Similarly operator checkpoints could be read in parallel when they are
launched for first time,
This will also help in copying only required checkpoints and will be
done in parallel
by multiple containers/threads.

For operator data stored in application directory, we could copy it
completely for now, but
in future we could provide an callback which will allow operator
partition to read only
required state from previous location.

let me know your though on this.

Regards,
- Tushar.