[jira] [Updated] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers

2016-09-12 Thread Arun Suresh (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun Suresh updated YARN-5620:
--
Attachment: YARN-5620.012.patch

Done.. Thanks [~jianhe]..

> Core changes in NodeManager to support for upgrade and rollback of Containers
> -
>
> Key: YARN-5620
> URL: https://issues.apache.org/jira/browse/YARN-5620
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Arun Suresh
>Assignee: Arun Suresh
> Attachments: YARN-5620.001.patch, YARN-5620.002.patch, 
> YARN-5620.003.patch, YARN-5620.004.patch, YARN-5620.005.patch, 
> YARN-5620.006.patch, YARN-5620.007.patch, YARN-5620.008.patch, 
> YARN-5620.009.patch, YARN-5620.010.patch, YARN-5620.011.patch, 
> YARN-5620.012.patch
>
>
> JIRA proposes to modify the ContainerManager (and other core classes) to 
> support upgrade of a running container with a new {{ContainerLaunchContext}} 
> as well as the ability to rollback the upgrade if the container is not able 
> to restart using the new launch Context. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers

2016-09-12 Thread Arun Suresh (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun Suresh updated YARN-5620:
--
Attachment: YARN-5620.011.patch

Thanks [~jianhe].. Uploading patch (v011) with the changes.

I left the CLEANUP_CONTAINER_FOR_REINIT there, even though it does the same 
thing as CLEANUP_CONTAINER. It is sent by a different source, it can be used 
for debugging etc.

> Core changes in NodeManager to support for upgrade and rollback of Containers
> -
>
> Key: YARN-5620
> URL: https://issues.apache.org/jira/browse/YARN-5620
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Arun Suresh
>Assignee: Arun Suresh
> Attachments: YARN-5620.001.patch, YARN-5620.002.patch, 
> YARN-5620.003.patch, YARN-5620.004.patch, YARN-5620.005.patch, 
> YARN-5620.006.patch, YARN-5620.007.patch, YARN-5620.008.patch, 
> YARN-5620.009.patch, YARN-5620.010.patch, YARN-5620.011.patch
>
>
> JIRA proposes to modify the ContainerManager (and other core classes) to 
> support upgrade of a running container with a new {{ContainerLaunchContext}} 
> as well as the ability to rollback the upgrade if the container is not able 
> to restart using the new launch Context. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers

2016-09-12 Thread Arun Suresh (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun Suresh updated YARN-5620:
--
Attachment: YARN-5620.010.patch

Fixing failed tests (The _TestDefaultContainerExecutor_ error seems to be 
unrelated) and some more checkstyles.

> Core changes in NodeManager to support for upgrade and rollback of Containers
> -
>
> Key: YARN-5620
> URL: https://issues.apache.org/jira/browse/YARN-5620
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Arun Suresh
>Assignee: Arun Suresh
> Attachments: YARN-5620.001.patch, YARN-5620.002.patch, 
> YARN-5620.003.patch, YARN-5620.004.patch, YARN-5620.005.patch, 
> YARN-5620.006.patch, YARN-5620.007.patch, YARN-5620.008.patch, 
> YARN-5620.009.patch, YARN-5620.010.patch
>
>
> JIRA proposes to modify the ContainerManager (and other core classes) to 
> support upgrade of a running container with a new {{ContainerLaunchContext}} 
> as well as the ability to rollback the upgrade if the container is not able 
> to restart using the new launch Context. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers

2016-09-12 Thread Arun Suresh (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun Suresh updated YARN-5620:
--
Attachment: YARN-5620.009.patch

Updating patch.
* Addressing [~jianhe]'s latest comments
* some javadoc, checkstyle and javac fixes

bq. IIUC, in this case, the ContainerImpl will receive the KILL event first and 
move to the KILLING state, and the CONTAINER_KILLED_ON_REQUEST will be sent to 
the container at KILLING state..
It goes to KILLING stage only if the AM explicitly sends a kill signal or the 
RM asks NM to kill. It is also possible that the an admin logs into the NM and 
does a 'kill -9' which will also cause the ContainerLaunch to send 
CONTAINER_KILLED_ON_REQUEST but it wont be in KILLING state.. right ?

bq. ..In testContainerUpgradeSuccess, could you make newStartFile a new upgrade 
resource, and verify the output is written into it, this verifies the part 
about the localization part as well.
Actually if you look at the _prepareContainerUpgrade()_ function, we create a 
new script file *scriptFile_new* while passed into the 
_prepareContainerLaunchContext()_ function which associates the new file to a 
new *dest_file_new* location.. this should verify that the upgrade needed a new 
localized resource. The output of the script is also written to a new 
*start_file_n.txt* which we read and verify to check if the new process has 
actually started.

Also by the way:

bq. We can use the ResourceSet#getAllResourcesByVisibility method instead, and 
so the getLocalPendingRequests method and the new constructor in 
ContainerLocalizationRequestEvent is not needed
The problem with getAllResourcesByVisibility, is it gets all resources. I just 
need the pending resources... So if you are ok with it, Id like to keep it as 
is..



> Core changes in NodeManager to support for upgrade and rollback of Containers
> -
>
> Key: YARN-5620
> URL: https://issues.apache.org/jira/browse/YARN-5620
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Arun Suresh
>Assignee: Arun Suresh
> Attachments: YARN-5620.001.patch, YARN-5620.002.patch, 
> YARN-5620.003.patch, YARN-5620.004.patch, YARN-5620.005.patch, 
> YARN-5620.006.patch, YARN-5620.007.patch, YARN-5620.008.patch, 
> YARN-5620.009.patch
>
>
> JIRA proposes to modify the ContainerManager (and other core classes) to 
> support upgrade of a running container with a new {{ContainerLaunchContext}} 
> as well as the ability to rollback the upgrade if the container is not able 
> to restart using the new launch Context. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers

2016-09-10 Thread Arun Suresh (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun Suresh updated YARN-5620:
--
Attachment: YARN-5620.008.patch

Uploading patch addressing [~jianhe]'s suggestions.
* Refactored to use a new REINITIALIZING state
* Handle race conditions to properly disallow relocalization and 
reintialization while a container is undergoing reinitialization.

> Core changes in NodeManager to support for upgrade and rollback of Containers
> -
>
> Key: YARN-5620
> URL: https://issues.apache.org/jira/browse/YARN-5620
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Arun Suresh
>Assignee: Arun Suresh
> Attachments: YARN-5620.001.patch, YARN-5620.002.patch, 
> YARN-5620.003.patch, YARN-5620.004.patch, YARN-5620.005.patch, 
> YARN-5620.006.patch, YARN-5620.007.patch, YARN-5620.008.patch
>
>
> JIRA proposes to modify the ContainerManager (and other core classes) to 
> support upgrade of a running container with a new {{ContainerLaunchContext}} 
> as well as the ability to rollback the upgrade if the container is not able 
> to restart using the new launch Context. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers

2016-09-08 Thread Arun Suresh (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun Suresh updated YARN-5620:
--
Attachment: YARN-5620.007.patch

Fixing checkstyles, javadocs and javac

> Core changes in NodeManager to support for upgrade and rollback of Containers
> -
>
> Key: YARN-5620
> URL: https://issues.apache.org/jira/browse/YARN-5620
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Arun Suresh
>Assignee: Arun Suresh
> Attachments: YARN-5620.001.patch, YARN-5620.002.patch, 
> YARN-5620.003.patch, YARN-5620.004.patch, YARN-5620.005.patch, 
> YARN-5620.006.patch, YARN-5620.007.patch
>
>
> JIRA proposes to modify the ContainerManager (and other core classes) to 
> support upgrade of a running container with a new {{ContainerLaunchContext}} 
> as well as the ability to rollback the upgrade if the container is not able 
> to restart using the new launch Context. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers

2016-09-08 Thread Arun Suresh (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun Suresh updated YARN-5620:
--
Attachment: YARN-5620.006.patch

Uploading patch addressing most of [~vvasudev] and [~jianhe] suggestions. 
Thanks for the comments !!

[~vvasudev],

bq. Should there be a guard against calling reint if a reinit is already in 
progress? Could we end up with the ReInitContext in odd state?
So there is already a guard in the ContainerManager api... but I have included 
an additional check in the transition in the new patch as per your suggestion.

bq. Instead of a launch event we should send a relaunch event - the relaunch 
takes care of trying to run in same work dir as the earlier attempt, etc
I actually tried using relaunch initially... but it looks like the pid has to 
be running for the re launch to work correctly. Also, looks like we would need 
an intermediate state there too and would result in same (or more) amount of 
code change. I would actually prefer to use launch itself, since I am more 
confident of how it works. I have also updated the testcase to verify that the 
upgraded container has access to and is able to read files created by the 
previous process in the working directory.

bq.  think an explicit commit API(with auto-commit option being the default 
option) should satisfy both use cases.
Thanks.. will update the patch with it once we agree that the reinit flow is 
fine.

[~jianhe],

bq. While AM issues the upgrade command, the container could exit with success 
or failure. in this case, should we still continue the upgrade process ?
I am nullifying the reInitContext in the event of an explicit kill or if 
process completed successfully during the reInit.. the upgrade should thus be 
cancelled. Do take a look at the latest patch and let me know if you think i've 
cover all cases.
 

> Core changes in NodeManager to support for upgrade and rollback of Containers
> -
>
> Key: YARN-5620
> URL: https://issues.apache.org/jira/browse/YARN-5620
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Arun Suresh
>Assignee: Arun Suresh
> Attachments: YARN-5620.001.patch, YARN-5620.002.patch, 
> YARN-5620.003.patch, YARN-5620.004.patch, YARN-5620.005.patch, 
> YARN-5620.006.patch
>
>
> JIRA proposes to modify the ContainerManager (and other core classes) to 
> support upgrade of a running container with a new {{ContainerLaunchContext}} 
> as well as the ability to rollback the upgrade if the container is not able 
> to restart using the new launch Context. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers

2016-09-08 Thread Arun Suresh (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun Suresh updated YARN-5620:
--
Attachment: YARN-5620.005.patch

Uploading an updated patch with minor test case fixes

> Core changes in NodeManager to support for upgrade and rollback of Containers
> -
>
> Key: YARN-5620
> URL: https://issues.apache.org/jira/browse/YARN-5620
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Arun Suresh
>Assignee: Arun Suresh
> Attachments: YARN-5620.001.patch, YARN-5620.002.patch, 
> YARN-5620.003.patch, YARN-5620.004.patch, YARN-5620.005.patch
>
>
> JIRA proposes to modify the ContainerManager (and other core classes) to 
> support upgrade of a running container with a new {{ContainerLaunchContext}} 
> as well as the ability to rollback the upgrade if the container is not able 
> to restart using the new launch Context. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers

2016-09-08 Thread Arun Suresh (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun Suresh updated YARN-5620:
--
Attachment: YARN-5620.004.patch

[~jianhe], As per your suggestion, I am uploading a patch with just the restart 
container for your review convenience. I renamed it *reInitialize* to signify 
that the restart is dependent on the container being re-initialized with new 
bits.

But, as per my previous comments, I do believe that we should not expose an 
upgrade without a rollback to just previous launch context (both implicit based 
on failure policy and well as an explicit rollback API).

I would thus prefer to update the same JIRA with the rollback and commit calls  
(once you are satisfied with the restart flow) rather than open separate JIRAs.

bq. the slider AM (also Yarn code) will have the prior context and call the 
upgardeContainer with the corresponding context, and so NM does not need to 
remember prior context.
H... I still believe rollback to just prior version should be supported by 
the NM.. and for rolling upgrades, atleast for production environments I have 
had experience with, it is an absolute requirement. The AM (Slider in our case) 
can subsequently _reinitialize_ to any version it chooses later on if it wants.

> Core changes in NodeManager to support for upgrade and rollback of Containers
> -
>
> Key: YARN-5620
> URL: https://issues.apache.org/jira/browse/YARN-5620
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Arun Suresh
>Assignee: Arun Suresh
> Attachments: YARN-5620.001.patch, YARN-5620.002.patch, 
> YARN-5620.003.patch, YARN-5620.004.patch
>
>
> JIRA proposes to modify the ContainerManager (and other core classes) to 
> support upgrade of a running container with a new {{ContainerLaunchContext}} 
> as well as the ability to rollback the upgrade if the container is not able 
> to restart using the new launch Context. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers

2016-09-07 Thread Arun Suresh (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun Suresh updated YARN-5620:
--
Attachment: YARN-5620.003.patch

Updating patch
* Adding more test coverage
* fixing some javadocs and checkstyles
* fixing the failed test cases (the {{TestDefaultContainerExecutor}} failures 
don't seem to be related to this patch though)

> Core changes in NodeManager to support for upgrade and rollback of Containers
> -
>
> Key: YARN-5620
> URL: https://issues.apache.org/jira/browse/YARN-5620
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Arun Suresh
>Assignee: Arun Suresh
> Attachments: YARN-5620.001.patch, YARN-5620.002.patch, 
> YARN-5620.003.patch
>
>
> JIRA proposes to modify the ContainerManager (and other core classes) to 
> support upgrade of a running container with a new {{ContainerLaunchContext}} 
> as well as the ability to rollback the upgrade if the container is not able 
> to restart using the new launch Context. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers

2016-09-06 Thread Arun Suresh (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun Suresh updated YARN-5620:
--
Attachment: YARN-5620.002.patch

Uploading updated patch:
* Added support for explicit Rollback. If upgrade has not been committed.
* Some minor code cleanup



> Core changes in NodeManager to support for upgrade and rollback of Containers
> -
>
> Key: YARN-5620
> URL: https://issues.apache.org/jira/browse/YARN-5620
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Arun Suresh
>Assignee: Arun Suresh
> Attachments: YARN-5620.001.patch, YARN-5620.002.patch
>
>
> JIRA proposes to modify the ContainerManager (and other core classes) to 
> support upgrade of a running container with a new {{ContainerLaunchContext}} 
> as well as the ability to rollback the upgrade if the container is not able 
> to restart using the new launch Context. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers

2016-09-06 Thread Arun Suresh (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun Suresh updated YARN-5620:
--
Attachment: YARN-5620.001.patch

Attaching initial patch based on some offline ideas from [~jianhe], [~vinodkv] 
etc.

I havn't included the API changes with this patch. I have just added 
{{upgradeContainer}} and {{commitUpgrade}} methods to the 
{{ContainerManagerImpl}} to test the end to end flow via test cases.

The patch assumes the following:
* The container is restarted only after ALL the required resources are 
localized.
* If the relaunch of the container with the new bits fails, the Container will 
be rollback
* Rollback involves reverting to the old launch Context and restarting.
* It is upto the AM to call the {{commitUpgrade}} once the container has 
completed to ensure that if the Container fails after the upgrade, it is not 
rolled back. This is required, since if the container fails for some reason 
after the upgrade, there is no way to distinguish if it is because of the 
upgrade or for some other reason.

> Core changes in NodeManager to support for upgrade and rollback of Containers
> -
>
> Key: YARN-5620
> URL: https://issues.apache.org/jira/browse/YARN-5620
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Arun Suresh
>Assignee: Arun Suresh
> Attachments: YARN-5620.001.patch
>
>
> JIRA proposes to modify the ContainerManager (and other core classes) to 
> support upgrade of a running container with a new {{ContainerLaunchContext}} 
> as well as the ability to rollback the upgrade if the container is not able 
> to restart using the new launch Context. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org