[jira] [Updated] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers
[ https://issues.apache.org/jira/browse/YARN-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun Suresh updated YARN-5620: -- Attachment: YARN-5620.012.patch Done.. Thanks [~jianhe].. > Core changes in NodeManager to support for upgrade and rollback of Containers > - > > Key: YARN-5620 > URL: https://issues.apache.org/jira/browse/YARN-5620 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Arun Suresh >Assignee: Arun Suresh > Attachments: YARN-5620.001.patch, YARN-5620.002.patch, > YARN-5620.003.patch, YARN-5620.004.patch, YARN-5620.005.patch, > YARN-5620.006.patch, YARN-5620.007.patch, YARN-5620.008.patch, > YARN-5620.009.patch, YARN-5620.010.patch, YARN-5620.011.patch, > YARN-5620.012.patch > > > JIRA proposes to modify the ContainerManager (and other core classes) to > support upgrade of a running container with a new {{ContainerLaunchContext}} > as well as the ability to rollback the upgrade if the container is not able > to restart using the new launch Context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers
[ https://issues.apache.org/jira/browse/YARN-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun Suresh updated YARN-5620: -- Attachment: YARN-5620.011.patch Thanks [~jianhe].. Uploading patch (v011) with the changes. I left the CLEANUP_CONTAINER_FOR_REINIT there, even though it does the same thing as CLEANUP_CONTAINER. It is sent by a different source, it can be used for debugging etc. > Core changes in NodeManager to support for upgrade and rollback of Containers > - > > Key: YARN-5620 > URL: https://issues.apache.org/jira/browse/YARN-5620 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Arun Suresh >Assignee: Arun Suresh > Attachments: YARN-5620.001.patch, YARN-5620.002.patch, > YARN-5620.003.patch, YARN-5620.004.patch, YARN-5620.005.patch, > YARN-5620.006.patch, YARN-5620.007.patch, YARN-5620.008.patch, > YARN-5620.009.patch, YARN-5620.010.patch, YARN-5620.011.patch > > > JIRA proposes to modify the ContainerManager (and other core classes) to > support upgrade of a running container with a new {{ContainerLaunchContext}} > as well as the ability to rollback the upgrade if the container is not able > to restart using the new launch Context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers
[ https://issues.apache.org/jira/browse/YARN-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun Suresh updated YARN-5620: -- Attachment: YARN-5620.010.patch Fixing failed tests (The _TestDefaultContainerExecutor_ error seems to be unrelated) and some more checkstyles. > Core changes in NodeManager to support for upgrade and rollback of Containers > - > > Key: YARN-5620 > URL: https://issues.apache.org/jira/browse/YARN-5620 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Arun Suresh >Assignee: Arun Suresh > Attachments: YARN-5620.001.patch, YARN-5620.002.patch, > YARN-5620.003.patch, YARN-5620.004.patch, YARN-5620.005.patch, > YARN-5620.006.patch, YARN-5620.007.patch, YARN-5620.008.patch, > YARN-5620.009.patch, YARN-5620.010.patch > > > JIRA proposes to modify the ContainerManager (and other core classes) to > support upgrade of a running container with a new {{ContainerLaunchContext}} > as well as the ability to rollback the upgrade if the container is not able > to restart using the new launch Context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers
[ https://issues.apache.org/jira/browse/YARN-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun Suresh updated YARN-5620: -- Attachment: YARN-5620.009.patch Updating patch. * Addressing [~jianhe]'s latest comments * some javadoc, checkstyle and javac fixes bq. IIUC, in this case, the ContainerImpl will receive the KILL event first and move to the KILLING state, and the CONTAINER_KILLED_ON_REQUEST will be sent to the container at KILLING state.. It goes to KILLING stage only if the AM explicitly sends a kill signal or the RM asks NM to kill. It is also possible that the an admin logs into the NM and does a 'kill -9' which will also cause the ContainerLaunch to send CONTAINER_KILLED_ON_REQUEST but it wont be in KILLING state.. right ? bq. ..In testContainerUpgradeSuccess, could you make newStartFile a new upgrade resource, and verify the output is written into it, this verifies the part about the localization part as well. Actually if you look at the _prepareContainerUpgrade()_ function, we create a new script file *scriptFile_new* while passed into the _prepareContainerLaunchContext()_ function which associates the new file to a new *dest_file_new* location.. this should verify that the upgrade needed a new localized resource. The output of the script is also written to a new *start_file_n.txt* which we read and verify to check if the new process has actually started. Also by the way: bq. We can use the ResourceSet#getAllResourcesByVisibility method instead, and so the getLocalPendingRequests method and the new constructor in ContainerLocalizationRequestEvent is not needed The problem with getAllResourcesByVisibility, is it gets all resources. I just need the pending resources... So if you are ok with it, Id like to keep it as is.. > Core changes in NodeManager to support for upgrade and rollback of Containers > - > > Key: YARN-5620 > URL: https://issues.apache.org/jira/browse/YARN-5620 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Arun Suresh >Assignee: Arun Suresh > Attachments: YARN-5620.001.patch, YARN-5620.002.patch, > YARN-5620.003.patch, YARN-5620.004.patch, YARN-5620.005.patch, > YARN-5620.006.patch, YARN-5620.007.patch, YARN-5620.008.patch, > YARN-5620.009.patch > > > JIRA proposes to modify the ContainerManager (and other core classes) to > support upgrade of a running container with a new {{ContainerLaunchContext}} > as well as the ability to rollback the upgrade if the container is not able > to restart using the new launch Context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers
[ https://issues.apache.org/jira/browse/YARN-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun Suresh updated YARN-5620: -- Attachment: YARN-5620.008.patch Uploading patch addressing [~jianhe]'s suggestions. * Refactored to use a new REINITIALIZING state * Handle race conditions to properly disallow relocalization and reintialization while a container is undergoing reinitialization. > Core changes in NodeManager to support for upgrade and rollback of Containers > - > > Key: YARN-5620 > URL: https://issues.apache.org/jira/browse/YARN-5620 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Arun Suresh >Assignee: Arun Suresh > Attachments: YARN-5620.001.patch, YARN-5620.002.patch, > YARN-5620.003.patch, YARN-5620.004.patch, YARN-5620.005.patch, > YARN-5620.006.patch, YARN-5620.007.patch, YARN-5620.008.patch > > > JIRA proposes to modify the ContainerManager (and other core classes) to > support upgrade of a running container with a new {{ContainerLaunchContext}} > as well as the ability to rollback the upgrade if the container is not able > to restart using the new launch Context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers
[ https://issues.apache.org/jira/browse/YARN-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun Suresh updated YARN-5620: -- Attachment: YARN-5620.007.patch Fixing checkstyles, javadocs and javac > Core changes in NodeManager to support for upgrade and rollback of Containers > - > > Key: YARN-5620 > URL: https://issues.apache.org/jira/browse/YARN-5620 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Arun Suresh >Assignee: Arun Suresh > Attachments: YARN-5620.001.patch, YARN-5620.002.patch, > YARN-5620.003.patch, YARN-5620.004.patch, YARN-5620.005.patch, > YARN-5620.006.patch, YARN-5620.007.patch > > > JIRA proposes to modify the ContainerManager (and other core classes) to > support upgrade of a running container with a new {{ContainerLaunchContext}} > as well as the ability to rollback the upgrade if the container is not able > to restart using the new launch Context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers
[ https://issues.apache.org/jira/browse/YARN-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun Suresh updated YARN-5620: -- Attachment: YARN-5620.006.patch Uploading patch addressing most of [~vvasudev] and [~jianhe] suggestions. Thanks for the comments !! [~vvasudev], bq. Should there be a guard against calling reint if a reinit is already in progress? Could we end up with the ReInitContext in odd state? So there is already a guard in the ContainerManager api... but I have included an additional check in the transition in the new patch as per your suggestion. bq. Instead of a launch event we should send a relaunch event - the relaunch takes care of trying to run in same work dir as the earlier attempt, etc I actually tried using relaunch initially... but it looks like the pid has to be running for the re launch to work correctly. Also, looks like we would need an intermediate state there too and would result in same (or more) amount of code change. I would actually prefer to use launch itself, since I am more confident of how it works. I have also updated the testcase to verify that the upgraded container has access to and is able to read files created by the previous process in the working directory. bq. think an explicit commit API(with auto-commit option being the default option) should satisfy both use cases. Thanks.. will update the patch with it once we agree that the reinit flow is fine. [~jianhe], bq. While AM issues the upgrade command, the container could exit with success or failure. in this case, should we still continue the upgrade process ? I am nullifying the reInitContext in the event of an explicit kill or if process completed successfully during the reInit.. the upgrade should thus be cancelled. Do take a look at the latest patch and let me know if you think i've cover all cases. > Core changes in NodeManager to support for upgrade and rollback of Containers > - > > Key: YARN-5620 > URL: https://issues.apache.org/jira/browse/YARN-5620 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Arun Suresh >Assignee: Arun Suresh > Attachments: YARN-5620.001.patch, YARN-5620.002.patch, > YARN-5620.003.patch, YARN-5620.004.patch, YARN-5620.005.patch, > YARN-5620.006.patch > > > JIRA proposes to modify the ContainerManager (and other core classes) to > support upgrade of a running container with a new {{ContainerLaunchContext}} > as well as the ability to rollback the upgrade if the container is not able > to restart using the new launch Context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers
[ https://issues.apache.org/jira/browse/YARN-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun Suresh updated YARN-5620: -- Attachment: YARN-5620.005.patch Uploading an updated patch with minor test case fixes > Core changes in NodeManager to support for upgrade and rollback of Containers > - > > Key: YARN-5620 > URL: https://issues.apache.org/jira/browse/YARN-5620 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Arun Suresh >Assignee: Arun Suresh > Attachments: YARN-5620.001.patch, YARN-5620.002.patch, > YARN-5620.003.patch, YARN-5620.004.patch, YARN-5620.005.patch > > > JIRA proposes to modify the ContainerManager (and other core classes) to > support upgrade of a running container with a new {{ContainerLaunchContext}} > as well as the ability to rollback the upgrade if the container is not able > to restart using the new launch Context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers
[ https://issues.apache.org/jira/browse/YARN-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun Suresh updated YARN-5620: -- Attachment: YARN-5620.004.patch [~jianhe], As per your suggestion, I am uploading a patch with just the restart container for your review convenience. I renamed it *reInitialize* to signify that the restart is dependent on the container being re-initialized with new bits. But, as per my previous comments, I do believe that we should not expose an upgrade without a rollback to just previous launch context (both implicit based on failure policy and well as an explicit rollback API). I would thus prefer to update the same JIRA with the rollback and commit calls (once you are satisfied with the restart flow) rather than open separate JIRAs. bq. the slider AM (also Yarn code) will have the prior context and call the upgardeContainer with the corresponding context, and so NM does not need to remember prior context. H... I still believe rollback to just prior version should be supported by the NM.. and for rolling upgrades, atleast for production environments I have had experience with, it is an absolute requirement. The AM (Slider in our case) can subsequently _reinitialize_ to any version it chooses later on if it wants. > Core changes in NodeManager to support for upgrade and rollback of Containers > - > > Key: YARN-5620 > URL: https://issues.apache.org/jira/browse/YARN-5620 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Arun Suresh >Assignee: Arun Suresh > Attachments: YARN-5620.001.patch, YARN-5620.002.patch, > YARN-5620.003.patch, YARN-5620.004.patch > > > JIRA proposes to modify the ContainerManager (and other core classes) to > support upgrade of a running container with a new {{ContainerLaunchContext}} > as well as the ability to rollback the upgrade if the container is not able > to restart using the new launch Context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers
[ https://issues.apache.org/jira/browse/YARN-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun Suresh updated YARN-5620: -- Attachment: YARN-5620.003.patch Updating patch * Adding more test coverage * fixing some javadocs and checkstyles * fixing the failed test cases (the {{TestDefaultContainerExecutor}} failures don't seem to be related to this patch though) > Core changes in NodeManager to support for upgrade and rollback of Containers > - > > Key: YARN-5620 > URL: https://issues.apache.org/jira/browse/YARN-5620 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Arun Suresh >Assignee: Arun Suresh > Attachments: YARN-5620.001.patch, YARN-5620.002.patch, > YARN-5620.003.patch > > > JIRA proposes to modify the ContainerManager (and other core classes) to > support upgrade of a running container with a new {{ContainerLaunchContext}} > as well as the ability to rollback the upgrade if the container is not able > to restart using the new launch Context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers
[ https://issues.apache.org/jira/browse/YARN-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun Suresh updated YARN-5620: -- Attachment: YARN-5620.002.patch Uploading updated patch: * Added support for explicit Rollback. If upgrade has not been committed. * Some minor code cleanup > Core changes in NodeManager to support for upgrade and rollback of Containers > - > > Key: YARN-5620 > URL: https://issues.apache.org/jira/browse/YARN-5620 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Arun Suresh >Assignee: Arun Suresh > Attachments: YARN-5620.001.patch, YARN-5620.002.patch > > > JIRA proposes to modify the ContainerManager (and other core classes) to > support upgrade of a running container with a new {{ContainerLaunchContext}} > as well as the ability to rollback the upgrade if the container is not able > to restart using the new launch Context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5620) Core changes in NodeManager to support for upgrade and rollback of Containers
[ https://issues.apache.org/jira/browse/YARN-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun Suresh updated YARN-5620: -- Attachment: YARN-5620.001.patch Attaching initial patch based on some offline ideas from [~jianhe], [~vinodkv] etc. I havn't included the API changes with this patch. I have just added {{upgradeContainer}} and {{commitUpgrade}} methods to the {{ContainerManagerImpl}} to test the end to end flow via test cases. The patch assumes the following: * The container is restarted only after ALL the required resources are localized. * If the relaunch of the container with the new bits fails, the Container will be rollback * Rollback involves reverting to the old launch Context and restarting. * It is upto the AM to call the {{commitUpgrade}} once the container has completed to ensure that if the Container fails after the upgrade, it is not rolled back. This is required, since if the container fails for some reason after the upgrade, there is no way to distinguish if it is because of the upgrade or for some other reason. > Core changes in NodeManager to support for upgrade and rollback of Containers > - > > Key: YARN-5620 > URL: https://issues.apache.org/jira/browse/YARN-5620 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Arun Suresh >Assignee: Arun Suresh > Attachments: YARN-5620.001.patch > > > JIRA proposes to modify the ContainerManager (and other core classes) to > support upgrade of a running container with a new {{ContainerLaunchContext}} > as well as the ability to rollback the upgrade if the container is not able > to restart using the new launch Context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org