[jira] [Commented] (YARN-3998) Add support in the NodeManager to re-launch containers
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15430973#comment-15430973 ] Jason Lowe commented on YARN-3998: -- bq. it looks like we throw an exception when we encounter a field we don't know of. Yeah, I misremembered how unexpected keys were handled. We should fix that to make it easier to do rolling downgrades. Bonus points for having it clean up all unrecognized keys underneath a container "directory" when the container completes so we don't leak unknown keys after a downgrade. In the meantime we'll have to treat it as an incompatible change with respect to downgrades. As I mentioned on YARN-5049 if we don't create these new store keys unless the feature is being used then at least we have the benefit of not breaking downgrades until the feature is enabled. Users that don't turn on the feature can still downgrade if needed. > Add support in the NodeManager to re-launch containers > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Jun Gong > Fix For: 2.9.0 > > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, > YARN-3998.06.patch, YARN-3998.07.patch, YARN-3998.08.patch, YARN-3998.09.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-3998) Add support in the NodeManager to re-launch containers
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15430896#comment-15430896 ] Varun Vasudev commented on YARN-3998: - Thanks for the comment Jason! bq. I believe the old software will ignore unrecognized keys in the state store, so we may be OK with a rolling downgrade as long as the resulting behavior is expected I think that's not the case. From {code}NMLeveldbStateStoreService#loadContainerState{code}, it looks like we throw an exception when we encounter a field we don't know of. {code} } else { throw new IOException("Unexpected container state key: " + key); } {code} Should we file a ticket to get this changed? The rest of your comment makes sense - I don't see any reason to flag this as an incompatible change. > Add support in the NodeManager to re-launch containers > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Jun Gong > Fix For: 2.9.0 > > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, > YARN-3998.06.patch, YARN-3998.07.patch, YARN-3998.08.patch, YARN-3998.09.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-3998) Add support in the NodeManager to re-launch containers
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15430822#comment-15430822 ] Jason Lowe commented on YARN-3998: -- I believe the old software will ignore unrecognized keys in the state store, so we may be OK with a rolling downgrade as long as the resulting behavior is expected. This feature adds the ability for the NM to re-launch containers, so if we downgrade we lose that ability. That means that containers will just fail, but I think we're OK there. We lose the optimization for a faster restart of the container, but AFAIK we don't outright lose containers like we could with the feature added in YARN-5049. One issue is that I'm not sure the old software will clean out the new keys when a container completes, so we might leak some keys in the state store during a rolling downgrade. We could bump the minor version although it's unused. The minor version carries no meaning currently, although it could potentially provide clues to code that needs to do a schema migration across major versions. In practice I suspect it still won't be used since any such migration will take into account all the keys it knows about, and at that point it doesn't need to look at the minor version. I can't think of a case where we would need the minor version info to properly implement the migration cases rather than have the migration code auto-detect based on what keys it finds in the store. It doesn't look to me that this needs to be flagged as an incompatible change unless I'm missing something with the semantics of the container re-launch. > Add support in the NodeManager to re-launch containers > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Jun Gong > Fix For: 2.9.0 > > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, > YARN-3998.06.patch, YARN-3998.07.patch, YARN-3998.08.patch, YARN-3998.09.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-3998) Add support in the NodeManager to re-launch containers
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15430517#comment-15430517 ] Varun Vasudev commented on YARN-3998: - My apologies for missing this [~asuresh]. What do you think the right fix is? > Add support in the NodeManager to re-launch containers > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Jun Gong > Fix For: 2.9.0 > > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, > YARN-3998.06.patch, YARN-3998.07.patch, YARN-3998.08.patch, YARN-3998.09.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-3998) Add support in the NodeManager to re-launch containers
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15430110#comment-15430110 ] Naganarasimha G R commented on YARN-3998: - [~asuresh], Do we need to flag it as *"Incompatible change"* ? > Add support in the NodeManager to re-launch containers > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Jun Gong > Fix For: 2.9.0 > > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, > YARN-3998.06.patch, YARN-3998.07.patch, YARN-3998.08.patch, YARN-3998.09.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-3998) Add support in the NodeManager to re-launch containers
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15430039#comment-15430039 ] Jun Gong commented on YARN-3998: Thanks [~asuresh] for pointing it out. I did not notice it. Yes, the CURRENT_VERSION_INFO need be updated, it might be enough to update minor version since it is compatible. I noticed that YARN-5049 which is committed after this issue has changed the CURRENT_VERSION_INFO, so we do not need change it in this issue again? > Add support in the NodeManager to re-launch containers > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Jun Gong > Fix For: 2.9.0 > > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, > YARN-3998.06.patch, YARN-3998.07.patch, YARN-3998.08.patch, YARN-3998.09.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-3998) Add support in the NodeManager to re-launch containers
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428730#comment-15428730 ] Arun Suresh commented on YARN-3998: --- I was just going thru the edit log of {{NMLevelsbStateStoreService}}. It looks like this patch introduces a schema change (addition of /remainingRetryAttempts, /logdir etc.) In which case, the CURRENT_VERSION_INFO should be updated. > Add support in the NodeManager to re-launch containers > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Jun Gong > Fix For: 2.9.0 > > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, > YARN-3998.06.patch, YARN-3998.07.patch, YARN-3998.08.patch, YARN-3998.09.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-3998) Add support in the NodeManager to re-launch containers
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265052#comment-15265052 ] Jun Gong commented on YARN-3998: Thanks [~vvasudev] for all the help and commit! Thanks [~vinodkv]'s suggestions. > Add support in the NodeManager to re-launch containers > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Jun Gong > Fix For: 2.9.0 > > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, > YARN-3998.06.patch, YARN-3998.07.patch, YARN-3998.08.patch, YARN-3998.09.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-3998) Add support in the NodeManager to re-launch containers
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15263912#comment-15263912 ] Hudson commented on YARN-3998: -- FAILURE: Integrated in Hadoop-trunk-Commit #9691 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/9691/]) YARN-3998. Add support in the NodeManager to re-launch containers. (vvasudev: rev 0f25a1bb52bc56661fd020a6ba82df99f8c6ef1f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/TestPBImplRecords.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMStateStoreService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/Container.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/ContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMMemoryStateStoreService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ContainerRetryContext.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerState.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ProtoUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_protos.proto * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/Client.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ContainerLaunchContext.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/webapp/MockContainer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ContainerRetryPolicy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMNullStateStoreService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerLaunchContextPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerRetryContextPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainersLauncher.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerRelaunch.java *
[jira] [Commented] (YARN-3998) Add support in the NodeManager to re-launch containers
[ https://issues.apache.org/jira/browse/YARN-3998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15263894#comment-15263894 ] Varun Vasudev commented on YARN-3998: - Committed to trunk and branch-2. Thanks for all your work [~hex108]! > Add support in the NodeManager to re-launch containers > -- > > Key: YARN-3998 > URL: https://issues.apache.org/jira/browse/YARN-3998 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jun Gong >Assignee: Jun Gong > Fix For: 2.9.0 > > Attachments: YARN-3998.01.patch, YARN-3998.02.patch, > YARN-3998.03.patch, YARN-3998.04.patch, YARN-3998.05.patch, > YARN-3998.06.patch, YARN-3998.07.patch, YARN-3998.08.patch, YARN-3998.09.patch > > > I'd like to add a field(retry-times) in ContainerLaunchContext. When AM > launches containers, it could specify the value. Then NM will re-launch the > container 'retry-times' times when it fails to run(e.g.exit code is not 0). > It will save a lot of time. It avoids container localization. RM does not > need to re-schedule the container. And local files in container's working > directory will be left for re-use.(If container have downloaded some big > files, it does not need to re-download them when running again.) > We find it is useful in systems like Storm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org