[jira] [Updated] (YARN-1070) ContainerImpl State Machine: Invalid event: CONTAINER_KILLED_ON_REQUEST at CONTAINER_CLEANEDUP_AFTER_KILL
[ https://issues.apache.org/jira/browse/YARN-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1070: -- Attachment: YARN-1070.4.patch Update the patch against the latest trunk. bq. Taking a step back, this approach will work, though the code is hard to read for me. A very simple state machine should make this code a lot cleaner. IMHO, the state machine will not help a lot here, because Callable is running on a separate thread, and is proceeding asynchronously compared to ContainerImpl. The container state will be changed to KILLING at any time: before Callable starts, when Callable is running, and after Callable is finished. We can check the state in many places, but the important one is the beginning of Callable. When the container is already at KILLING, there's no need to go through all the following logic. This actually behaves like canceling the Callable. bq. Also, as part of ContainerLaunch.cleanupContainer(), we should try to cancel the Callable. It's not necessary if we can terminate the Callable early, and will cause the bug in YARN-906. When cleanupContainer() is invoked, the container state is already KILLING, cancel will just cancel the Callable that is not started. On the other side, if the Callable is not started, while the container state is already KILLING, the Callable will terminate at very beginning. Meanwhile, a CONTAINER_KILLED_ON_REQUEST will be emitted. If we did cancel Callable(), we still need to check the container state there, and decide whether we need to emit a CONTAINER_KILLED_ON_REQUEST there as well, which returns to the initial problem of this ticket. ContainerImpl State Machine: Invalid event: CONTAINER_KILLED_ON_REQUEST at CONTAINER_CLEANEDUP_AFTER_KILL - Key: YARN-1070 URL: https://issues.apache.org/jira/browse/YARN-1070 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Hitesh Shah Assignee: Zhijie Shen Attachments: YARN-1070.1.patch, YARN-1070.2.patch, YARN-1070.3.patch, YARN-1070.4.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1070) ContainerImpl State Machine: Invalid event: CONTAINER_KILLED_ON_REQUEST at CONTAINER_CLEANEDUP_AFTER_KILL
[ https://issues.apache.org/jira/browse/YARN-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1070: -- Attachment: YARN-1070.5.patch Fix the findbugs warning. ContainerImpl State Machine: Invalid event: CONTAINER_KILLED_ON_REQUEST at CONTAINER_CLEANEDUP_AFTER_KILL - Key: YARN-1070 URL: https://issues.apache.org/jira/browse/YARN-1070 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Hitesh Shah Assignee: Zhijie Shen Attachments: YARN-1070.1.patch, YARN-1070.2.patch, YARN-1070.3.patch, YARN-1070.4.patch, YARN-1070.5.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1070) ContainerImpl State Machine: Invalid event: CONTAINER_KILLED_ON_REQUEST at CONTAINER_CLEANEDUP_AFTER_KILL
[ https://issues.apache.org/jira/browse/YARN-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1070: -- Attachment: YARN-1070.3.patch Thanks Vinod for your review. I've updated the patch accordingly. The important change in this patch is that I removed the logic of canceling ContainerLaunch.call(), and in call(), I checked the container state first, returned immediately if the container is not at LOCALIZED, and send CONTAINER_KILLED_ON_REQUEST if necessary. The rationale of checking the container state is that the thread of ContainerLaunch.call() is scheduled and should be executed after the container enters LOCALIZED. As this thread can run parallel with the thread of ContainerImpl, the container is free to move on to some other state, which can be either RUNNING, EXIT_WITH_FAILURE or KILLING. The first two should be triggered by the event send from ContainerLaunch.call(), while KILLING is caused by a kill event. Therefore, when ContainerLaunch.call() is started, we check the container state. If it is KILLING, ContainerLaunch.call() can stop immediately, which is equivalent to the cancel operation which is removed in ContainersLauncher. Actually, it should even be better, because Future.cancel will not terminate call() immediately. On the other side, if at this point the container state is still LOCALIZED, call() will move on. Then, if the container state changes to KILLING in the midway, we just ignore it let call() finish as usual. It does no harm because when the container reaches KILLING, CLEANUP_CONTAINER is scheduled or is started. ContainerImpl State Machine: Invalid event: CONTAINER_KILLED_ON_REQUEST at CONTAINER_CLEANEDUP_AFTER_KILL - Key: YARN-1070 URL: https://issues.apache.org/jira/browse/YARN-1070 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Hitesh Shah Assignee: Zhijie Shen Attachments: YARN-1070.1.patch, YARN-1070.2.patch, YARN-1070.3.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1070) ContainerImpl State Machine: Invalid event: CONTAINER_KILLED_ON_REQUEST at CONTAINER_CLEANEDUP_AFTER_KILL
[ https://issues.apache.org/jira/browse/YARN-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1070: -- Attachment: YARN-1070.2.patch Revert the change in YARN-906 to TestContainerLaunch to fix the test failure ContainerImpl State Machine: Invalid event: CONTAINER_KILLED_ON_REQUEST at CONTAINER_CLEANEDUP_AFTER_KILL - Key: YARN-1070 URL: https://issues.apache.org/jira/browse/YARN-1070 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Hitesh Shah Assignee: Zhijie Shen Attachments: YARN-1070.1.patch, YARN-1070.2.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1070) ContainerImpl State Machine: Invalid event: CONTAINER_KILLED_ON_REQUEST at CONTAINER_CLEANEDUP_AFTER_KILL
[ https://issues.apache.org/jira/browse/YARN-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated YARN-1070: -- Component/s: nodemanager ContainerImpl State Machine: Invalid event: CONTAINER_KILLED_ON_REQUEST at CONTAINER_CLEANEDUP_AFTER_KILL - Key: YARN-1070 URL: https://issues.apache.org/jira/browse/YARN-1070 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Hitesh Shah -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1070) ContainerImpl State Machine: Invalid event: CONTAINER_KILLED_ON_REQUEST at CONTAINER_CLEANEDUP_AFTER_KILL
[ https://issues.apache.org/jira/browse/YARN-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated YARN-1070: -- Component/s: nodemanager ContainerImpl State Machine: Invalid event: CONTAINER_KILLED_ON_REQUEST at CONTAINER_CLEANEDUP_AFTER_KILL - Key: YARN-1070 URL: https://issues.apache.org/jira/browse/YARN-1070 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Hitesh Shah -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1070) ContainerImpl State Machine: Invalid event: CONTAINER_KILLED_ON_REQUEST at CONTAINER_CLEANEDUP_AFTER_KILL
[ https://issues.apache.org/jira/browse/YARN-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1070: -- Issue Type: Sub-task (was: Bug) Parent: YARN-676 ContainerImpl State Machine: Invalid event: CONTAINER_KILLED_ON_REQUEST at CONTAINER_CLEANEDUP_AFTER_KILL - Key: YARN-1070 URL: https://issues.apache.org/jira/browse/YARN-1070 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Hitesh Shah Assignee: Zhijie Shen -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1070) ContainerImpl State Machine: Invalid event: CONTAINER_KILLED_ON_REQUEST at CONTAINER_CLEANEDUP_AFTER_KILL
[ https://issues.apache.org/jira/browse/YARN-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-1070: -- Target Version/s: 2.1.1-beta ContainerImpl State Machine: Invalid event: CONTAINER_KILLED_ON_REQUEST at CONTAINER_CLEANEDUP_AFTER_KILL - Key: YARN-1070 URL: https://issues.apache.org/jira/browse/YARN-1070 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Hitesh Shah Assignee: Zhijie Shen -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1070) ContainerImpl State Machine: Invalid event: CONTAINER_KILLED_ON_REQUEST at CONTAINER_CLEANEDUP_AFTER_KILL
[ https://issues.apache.org/jira/browse/YARN-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1070: -- Attachment: YARN-1070.1.patch Created the patch to fix the problem, which is that cancel() returning true whether call() is started or not. In fact the event needs to be emitted from ContainersLaunch only when call() is not started. In addition, fix the bug bellow together in this patch. {code} localResources = container.getLocalizedResources(); if (localResources == null) { !!!need throw here!!! RPCUtil.getRemoteException( Unable to get local resources when Container + containerID + is at + container.getContainerState()); } Moreover, add the test case, which simulates that call() is started but !isDone(). {code} ContainerImpl State Machine: Invalid event: CONTAINER_KILLED_ON_REQUEST at CONTAINER_CLEANEDUP_AFTER_KILL - Key: YARN-1070 URL: https://issues.apache.org/jira/browse/YARN-1070 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Hitesh Shah Assignee: Zhijie Shen Attachments: YARN-1070.1.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira