[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.
[ https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14513106#comment-14513106 ] Karthik Kambatla commented on YARN-3464: +1 > Race condition in LocalizerRunner causes container localization timeout. > > > Key: YARN-3464 > URL: https://issues.apache.org/jira/browse/YARN-3464 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3464.000.patch, YARN-3464.001.patch > > > Race condition in LocalizerRunner causes container localization timeout. > Currently LocalizerRunner will kill the ContainerLocalizer when pending list > for LocalizerResourceRequestEvent is empty. > {code} > } else if (pending.isEmpty()) { > action = LocalizerAction.DIE; > } > {code} > If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the > ContainerLocalizer due to empty pending list, this > LocalizerResourceRequestEvent will never be handled. > Without ContainerLocalizer, LocalizerRunner#update will never be called. > The container will stay at LOCALIZING state, until the container is killed by > AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.
[ https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512880#comment-14512880 ] Hadoop QA commented on YARN-3464: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 58s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | javac | 7m 55s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 49s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 5m 30s | The applied patch generated 3 additional checkstyle issues. | | {color:green}+1{color} | install | 1m 37s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 7s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 5m 51s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 47m 46s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12728198/YARN-3464.001.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / a00e001 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/7503/artifact/patchprocess/checkstyle-result-diff.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7503/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7503/testReport/ | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7503/console | This message was automatically generated. > Race condition in LocalizerRunner causes container localization timeout. > > > Key: YARN-3464 > URL: https://issues.apache.org/jira/browse/YARN-3464 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3464.000.patch, YARN-3464.001.patch > > > Race condition in LocalizerRunner causes container localization timeout. > Currently LocalizerRunner will kill the ContainerLocalizer when pending list > for LocalizerResourceRequestEvent is empty. > {code} > } else if (pending.isEmpty()) { > action = LocalizerAction.DIE; > } > {code} > If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the > ContainerLocalizer due to empty pending list, this > LocalizerResourceRequestEvent will never be handled. > Without ContainerLocalizer, LocalizerRunner#update will never be called. > The container will stay at LOCALIZING state, until the container is killed by > AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.
[ https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512868#comment-14512868 ] zhihai xu commented on YARN-3464: - [~kasha], thanks for the review. Your suggestions are very reasonable, which make the code much easier to read. I uploaded a new patch YARN-3464.001.patch, which addressed all your comments. Please review it. > Race condition in LocalizerRunner causes container localization timeout. > > > Key: YARN-3464 > URL: https://issues.apache.org/jira/browse/YARN-3464 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3464.000.patch, YARN-3464.001.patch > > > Race condition in LocalizerRunner causes container localization timeout. > Currently LocalizerRunner will kill the ContainerLocalizer when pending list > for LocalizerResourceRequestEvent is empty. > {code} > } else if (pending.isEmpty()) { > action = LocalizerAction.DIE; > } > {code} > If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the > ContainerLocalizer due to empty pending list, this > LocalizerResourceRequestEvent will never be handled. > Without ContainerLocalizer, LocalizerRunner#update will never be called. > The container will stay at LOCALIZING state, until the container is killed by > AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.
[ https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14511771#comment-14511771 ] Karthik Kambatla commented on YARN-3464: Thanks for uploading the patch, [~zxu]. Comments: # ContainerImpl: unrelated to the patch, {{container.metrics.endInitingContainer()}} should likely move up to right after {{container.sendLaunchEvent()}} # Nit: Simplify javadoc for {{ResourceLocalizationService#handleContainerResourcesLocalized}}. How about "Once a container's resources are localized, kill the corresponding {@link ContainerLocalizer}" # ResourceLocalizationService.LocalizerRunner#update - mostly unrelated to this patch ## Rename to {{processHeartbeat}}? ## I don't see the point of a LocalizerAction variable initialized to DIE. How about changing that to a {{boolean fetchFailed}}? > Race condition in LocalizerRunner causes container localization timeout. > > > Key: YARN-3464 > URL: https://issues.apache.org/jira/browse/YARN-3464 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3464.000.patch > > > Race condition in LocalizerRunner causes container localization timeout. > Currently LocalizerRunner will kill the ContainerLocalizer when pending list > for LocalizerResourceRequestEvent is empty. > {code} > } else if (pending.isEmpty()) { > action = LocalizerAction.DIE; > } > {code} > If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the > ContainerLocalizer due to empty pending list, this > LocalizerResourceRequestEvent will never be handled. > Without ContainerLocalizer, LocalizerRunner#update will never be called. > The container will stay at LOCALIZING state, until the container is killed by > AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.
[ https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14506441#comment-14506441 ] Hadoop QA commented on YARN-3464: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 30s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | javac | 7m 33s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 31s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 5m 27s | The applied patch generated 3 additional checkstyle issues. | | {color:green}+1{color} | install | 1m 40s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 4s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 6m 0s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 46m 46s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12727117/YARN-3464.000.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 674c7ef | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/7443/artifact/patchprocess/checkstyle-result-diff.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7443/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7443/testReport/ | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7443//console | This message was automatically generated. > Race condition in LocalizerRunner causes container localization timeout. > > > Key: YARN-3464 > URL: https://issues.apache.org/jira/browse/YARN-3464 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3464.000.patch > > > Race condition in LocalizerRunner causes container localization timeout. > Currently LocalizerRunner will kill the ContainerLocalizer when pending list > for LocalizerResourceRequestEvent is empty. > {code} > } else if (pending.isEmpty()) { > action = LocalizerAction.DIE; > } > {code} > If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the > ContainerLocalizer due to empty pending list, this > LocalizerResourceRequestEvent will never be handled. > Without ContainerLocalizer, LocalizerRunner#update will never be called. > The container will stay at LOCALIZING state, until the container is killed by > AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.
[ https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14506291#comment-14506291 ] Hadoop QA commented on YARN-3464: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 48s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 1s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | javac | 7m 45s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 31s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 5m 29s | The applied patch generated 3 additional checkstyle issues. | | {color:green}+1{color} | install | 1m 34s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 1s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:red}-1{color} | yarn tests | 5m 44s | Tests failed in hadoop-yarn-server-nodemanager. | | | | 46m 51s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12727091/YARN-3464.000.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / e71d0d8 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/7442/artifact/patchprocess/checkstyle-result-diff.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7442/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7442/testReport/ | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7442//console | This message was automatically generated. > Race condition in LocalizerRunner causes container localization timeout. > > > Key: YARN-3464 > URL: https://issues.apache.org/jira/browse/YARN-3464 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3464.000.patch > > > Race condition in LocalizerRunner causes container localization timeout. > Currently LocalizerRunner will kill the ContainerLocalizer when pending list > for LocalizerResourceRequestEvent is empty. > {code} > } else if (pending.isEmpty()) { > action = LocalizerAction.DIE; > } > {code} > If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the > ContainerLocalizer due to empty pending list, this > LocalizerResourceRequestEvent will never be handled. > Without ContainerLocalizer, LocalizerRunner#update will never be called. > The container will stay at LOCALIZING state, until the container is killed by > AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.
[ https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14506169#comment-14506169 ] Hadoop QA commented on YARN-3464: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 49s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | javac | 7m 38s | There were no new javac warning messages. | | {color:red}-1{color} | javadoc | 9m 52s | The applied patch generated 50 additional warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 4m 4s | The applied patch generated 3 additional checkstyle issues. | | {color:green}+1{color} | install | 1m 38s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 3s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:red}-1{color} | yarn tests | 5m 44s | Tests failed in hadoop-yarn-server-nodemanager. | | | | 45m 46s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12727047/YARN-3464.000.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / cfba355 | | javadoc | https://builds.apache.org/job/PreCommit-YARN-Build/7439/artifact/patchprocess/diffJavadocWarnings.txt | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/7439/artifact/patchprocess/checkstyle-result-diff.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7439/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7439/testReport/ | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7439//console | This message was automatically generated. > Race condition in LocalizerRunner causes container localization timeout. > > > Key: YARN-3464 > URL: https://issues.apache.org/jira/browse/YARN-3464 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3464.000.patch > > > Race condition in LocalizerRunner causes container localization timeout. > Currently LocalizerRunner will kill the ContainerLocalizer when pending list > for LocalizerResourceRequestEvent is empty. > {code} > } else if (pending.isEmpty()) { > action = LocalizerAction.DIE; > } > {code} > If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the > ContainerLocalizer due to empty pending list, this > LocalizerResourceRequestEvent will never be handled. > Without ContainerLocalizer, LocalizerRunner#update will never be called. > The container will stay at LOCALIZING state, until the container is killed by > AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.
[ https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14506094#comment-14506094 ] zhihai xu commented on YARN-3464: - I uploaded a patch YARN-3464.000.patch for review. > Race condition in LocalizerRunner causes container localization timeout. > > > Key: YARN-3464 > URL: https://issues.apache.org/jira/browse/YARN-3464 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-3464.000.patch > > > Race condition in LocalizerRunner causes container localization timeout. > Currently LocalizerRunner will kill the ContainerLocalizer when pending list > for LocalizerResourceRequestEvent is empty. > {code} > } else if (pending.isEmpty()) { > action = LocalizerAction.DIE; > } > {code} > If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the > ContainerLocalizer due to empty pending list, this > LocalizerResourceRequestEvent will never be handled. > Without ContainerLocalizer, LocalizerRunner#update will never be called. > The container will stay at LOCALIZING state, until the container is killed by > AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.
[ https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503994#comment-14503994 ] zhihai xu commented on YARN-3464: - thanks [~jlowe] and [~kasha], bq. changing the logic from "kill me when there's no more work in my queue" to "kill me when my container is ready to be launched." That is a fantastic summary. I will implement the patch based on this. > Race condition in LocalizerRunner causes container localization timeout. > > > Key: YARN-3464 > URL: https://issues.apache.org/jira/browse/YARN-3464 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > > Race condition in LocalizerRunner causes container localization timeout. > Currently LocalizerRunner will kill the ContainerLocalizer when pending list > for LocalizerResourceRequestEvent is empty. > {code} > } else if (pending.isEmpty()) { > action = LocalizerAction.DIE; > } > {code} > If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the > ContainerLocalizer due to empty pending list, this > LocalizerResourceRequestEvent will never be handled. > Without ContainerLocalizer, LocalizerRunner#update will never be called. > The container will stay at LOCALIZING state, until the container is killed by > AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.
[ https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503588#comment-14503588 ] Karthik Kambatla commented on YARN-3464: bq. So I'd look into changing the logic from "kill me when there's no more work in my queue" to "kill me when my container is ready to be launched." Thanks [~jlowe]. Yes, that would simplify the localizer-NM communication significantly. We could add the kill-container-localizer logic to either {{LocalizedTransition}} or {{LaunchTransition}} in ContainerImpl. I prefer we do it in the {{LocalizedTransition}}. > Race condition in LocalizerRunner causes container localization timeout. > > > Key: YARN-3464 > URL: https://issues.apache.org/jira/browse/YARN-3464 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > > Race condition in LocalizerRunner causes container localization timeout. > Currently LocalizerRunner will kill the ContainerLocalizer when pending list > for LocalizerResourceRequestEvent is empty. > {code} > } else if (pending.isEmpty()) { > action = LocalizerAction.DIE; > } > {code} > If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the > ContainerLocalizer due to empty pending list, this > LocalizerResourceRequestEvent will never be handled. > Without ContainerLocalizer, LocalizerRunner#update will never be called. > The container will stay at LOCALIZING state, until the container is killed by > AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.
[ https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503540#comment-14503540 ] Jason Lowe commented on YARN-3464: -- We shouldn't leave a ContainerLocalizer lingering around if the container is ready to be launched, as that's wasting node resources and adding extra localizer heartbeat processing on the NM we don't need to do. One exception to that would be if we want to support localizing new resources while a container is already running, but last I checked we don't support that. IMHO it makes sense to kill the localizer when the container is ready to be launched. If it's not ready to be launched then we may need to (re)localize some resource and the localizer would have some utility to keep running. So I'd look into changing the logic from "kill me when there's no more work in my queue" to "kill me when my container is ready to be launched." > Race condition in LocalizerRunner causes container localization timeout. > > > Key: YARN-3464 > URL: https://issues.apache.org/jira/browse/YARN-3464 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > > Race condition in LocalizerRunner causes container localization timeout. > Currently LocalizerRunner will kill the ContainerLocalizer when pending list > for LocalizerResourceRequestEvent is empty. > {code} > } else if (pending.isEmpty()) { > action = LocalizerAction.DIE; > } > {code} > If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the > ContainerLocalizer due to empty pending list, this > LocalizerResourceRequestEvent will never be handled. > Without ContainerLocalizer, LocalizerRunner#update will never be called. > The container will stay at LOCALIZING state, until the container is killed by > AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.
[ https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503328#comment-14503328 ] Karthik Kambatla commented on YARN-3464: [~jlowe] - do you think any of my earlier suggestions are reasonable? > Race condition in LocalizerRunner causes container localization timeout. > > > Key: YARN-3464 > URL: https://issues.apache.org/jira/browse/YARN-3464 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > > Race condition in LocalizerRunner causes container localization timeout. > Currently LocalizerRunner will kill the ContainerLocalizer when pending list > for LocalizerResourceRequestEvent is empty. > {code} > } else if (pending.isEmpty()) { > action = LocalizerAction.DIE; > } > {code} > If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the > ContainerLocalizer due to empty pending list, this > LocalizerResourceRequestEvent will never be handled. > Without ContainerLocalizer, LocalizerRunner#update will never be called. > The container will stay at LOCALIZING state, until the container is killed by > AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.
[ https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503258#comment-14503258 ] zhihai xu commented on YARN-3464: - Thanks [~kasha] for the detail explanation. For the suggested first solution, ContainerLocalizer will be killed by ContainerImpl#cleanup when the container is finished. For the suggested second solution, ContainerLocalizer can be killed earlier by a sentinel/event sent by ContainerImp after localization before launch container. I want to check whether other people has better idea for this issue before I implement a patch for this issue. > Race condition in LocalizerRunner causes container localization timeout. > > > Key: YARN-3464 > URL: https://issues.apache.org/jira/browse/YARN-3464 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > > Race condition in LocalizerRunner causes container localization timeout. > Currently LocalizerRunner will kill the ContainerLocalizer when pending list > for LocalizerResourceRequestEvent is empty. > {code} > } else if (pending.isEmpty()) { > action = LocalizerAction.DIE; > } > {code} > If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the > ContainerLocalizer due to empty pending list, this > LocalizerResourceRequestEvent will never be handled. > Without ContainerLocalizer, LocalizerRunner#update will never be called. > The container will stay at LOCALIZING state, until the container is killed by > AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.
[ https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486046#comment-14486046 ] Karthik Kambatla commented on YARN-3464: Discussed this with Zhihai offline. Firstly, responding with a DIE when {{LocalizerRunner#pending}} is empty seems wrong. I don't think synchronizing that check is really going to help if someone can add a request to {{pending}} after that. We see two alternatives: # If we are going to leave the ContainerLocalizer around, why do we even bother issuing a DIE action? Can we just let it LIVE forever. # If there is merit to killing the ContainerLocalizer, we could have a sentinel sent for each localizer. DIE action is to be sent only when either of the two conditions are met: ## One of the localizations failed. Since the container can't be launched, kill the localizer and fail the container. ## The sentinel has been received *and* all previously scheduled localizations have been fetched successfully ({{LocalizerRunner#scheduled.isEmpty()}}) While we make these changes, I see how YARN-3465 could be useful in reducing the likelihood of this issue in the interim. > Race condition in LocalizerRunner causes container localization timeout. > > > Key: YARN-3464 > URL: https://issues.apache.org/jira/browse/YARN-3464 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > > Race condition in LocalizerRunner causes container localization timeout. > Currently LocalizerRunner will kill the ContainerLocalizer when pending list > for LocalizerResourceRequestEvent is empty. > {code} > } else if (pending.isEmpty()) { > action = LocalizerAction.DIE; > } > {code} > If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the > ContainerLocalizer due to empty pending list, this > LocalizerResourceRequestEvent will never be handled. > Without ContainerLocalizer, LocalizerRunner#update will never be called. > The container will stay at LOCALIZING state, until the container is killed by > AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.
[ https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485818#comment-14485818 ] Karthik Kambatla commented on YARN-3464: We can may be discuss this more on YARN-3465, but I don't think having it sorted is necessary. The container can not be started until all the resources are localized; so, the order of their downloads shouldn't matter as long as they all get localized. > Race condition in LocalizerRunner causes container localization timeout. > > > Key: YARN-3464 > URL: https://issues.apache.org/jira/browse/YARN-3464 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > > Race condition in LocalizerRunner causes container localization timeout. > Currently LocalizerRunner will kill the ContainerLocalizer when pending list > for LocalizerResourceRequestEvent is empty. > {code} > } else if (pending.isEmpty()) { > action = LocalizerAction.DIE; > } > {code} > If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the > ContainerLocalizer due to empty pending list, this > LocalizerResourceRequestEvent will never be handled. > Without ContainerLocalizer, LocalizerRunner#update will never be called. > The container will stay at LOCALIZING state, until the container is killed by > AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.
[ https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485794#comment-14485794 ] zhihai xu commented on YARN-3464: - I also created another JIRA YARN-3465, which can help this issue and make sure localization is based on the correct order: PUBLIC, PRIVATE and APPLICATION. The issue in my case is also because PRIVATE LocalResourceRequest is reordered to first and APPLICATION LocalResourceRequest is reordered to last. The PUBLIC LocalResourceRequest is in the middle which add delay for APPLICATION LocalResourceRequest. Because the entrySet order based on HashMap will not be fixed. use LinkedHashMap should be used. > Race condition in LocalizerRunner causes container localization timeout. > > > Key: YARN-3464 > URL: https://issues.apache.org/jira/browse/YARN-3464 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > > Race condition in LocalizerRunner causes container localization timeout. > Currently LocalizerRunner will kill the ContainerLocalizer when pending list > for LocalizerResourceRequestEvent is empty. > {code} > } else if (pending.isEmpty()) { > action = LocalizerAction.DIE; > } > {code} > If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the > ContainerLocalizer due to empty pending list, this > LocalizerResourceRequestEvent will never be handled. > Without ContainerLocalizer, LocalizerRunner#update will never be called. > The container will stay at LOCALIZING state, until the container is killed by > AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.
[ https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485775#comment-14485775 ] zhihai xu commented on YARN-3464: - [~kasha], thanks for the information. I just looked at YARN-3024, Yes, it will make this issue happen more frequently. Before YARN-3024, The localization for private resource is one by one. The next one won't start until the current one finish localization. It will take longer time for private resource localization. With YARN-3024, The localization will be done in parallel, multiple files can be localized at the same time. The chance for ContainerLocalizer being killed when the last two PRIVATE LocalizerResourceRequestEvent are added is bigger. Yes, your suggestion is also what I thought. > Race condition in LocalizerRunner causes container localization timeout. > > > Key: YARN-3464 > URL: https://issues.apache.org/jira/browse/YARN-3464 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > > Race condition in LocalizerRunner causes container localization timeout. > Currently LocalizerRunner will kill the ContainerLocalizer when pending list > for LocalizerResourceRequestEvent is empty. > {code} > } else if (pending.isEmpty()) { > action = LocalizerAction.DIE; > } > {code} > If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the > ContainerLocalizer due to empty pending list, this > LocalizerResourceRequestEvent will never be handled. > Without ContainerLocalizer, LocalizerRunner#update will never be called. > The container will stay at LOCALIZING state, until the container is killed by > AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.
[ https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485687#comment-14485687 ] Karthik Kambatla commented on YARN-3464: bq. Looking at the code closely, I don't see any resources being removed from pending. So, pending shouldn't be empty after some of the resources have been downloaded. Never mind. findNextResource has a call to iterator.remove(). In any case, I think the right approach is to send an explicit event to the localizer to indicate we are done with localizing all the resources. On receiving this, the localizer tracker sends the DIE action. > Race condition in LocalizerRunner causes container localization timeout. > > > Key: YARN-3464 > URL: https://issues.apache.org/jira/browse/YARN-3464 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > > Race condition in LocalizerRunner causes container localization timeout. > Currently LocalizerRunner will kill the ContainerLocalizer when pending list > for LocalizerResourceRequestEvent is empty. > {code} > } else if (pending.isEmpty()) { > action = LocalizerAction.DIE; > } > {code} > If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the > ContainerLocalizer due to empty pending list, this > LocalizerResourceRequestEvent will never be handled. > Without ContainerLocalizer, LocalizerRunner#update will never be called. > The container will stay at LOCALIZING state, until the container is killed by > AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.
[ https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485566#comment-14485566 ] Karthik Kambatla commented on YARN-3464: I have been investigating a similar issue. Initially I thought of the same race, but not sure if that alone solves the issue. Looking at the code closely, I don't see any resources being removed from pending. So, pending shouldn't be empty after some of the resources have been downloaded. Related: YARN-3024 increases the frequency of this issue. > Race condition in LocalizerRunner causes container localization timeout. > > > Key: YARN-3464 > URL: https://issues.apache.org/jira/browse/YARN-3464 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > > Race condition in LocalizerRunner causes container localization timeout. > Currently LocalizerRunner will kill the ContainerLocalizer when pending list > for LocalizerResourceRequestEvent is empty. > {code} > } else if (pending.isEmpty()) { > action = LocalizerAction.DIE; > } > {code} > If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the > ContainerLocalizer due to empty pending list, this > LocalizerResourceRequestEvent will never be handled. > Without ContainerLocalizer, LocalizerRunner#update will never be called. > The container will stay at LOCALIZING state, until the container is killed by > AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.
[ https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14484920#comment-14484920 ] zhihai xu commented on YARN-3464: - This issue only happened for PRIVATE/APPLICATION resource Localization We saw this issue happened when the PRIVATE LocalizerResourceRequestEvent interleaved with PUBLIC LocalizerResourceRequestEvent in the following order: PRIVATE1 PRIVATE2 .. PRIVATEm PUBLIC1 PUBLIC2 . PUBLICn PRIVATEm+1 PRIVATEm+2 The last two PRIVATE LocalizerResourceRequestEvent is added after all previous m PRIVATE LocalizerResourceRequestEvent are LOCALIZED due to the delay to process n PUBLIC LocalizerResourceRequestEvent. Then the container will stay at LOCALIZING state until it is killed by AM. > Race condition in LocalizerRunner causes container localization timeout. > > > Key: YARN-3464 > URL: https://issues.apache.org/jira/browse/YARN-3464 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > > Race condition in LocalizerRunner causes container localization timeout. > Currently LocalizerRunner will kill the ContainerLocalizer when pending list > for LocalizerResourceRequestEvent is empty. > {code} > } else if (pending.isEmpty()) { > action = LocalizerAction.DIE; > } > {code} > If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the > ContainerLocalizer due to empty pending list, this > LocalizerResourceRequestEvent will never be handled. > The container will stay at LOCALIZING state, until the container is killed by > AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)