[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.

2015-04-26 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14513106#comment-14513106
 ] 

Karthik Kambatla commented on YARN-3464:


+1

> Race condition in LocalizerRunner causes container localization timeout.
> 
>
> Key: YARN-3464
> URL: https://issues.apache.org/jira/browse/YARN-3464
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-3464.000.patch, YARN-3464.001.patch
>
>
> Race condition in LocalizerRunner causes container localization timeout.
> Currently LocalizerRunner will kill the ContainerLocalizer when pending list 
> for LocalizerResourceRequestEvent is empty.
> {code}
>   } else if (pending.isEmpty()) {
> action = LocalizerAction.DIE;
>   }
> {code}
> If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the 
> ContainerLocalizer due to empty pending list, this 
> LocalizerResourceRequestEvent will never be handled.
> Without ContainerLocalizer, LocalizerRunner#update will never be called.
> The container will stay at LOCALIZING state, until the container is killed by 
> AM due to TASK_TIMEOUT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.

2015-04-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512880#comment-14512880
 ] 

Hadoop QA commented on YARN-3464:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  14m 58s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | javac |   7m 55s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 49s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   5m 30s | The applied patch generated  3 
 additional checkstyle issues. |
| {color:green}+1{color} | install |   1m 37s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m  7s | The patch does not introduce 
any new Findbugs (version 2.0.3) warnings. |
| {color:green}+1{color} | yarn tests |   5m 51s | Tests passed in 
hadoop-yarn-server-nodemanager. |
| | |  47m 46s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12728198/YARN-3464.001.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / a00e001 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-YARN-Build/7503/artifact/patchprocess/checkstyle-result-diff.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7503/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/7503/testReport/ |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7503/console |


This message was automatically generated.

> Race condition in LocalizerRunner causes container localization timeout.
> 
>
> Key: YARN-3464
> URL: https://issues.apache.org/jira/browse/YARN-3464
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-3464.000.patch, YARN-3464.001.patch
>
>
> Race condition in LocalizerRunner causes container localization timeout.
> Currently LocalizerRunner will kill the ContainerLocalizer when pending list 
> for LocalizerResourceRequestEvent is empty.
> {code}
>   } else if (pending.isEmpty()) {
> action = LocalizerAction.DIE;
>   }
> {code}
> If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the 
> ContainerLocalizer due to empty pending list, this 
> LocalizerResourceRequestEvent will never be handled.
> Without ContainerLocalizer, LocalizerRunner#update will never be called.
> The container will stay at LOCALIZING state, until the container is killed by 
> AM due to TASK_TIMEOUT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.

2015-04-25 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512868#comment-14512868
 ] 

zhihai xu commented on YARN-3464:
-

[~kasha], thanks for the review. Your suggestions are very reasonable, which 
make the code much easier to read.
I uploaded a new patch YARN-3464.001.patch, which addressed all your comments. 
Please review it.

> Race condition in LocalizerRunner causes container localization timeout.
> 
>
> Key: YARN-3464
> URL: https://issues.apache.org/jira/browse/YARN-3464
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-3464.000.patch, YARN-3464.001.patch
>
>
> Race condition in LocalizerRunner causes container localization timeout.
> Currently LocalizerRunner will kill the ContainerLocalizer when pending list 
> for LocalizerResourceRequestEvent is empty.
> {code}
>   } else if (pending.isEmpty()) {
> action = LocalizerAction.DIE;
>   }
> {code}
> If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the 
> ContainerLocalizer due to empty pending list, this 
> LocalizerResourceRequestEvent will never be handled.
> Without ContainerLocalizer, LocalizerRunner#update will never be called.
> The container will stay at LOCALIZING state, until the container is killed by 
> AM due to TASK_TIMEOUT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.

2015-04-24 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14511771#comment-14511771
 ] 

Karthik Kambatla commented on YARN-3464:


Thanks for uploading the patch, [~zxu]. 

Comments:
# ContainerImpl: unrelated to the patch, 
{{container.metrics.endInitingContainer()}} should likely move up to right 
after {{container.sendLaunchEvent()}}
# Nit: Simplify javadoc for 
{{ResourceLocalizationService#handleContainerResourcesLocalized}}. How about 
"Once a container's resources are localized, kill the corresponding {@link 
ContainerLocalizer}"
# ResourceLocalizationService.LocalizerRunner#update - mostly unrelated to this 
patch
## Rename to {{processHeartbeat}}?
## I don't see the point of a LocalizerAction variable initialized to DIE. How 
about changing that to a {{boolean fetchFailed}}? 

> Race condition in LocalizerRunner causes container localization timeout.
> 
>
> Key: YARN-3464
> URL: https://issues.apache.org/jira/browse/YARN-3464
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-3464.000.patch
>
>
> Race condition in LocalizerRunner causes container localization timeout.
> Currently LocalizerRunner will kill the ContainerLocalizer when pending list 
> for LocalizerResourceRequestEvent is empty.
> {code}
>   } else if (pending.isEmpty()) {
> action = LocalizerAction.DIE;
>   }
> {code}
> If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the 
> ContainerLocalizer due to empty pending list, this 
> LocalizerResourceRequestEvent will never be handled.
> Without ContainerLocalizer, LocalizerRunner#update will never be called.
> The container will stay at LOCALIZING state, until the container is killed by 
> AM due to TASK_TIMEOUT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.

2015-04-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14506441#comment-14506441
 ] 

Hadoop QA commented on YARN-3464:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  14m 30s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | javac |   7m 33s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 31s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   5m 27s | The applied patch generated  3 
 additional checkstyle issues. |
| {color:green}+1{color} | install |   1m 40s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m  4s | The patch does not introduce 
any new Findbugs (version 2.0.3) warnings. |
| {color:green}+1{color} | yarn tests |   6m  0s | Tests passed in 
hadoop-yarn-server-nodemanager. |
| | |  46m 46s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12727117/YARN-3464.000.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 674c7ef |
| checkstyle | 
https://builds.apache.org/job/PreCommit-YARN-Build/7443/artifact/patchprocess/checkstyle-result-diff.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7443/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/7443/testReport/ |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7443//console |


This message was automatically generated.

> Race condition in LocalizerRunner causes container localization timeout.
> 
>
> Key: YARN-3464
> URL: https://issues.apache.org/jira/browse/YARN-3464
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-3464.000.patch
>
>
> Race condition in LocalizerRunner causes container localization timeout.
> Currently LocalizerRunner will kill the ContainerLocalizer when pending list 
> for LocalizerResourceRequestEvent is empty.
> {code}
>   } else if (pending.isEmpty()) {
> action = LocalizerAction.DIE;
>   }
> {code}
> If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the 
> ContainerLocalizer due to empty pending list, this 
> LocalizerResourceRequestEvent will never be handled.
> Without ContainerLocalizer, LocalizerRunner#update will never be called.
> The container will stay at LOCALIZING state, until the container is killed by 
> AM due to TASK_TIMEOUT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.

2015-04-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14506291#comment-14506291
 ] 

Hadoop QA commented on YARN-3464:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  14m 48s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  1s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | javac |   7m 45s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 31s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   5m 29s | The applied patch generated  3 
 additional checkstyle issues. |
| {color:green}+1{color} | install |   1m 34s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m  1s | The patch does not introduce 
any new Findbugs (version 2.0.3) warnings. |
| {color:red}-1{color} | yarn tests |   5m 44s | Tests failed in 
hadoop-yarn-server-nodemanager. |
| | |  46m 51s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | 
hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService
 |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12727091/YARN-3464.000.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / e71d0d8 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-YARN-Build/7442/artifact/patchprocess/checkstyle-result-diff.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7442/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/7442/testReport/ |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7442//console |


This message was automatically generated.

> Race condition in LocalizerRunner causes container localization timeout.
> 
>
> Key: YARN-3464
> URL: https://issues.apache.org/jira/browse/YARN-3464
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-3464.000.patch
>
>
> Race condition in LocalizerRunner causes container localization timeout.
> Currently LocalizerRunner will kill the ContainerLocalizer when pending list 
> for LocalizerResourceRequestEvent is empty.
> {code}
>   } else if (pending.isEmpty()) {
> action = LocalizerAction.DIE;
>   }
> {code}
> If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the 
> ContainerLocalizer due to empty pending list, this 
> LocalizerResourceRequestEvent will never be handled.
> Without ContainerLocalizer, LocalizerRunner#update will never be called.
> The container will stay at LOCALIZING state, until the container is killed by 
> AM due to TASK_TIMEOUT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.

2015-04-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14506169#comment-14506169
 ] 

Hadoop QA commented on YARN-3464:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  14m 49s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | javac |   7m 38s | There were no new javac warning 
messages. |
| {color:red}-1{color} | javadoc |   9m 52s | The applied patch generated  50  
additional warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   4m  4s | The applied patch generated  3 
 additional checkstyle issues. |
| {color:green}+1{color} | install |   1m 38s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m  3s | The patch does not introduce 
any new Findbugs (version 2.0.3) warnings. |
| {color:red}-1{color} | yarn tests |   5m 44s | Tests failed in 
hadoop-yarn-server-nodemanager. |
| | |  45m 46s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | 
hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService
 |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12727047/YARN-3464.000.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / cfba355 |
| javadoc | 
https://builds.apache.org/job/PreCommit-YARN-Build/7439/artifact/patchprocess/diffJavadocWarnings.txt
 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-YARN-Build/7439/artifact/patchprocess/checkstyle-result-diff.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7439/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/7439/testReport/ |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7439//console |


This message was automatically generated.

> Race condition in LocalizerRunner causes container localization timeout.
> 
>
> Key: YARN-3464
> URL: https://issues.apache.org/jira/browse/YARN-3464
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-3464.000.patch
>
>
> Race condition in LocalizerRunner causes container localization timeout.
> Currently LocalizerRunner will kill the ContainerLocalizer when pending list 
> for LocalizerResourceRequestEvent is empty.
> {code}
>   } else if (pending.isEmpty()) {
> action = LocalizerAction.DIE;
>   }
> {code}
> If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the 
> ContainerLocalizer due to empty pending list, this 
> LocalizerResourceRequestEvent will never be handled.
> Without ContainerLocalizer, LocalizerRunner#update will never be called.
> The container will stay at LOCALIZING state, until the container is killed by 
> AM due to TASK_TIMEOUT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.

2015-04-21 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14506094#comment-14506094
 ] 

zhihai xu commented on YARN-3464:
-

I uploaded a patch YARN-3464.000.patch for review.

> Race condition in LocalizerRunner causes container localization timeout.
> 
>
> Key: YARN-3464
> URL: https://issues.apache.org/jira/browse/YARN-3464
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
> Attachments: YARN-3464.000.patch
>
>
> Race condition in LocalizerRunner causes container localization timeout.
> Currently LocalizerRunner will kill the ContainerLocalizer when pending list 
> for LocalizerResourceRequestEvent is empty.
> {code}
>   } else if (pending.isEmpty()) {
> action = LocalizerAction.DIE;
>   }
> {code}
> If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the 
> ContainerLocalizer due to empty pending list, this 
> LocalizerResourceRequestEvent will never be handled.
> Without ContainerLocalizer, LocalizerRunner#update will never be called.
> The container will stay at LOCALIZING state, until the container is killed by 
> AM due to TASK_TIMEOUT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.

2015-04-20 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503994#comment-14503994
 ] 

zhihai xu commented on YARN-3464:
-

thanks [~jlowe] and [~kasha],
bq. changing the logic from "kill me when there's no more work in my queue" to 
"kill me when my container is ready to be launched."
That is a fantastic summary. I will implement the patch based on this.

> Race condition in LocalizerRunner causes container localization timeout.
> 
>
> Key: YARN-3464
> URL: https://issues.apache.org/jira/browse/YARN-3464
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
>
> Race condition in LocalizerRunner causes container localization timeout.
> Currently LocalizerRunner will kill the ContainerLocalizer when pending list 
> for LocalizerResourceRequestEvent is empty.
> {code}
>   } else if (pending.isEmpty()) {
> action = LocalizerAction.DIE;
>   }
> {code}
> If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the 
> ContainerLocalizer due to empty pending list, this 
> LocalizerResourceRequestEvent will never be handled.
> Without ContainerLocalizer, LocalizerRunner#update will never be called.
> The container will stay at LOCALIZING state, until the container is killed by 
> AM due to TASK_TIMEOUT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.

2015-04-20 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503588#comment-14503588
 ] 

Karthik Kambatla commented on YARN-3464:


bq. So I'd look into changing the logic from "kill me when there's no more work 
in my queue" to "kill me when my container is ready to be launched."
Thanks [~jlowe]. Yes, that would simplify the localizer-NM communication 
significantly.

We could add the kill-container-localizer logic to either 
{{LocalizedTransition}} or {{LaunchTransition}} in ContainerImpl. I prefer we 
do it in the {{LocalizedTransition}}. 

> Race condition in LocalizerRunner causes container localization timeout.
> 
>
> Key: YARN-3464
> URL: https://issues.apache.org/jira/browse/YARN-3464
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
>
> Race condition in LocalizerRunner causes container localization timeout.
> Currently LocalizerRunner will kill the ContainerLocalizer when pending list 
> for LocalizerResourceRequestEvent is empty.
> {code}
>   } else if (pending.isEmpty()) {
> action = LocalizerAction.DIE;
>   }
> {code}
> If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the 
> ContainerLocalizer due to empty pending list, this 
> LocalizerResourceRequestEvent will never be handled.
> Without ContainerLocalizer, LocalizerRunner#update will never be called.
> The container will stay at LOCALIZING state, until the container is killed by 
> AM due to TASK_TIMEOUT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.

2015-04-20 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503540#comment-14503540
 ] 

Jason Lowe commented on YARN-3464:
--

We shouldn't leave a ContainerLocalizer lingering around if the container is 
ready to be launched, as that's wasting node resources and adding extra 
localizer heartbeat processing on the NM we don't need to do.  One exception to 
that would be if we want to support localizing new resources while a container 
is already running, but last I checked we don't support that.

IMHO it makes sense to kill the localizer when the container is ready to be 
launched.  If it's not ready to be launched then we may need to (re)localize 
some resource and the localizer would have some utility to keep running.  So 
I'd look into changing the logic from "kill me when there's no more work in my 
queue" to "kill me when my container is ready to be launched."



> Race condition in LocalizerRunner causes container localization timeout.
> 
>
> Key: YARN-3464
> URL: https://issues.apache.org/jira/browse/YARN-3464
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
>
> Race condition in LocalizerRunner causes container localization timeout.
> Currently LocalizerRunner will kill the ContainerLocalizer when pending list 
> for LocalizerResourceRequestEvent is empty.
> {code}
>   } else if (pending.isEmpty()) {
> action = LocalizerAction.DIE;
>   }
> {code}
> If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the 
> ContainerLocalizer due to empty pending list, this 
> LocalizerResourceRequestEvent will never be handled.
> Without ContainerLocalizer, LocalizerRunner#update will never be called.
> The container will stay at LOCALIZING state, until the container is killed by 
> AM due to TASK_TIMEOUT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.

2015-04-20 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503328#comment-14503328
 ] 

Karthik Kambatla commented on YARN-3464:


[~jlowe] - do you think any of my earlier suggestions are reasonable? 

> Race condition in LocalizerRunner causes container localization timeout.
> 
>
> Key: YARN-3464
> URL: https://issues.apache.org/jira/browse/YARN-3464
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
>
> Race condition in LocalizerRunner causes container localization timeout.
> Currently LocalizerRunner will kill the ContainerLocalizer when pending list 
> for LocalizerResourceRequestEvent is empty.
> {code}
>   } else if (pending.isEmpty()) {
> action = LocalizerAction.DIE;
>   }
> {code}
> If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the 
> ContainerLocalizer due to empty pending list, this 
> LocalizerResourceRequestEvent will never be handled.
> Without ContainerLocalizer, LocalizerRunner#update will never be called.
> The container will stay at LOCALIZING state, until the container is killed by 
> AM due to TASK_TIMEOUT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.

2015-04-20 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503258#comment-14503258
 ] 

zhihai xu commented on YARN-3464:
-

Thanks [~kasha] for the detail explanation.
For the suggested first solution,
ContainerLocalizer will be killed by ContainerImpl#cleanup when the container 
is finished.
For the suggested second solution,
ContainerLocalizer can be killed earlier by a sentinel/event sent by 
ContainerImp after localization before launch container.
I want to check whether other people has better idea for this issue before I 
implement a patch for this issue.


> Race condition in LocalizerRunner causes container localization timeout.
> 
>
> Key: YARN-3464
> URL: https://issues.apache.org/jira/browse/YARN-3464
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
>
> Race condition in LocalizerRunner causes container localization timeout.
> Currently LocalizerRunner will kill the ContainerLocalizer when pending list 
> for LocalizerResourceRequestEvent is empty.
> {code}
>   } else if (pending.isEmpty()) {
> action = LocalizerAction.DIE;
>   }
> {code}
> If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the 
> ContainerLocalizer due to empty pending list, this 
> LocalizerResourceRequestEvent will never be handled.
> Without ContainerLocalizer, LocalizerRunner#update will never be called.
> The container will stay at LOCALIZING state, until the container is killed by 
> AM due to TASK_TIMEOUT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.

2015-04-08 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486046#comment-14486046
 ] 

Karthik Kambatla commented on YARN-3464:


Discussed this with Zhihai offline. 

Firstly, responding with a DIE when {{LocalizerRunner#pending}} is empty seems 
wrong. I don't think synchronizing that check is really going to help if 
someone can add a request to {{pending}} after that. We see two alternatives:
# If we are going to leave the ContainerLocalizer around, why do we even bother 
issuing a DIE action? Can we just let it LIVE forever.
# If there is merit to killing the ContainerLocalizer, we could have a sentinel 
sent for each localizer. DIE action is to be sent only when either of the two 
conditions are met:
## One of the localizations failed. Since the container can't be launched, kill 
the localizer and fail the container.
## The sentinel has been received *and* all previously scheduled localizations 
have been fetched successfully ({{LocalizerRunner#scheduled.isEmpty()}}) 

While we make these changes, I see how YARN-3465 could be useful in reducing 
the likelihood of this issue in the interim. 

> Race condition in LocalizerRunner causes container localization timeout.
> 
>
> Key: YARN-3464
> URL: https://issues.apache.org/jira/browse/YARN-3464
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
>
> Race condition in LocalizerRunner causes container localization timeout.
> Currently LocalizerRunner will kill the ContainerLocalizer when pending list 
> for LocalizerResourceRequestEvent is empty.
> {code}
>   } else if (pending.isEmpty()) {
> action = LocalizerAction.DIE;
>   }
> {code}
> If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the 
> ContainerLocalizer due to empty pending list, this 
> LocalizerResourceRequestEvent will never be handled.
> Without ContainerLocalizer, LocalizerRunner#update will never be called.
> The container will stay at LOCALIZING state, until the container is killed by 
> AM due to TASK_TIMEOUT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.

2015-04-08 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485818#comment-14485818
 ] 

Karthik Kambatla commented on YARN-3464:


We can may be discuss this more on YARN-3465, but I don't think having it 
sorted is necessary. The container can not be started until all the resources 
are localized; so, the order of their downloads shouldn't matter as long as 
they all get localized. 


> Race condition in LocalizerRunner causes container localization timeout.
> 
>
> Key: YARN-3464
> URL: https://issues.apache.org/jira/browse/YARN-3464
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
>
> Race condition in LocalizerRunner causes container localization timeout.
> Currently LocalizerRunner will kill the ContainerLocalizer when pending list 
> for LocalizerResourceRequestEvent is empty.
> {code}
>   } else if (pending.isEmpty()) {
> action = LocalizerAction.DIE;
>   }
> {code}
> If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the 
> ContainerLocalizer due to empty pending list, this 
> LocalizerResourceRequestEvent will never be handled.
> Without ContainerLocalizer, LocalizerRunner#update will never be called.
> The container will stay at LOCALIZING state, until the container is killed by 
> AM due to TASK_TIMEOUT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.

2015-04-08 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485794#comment-14485794
 ] 

zhihai xu commented on YARN-3464:
-

I also created another JIRA YARN-3465, which can help this issue and make sure 
localization is based on the correct order:
PUBLIC, PRIVATE and APPLICATION.
The issue in my case is also because PRIVATE LocalResourceRequest is reordered 
to first and APPLICATION  LocalResourceRequest is reordered to last. The PUBLIC 
LocalResourceRequest is in the middle which add delay for APPLICATION  
LocalResourceRequest.
Because the entrySet order based on HashMap will not be fixed. use 
LinkedHashMap should be used.

> Race condition in LocalizerRunner causes container localization timeout.
> 
>
> Key: YARN-3464
> URL: https://issues.apache.org/jira/browse/YARN-3464
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
>
> Race condition in LocalizerRunner causes container localization timeout.
> Currently LocalizerRunner will kill the ContainerLocalizer when pending list 
> for LocalizerResourceRequestEvent is empty.
> {code}
>   } else if (pending.isEmpty()) {
> action = LocalizerAction.DIE;
>   }
> {code}
> If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the 
> ContainerLocalizer due to empty pending list, this 
> LocalizerResourceRequestEvent will never be handled.
> Without ContainerLocalizer, LocalizerRunner#update will never be called.
> The container will stay at LOCALIZING state, until the container is killed by 
> AM due to TASK_TIMEOUT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.

2015-04-08 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485775#comment-14485775
 ] 

zhihai xu commented on YARN-3464:
-

[~kasha], thanks for the information. I just looked at YARN-3024, Yes, it will 
make this issue happen more frequently.
Before YARN-3024, The localization for private resource is one by one. The next 
one won't start until the current one finish localization.
It will take longer time for private resource localization.
With YARN-3024, The localization will be done in parallel, multiple files can 
be localized at the same time.
The chance for ContainerLocalizer being killed when the last two PRIVATE 
LocalizerResourceRequestEvent are added is bigger.
Yes, your suggestion is also what I thought.

> Race condition in LocalizerRunner causes container localization timeout.
> 
>
> Key: YARN-3464
> URL: https://issues.apache.org/jira/browse/YARN-3464
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
>
> Race condition in LocalizerRunner causes container localization timeout.
> Currently LocalizerRunner will kill the ContainerLocalizer when pending list 
> for LocalizerResourceRequestEvent is empty.
> {code}
>   } else if (pending.isEmpty()) {
> action = LocalizerAction.DIE;
>   }
> {code}
> If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the 
> ContainerLocalizer due to empty pending list, this 
> LocalizerResourceRequestEvent will never be handled.
> Without ContainerLocalizer, LocalizerRunner#update will never be called.
> The container will stay at LOCALIZING state, until the container is killed by 
> AM due to TASK_TIMEOUT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.

2015-04-08 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485687#comment-14485687
 ] 

Karthik Kambatla commented on YARN-3464:


bq. Looking at the code closely, I don't see any resources being removed from 
pending. So, pending shouldn't be empty after some of the resources have been 
downloaded.
Never mind. findNextResource has a call to iterator.remove().

In any case, I think the right approach is to send an explicit event to the 
localizer to indicate we are done with localizing all the resources. On 
receiving this, the localizer tracker sends the DIE action.

> Race condition in LocalizerRunner causes container localization timeout.
> 
>
> Key: YARN-3464
> URL: https://issues.apache.org/jira/browse/YARN-3464
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
>
> Race condition in LocalizerRunner causes container localization timeout.
> Currently LocalizerRunner will kill the ContainerLocalizer when pending list 
> for LocalizerResourceRequestEvent is empty.
> {code}
>   } else if (pending.isEmpty()) {
> action = LocalizerAction.DIE;
>   }
> {code}
> If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the 
> ContainerLocalizer due to empty pending list, this 
> LocalizerResourceRequestEvent will never be handled.
> Without ContainerLocalizer, LocalizerRunner#update will never be called.
> The container will stay at LOCALIZING state, until the container is killed by 
> AM due to TASK_TIMEOUT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.

2015-04-08 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485566#comment-14485566
 ] 

Karthik Kambatla commented on YARN-3464:


I have been investigating a similar issue. Initially I thought of the same 
race, but not sure if that alone solves the issue.

Looking at the code closely, I don't see any resources being removed from 
pending. So, pending shouldn't be empty after some of the resources have been 
downloaded. 

Related: YARN-3024 increases the frequency of this issue. 



> Race condition in LocalizerRunner causes container localization timeout.
> 
>
> Key: YARN-3464
> URL: https://issues.apache.org/jira/browse/YARN-3464
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
>
> Race condition in LocalizerRunner causes container localization timeout.
> Currently LocalizerRunner will kill the ContainerLocalizer when pending list 
> for LocalizerResourceRequestEvent is empty.
> {code}
>   } else if (pending.isEmpty()) {
> action = LocalizerAction.DIE;
>   }
> {code}
> If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the 
> ContainerLocalizer due to empty pending list, this 
> LocalizerResourceRequestEvent will never be handled.
> Without ContainerLocalizer, LocalizerRunner#update will never be called.
> The container will stay at LOCALIZING state, until the container is killed by 
> AM due to TASK_TIMEOUT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.

2015-04-08 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14484920#comment-14484920
 ] 

zhihai xu commented on YARN-3464:
-

This issue only happened for PRIVATE/APPLICATION resource Localization
We saw this issue happened when the PRIVATE LocalizerResourceRequestEvent 
interleaved with PUBLIC LocalizerResourceRequestEvent in the following order:
PRIVATE1 
PRIVATE2
..
PRIVATEm
PUBLIC1
PUBLIC2
.
PUBLICn
PRIVATEm+1
PRIVATEm+2
The last two PRIVATE LocalizerResourceRequestEvent is added after all previous 
m PRIVATE LocalizerResourceRequestEvent are LOCALIZED due to the delay to 
process n PUBLIC LocalizerResourceRequestEvent.
Then the container will stay at LOCALIZING state until it is killed by AM.

> Race condition in LocalizerRunner causes container localization timeout.
> 
>
> Key: YARN-3464
> URL: https://issues.apache.org/jira/browse/YARN-3464
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Critical
>
> Race condition in LocalizerRunner causes container localization timeout.
> Currently LocalizerRunner will kill the ContainerLocalizer when pending list 
> for LocalizerResourceRequestEvent is empty.
> {code}
>   } else if (pending.isEmpty()) {
> action = LocalizerAction.DIE;
>   }
> {code}
> If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the 
> ContainerLocalizer due to empty pending list, this 
> LocalizerResourceRequestEvent will never be handled.
> The container will stay at LOCALIZING state, until the container is killed by 
> AM due to TASK_TIMEOUT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)