[jira] [Commented] (YARN-8545) YARN native service should return container if launch failed

2018-07-30 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562358#comment-16562358
 ] 

Wangda Tan commented on YARN-8545:
--

I think it is important to get it backported to branch-3.1.1, I'm going to do 
this in a couple of hours, please let me know if you think different.

cc: [~csingh], [~eyang]

> YARN native service should return container if launch failed
> 
>
> Key: YARN-8545
> URL: https://issues.apache.org/jira/browse/YARN-8545
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Assignee: Chandni Singh
>Priority: Critical
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-8545.001.patch
>
>
> In some cases, container launch may fail but container will not be properly 
> returned to RM. 
> This could happen when AM trying to prepare container launch context but 
> failed w/o sending container launch context to NM (Once container launch 
> context is sent to NM, NM will report failed container to RM).
> Exception like: 
> {code:java}
> java.io.FileNotFoundException: File does not exist: 
> hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
>   at 
> org.apache.hadoop.yarn.service.utils.CoreFileSystem.createAmResource(CoreFileSystem.java:388)
>   at 
> org.apache.hadoop.yarn.service.provider.ProviderUtils.createConfigFileAndAddLocalResource(ProviderUtils.java:253)
>   at 
> org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:152)
>   at 
> org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:105)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745){code}
> And even after container launch context prepare failed, AM still trying to 
> monitor container's readiness:
> {code:java}
> 2018-07-17 18:42:57,518 [pool-7-thread-1] INFO  monitor.ServiceMonitor - 
> Readiness check failed for primary-worker-0: Probe Status, time="Tue Jul 17 
> 18:42:57 UTC 2018", outcome="failure", message="Failure in Default probe: IP 
> presence", exception="java.io.IOException: primary-worker-0: IP is not 
> available yet"
> ...{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8545) YARN native service should return container if launch failed

2018-07-26 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559026#comment-16559026
 ] 

Hudson commented on YARN-8545:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #14649 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/14649/])
YARN-8545.  Return allocated resource to RM for failed container.
(eyang: rev 40fad32824d2f8f960c779d78357e62103453da0)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core/src/main/java/org/apache/hadoop/yarn/service/component/instance/ComponentInstanceEvent.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core/src/test/java/org/apache/hadoop/yarn/service/TestServiceAM.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core/src/main/java/org/apache/hadoop/yarn/service/ServiceScheduler.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core/src/main/java/org/apache/hadoop/yarn/service/containerlaunch/ContainerLaunchService.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core/src/test/java/org/apache/hadoop/yarn/service/MockServiceAM.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core/src/test/java/org/apache/hadoop/yarn/service/component/TestComponent.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core/src/test/java/org/apache/hadoop/yarn/service/component/instance/TestComponentInstance.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core/src/main/java/org/apache/hadoop/yarn/service/component/instance/ComponentInstance.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core/src/main/java/org/apache/hadoop/yarn/service/component/Component.java


> YARN native service should return container if launch failed
> 
>
> Key: YARN-8545
> URL: https://issues.apache.org/jira/browse/YARN-8545
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Assignee: Chandni Singh
>Priority: Critical
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-8545.001.patch
>
>
> In some cases, container launch may fail but container will not be properly 
> returned to RM. 
> This could happen when AM trying to prepare container launch context but 
> failed w/o sending container launch context to NM (Once container launch 
> context is sent to NM, NM will report failed container to RM).
> Exception like: 
> {code:java}
> java.io.FileNotFoundException: File does not exist: 
> hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
>   at 
> org.apache.hadoop.yarn.service.utils.CoreFileSystem.createAmResource(CoreFileSystem.java:388)
>   at 
> org.apache.hadoop.yarn.service.provider.ProviderUtils.createConfigFileAndAddLocalResource(ProviderUtils.java:253)
>   at 
> org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:152)
>   at 
> org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:105)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745){code}
> And even after container launch context prepare failed, AM still trying to 
> monitor container's readiness:
> {code:java}
> 2018-07-17 18:42:57,518 [pool-7-thread-1] INFO  monitor.ServiceMonitor - 
> Readiness check failed for primary-worker-0: Probe Status, time="Tue Jul 17 
> 18:42:57 UTC 2018", outcome="failure", message="Failure in Default probe: IP 
> presence", exception="java.io.IOException: primary-worker-0: IP is not 
> available yet"
> ...{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, 

[jira] [Commented] (YARN-8545) YARN native service should return container if launch failed

2018-07-26 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16558997#comment-16558997
 ] 

Eric Yang commented on YARN-8545:
-

+1 looks good to me.  Committing shortly.

> YARN native service should return container if launch failed
> 
>
> Key: YARN-8545
> URL: https://issues.apache.org/jira/browse/YARN-8545
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Assignee: Chandni Singh
>Priority: Critical
> Attachments: YARN-8545.001.patch
>
>
> In some cases, container launch may fail but container will not be properly 
> returned to RM. 
> This could happen when AM trying to prepare container launch context but 
> failed w/o sending container launch context to NM (Once container launch 
> context is sent to NM, NM will report failed container to RM).
> Exception like: 
> {code:java}
> java.io.FileNotFoundException: File does not exist: 
> hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
>   at 
> org.apache.hadoop.yarn.service.utils.CoreFileSystem.createAmResource(CoreFileSystem.java:388)
>   at 
> org.apache.hadoop.yarn.service.provider.ProviderUtils.createConfigFileAndAddLocalResource(ProviderUtils.java:253)
>   at 
> org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:152)
>   at 
> org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:105)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745){code}
> And even after container launch context prepare failed, AM still trying to 
> monitor container's readiness:
> {code:java}
> 2018-07-17 18:42:57,518 [pool-7-thread-1] INFO  monitor.ServiceMonitor - 
> Readiness check failed for primary-worker-0: Probe Status, time="Tue Jul 17 
> 18:42:57 UTC 2018", outcome="failure", message="Failure in Default probe: IP 
> presence", exception="java.io.IOException: primary-worker-0: IP is not 
> available yet"
> ...{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8545) YARN native service should return container if launch failed

2018-07-26 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16558835#comment-16558835
 ] 

Chandni Singh commented on YARN-8545:
-

[~billie.rinaldi] [~eyang] Do you have any comments on patch 1?

> YARN native service should return container if launch failed
> 
>
> Key: YARN-8545
> URL: https://issues.apache.org/jira/browse/YARN-8545
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Assignee: Chandni Singh
>Priority: Critical
> Attachments: YARN-8545.001.patch
>
>
> In some cases, container launch may fail but container will not be properly 
> returned to RM. 
> This could happen when AM trying to prepare container launch context but 
> failed w/o sending container launch context to NM (Once container launch 
> context is sent to NM, NM will report failed container to RM).
> Exception like: 
> {code:java}
> java.io.FileNotFoundException: File does not exist: 
> hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
>   at 
> org.apache.hadoop.yarn.service.utils.CoreFileSystem.createAmResource(CoreFileSystem.java:388)
>   at 
> org.apache.hadoop.yarn.service.provider.ProviderUtils.createConfigFileAndAddLocalResource(ProviderUtils.java:253)
>   at 
> org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:152)
>   at 
> org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:105)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745){code}
> And even after container launch context prepare failed, AM still trying to 
> monitor container's readiness:
> {code:java}
> 2018-07-17 18:42:57,518 [pool-7-thread-1] INFO  monitor.ServiceMonitor - 
> Readiness check failed for primary-worker-0: Probe Status, time="Tue Jul 17 
> 18:42:57 UTC 2018", outcome="failure", message="Failure in Default probe: IP 
> presence", exception="java.io.IOException: primary-worker-0: IP is not 
> available yet"
> ...{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8545) YARN native service should return container if launch failed

2018-07-25 Thread Gour Saha (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556267#comment-16556267
 ] 

Gour Saha commented on YARN-8545:
-

[~csingh] patch 001 looks good to me. +1.

> YARN native service should return container if launch failed
> 
>
> Key: YARN-8545
> URL: https://issues.apache.org/jira/browse/YARN-8545
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Assignee: Chandni Singh
>Priority: Critical
> Attachments: YARN-8545.001.patch
>
>
> In some cases, container launch may fail but container will not be properly 
> returned to RM. 
> This could happen when AM trying to prepare container launch context but 
> failed w/o sending container launch context to NM (Once container launch 
> context is sent to NM, NM will report failed container to RM).
> Exception like: 
> {code:java}
> java.io.FileNotFoundException: File does not exist: 
> hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
>   at 
> org.apache.hadoop.yarn.service.utils.CoreFileSystem.createAmResource(CoreFileSystem.java:388)
>   at 
> org.apache.hadoop.yarn.service.provider.ProviderUtils.createConfigFileAndAddLocalResource(ProviderUtils.java:253)
>   at 
> org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:152)
>   at 
> org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:105)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745){code}
> And even after container launch context prepare failed, AM still trying to 
> monitor container's readiness:
> {code:java}
> 2018-07-17 18:42:57,518 [pool-7-thread-1] INFO  monitor.ServiceMonitor - 
> Readiness check failed for primary-worker-0: Probe Status, time="Tue Jul 17 
> 18:42:57 UTC 2018", outcome="failure", message="Failure in Default probe: IP 
> presence", exception="java.io.IOException: primary-worker-0: IP is not 
> available yet"
> ...{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8545) YARN native service should return container if launch failed

2018-07-23 Thread genericqa (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553500#comment-16553500
 ] 

genericqa commented on YARN-8545:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
28s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 4 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 24m 
46s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
24s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
13s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
29s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m 46s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
41s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
19s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
28s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
 8s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
26s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m 37s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
12s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 11m 
26s{color} | {color:green} hadoop-yarn-services-core in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
22s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 63m  9s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:ba1ab08 |
| JIRA Issue | YARN-8545 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12932779/YARN-8545.001.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 56cd137fb41c 4.4.0-130-generic #156-Ubuntu SMP Thu Jun 14 
08:53:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 17e2616 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_171 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/21348/testReport/ |
| Max. process+thread count | 755 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/21348/console |
| Powered by | Apache Yetus 0.8.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> YARN native service 

[jira] [Commented] (YARN-8545) YARN native service should return container if launch failed

2018-07-23 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553422#comment-16553422
 ] 

Chandni Singh commented on YARN-8545:
-

[~gsaha] [~billie.rinaldi] could you please review the patch?


> YARN native service should return container if launch failed
> 
>
> Key: YARN-8545
> URL: https://issues.apache.org/jira/browse/YARN-8545
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Assignee: Chandni Singh
>Priority: Critical
>
> In some cases, container launch may fail but container will not be properly 
> returned to RM. 
> This could happen when AM trying to prepare container launch context but 
> failed w/o sending container launch context to NM (Once container launch 
> context is sent to NM, NM will report failed container to RM).
> Exception like: 
> {code:java}
> java.io.FileNotFoundException: File does not exist: 
> hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
>   at 
> org.apache.hadoop.yarn.service.utils.CoreFileSystem.createAmResource(CoreFileSystem.java:388)
>   at 
> org.apache.hadoop.yarn.service.provider.ProviderUtils.createConfigFileAndAddLocalResource(ProviderUtils.java:253)
>   at 
> org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:152)
>   at 
> org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:105)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745){code}
> And even after container launch context prepare failed, AM still trying to 
> monitor container's readiness:
> {code:java}
> 2018-07-17 18:42:57,518 [pool-7-thread-1] INFO  monitor.ServiceMonitor - 
> Readiness check failed for primary-worker-0: Probe Status, time="Tue Jul 17 
> 18:42:57 UTC 2018", outcome="failure", message="Failure in Default probe: IP 
> presence", exception="java.io.IOException: primary-worker-0: IP is not 
> available yet"
> ...{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org