[
https://issues.apache.org/jira/browse/YARN-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16559026#comment-16559026
]
Hudson commented on YARN-8545:
------------------------------
SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #14649 (See
[https://builds.apache.org/job/Hadoop-trunk-Commit/14649/])
YARN-8545. Return allocated resource to RM for failed container.
(eyang: rev 40fad32824d2f8f960c779d78357e62103453da0)
* (edit)
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core/src/main/java/org/apache/hadoop/yarn/service/component/instance/ComponentInstanceEvent.java
* (edit)
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core/src/test/java/org/apache/hadoop/yarn/service/TestServiceAM.java
* (edit)
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core/src/main/java/org/apache/hadoop/yarn/service/ServiceScheduler.java
* (edit)
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core/src/main/java/org/apache/hadoop/yarn/service/containerlaunch/ContainerLaunchService.java
* (edit)
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core/src/test/java/org/apache/hadoop/yarn/service/MockServiceAM.java
* (edit)
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core/src/test/java/org/apache/hadoop/yarn/service/component/TestComponent.java
* (edit)
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core/src/test/java/org/apache/hadoop/yarn/service/component/instance/TestComponentInstance.java
* (edit)
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core/src/main/java/org/apache/hadoop/yarn/service/component/instance/ComponentInstance.java
* (edit)
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core/src/main/java/org/apache/hadoop/yarn/service/component/Component.java
> YARN native service should return container if launch failed
> ------------------------------------------------------------
>
> Key: YARN-8545
> URL: https://issues.apache.org/jira/browse/YARN-8545
> Project: Hadoop YARN
> Issue Type: Task
> Reporter: Wangda Tan
> Assignee: Chandni Singh
> Priority: Critical
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-8545.001.patch
>
>
> In some cases, container launch may fail but container will not be properly
> returned to RM.
> This could happen when AM trying to prepare container launch context but
> failed w/o sending container launch context to NM (Once container launch
> context is sent to NM, NM will report failed container to RM).
> Exception like:
> {code:java}
> java.io.FileNotFoundException: File does not exist:
> hdfs://ns1/user/wtan/.yarn/services/tf-job-001/components/1531852429056/primary-worker/primary-worker-0/run-PRIMARY_WORKER.sh
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1583)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
> at
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
> at
> org.apache.hadoop.yarn.service.utils.CoreFileSystem.createAmResource(CoreFileSystem.java:388)
> at
> org.apache.hadoop.yarn.service.provider.ProviderUtils.createConfigFileAndAddLocalResource(ProviderUtils.java:253)
> at
> org.apache.hadoop.yarn.service.provider.AbstractProviderService.buildContainerLaunchContext(AbstractProviderService.java:152)
> at
> org.apache.hadoop.yarn.service.containerlaunch.ContainerLaunchService$ContainerLauncher.run(ContainerLaunchService.java:105)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745){code}
> And even after container launch context prepare failed, AM still trying to
> monitor container's readiness:
> {code:java}
> 2018-07-17 18:42:57,518 [pool-7-thread-1] INFO monitor.ServiceMonitor -
> Readiness check failed for primary-worker-0: Probe Status, time="Tue Jul 17
> 18:42:57 UTC 2018", outcome="failure", message="Failure in Default probe: IP
> presence", exception="java.io.IOException: primary-worker-0: IP is not
> available yet"
> ...{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]