[
https://issues.apache.org/jira/browse/YARN-11251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18054392#comment-18054392
]
ASF GitHub Bot commented on YARN-11251:
---------------------------------------
Samrat002 opened a new pull request, #8208:
URL: https://github.com/apache/hadoop/pull/8208
<!--
Thanks for sending a pull request!
1. If this is your first time, please read our contributor guidelines:
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
2. Make sure your PR title starts with JIRA issue id, e.g.,
'HADOOP-17799. Your PR title ...'.
-->
### Description of PR
When hadoop cluster running on cloud , uses spot instance and AM is launched
on one of those instances. When these instances are removed then we have
observed too many AM Launch Failures due to Token Expired or Container
Liveliness Expiry when AM Launch Threads are busy retrying to connect to AM
Host (Spot Instances) which are down. Having Separate ThreadPools for both
Cleanup and Launch will reduce the AM Launch failures.
### Token Expired
```
2022-07-19 14:56:33,486 ERROR
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
(IPC Server handler 39 on 8041): Unauthorized request to start container.
This token is expired. current time is 1658242593486 found 1658242289457
Note: System times on machines may be out of sync. Check system time and
time zones.
```
### Container Liveliness Expiry
```
2022-07-19 16:06:48,663 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl
(ResourceManager Event Processor): container_xxxxxxxxxxxxx_xxxxxxx_xx_000001
Container Transitioned from ACQUIRED to EXPIRED
2022-07-19 16:10:08,663 INFO
org.apache.hadoop.yarn.util.AbstractLivelinessMonitor (Ping Checker):
Expired:<container=container_xxxxxxxxxxxxx_xxxxxxx_xx_000001, increase=false>
Timed out after 600 secs
```
Associated ticket :-
[YARN-11251](https://issues.apache.org/jira/browse/YARN-11251)
### How was this patch tested?
This patch is tested in EMR cluster where 1 master node and 1 core nodes ,
and 2 tasks nodes , task nodes are spot instances , we launched an AM in one of
the task node and bring it down , This replicate the following senerio
TODO :- unit test need to be added
### For code changes:
- [x] Does the title or this PR starts with the corresponding JIRA issue id
(e.g. 'HADOOP-17799. Your PR title ...')?
- [x] Object storage: have the integration tests been executed and the
endpoint declared according to the connector-specific documentation?
- [x] If adding new dependencies to the code, are these dependencies
licensed in a way that is compatible for inclusion under [ASF
2.0](http://www.apache.org/legal/resolved.html#category-a)?
- [x] If applicable, have you updated the `LICENSE`, `LICENSE-binary`,
`NOTICE-binary` files?
> Separate ThreadPool for AMLauncher Launch and Clean Events
> ----------------------------------------------------------
>
> Key: YARN-11251
> URL: https://issues.apache.org/jira/browse/YARN-11251
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: yarn
> Affects Versions: 3.4.0
> Reporter: Prabhu Joseph
> Assignee: Samrat Deb
> Priority: Major
> Labels: pull-request-available
>
> Have seen too many AM Launch Failures due to Token Expired or Container
> Liveliness Expiry when AM Launch Threads are busy retrying to connect to AM
> Host (Spot Instances) which are down. Having Separate ThreadPools for both
> Cleanup and Launch will reduce the AM Launch failures.
> *Token Expired*
> {code}
> 2022-07-19 14:56:33,486 ERROR
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
> (IPC Server handler 39 on 8041): Unauthorized request to start container.
> This token is expired. current time is 1658242593486 found 1658242289457
> Note: System times on machines may be out of sync. Check system time and time
> zones.
> {code}
> *Container Liveliness Expiry*
> {code}
> 2022-07-19 16:06:48,663 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl
> (ResourceManager Event Processor): container_1656573205571_2357731_01_000001
> Container Transitioned from ACQUIRED to EXPIRED
> 2022-07-19 16:10:08,663 INFO
> org.apache.hadoop.yarn.util.AbstractLivelinessMonitor (Ping Checker):
> Expired:<container=container_1656573205571_2357773_01_000001, increase=false>
> Timed out after 600 secs
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]