[ 
https://issues.apache.org/jira/browse/YARN-11251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18054392#comment-18054392
 ] 

ASF GitHub Bot commented on YARN-11251:
---------------------------------------

Samrat002 opened a new pull request, #8208:
URL: https://github.com/apache/hadoop/pull/8208

   <!--
     Thanks for sending a pull request!
       1. If this is your first time, please read our contributor guidelines: 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
       2. Make sure your PR title starts with JIRA issue id, e.g., 
'HADOOP-17799. Your PR title ...'.
   -->
   
   ### Description of PR
   When hadoop cluster running on cloud , uses spot instance and AM is launched 
on one of those instances. When these instances are removed then we have 
observed too many AM Launch Failures due to Token Expired or Container 
Liveliness Expiry when AM Launch Threads are busy retrying to connect to AM 
Host (Spot Instances) which are down. Having Separate ThreadPools for both 
Cleanup and Launch will reduce the AM Launch failures.
   
   ### Token Expired
   
   ```
   2022-07-19 14:56:33,486 ERROR 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl 
(IPC Server handler 39 on 8041): Unauthorized request to start container.
   This token is expired. current time is 1658242593486 found 1658242289457
   Note: System times on machines may be out of sync. Check system time and 
time zones.
   ```
   
   ### Container Liveliness Expiry
   
   ```
   2022-07-19 16:06:48,663 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl 
(ResourceManager Event Processor): container_xxxxxxxxxxxxx_xxxxxxx_xx_000001 
Container Transitioned from ACQUIRED to EXPIRED
   
   2022-07-19 16:10:08,663 INFO 
org.apache.hadoop.yarn.util.AbstractLivelinessMonitor (Ping Checker): 
Expired:<container=container_xxxxxxxxxxxxx_xxxxxxx_xx_000001, increase=false> 
Timed out after 600 secs
   ```
   
   Associated ticket :- 
[YARN-11251](https://issues.apache.org/jira/browse/YARN-11251)
   
   
   ### How was this patch tested?
   This patch is tested in EMR cluster where 1 master node and 1 core nodes , 
and 2 tasks nodes , task nodes are spot instances , we launched an AM in one of 
the task node and bring it down , This replicate the following senerio 
   
   TODO :- unit test need to be added 
   ### For code changes:
   
   - [x] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [x] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [x] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [x] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> Separate ThreadPool for AMLauncher Launch and Clean Events
> ----------------------------------------------------------
>
>                 Key: YARN-11251
>                 URL: https://issues.apache.org/jira/browse/YARN-11251
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: yarn
>    Affects Versions: 3.4.0
>            Reporter: Prabhu Joseph
>            Assignee: Samrat Deb
>            Priority: Major
>              Labels: pull-request-available
>
> Have seen too many AM Launch Failures due to Token Expired or Container 
> Liveliness Expiry when AM Launch Threads are busy retrying to connect to AM 
> Host (Spot Instances) which are down. Having Separate ThreadPools for both 
> Cleanup and Launch will reduce the AM Launch failures.
> *Token Expired*
> {code}
> 2022-07-19 14:56:33,486 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
>  (IPC Server handler 39 on 8041): Unauthorized request to start container.
> This token is expired. current time is 1658242593486 found 1658242289457
> Note: System times on machines may be out of sync. Check system time and time 
> zones.
> {code}
> *Container Liveliness Expiry*
> {code}
> 2022-07-19 16:06:48,663 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl 
> (ResourceManager Event Processor): container_1656573205571_2357731_01_000001 
> Container Transitioned from ACQUIRED to EXPIRED
> 2022-07-19 16:10:08,663 INFO 
> org.apache.hadoop.yarn.util.AbstractLivelinessMonitor (Ping Checker): 
> Expired:<container=container_1656573205571_2357773_01_000001, increase=false> 
> Timed out after 600 secs
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to