[ 
https://issues.apache.org/jira/browse/YARN-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14681691#comment-14681691
 ] 

Rohith Sharma K S commented on YARN-3999:
-----------------------------------------

Thanks [~jianhe] for updating the patch.. 
One doubt is SystemMetricsPublisher has been moved from RMActiveServices to 
ResourceManager. So this service will not be reinitialized on every RM switch. 
Thinking that this would lead for processing stale events even after RM is in 
standby. If any case, the same RM becomes active SystemMetricsPublisher  
dispatcher publishes stale events plus recovered application events. Anyway 
events processing will happen in the sequential order if same RM comes back 
Active. But issue may can ocure when the different RM becomes active i.e 
# RM1 is active and publishing the events
# RM1 is transitioning to standby,and some events are in the queue to be 
updated in the timeline sever
# RM2 become active and recovered the applications. When application got 
finished, RM2 systempublisher publishes app status as finished.
# RM1 is still processing the events for app which would process bit late i.e 
after RM2 processed.

Doesn't it cause problem? Any thoughts?

> RM hangs on draing events
> -------------------------
>
>                 Key: YARN-3999
>                 URL: https://issues.apache.org/jira/browse/YARN-3999
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Jian He
>            Assignee: Jian He
>         Attachments: YARN-3999.1.patch, YARN-3999.2.patch, YARN-3999.2.patch, 
> YARN-3999.3.patch, YARN-3999.4.patch, YARN-3999.5.patch, YARN-3999.patch, 
> YARN-3999.patch
>
>
> If external systems like ATS, or ZK becomes very slow, draining all the 
> events take a lot of time. If this time becomes larger than 10 mins, all 
> applications will expire. Fixes include:
> 1. add a timeout and stop the dispatcher even if not all events are drained.
> 2. Move ATS service out from RM active service so that RM doesn't need to 
> wait for ATS to flush the events when transitioning to standby.
> 3. Stop client-facing services (ClientRMService etc.) first so that clients 
> get fast notification that RM is stopping/transitioning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to