Jian He commented on YARN-3999:

I talked to [~zjshen] about this too. I think it's fine as the event processing 
order is not that critical. Also each timeline entity has a timestamp which 
itself indicates the order of the event too.IMO,  this is similar to multiple 
containers writing to ATS at the same time. There's no guarantee that the 
earliest generated event gets published into ATS first.

> RM hangs on draing events
> -------------------------
>                 Key: YARN-3999
>                 URL: https://issues.apache.org/jira/browse/YARN-3999
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Jian He
>            Assignee: Jian He
>         Attachments: YARN-3999.1.patch, YARN-3999.2.patch, YARN-3999.2.patch, 
> YARN-3999.3.patch, YARN-3999.4.patch, YARN-3999.5.patch, YARN-3999.patch, 
> YARN-3999.patch
> If external systems like ATS, or ZK becomes very slow, draining all the 
> events take a lot of time. If this time becomes larger than 10 mins, all 
> applications will expire. Fixes include:
> 1. add a timeout and stop the dispatcher even if not all events are drained.
> 2. Move ATS service out from RM active service so that RM doesn't need to 
> wait for ATS to flush the events when transitioning to standby.
> 3. Stop client-facing services (ClientRMService etc.) first so that clients 
> get fast notification that RM is stopping/transitioning.

This message was sent by Atlassian JIRA

Reply via email to