Sangjin Lee commented on YARN-3039:

Thanks [~djp] for the doc!

Some high level comments:
- I'm also thinking that option 2 might be more feasible, mostly from the 
standpoint of limiting the risk. Having said that, I haven't followed YARN-913 
closely enough to see how close it is...
- The service discovery needs to work across all these different modes: NM aux 
service, standalone per-node daemon, and standalone per-app daemon. That needs 
to be one of the primary considerations in this.
- The failure scenarios need more details in their own right; for this JIRA, I 
think it is sufficient to see how it may impact the service discovery and 
design just enough.

We need a perĀ­application logical aggregator for ATS which provides aggregator 
service in
form of REST API to: RM, AM and NMs,
The RM will likely not use the service discovery. For example, for RM to write 
the app started event, the timeline aggregator may not even be initialized yet.

However, AM container could be reschedule to other
node for some reason (container failure, etc.), so we cannot guarantee the two 
always together.
If the AM fails and starts in another node, the existing per-app aggregator 
should be shut down, and started on the new node. In fact, in the aux service 
setup, that comes most naturally. So I think we should try to keep that as much 
as possible.

Failure Cases: 3. Aggregator failed (only):
We're talking about the aggregator failing as a standalone daemon, correct?

> [Aggregator wireup] Implement ATS writer service discovery
> ----------------------------------------------------------
>                 Key: YARN-3039
>                 URL: https://issues.apache.org/jira/browse/YARN-3039
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Sangjin Lee
>            Assignee: Robert Kanter
>         Attachments: Service Binding for applicationaggregator of ATS 
> (draft).pdf
> Per design in YARN-2928, implement ATS writer service discovery. This is 
> essential for off-node clients to send writes to the right ATS writer. This 
> should also handle the case of AM failures.

This message was sent by Atlassian JIRA

Reply via email to