Junping Du commented on YARN-3039:

Thanks [~sjlee0] for comments! 
bq. I'm also thinking that option 2 might be more feasible, mostly from the 
standpoint of limiting the risk. Having said that, I haven't followed YARN-913 
closely enough to see how close it is...
I was thinking the same. As discussed with [~vinodkv] offline, we prefer to 
start the work immediately based on current implemented features on YARN. 
[~rkanter], please let us know if you have different ideas here.

bq. The service discovery needs to work across all these different modes: NM 
aux service, standalone per-node daemon, and standalone per-app daemon. That 
needs to be one of the primary considerations in this.
Agree. I think things don't change here is still three counterparts - AM, NM 
and RM that need to know the service info (url for rest api), so we put RM here 
as a center point for registration. The things could be different in your modes 
mentioned above is who and how to do the registration. I would prefer some 
other JIRA, like: YARN-3033, could address these differences. Thoughts?

bq. The RM will likely not use the service discovery. For example, for RM to 
write the app started event, the timeline aggregator may not even be 
initialized yet.
That's a very good point. We need RM to write some initiative app info 
standalone. However, do we expect RM to write all app-specific info or just in 
the beginning? We have a similar case in launching app's container - the first 
AM container get launched by RM, but following containers get launched by AM. 
Do we want to follow this pattern if we want to consolidate all app info with 
only one app aggregator?

bq. If the AM fails and starts in another node, the existing per-app aggregator 
should be shut down, and started on the new node. In fact, in the aux service 
setup, that comes most naturally. So I think we should try to keep that as much 
as possible.
As I said in proposal, we should do the best effort to locate two things 
together. However, I think we also want to decouple the life cycle of these two 
things which could make things more robust. Beside case of aggregator live but 
AM die, another quick example is: AM container works fine, but aggregator on 
this NM cannot be bind/started (for some reason, e.g. port is banned, etc.). In 
those cases, we may not want to kill AM container (or aggregator service) for 
aggregation locality reason given these are rarely cases so keep simple should 
be better.

bq. We're talking about the aggregator failing as a standalone daemon, correct?
Yes and No. Even as auxiliary service of NM, aggregator could failed alone for 
some reasons, e.g. port is blocked, etc. Am I missing anything here?

> [Aggregator wireup] Implement ATS writer service discovery
> ----------------------------------------------------------
>                 Key: YARN-3039
>                 URL: https://issues.apache.org/jira/browse/YARN-3039
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Sangjin Lee
>            Assignee: Robert Kanter
>         Attachments: Service Binding for applicationaggregator of ATS 
> (draft).pdf
> Per design in YARN-2928, implement ATS writer service discovery. This is 
> essential for off-node clients to send writes to the right ATS writer. This 
> should also handle the case of AM failures.

This message was sent by Atlassian JIRA

Reply via email to