Junping Du commented on YARN-3039:
Thanks [~sjlee0] for comments!
bq. I'm also thinking that option 2 might be more feasible, mostly from the
standpoint of limiting the risk. Having said that, I haven't followed YARN-913
closely enough to see how close it is...
I was thinking the same. As discussed with [~vinodkv] offline, we prefer to
start the work immediately based on current implemented features on YARN.
[~rkanter], please let us know if you have different ideas here.
bq. The service discovery needs to work across all these different modes: NM
aux service, standalone per-node daemon, and standalone per-app daemon. That
needs to be one of the primary considerations in this.
Agree. I think things don't change here is still three counterparts - AM, NM
and RM that need to know the service info (url for rest api), so we put RM here
as a center point for registration. The things could be different in your modes
mentioned above is who and how to do the registration. I would prefer some
other JIRA, like: YARN-3033, could address these differences. Thoughts?
bq. The RM will likely not use the service discovery. For example, for RM to
write the app started event, the timeline aggregator may not even be
That's a very good point. We need RM to write some initiative app info
standalone. However, do we expect RM to write all app-specific info or just in
the beginning? We have a similar case in launching app's container - the first
AM container get launched by RM, but following containers get launched by AM.
Do we want to follow this pattern if we want to consolidate all app info with
only one app aggregator?
bq. If the AM fails and starts in another node, the existing per-app aggregator
should be shut down, and started on the new node. In fact, in the aux service
setup, that comes most naturally. So I think we should try to keep that as much
As I said in proposal, we should do the best effort to locate two things
together. However, I think we also want to decouple the life cycle of these two
things which could make things more robust. Beside case of aggregator live but
AM die, another quick example is: AM container works fine, but aggregator on
this NM cannot be bind/started (for some reason, e.g. port is banned, etc.). In
those cases, we may not want to kill AM container (or aggregator service) for
aggregation locality reason given these are rarely cases so keep simple should
bq. We're talking about the aggregator failing as a standalone daemon, correct?
Yes and No. Even as auxiliary service of NM, aggregator could failed alone for
some reasons, e.g. port is blocked, etc. Am I missing anything here?
> [Aggregator wireup] Implement ATS writer service discovery
> Key: YARN-3039
> URL: https://issues.apache.org/jira/browse/YARN-3039
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: timelineserver
> Reporter: Sangjin Lee
> Assignee: Robert Kanter
> Attachments: Service Binding for applicationaggregator of ATS
> Per design in YARN-2928, implement ATS writer service discovery. This is
> essential for off-node clients to send writes to the right ATS writer. This
> should also handle the case of AM failures.
This message was sent by Atlassian JIRA