Junping Du commented on YARN-3039:

Thanks [~zjshen] for review and comments!
bq. I think so, too. RM has its own builtin aggregator, and RM directly writes 
through it.
I have a very basic question here: didn't we want a singleton app aggregator 
for all app related events, logs, etc.? Ideally, only this singleton aggregator 
can have magic to sort out app info in aggregation. If not, we can even give up 
current flow "NM(s) -> app aggregator(deployed on one NM) -> backend" and let 
NM to talk to backend directly for saving hop for traffic. Can you clarify more 
on this?

bq.  in the heartbeat, instead of always sending the snapshot of the aggregator 
address info, can we send the incremental information upon any change happens 
to the aggregator address table. Usually, the aggregator will not change it 
place often, such that we can avoid unnecessary additional traffic in most 
That's a very good point for discussion. 
The interesting thing here is only we can compare with info from client (NM), 
then we can know what is alternated in server (RM) since last heartbeat. Take 
token update for example (populateKeys() in ResourceTrackerService), our 
current implementation is: we encoded master keys (ContainerTokenMasterKey and 
NMTokenMasterKey) known by NM in request, then in response we can filter out 
old keys that already known by NM. IMO, this (put everything in request, and 
put something/nothing in response) doesn't have any optimization against the 
way we put nothing in request and put everything in response, but only turn 
outbound traffic into inbound and bring compare logic in server side. Isn't it? 
Another optimization we can think here is to let client express its interested 
app aggregators on the request (with adding them to a new optional field, e.g. 
InterestedApps) when it found these info are missing or stale, and server only 
loop related app aggregators info in. NM can maintain an interested app 
aggregator list, which get updated when first time app's container get launched 
or app's aggregator info get stale (may reported in writer/reader's retry 
logic) and items from list get removed when received from heartbeat response. 

bq. One addition issue related the rm state store: calling it in the update 
transition may break the app recovery. The current state instead of the final 
state will be written into the store. If RM stops and restarts at this moment, 
this app can't be recovered properly.
Thanks for reminding on this. This is something I am not 100% sure. However, 
from recoverApplication() in RMAppManager, I didn't see we cannot recover app 
in RUNNING or other state (except final states, like: killed, finished, etc.). 
Do I miss anything on this? One missing piece of code indeed here is I forget 
to repopulate aggregatorAddr from store in RMAppImpl.recover(), will add it 
back in next patch.

> [Aggregator wireup] Implement ATS writer service discovery
> ----------------------------------------------------------
>                 Key: YARN-3039
>                 URL: https://issues.apache.org/jira/browse/YARN-3039
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Sangjin Lee
>            Assignee: Junping Du
>         Attachments: Service Binding for applicationaggregator of ATS 
> (draft).pdf, YARN-3039-no-test.patch
> Per design in YARN-2928, implement ATS writer service discovery. This is 
> essential for off-node clients to send writes to the right ATS writer. This 
> should also handle the case of AM failures.

This message was sent by Atlassian JIRA

Reply via email to