[
https://issues.apache.org/jira/browse/YARN-6523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15996281#comment-15996281
]
Naganarasimha G R commented on YARN-6523:
-----------------------------------------
Thanks for the quick reply [~jlowe],
bq. I also think we can get the delta to work with some effort. Note however
that the delta is per node not some global delta, because nodes may be heart
beating at drastically different times. Therefore there isn't going to be a
good way to build a single, pre-computed
well actually i was trying to say here was not a delta, but send the tokens for
all apps for which atleast one of the tokens gets renewed (assuming that there
will be less #apps for which renewal happens).
Based on what you mentioned and what i could understand from the code: if the
tokens are not expired then usually tokens are available in
ContainerLaunchContext for NM to localize the resources. So we need tokens from
RM to be sent to NM only for the renewed ones only. And as you were mentioning
earlier there were two issues to be addressed
# Long running job with renewed token can get an allocation to a node which has
not launched any container for this app.
# tokens are renewed for app and either Node is down or having connectivity
issues.
Sending all tokens during registration might solve the later issue. But having
delta per node does not solve the first issue. Hence i was suggesting we will
send the tokens for all apps for which atleast one of the tokens gets renewed.
bq. If we suddenly start sending a delta in heartbeats instead of the full set
then that's an incompatible semantic change even though the technical signature
of the interface did not change. Old nodemanagers during a rolling upgrade will
not do the correct thing and apps could fail.
Ohh i missed this scenario, thanks for pointing it out and also helping with
the solution. IIUC there is no version concept as of now between RM and NM and
we need to bring in now right ?
> RM requires large memory in sending out security tokens as part of Node
> Heartbeat in large cluster
> --------------------------------------------------------------------------------------------------
>
> Key: YARN-6523
> URL: https://issues.apache.org/jira/browse/YARN-6523
> Project: Hadoop YARN
> Issue Type: Bug
> Components: RM
> Affects Versions: 2.8.0, 2.7.3
> Reporter: Naganarasimha G R
> Assignee: Naganarasimha G R
> Priority: Critical
>
> Currently as part of heartbeat response RM sets all application's tokens
> though all applications might not be active on the node. On top of it
> NodeHeartbeatResponsePBImpl converts tokens for each app into
> SystemCredentialsForAppsProto. Hence for each node and each heartbeat too
> many SystemCredentialsForAppsProto objects were getting created.
> We hit a OOM while testing for 2000 concurrent apps on 500 nodes cluster with
> 8GB RAM configured for RM
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]