[
https://issues.apache.org/jira/browse/YARN-6523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15995030#comment-15995030
]
Jason Lowe commented on YARN-6523:
----------------------------------
Sending the full list at registration time makes a lot of sense to me, and I
also think we can get the delta to work with some effort. Note however that
the delta is _per node_ not some global delta, because nodes may be
heartbeating at drastically different times. Therefore there isn't going to be
a good way to build a single, pre-computed SystemCredentialsForAppsProto for
deltas. Each node will have to receive the app tokens that have been renewed
since their last heartbeat, and that will be a different list than for other
nodes in the cluster. There will be many that will share the same delta, but
it won't be the same for all of them.
Also note that there is going to be an interface change even with your
proposal. The current code assumes that the system credentials received in a
heartbeat _replace_ the previous set of credentials. If we suddenly start
sending a delta in heartbeats instead of the full set then that's an
incompatible semantic change even though the technical signature of the
interface did not change. Old nodemanagers during a rolling upgrade will not
do the correct thing and apps could fail. So minimally the RM would need to
check the NM version and always send the full system credentials in each
heartbeat if the NM version is "old" and only use the delta when the NM is
beyond a certain version.
> RM requires large memory in sending out security tokens as part of Node
> Heartbeat in large cluster
> --------------------------------------------------------------------------------------------------
>
> Key: YARN-6523
> URL: https://issues.apache.org/jira/browse/YARN-6523
> Project: Hadoop YARN
> Issue Type: Bug
> Components: RM
> Affects Versions: 2.8.0, 2.7.3
> Reporter: Naganarasimha G R
> Assignee: Naganarasimha G R
> Priority: Critical
>
> Currently as part of heartbeat response RM sets all application's tokens
> though all applications might not be active on the node. On top of it
> NodeHeartbeatResponsePBImpl converts tokens for each app into
> SystemCredentialsForAppsProto. Hence for each node and each heartbeat too
> many SystemCredentialsForAppsProto objects were getting created.
> We hit a OOM while testing for 2000 concurrent apps on 500 nodes cluster with
> 8GB RAM configured for RM
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]