[ 
https://issues.apache.org/jira/browse/YARN-6523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15996281#comment-15996281
 ] 

Naganarasimha G R commented on YARN-6523:
-----------------------------------------

Thanks for the quick reply [~jlowe],
bq. I also think we can get the delta to work with some effort. Note however 
that the delta is per node not some global delta, because nodes may be heart 
beating at drastically different times. Therefore there isn't going to be a 
good way to build a single, pre-computed 
well actually i was trying to say here was not a delta, but send the tokens for 
all apps for which atleast one of the tokens gets renewed (assuming that there 
will be less #apps for which renewal happens).
Based on what you mentioned and what i could understand from the code: if the 
tokens are not expired then usually tokens are available in 
ContainerLaunchContext for NM to localize the resources. So we need tokens from 
RM to be sent to NM only for the renewed ones only. And as you were mentioning 
earlier there were two issues to be addressed
# Long running job with renewed token can get an allocation to a node which has 
not launched any container for this app.
# tokens are renewed for app and either Node is down or having connectivity 
issues.

Sending all tokens during registration might solve the later issue. But having 
delta per node does not solve the first issue. Hence i was suggesting we will 
send the tokens for all apps for which atleast one of the tokens gets renewed.

bq. If we suddenly start sending a delta in heartbeats instead of the full set 
then that's an incompatible semantic change even though the technical signature 
of the interface did not change. Old nodemanagers during a rolling upgrade will 
not do the correct thing and apps could fail. 
Ohh i missed this scenario, thanks for pointing it out and also helping with 
the solution. IIUC there is no version concept as of now between RM and NM and 
we need to bring in now right ?

> RM requires large memory in sending out security tokens as part of Node 
> Heartbeat in large cluster
> --------------------------------------------------------------------------------------------------
>
>                 Key: YARN-6523
>                 URL: https://issues.apache.org/jira/browse/YARN-6523
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: RM
>    Affects Versions: 2.8.0, 2.7.3
>            Reporter: Naganarasimha G R
>            Assignee: Naganarasimha G R
>            Priority: Critical
>
> Currently as part of heartbeat response RM sets all application's tokens 
> though all applications might not be active on the node. On top of it 
> NodeHeartbeatResponsePBImpl converts tokens for each app into 
> SystemCredentialsForAppsProto. Hence for each node and each heartbeat too 
> many SystemCredentialsForAppsProto objects were getting created.
> We hit a OOM while testing for 2000 concurrent apps on 500 nodes cluster with 
> 8GB RAM configured for RM



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to