[ 
https://issues.apache.org/jira/browse/YARN-6523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16685353#comment-16685353
 ] 

Jason Lowe commented on YARN-6523:
----------------------------------

Thanks for updating the patch!  The unit test failure is unrelated and tracked 
by YARN-8937.

Changes are looking much better, although it would be good to implement one 
suggestion I made early on:
bq. we should precompute the SystemCredentialsForAppsProto once when the 
credentials change and re-send the same object to any node that needs the 
updated credentials rather than recreate the same object over and over and over.
The NodeHeartbeatResponseProto.Builder already has a 
setSystemCredentialsForApps method that takes a SystemCredentialsForAppsProto, 
but that would need to be exposed in the PBImpl wrapper so every time the 
system credentials change we can precompute the protobuf object that will be 
sent to every NM on their next heartbeat.  It's pretty expensive in terms of 
generated garbage to compute this, and it would be nice not to compute it on 
the next heartbeat from every node in the cluster when the token sequence 
number changes.

Nit: Checking for hasTokenSequenceNo and returning zero if not present is 
redundant.  getTokenSequenceNo already does this if the field is not present, 
since 0 is the default value for longs in a protobuf (can be changed in the 
protobuf definition if desired).

My concerns about the unit test duration have not been addressed.  This single 
unit test takes almost _two minutes_ to execute.  The unit testing could be 
done much faster if the tests were more like unit tests than integration tests. 
 One approach is to split the testing into two, separate unit tests for the two 
prime units requiring testing as part of this feature.  One test can verify the 
DelegationTokenRenewer updates the sequence number when new delegation tokens 
are requested, and another test can verify that ResourceTrackerService sends 
system credentials in the heartbeat response only when the sequence numbers 
mismatch.  Both of those unit tests can be implemented in a way that does not 
require waiting around for internal machinery timers to expire.


> Newly retrieved security Tokens are sent as part of each heartbeat to each 
> node from RM which is not desirable in large cluster
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-6523
>                 URL: https://issues.apache.org/jira/browse/YARN-6523
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: RM
>    Affects Versions: 2.8.0, 2.7.3
>            Reporter: Naganarasimha G R
>            Assignee: Manikandan R
>            Priority: Major
>         Attachments: YARN-6523.001.patch, YARN-6523.002.patch, 
> YARN-6523.003.patch, YARN-6523.004.patch, YARN-6523.005.patch, 
> YARN-6523.006.patch
>
>
> Currently as part of heartbeat response RM sets all application's tokens 
> though all applications might not be active on the node. On top of it 
> NodeHeartbeatResponsePBImpl converts tokens for each app into 
> SystemCredentialsForAppsProto. Hence for each node and each heartbeat too 
> many SystemCredentialsForAppsProto objects were getting created.
> We hit a OOM while testing for 2000 concurrent apps on 500 nodes cluster with 
> 8GB RAM configured for RM



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to