[
https://issues.apache.org/jira/browse/YARN-6523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16685353#comment-16685353
]
Jason Lowe commented on YARN-6523:
----------------------------------
Thanks for updating the patch! The unit test failure is unrelated and tracked
by YARN-8937.
Changes are looking much better, although it would be good to implement one
suggestion I made early on:
bq. we should precompute the SystemCredentialsForAppsProto once when the
credentials change and re-send the same object to any node that needs the
updated credentials rather than recreate the same object over and over and over.
The NodeHeartbeatResponseProto.Builder already has a
setSystemCredentialsForApps method that takes a SystemCredentialsForAppsProto,
but that would need to be exposed in the PBImpl wrapper so every time the
system credentials change we can precompute the protobuf object that will be
sent to every NM on their next heartbeat. It's pretty expensive in terms of
generated garbage to compute this, and it would be nice not to compute it on
the next heartbeat from every node in the cluster when the token sequence
number changes.
Nit: Checking for hasTokenSequenceNo and returning zero if not present is
redundant. getTokenSequenceNo already does this if the field is not present,
since 0 is the default value for longs in a protobuf (can be changed in the
protobuf definition if desired).
My concerns about the unit test duration have not been addressed. This single
unit test takes almost _two minutes_ to execute. The unit testing could be
done much faster if the tests were more like unit tests than integration tests.
One approach is to split the testing into two, separate unit tests for the two
prime units requiring testing as part of this feature. One test can verify the
DelegationTokenRenewer updates the sequence number when new delegation tokens
are requested, and another test can verify that ResourceTrackerService sends
system credentials in the heartbeat response only when the sequence numbers
mismatch. Both of those unit tests can be implemented in a way that does not
require waiting around for internal machinery timers to expire.
> Newly retrieved security Tokens are sent as part of each heartbeat to each
> node from RM which is not desirable in large cluster
> -------------------------------------------------------------------------------------------------------------------------------
>
> Key: YARN-6523
> URL: https://issues.apache.org/jira/browse/YARN-6523
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: RM
> Affects Versions: 2.8.0, 2.7.3
> Reporter: Naganarasimha G R
> Assignee: Manikandan R
> Priority: Major
> Attachments: YARN-6523.001.patch, YARN-6523.002.patch,
> YARN-6523.003.patch, YARN-6523.004.patch, YARN-6523.005.patch,
> YARN-6523.006.patch
>
>
> Currently as part of heartbeat response RM sets all application's tokens
> though all applications might not be active on the node. On top of it
> NodeHeartbeatResponsePBImpl converts tokens for each app into
> SystemCredentialsForAppsProto. Hence for each node and each heartbeat too
> many SystemCredentialsForAppsProto objects were getting created.
> We hit a OOM while testing for 2000 concurrent apps on 500 nodes cluster with
> 8GB RAM configured for RM
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]