[
https://issues.apache.org/jira/browse/YARN-6523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703485#comment-16703485
]
Jason Lowe commented on YARN-6523:
----------------------------------
Thanks for updating the patch!
NodeHeartbeatResponse should not take a map of AppId to
SystemCredentialsForAppsProto. It can take a map of AppId to ByteBuffer and/or
a collection of SystemCredentialsForAppsProto. If it gets a map of appId to
the creds proto, it just ignores the apps so the map is overkill.
NodeHeartbeatResponsePBImpl has two fields that represent the same thing:
systemCredentials and systemCredentialsForAppsProto. It should only have one,
otherwise they are bound to conflict and cause bugs. Since we want to cache
the raw protobuffers for multiple heartbeat responses, IMHO the interface
should have a getter/setter for a Collection of SystemCredentialsForAppsProto.
ProtoUtils or some other utility class can be used to convert a
Map<ApplicationId,ByteBuffer> to/from this collection outside of the PBImpl.
NodeHeartbeatResponsePBImpl should call clearSystemCredentialsForApps on the
builder before earlying-out or setting it to implement the semantics of a set
rather than an append. It should also use addAllSystemCredentialsForApps on
the builder rather than iterating itself. That allows the underlying protobuf
to do the add more efficiently since it can know up front how many are being
added.
I'd really rather see the two separate tests that involve no sleeping/waiting
rather than a single test that does. The risk of using these arbitrary time
intervals is things can fall apart if the test runs on a slow VM that gets
paused for a few seconds for some reason due to load and/or long GC. If that's
not going to be an issue somehow then lower all the timeouts to as small as
possible (i.e.: tokens expiring in 1 second and waitFor polling every 10 msec
rather than every 2 seconds). It will run even faster, but I suspect this,
along with the original test, will ultimately be a flaky test in practice.
> Newly retrieved security Tokens are sent as part of each heartbeat to each
> node from RM which is not desirable in large cluster
> -------------------------------------------------------------------------------------------------------------------------------
>
> Key: YARN-6523
> URL: https://issues.apache.org/jira/browse/YARN-6523
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: RM
> Affects Versions: 2.8.0, 2.7.3
> Reporter: Naganarasimha G R
> Assignee: Manikandan R
> Priority: Major
> Attachments: YARN-6523.001.patch, YARN-6523.002.patch,
> YARN-6523.003.patch, YARN-6523.004.patch, YARN-6523.005.patch,
> YARN-6523.006.patch, YARN-6523.007.patch, YARN-6523.008.patch
>
>
> Currently as part of heartbeat response RM sets all application's tokens
> though all applications might not be active on the node. On top of it
> NodeHeartbeatResponsePBImpl converts tokens for each app into
> SystemCredentialsForAppsProto. Hence for each node and each heartbeat too
> many SystemCredentialsForAppsProto objects were getting created.
> We hit a OOM while testing for 2000 concurrent apps on 500 nodes cluster with
> 8GB RAM configured for RM
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]