Wangda Tan created YARN-6136:
--------------------------------
Summary: Registry should avoid scanning whole ZK tree for every
container/application finish
Key: YARN-6136
URL: https://issues.apache.org/jira/browse/YARN-6136
Project: Hadoop YARN
Issue Type: Sub-task
Reporter: Wangda Tan
Assignee: Wangda Tan
Priority: Critical
In existing registry service implementation, purge operation triggered by
container finish event:
{code}
public void onContainerFinished(ContainerId id) throws IOException {
LOG.info("Container {} finished, purging container-level records",
id);
purgeRecordsAsync("/",
id.toString(),
PersistencePolicies.CONTAINER);
}
{code}
Since this happens on every container finish, so it essentially scans all (or
almost) ZK node from the root.
We have a cluster which have hundreds of ZK nodes for service registry, and
have 20K+ ZK nodes for other purposes. The existing implementation could
generate massive ZK operations and internal Java objects (RegistryPathStatus)
as well. The RM becomes very unstable when there're batch container finish
events because of full GC pause and ZK connection failure.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]