[ https://issues.apache.org/jira/browse/YARN-8010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wangda Tan updated YARN-8010: ----------------------------- Fix Version/s: (was: 3.1.1) Doing 3.1.0 RC1 now, moved all 3.1.1 (branch-3.1) fixes to 3.1.0 (branch-3.1.0) > Add config in FederationRMFailoverProxy to not bypass facade cache when > failing over > ------------------------------------------------------------------------------------ > > Key: YARN-8010 > URL: https://issues.apache.org/jira/browse/YARN-8010 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Botong Huang > Assignee: Botong Huang > Priority: Minor > Fix For: 3.1.0, 2.10.0, 2.9.1 > > Attachments: YARN-8010.v1.patch, YARN-8010.v1.patch, > YARN-8010.v2.patch, YARN-8010.v3.patch > > > Today when YarnRM is failing over, the FederationRMFailoverProxy running in > AMRMProxy will perform failover, try to get latest subcluster info from > FederationStateStore and then retry connect to the latest YarnRM master. When > calling getSubCluster() to FederationStateStoreFacade, it bypasses the cache > with a flush flag. When YarnRM is failing over, every AM heartbeat thread > creates a different thread inside FederationInterceptor, each of which keeps > performing failover several times. This leads to a big spike of getSubCluster > call to FederationStateStore. > Depending on the cluster setup (e.g. putting a VIP before all YarnRMs), > YarnRM master slave change might not result in RM addr change. In other > cases, a small delay of getting latest subcluster information may be > acceptable. This patch thus creates a config option, so that it is possible > to ask the FederationRMFailoverProxy to not flush cache when calling > getSubCluster(). -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org