Michael Stack created HBASE-25353:
-------------------------------------

             Summary: [Flakey Tests] branch-2 TestShutdownBackupMaster
                 Key: HBASE-25353
                 URL: https://issues.apache.org/jira/browse/HBASE-25353
             Project: HBase
          Issue Type: Sub-task
          Components: flakies
    Affects Versions: 2.4.0
            Reporter: Michael Stack
            Assignee: Michael Stack
             Fix For: 2.4.1


Making this as a sub-issue of parent issue which fails similar to how we are 
failing now.

Currently, I see that that TestShutdownBackupMaster test passes usually but it 
is warped in how it completes. It will do all retries just before the test 
timesout at 13minutes max...: e.g. you'll see this...

2020-12-02 22:07:34,200 DEBUG [master/stack:0:becomeActiveMaster] 
client.ConnectionImplementation(1009): locateRegionInMeta 
parentTable='hbase:meta', attempt=44 of 46 failed; retrying after sleep of 46

... so we'll do all the retries and then complete so the test looks like it 
'succeeded' but it actually ran for Total time: 12:41 min... and the log is 
full of thread dumps because the cluster won't go down (The time is spent in 
the test shutdown).

Often though, we won't complete the retries in time and the test fails. It is 
in the flakey list.

Rather, we are supposed to fail out fast when we are shutting down. Below is 
the type of retry we see.

 
{code:java}
2020-12-02 10:53:35,540 INFO [Listener at localhost/61609] 
util.JVMClusterUtil(348): Shutdown of 2 master(s) and 2 regionserver(s) complete
 2020-12-02 10:53:35,548 DEBUG [master/stack:0:becomeActiveMaster] 
client.ConnectionImplementation(1009): locateRegionInMeta 
parentTable='hbase:meta', attempt=2 of 46 failed; retrying after sleep of 46
 org.apache.hadoop.hbase.DoNotRetryIOException: hconnection-0x1afa7f5b closed
 at 
org.apache.hadoop.hbase.client.ConnectionImplementation.checkClosed(ConnectionImplementation.java:630)
 at 
org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:815)
 at 
org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.locateRegion(ConnectionUtils.java:138)
 at 
org.apache.hadoop.hbase.client.ConnectionImplementation.relocateRegion(ConnectionImplementation.java:803)
 at 
org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.relocateRegion(ConnectionUtils.java:138)
 at 
org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegionInMeta(ConnectionImplementation.java:933)
 at 
org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:823)
 at 
org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.locateRegion(ConnectionUtils.java:138)
 at 
org.apache.hadoop.hbase.client.HRegionLocator.getRegionLocation(HRegionLocator.java:64)
 at 
org.apache.hadoop.hbase.client.RegionLocator.getRegionLocation(RegionLocator.java:70)
 at 
org.apache.hadoop.hbase.client.RegionLocator.getRegionLocation(RegionLocator.java:59)
 at 
org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:223)
 at 
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105)
 at org.apache.hadoop.hbase.client.HTable.get(HTable.java:383)
 at org.apache.hadoop.hbase.client.HTable.get(HTable.java:357)
 at 
org.apache.hadoop.hbase.master.TableNamespaceManager.get(TableNamespaceManager.java:141)
 at 
org.apache.hadoop.hbase.master.TableNamespaceManager.isTableAvailableAndInitialized(TableNamespaceManager.java:278)
 at 
org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:103)
 at 
org.apache.hadoop.hbase.master.ClusterSchemaServiceImpl.doStart(ClusterSchemaServiceImpl.java:63)
 at 
org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.startAsync(AbstractService.java:249)
 at 
org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1224)
 at 
org.apache.hadoop.hbase.master.TestShutdownBackupMaster$MockHMaster.initClusterSchemaService(TestShutdownBackupMaster.java:68)
 at 
org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:1021)
 at 
org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2082)
 at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:506){code}
See how a master is trying to become active and it won't relent trying to 
become active master even though this cluster is shutting down? See how we 
retry but the check for close of the connection is coming back with a 
DoNotRetryIOException? The exception is being swallowed. We keep going.

Fix looks simple enough.

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to