[jira] [Comment Edited] (HBASE-21266) Not running balancer because processing dead regionservers, but empty dead rs list
[ https://issues.apache.org/jira/browse/HBASE-21266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16648268#comment-16648268 ] Andrew Purtell edited comment on HBASE-21266 at 10/12/18 6:24 PM: -- Updated patch with suggestion from [~stack]*, also fixed something dumb I did with logging * - Logging only, no change to API. Ok? was (Author: apurtell): Updated patch with suggestion from [~stack], also fixed something dumb I did with logging > Not running balancer because processing dead regionservers, but empty dead rs > list > -- > > Key: HBASE-21266 > URL: https://issues.apache.org/jira/browse/HBASE-21266 > Project: HBase > Issue Type: Bug >Affects Versions: 1.4.8 >Reporter: Andrew Purtell >Assignee: Andrew Purtell >Priority: Major > Fix For: 1.5.0, 1.4.9 > > Attachments: HBASE-21266-branch-1.patch, HBASE-21266-branch-1.patch, > HBASE-21266-branch-1.patch, HBASE-21266-branch-1.patch, > HBASE-21266-branch-1.patch, HBASE-21266-branch-1.patch, > HBASE-21266-branch-1.patch, HBASE-21266-branch-1.patch, > HBASE-21266-branch-1.patch, HBASE-21266-branch-1.patch > > > Found during ITBLL testing. AM in master gets into a state where manual > attempts from the shell to run the balancer always return false and this is > printed in the master log: > 2018-10-03 19:17:14,892 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=21,queue=0,port=8100] master.HMaster: > Not running balancer because processing dead regionserver(s): > Note the empty list. > This errant state did not recover without intervention by way of master > restart, but the test environment was chaotic so needs investigation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-21266) Not running balancer because processing dead regionservers, but empty dead rs list
[ https://issues.apache.org/jira/browse/HBASE-21266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16647107#comment-16647107 ] Andrew Purtell edited comment on HBASE-21266 at 10/11/18 10:22 PM: --- TestZKLessSplitOnCluster.testSSHCleanupDaugtherRegionsOfAbortedSplit does hand-rolled waits with 10 ms sleeps. Rewrote those to use Waiter#waitFor with the same timeout and period values of other uses of Waiter#waitFor in this unit. TestEndToEndSplitTransaction.testFromClientSideWhileSplitting utilizes a chore named RegionChecker also with a 10 ms interval, increasing this to 100. This isn't necessary beyond the fact that sleep(10) is obnoxious. Might as well just be a yield() or a spin-wait. There are three instances of sleep(10) in this unit, changed to sleep(100) which IMHO is the smallest reasonable value you want to use if doing short waits. was (Author: apurtell): TestZKLessSplitOnCluster.testSSHCleanupDaugtherRegionsOfAbortedSplit does hand-rolled waits with 10 ms sleeps. Rewrote those to use Waiter#waitFor with the same timeout and period values of other uses of Waiter#waitFor in this unit. TestEndToEndSplitTransaction.testFromClientSideWhileSplitting utilizes a chore named RegionChecker also with a 10 ms interval, increasing this to 100. This isn't necessary beyond the fact that sleep(10) is obnoxious. Might as well just be a yield() or a spin-wait. > Not running balancer because processing dead regionservers, but empty dead rs > list > -- > > Key: HBASE-21266 > URL: https://issues.apache.org/jira/browse/HBASE-21266 > Project: HBase > Issue Type: Bug >Affects Versions: 1.4.8 >Reporter: Andrew Purtell >Assignee: Andrew Purtell >Priority: Major > Fix For: 1.5.0, 1.4.9 > > Attachments: HBASE-21266-branch-1.patch, HBASE-21266-branch-1.patch, > HBASE-21266-branch-1.patch, HBASE-21266-branch-1.patch, > HBASE-21266-branch-1.patch, HBASE-21266-branch-1.patch, > HBASE-21266-branch-1.patch, HBASE-21266-branch-1.patch > > > Found during ITBLL testing. AM in master gets into a state where manual > attempts from the shell to run the balancer always return false and this is > printed in the master log: > 2018-10-03 19:17:14,892 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=21,queue=0,port=8100] master.HMaster: > Not running balancer because processing dead regionserver(s): > Note the empty list. > This errant state did not recover without intervention by way of master > restart, but the test environment was chaotic so needs investigation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-21266) Not running balancer because processing dead regionservers, but empty dead rs list
[ https://issues.apache.org/jira/browse/HBASE-21266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16647007#comment-16647007 ] Andrew Purtell edited comment on HBASE-21266 at 10/11/18 8:44 PM: -- Those test failures in precommit might be flakes, let me see if I can reproduce them. I ran split, merge, assignment, and balancer tests, including the tests in question, and am not seeing any issues. {noformat} [INFO] Running org.apache.hadoop.hbase.util.TestMergeTable [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 26.85 s - in org.apache.hadoop.hbase.util.TestMergeTable [INFO] Running org.apache.hadoop.hbase.util.TestMergeTool [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 19.995 s - in org.apache.hadoop.hbase.util.TestMergeTool [INFO] Running org.apache.hadoop.hbase.util.TestRegionSplitter [INFO] Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 16.962 s - in org.apache.hadoop.hbase.util.TestRegionSplitter [INFO] Running org.apache.hadoop.hbase.util.TestRegionSplitCalculator [INFO] Tests run: 15, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.968 s - in org.apache.hadoop.hbase.util.TestRegionSplitCalculator [INFO] Running org.apache.hadoop.hbase.wal.TestWALSplit [INFO] Tests run: 33, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 46.067 s - in org.apache.hadoop.hbase.wal.TestWALSplit [INFO] Running org.apache.hadoop.hbase.wal.TestWALSplitBoundedLogWriterCreation [WARNING] Tests run: 33, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 43.419 s - in org.apache.hadoop.hbase.wal.TestWALSplitBoundedLogWriterCreation [INFO] Running org.apache.hadoop.hbase.wal.TestWALSplitCompressed [INFO] Tests run: 33, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 46.075 s - in org.apache.hadoop.hbase.wal.TestWALSplitCompressed [INFO] Running org.apache.hadoop.hbase.mapred.TestSplitTable [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.774 s - in org.apache.hadoop.hbase.mapred.TestSplitTable [INFO] Running org.apache.hadoop.hbase.regionserver.TestRegionSplitPolicy [INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.836 s - in org.apache.hadoop.hbase.regionserver.TestRegionSplitPolicy [INFO] Running org.apache.hadoop.hbase.regionserver.TestCompactSplitThread [INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 30.692 s - in org.apache.hadoop.hbase.regionserver.TestCompactSplitThread [INFO] Running org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster [INFO] Tests run: 17, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 87.588 s - in org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster [INFO] Running org.apache.hadoop.hbase.regionserver.TestSplitWalDataLoss [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 13.796 s - in org.apache.hadoop.hbase.regionserver.TestSplitWalDataLoss [INFO] Running org.apache.hadoop.hbase.regionserver.TestSplitTransaction [INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 6.14 s - in org.apache.hadoop.hbase.regionserver.TestSplitTransaction [INFO] Running org.apache.hadoop.hbase.regionserver.TestRegionMergeTransaction [INFO] Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 9.715 s - in org.apache.hadoop.hbase.regionserver.TestRegionMergeTransaction [INFO] Running org.apache.hadoop.hbase.regionserver.TestZKLessSplitOnCluster [INFO] Tests run: 17, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 71.721 s - in org.apache.hadoop.hbase.regionserver.TestZKLessSplitOnCluster [INFO] Running org.apache.hadoop.hbase.regionserver.TestEndToEndSplitTransaction [INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 59.082 s - in org.apache.hadoop.hbase.regionserver.TestEndToEndSplitTransaction [INFO] Running org.apache.hadoop.hbase.regionserver.TestSplitLogWorker [INFO] Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 24.689 s - in org.apache.hadoop.hbase.regionserver.TestSplitLogWorker [INFO] Running org.apache.hadoop.hbase.regionserver.TestZKLessMergeOnCluster [INFO] Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 28.068 s - in org.apache.hadoop.hbase.regionserver.TestZKLessMergeOnCluster [INFO] Running org.apache.hadoop.hbase.regionserver.TestRegionMergeTransactionOnCluster [INFO] Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 35.386 s - in org.apache.hadoop.hbase.regionserver.TestRegionMergeTransactionOnCluster [INFO] Running org.apache.hadoop.hbase.master.TestAssignmentManagerOnCluster [INFO] Tests run: 23, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 185.692 s - in org.apache.hadoop.hbase.master.TestAssignmentManagerOnCluster [INFO] Running org.apache.hadoop.hbase.master.TestDistributedLogSplitting [WARNING] Tests run: 18, Failures: 0, Errors: 0, Skipped: 15, Time elapsed: 90.971 s - in
[jira] [Comment Edited] (HBASE-21266) Not running balancer because processing dead regionservers, but empty dead rs list
[ https://issues.apache.org/jira/browse/HBASE-21266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644226#comment-16644226 ] Xu Cang edited comment on HBASE-21266 at 10/10/18 2:51 AM: --- {quote}If all access to {{numProcessing}} is {{synchronized}}, we don't need the {{AtomicInteger}}. {quote} -But ++ / - - is not thread-safe for integer. It's still possible gets caught by race condition IMO>- -e.g.- -Two threads call #add and #finish respectively at the same time, though synchronized keyword helps nothing in this case.- Edited. was (Author: xucang): {quote}If all access to {{numProcessing}} is {{synchronized}}, we don't need the {{AtomicInteger}}. {quote} But ++ / - - is not thread-safe for integer. It's still possible gets caught by race condition IMO> e.g. Two threads call #add and #finish respectively at the same time, though synchronized keyword helps nothing in this case. > Not running balancer because processing dead regionservers, but empty dead rs > list > -- > > Key: HBASE-21266 > URL: https://issues.apache.org/jira/browse/HBASE-21266 > Project: HBase > Issue Type: Bug >Affects Versions: 1.4.8 >Reporter: Andrew Purtell >Assignee: Andrew Purtell >Priority: Major > Fix For: 1.5.0, 1.4.9 > > Attachments: HBASE-21266-branch-1.patch, HBASE-21266-branch-1.patch, > HBASE-21266-branch-1.patch, HBASE-21266-branch-1.patch, > HBASE-21266-branch-1.patch, HBASE-21266-branch-1.patch > > > Found during ITBLL testing. AM in master gets into a state where manual > attempts from the shell to run the balancer always return false and this is > printed in the master log: > 2018-10-03 19:17:14,892 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=21,queue=0,port=8100] master.HMaster: > Not running balancer because processing dead regionserver(s): > Note the empty list. > This errant state did not recover without intervention by way of master > restart, but the test environment was chaotic so needs investigation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-21266) Not running balancer because processing dead regionservers, but empty dead rs list
[ https://issues.apache.org/jira/browse/HBASE-21266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644226#comment-16644226 ] Xu Cang edited comment on HBASE-21266 at 10/10/18 12:18 AM: {quote}If all access to {{numProcessing}} is {{synchronized}}, we don't need the {{AtomicInteger}}. {quote} But ++ / - - is not thread-safe for integer. It's still possible gets caught by race condition IMO> e.g. Two threads call #add and #finish respectively at the same time, though synchronized keyword helps nothing in this case. was (Author: xucang): {quote}If all access to {{numProcessing}} is {{synchronized}}, we don't need the {{AtomicInteger}}. {quote} But ++ / - - is not thread-safe for integer. It's still possible gets caught by race condition IMO> > Not running balancer because processing dead regionservers, but empty dead rs > list > -- > > Key: HBASE-21266 > URL: https://issues.apache.org/jira/browse/HBASE-21266 > Project: HBase > Issue Type: Bug >Affects Versions: 1.4.8 >Reporter: Andrew Purtell >Assignee: Andrew Purtell >Priority: Major > Fix For: 1.5.0, 1.4.9 > > Attachments: HBASE-21266-branch-1.patch, HBASE-21266-branch-1.patch, > HBASE-21266-branch-1.patch > > > Found during ITBLL testing. AM in master gets into a state where manual > attempts from the shell to run the balancer always return false and this is > printed in the master log: > 2018-10-03 19:17:14,892 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=21,queue=0,port=8100] master.HMaster: > Not running balancer because processing dead regionserver(s): > Note the empty list. > This errant state did not recover without intervention by way of master > restart, but the test environment was chaotic so needs investigation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-21266) Not running balancer because processing dead regionservers, but empty dead rs list
[ https://issues.apache.org/jira/browse/HBASE-21266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643838#comment-16643838 ] Andrew Purtell edited comment on HBASE-21266 at 10/9/18 9:39 PM: - I plan to commit this today unless objection. I have run a couple of ITBLL workloads with stressAM policy with a shell invoking the balancer every minute. No issues with dead server processing observed. The earlier observed problem does not reproduce. was (Author: apurtell): I plan to commit this today unless objection. I have run a couple of ITBLL workloads with serverKilling and stressAM policy with a shell invoking the balancer every minute. No issues with dead server processing observed. The earlier observed problem does not reproduce. This isn't a positive test for that change, though, because I think it was a race condition, but it is a negative test in the sense that it is very unlikely we broke the AM with this change. All unit tests pass. > Not running balancer because processing dead regionservers, but empty dead rs > list > -- > > Key: HBASE-21266 > URL: https://issues.apache.org/jira/browse/HBASE-21266 > Project: HBase > Issue Type: Bug >Affects Versions: 1.4.8 >Reporter: Andrew Purtell >Assignee: Andrew Purtell >Priority: Major > Fix For: 1.5.0, 1.4.9 > > Attachments: HBASE-21266-branch-1.patch > > > Found during ITBLL testing. AM in master gets into a state where manual > attempts from the shell to run the balancer always return false and this is > printed in the master log: > 2018-10-03 19:17:14,892 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=21,queue=0,port=8100] master.HMaster: > Not running balancer because processing dead regionserver(s): > Note the empty list. > This errant state did not recover without intervention by way of master > restart, but the test environment was chaotic so needs investigation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-21266) Not running balancer because processing dead regionservers, but empty dead rs list
[ https://issues.apache.org/jira/browse/HBASE-21266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16638611#comment-16638611 ] Andrew Purtell edited comment on HBASE-21266 at 10/4/18 5:48 PM: - bq. "Number of dead servers in processing should always be non-negative" You are looking at that assert in DeadServer#finish, right? Those aren't evaulated unless the JVM is started with the -ea command line flag, which I didn't do. We can see from the log line I did see that the dead server map was empty at the time so I agree we should look at accounting in DeadServer.java. "Not running balancer because processing dead regionserver(s)" is printed from HMaster.java:1846 based on the result from ServerManager#areDeadServersInProgress, which passes through the result from DeadServer#areDeadServersInProgress, which is simply {code} public synchronized boolean areDeadServersInProgress() { return processing; } {code} This boolean is cleared in DeadServer#finish when {code} if (numProcessing == 0) { processing = false; } {code} So the first question I have is why do we even need this boolean field? It can easily be derived cheaply from other state. In areDeadServersInProgress just return the result of {{!(numProcessing == 0)}}. That assert you observed should be replaced by use of Preconditions so we will get a RuntimeException that will get noticed. was (Author: apurtell): bq. "Number of dead servers in processing should always be non-negative" You are looking at that assert in DeadServer#finish, right? Those aren't evaulated unless the JVM is started with the -ea command line flag, which I didn't do. We can see from the log line I did see that the dead server map was empty at the time so I agree we should look at accounting in DeadServer.java. "Not running balancer because processing dead regionserver(s)" is printed from HMaster.java:1846 based on the result from ServerManager#areDeadServersInProgress, which passes through the result from DeadServer#areDeadServersInProgress, which is simply {code} public synchronized boolean areDeadServersInProgress() { return processing; } {code} This boolean is cleared in DeadServer#finish when {code} if (numProcessing == 0) { processing = false; } {code} So the first question I have is why do we even need this boolean field? It can easily be derived cheaply from other state. In areDeadServersInProgress just return the result of {{numProcessing == 0}}. That assert you observed should be replaced by use of Preconditions so we will get a RuntimeException that will get noticed. > Not running balancer because processing dead regionservers, but empty dead rs > list > -- > > Key: HBASE-21266 > URL: https://issues.apache.org/jira/browse/HBASE-21266 > Project: HBase > Issue Type: Bug >Affects Versions: 1.4.8 >Reporter: Andrew Purtell >Priority: Major > Fix For: 1.5.0, 1.4.9 > > > Found during ITBLL testing. AM in master gets into a state where manual > attempts from the shell to run the balancer always return false and this is > printed in the master log: > 2018-10-03 19:17:14,892 DEBUG > [RpcServer.default.FPBQ.Fifo.handler=21,queue=0,port=8100] master.HMaster: > Not running balancer because processing dead regionserver(s): > Note the empty list. > This errant state did not recover without intervention by way of master > restart, but the test environment was chaotic so needs investigation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)