[jira] [Comment Edited] (HBASE-21266) Not running balancer because processing dead regionservers, but empty dead rs list

2018-10-12 Thread Andrew Purtell (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16648268#comment-16648268
 ] 

Andrew Purtell edited comment on HBASE-21266 at 10/12/18 6:24 PM:
--

Updated patch with suggestion from [~stack]*, also fixed something dumb I did 
with logging

* - Logging only, no change to API. Ok?


was (Author: apurtell):
Updated patch with suggestion from [~stack], also fixed something dumb I did 
with logging

> Not running balancer because processing dead regionservers, but empty dead rs 
> list
> --
>
> Key: HBASE-21266
> URL: https://issues.apache.org/jira/browse/HBASE-21266
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.4.8
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Major
> Fix For: 1.5.0, 1.4.9
>
> Attachments: HBASE-21266-branch-1.patch, HBASE-21266-branch-1.patch, 
> HBASE-21266-branch-1.patch, HBASE-21266-branch-1.patch, 
> HBASE-21266-branch-1.patch, HBASE-21266-branch-1.patch, 
> HBASE-21266-branch-1.patch, HBASE-21266-branch-1.patch, 
> HBASE-21266-branch-1.patch, HBASE-21266-branch-1.patch
>
>
> Found during ITBLL testing. AM in master gets into a state where manual 
> attempts from the shell to run the balancer always return false and this is 
> printed in the master log:
> 2018-10-03 19:17:14,892 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=21,queue=0,port=8100] master.HMaster: 
> Not running balancer because processing dead regionserver(s): 
> Note the empty list. 
> This errant state did not recover without intervention by way of master 
> restart, but the test environment was chaotic so needs investigation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-21266) Not running balancer because processing dead regionservers, but empty dead rs list

2018-10-11 Thread Andrew Purtell (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16647107#comment-16647107
 ] 

Andrew Purtell edited comment on HBASE-21266 at 10/11/18 10:22 PM:
---

TestZKLessSplitOnCluster.testSSHCleanupDaugtherRegionsOfAbortedSplit does 
hand-rolled waits with 10 ms sleeps. Rewrote those to use Waiter#waitFor with 
the same timeout and period values of other uses of Waiter#waitFor in this 
unit. 

TestEndToEndSplitTransaction.testFromClientSideWhileSplitting utilizes a chore 
named RegionChecker also with a 10 ms interval, increasing this to 100. This 
isn't necessary beyond the fact that sleep(10) is obnoxious. Might as well just 
be a yield() or a spin-wait. There are three instances of sleep(10) in this 
unit, changed to sleep(100) which IMHO is the smallest reasonable value you 
want to use if doing short waits.


was (Author: apurtell):
TestZKLessSplitOnCluster.testSSHCleanupDaugtherRegionsOfAbortedSplit does 
hand-rolled waits with 10 ms sleeps. Rewrote those to use Waiter#waitFor with 
the same timeout and period values of other uses of Waiter#waitFor in this 
unit. 

TestEndToEndSplitTransaction.testFromClientSideWhileSplitting utilizes a chore 
named RegionChecker also with a 10 ms interval, increasing this to 100. This 
isn't necessary beyond the fact that sleep(10) is obnoxious. Might as well just 
be a yield() or a spin-wait. 

> Not running balancer because processing dead regionservers, but empty dead rs 
> list
> --
>
> Key: HBASE-21266
> URL: https://issues.apache.org/jira/browse/HBASE-21266
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.4.8
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Major
> Fix For: 1.5.0, 1.4.9
>
> Attachments: HBASE-21266-branch-1.patch, HBASE-21266-branch-1.patch, 
> HBASE-21266-branch-1.patch, HBASE-21266-branch-1.patch, 
> HBASE-21266-branch-1.patch, HBASE-21266-branch-1.patch, 
> HBASE-21266-branch-1.patch, HBASE-21266-branch-1.patch
>
>
> Found during ITBLL testing. AM in master gets into a state where manual 
> attempts from the shell to run the balancer always return false and this is 
> printed in the master log:
> 2018-10-03 19:17:14,892 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=21,queue=0,port=8100] master.HMaster: 
> Not running balancer because processing dead regionserver(s): 
> Note the empty list. 
> This errant state did not recover without intervention by way of master 
> restart, but the test environment was chaotic so needs investigation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-21266) Not running balancer because processing dead regionservers, but empty dead rs list

2018-10-11 Thread Andrew Purtell (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16647007#comment-16647007
 ] 

Andrew Purtell edited comment on HBASE-21266 at 10/11/18 8:44 PM:
--

Those test failures in precommit might be flakes, let me see if I can reproduce 
them. 

I ran split, merge, assignment, and balancer tests, including the tests in 
question, and am not seeing any issues. 

{noformat}
[INFO] Running org.apache.hadoop.hbase.util.TestMergeTable
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 26.85 s 
- in org.apache.hadoop.hbase.util.TestMergeTable
[INFO] Running org.apache.hadoop.hbase.util.TestMergeTool
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 19.995 s 
- in org.apache.hadoop.hbase.util.TestMergeTool
[INFO] Running org.apache.hadoop.hbase.util.TestRegionSplitter
[INFO] Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 16.962 s 
- in org.apache.hadoop.hbase.util.TestRegionSplitter
[INFO] Running org.apache.hadoop.hbase.util.TestRegionSplitCalculator
[INFO] Tests run: 15, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.968 s 
- in org.apache.hadoop.hbase.util.TestRegionSplitCalculator
[INFO] Running org.apache.hadoop.hbase.wal.TestWALSplit
[INFO] Tests run: 33, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 46.067 
s - in org.apache.hadoop.hbase.wal.TestWALSplit
[INFO] Running org.apache.hadoop.hbase.wal.TestWALSplitBoundedLogWriterCreation
[WARNING] Tests run: 33, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 
43.419 s - in org.apache.hadoop.hbase.wal.TestWALSplitBoundedLogWriterCreation
[INFO] Running org.apache.hadoop.hbase.wal.TestWALSplitCompressed
[INFO] Tests run: 33, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 46.075 
s - in org.apache.hadoop.hbase.wal.TestWALSplitCompressed
[INFO] Running org.apache.hadoop.hbase.mapred.TestSplitTable
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.774 s 
- in org.apache.hadoop.hbase.mapred.TestSplitTable
[INFO] Running org.apache.hadoop.hbase.regionserver.TestRegionSplitPolicy
[INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.836 s 
- in org.apache.hadoop.hbase.regionserver.TestRegionSplitPolicy
[INFO] Running org.apache.hadoop.hbase.regionserver.TestCompactSplitThread
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 30.692 s 
- in org.apache.hadoop.hbase.regionserver.TestCompactSplitThread
[INFO] Running 
org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster
[INFO] Tests run: 17, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 87.588 
s - in org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster
[INFO] Running org.apache.hadoop.hbase.regionserver.TestSplitWalDataLoss
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 13.796 s 
- in org.apache.hadoop.hbase.regionserver.TestSplitWalDataLoss
[INFO] Running org.apache.hadoop.hbase.regionserver.TestSplitTransaction
[INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 6.14 s - 
in org.apache.hadoop.hbase.regionserver.TestSplitTransaction
[INFO] Running org.apache.hadoop.hbase.regionserver.TestRegionMergeTransaction
[INFO] Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 9.715 s 
- in org.apache.hadoop.hbase.regionserver.TestRegionMergeTransaction
[INFO] Running org.apache.hadoop.hbase.regionserver.TestZKLessSplitOnCluster
[INFO] Tests run: 17, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 71.721 
s - in org.apache.hadoop.hbase.regionserver.TestZKLessSplitOnCluster
[INFO] Running org.apache.hadoop.hbase.regionserver.TestEndToEndSplitTransaction
[INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 59.082 s 
- in org.apache.hadoop.hbase.regionserver.TestEndToEndSplitTransaction
[INFO] Running org.apache.hadoop.hbase.regionserver.TestSplitLogWorker
[INFO] Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 24.689 s 
- in org.apache.hadoop.hbase.regionserver.TestSplitLogWorker
[INFO] Running org.apache.hadoop.hbase.regionserver.TestZKLessMergeOnCluster
[INFO] Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 28.068 s 
- in org.apache.hadoop.hbase.regionserver.TestZKLessMergeOnCluster
[INFO] Running 
org.apache.hadoop.hbase.regionserver.TestRegionMergeTransactionOnCluster
[INFO] Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 35.386 s 
- in org.apache.hadoop.hbase.regionserver.TestRegionMergeTransactionOnCluster
[INFO] Running org.apache.hadoop.hbase.master.TestAssignmentManagerOnCluster
[INFO] Tests run: 23, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 185.692 
s - in org.apache.hadoop.hbase.master.TestAssignmentManagerOnCluster
[INFO] Running org.apache.hadoop.hbase.master.TestDistributedLogSplitting
[WARNING] Tests run: 18, Failures: 0, Errors: 0, Skipped: 15, Time elapsed: 
90.971 s - in 

[jira] [Comment Edited] (HBASE-21266) Not running balancer because processing dead regionservers, but empty dead rs list

2018-10-09 Thread Xu Cang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644226#comment-16644226
 ] 

Xu Cang edited comment on HBASE-21266 at 10/10/18 2:51 AM:
---

{quote}If all access to {{numProcessing}} is {{synchronized}}, we don't need 
the {{AtomicInteger}}.
{quote}
 -But ++ / - - is not thread-safe for integer. It's still possible gets caught 
by race condition IMO>-

-e.g.- 

-Two threads call #add and #finish respectively at the same time, though 
synchronized keyword helps nothing in this case.-

 

Edited.


was (Author: xucang):
{quote}If all access to {{numProcessing}} is {{synchronized}}, we don't need 
the {{AtomicInteger}}.
{quote}
 But ++ / - - is not thread-safe for integer. It's still possible gets caught 
by race condition IMO>

e.g. 

Two threads call #add and #finish respectively at the same time, though 
synchronized keyword helps nothing in this case.

> Not running balancer because processing dead regionservers, but empty dead rs 
> list
> --
>
> Key: HBASE-21266
> URL: https://issues.apache.org/jira/browse/HBASE-21266
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.4.8
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Major
> Fix For: 1.5.0, 1.4.9
>
> Attachments: HBASE-21266-branch-1.patch, HBASE-21266-branch-1.patch, 
> HBASE-21266-branch-1.patch, HBASE-21266-branch-1.patch, 
> HBASE-21266-branch-1.patch, HBASE-21266-branch-1.patch
>
>
> Found during ITBLL testing. AM in master gets into a state where manual 
> attempts from the shell to run the balancer always return false and this is 
> printed in the master log:
> 2018-10-03 19:17:14,892 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=21,queue=0,port=8100] master.HMaster: 
> Not running balancer because processing dead regionserver(s): 
> Note the empty list. 
> This errant state did not recover without intervention by way of master 
> restart, but the test environment was chaotic so needs investigation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-21266) Not running balancer because processing dead regionservers, but empty dead rs list

2018-10-09 Thread Xu Cang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644226#comment-16644226
 ] 

Xu Cang edited comment on HBASE-21266 at 10/10/18 12:18 AM:


{quote}If all access to {{numProcessing}} is {{synchronized}}, we don't need 
the {{AtomicInteger}}.
{quote}
 But ++ / - - is not thread-safe for integer. It's still possible gets caught 
by race condition IMO>

e.g. 

Two threads call #add and #finish respectively at the same time, though 
synchronized keyword helps nothing in this case.


was (Author: xucang):
{quote}If all access to {{numProcessing}} is {{synchronized}}, we don't need 
the {{AtomicInteger}}.
{quote}
 But ++ / - - is not thread-safe for integer. It's still possible gets caught 
by race condition IMO>

> Not running balancer because processing dead regionservers, but empty dead rs 
> list
> --
>
> Key: HBASE-21266
> URL: https://issues.apache.org/jira/browse/HBASE-21266
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.4.8
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Major
> Fix For: 1.5.0, 1.4.9
>
> Attachments: HBASE-21266-branch-1.patch, HBASE-21266-branch-1.patch, 
> HBASE-21266-branch-1.patch
>
>
> Found during ITBLL testing. AM in master gets into a state where manual 
> attempts from the shell to run the balancer always return false and this is 
> printed in the master log:
> 2018-10-03 19:17:14,892 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=21,queue=0,port=8100] master.HMaster: 
> Not running balancer because processing dead regionserver(s): 
> Note the empty list. 
> This errant state did not recover without intervention by way of master 
> restart, but the test environment was chaotic so needs investigation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-21266) Not running balancer because processing dead regionservers, but empty dead rs list

2018-10-09 Thread Andrew Purtell (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643838#comment-16643838
 ] 

Andrew Purtell edited comment on HBASE-21266 at 10/9/18 9:39 PM:
-

I plan to commit this today unless objection.

I have run a couple of ITBLL workloads with stressAM policy with a shell 
invoking the balancer every minute. No issues with dead server processing 
observed. The earlier observed problem does not reproduce. 


was (Author: apurtell):
I plan to commit this today unless objection.

I have run a couple of ITBLL workloads with serverKilling and stressAM policy 
with a shell invoking the balancer every minute. No issues with dead server 
processing observed. The earlier observed problem does not reproduce. This 
isn't a positive test for that change, though, because I think it was a race 
condition, but it is a negative test in the sense that it is very unlikely we 
broke the AM with this change. All unit tests pass. 

> Not running balancer because processing dead regionservers, but empty dead rs 
> list
> --
>
> Key: HBASE-21266
> URL: https://issues.apache.org/jira/browse/HBASE-21266
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.4.8
>Reporter: Andrew Purtell
>Assignee: Andrew Purtell
>Priority: Major
> Fix For: 1.5.0, 1.4.9
>
> Attachments: HBASE-21266-branch-1.patch
>
>
> Found during ITBLL testing. AM in master gets into a state where manual 
> attempts from the shell to run the balancer always return false and this is 
> printed in the master log:
> 2018-10-03 19:17:14,892 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=21,queue=0,port=8100] master.HMaster: 
> Not running balancer because processing dead regionserver(s): 
> Note the empty list. 
> This errant state did not recover without intervention by way of master 
> restart, but the test environment was chaotic so needs investigation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-21266) Not running balancer because processing dead regionservers, but empty dead rs list

2018-10-04 Thread Andrew Purtell (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16638611#comment-16638611
 ] 

Andrew Purtell edited comment on HBASE-21266 at 10/4/18 5:48 PM:
-

bq.  "Number of dead servers in processing should always be non-negative"

You are looking at that assert in DeadServer#finish, right? Those aren't 
evaulated unless the JVM is started with the -ea command line flag, which I 
didn't do. 

We can see from the log line I did see that the dead server map was empty at 
the time so I agree we should look at accounting in DeadServer.java.

"Not running balancer because processing dead regionserver(s)" is printed from 
HMaster.java:1846 based on the result from 
ServerManager#areDeadServersInProgress, which passes through the result from 
DeadServer#areDeadServersInProgress, which is simply

{code}
  public synchronized boolean areDeadServersInProgress() { return processing; }
{code}

This boolean is cleared in DeadServer#finish when
{code}
if (numProcessing == 0) { processing = false; }
{code}

So the first question I have is why do we even need this boolean field? It can 
easily be derived cheaply from other state. In areDeadServersInProgress just 
return the result of {{!(numProcessing == 0)}}. 

That assert you observed should be replaced by use of Preconditions so we will 
get a RuntimeException that will get noticed. 


was (Author: apurtell):
bq.  "Number of dead servers in processing should always be non-negative"

You are looking at that assert in DeadServer#finish, right? Those aren't 
evaulated unless the JVM is started with the -ea command line flag, which I 
didn't do. 

We can see from the log line I did see that the dead server map was empty at 
the time so I agree we should look at accounting in DeadServer.java.

"Not running balancer because processing dead regionserver(s)" is printed from 
HMaster.java:1846 based on the result from 
ServerManager#areDeadServersInProgress, which passes through the result from 
DeadServer#areDeadServersInProgress, which is simply

{code}
  public synchronized boolean areDeadServersInProgress() { return processing; }
{code}

This boolean is cleared in DeadServer#finish when
{code}
if (numProcessing == 0) { processing = false; }
{code}

So the first question I have is why do we even need this boolean field? It can 
easily be derived cheaply from other state. In areDeadServersInProgress just 
return the result of {{numProcessing == 0}}. 

That assert you observed should be replaced by use of Preconditions so we will 
get a RuntimeException that will get noticed. 

> Not running balancer because processing dead regionservers, but empty dead rs 
> list
> --
>
> Key: HBASE-21266
> URL: https://issues.apache.org/jira/browse/HBASE-21266
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.4.8
>Reporter: Andrew Purtell
>Priority: Major
> Fix For: 1.5.0, 1.4.9
>
>
> Found during ITBLL testing. AM in master gets into a state where manual 
> attempts from the shell to run the balancer always return false and this is 
> printed in the master log:
> 2018-10-03 19:17:14,892 DEBUG 
> [RpcServer.default.FPBQ.Fifo.handler=21,queue=0,port=8100] master.HMaster: 
> Not running balancer because processing dead regionserver(s): 
> Note the empty list. 
> This errant state did not recover without intervention by way of master 
> restart, but the test environment was chaotic so needs investigation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)