[jira] [Commented] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP

2018-11-13 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685280#comment-16685280
 ] 

Hudson commented on HBASE-21463:


Results for branch master
[build #602 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/master/602/]: (x) 
*{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/602//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/602//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/602//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> The checkOnlineRegionsReport can accidentally complete a TRSP
> -
>
> Key: HBASE-21463
> URL: https://issues.apache.org/jira/browse/HBASE-21463
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21463-UT.patch, HBASE-21463-v1.patch, 
> HBASE-21463-v2.patch, HBASE-21463.patch
>
>
> On our testing cluster, we observe a  race condition:
> 1. A regionServerReport request is built
> 2. A TRSP is scheduled to reopen the region
> 3. The region is closed at RS side
> 4. The OpenRegionProcedure is created
> 5. The regionServerReport generated at step 1 is executed, and we find that 
> the region is opened on the RS, so we update the region state to OPEN.
> 6. The OpenRegionProcedure notices that the region has already been in the 
> OPEN state so gives up  and finishes.
> 7. The TRSP finishes.
> 8. The region is recorded as OPEN on the RS but actually not, and can not 
> recover unless we use HBCK2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP

2018-11-12 Thread Guanghao Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684670#comment-16684670
 ] 

Guanghao Zhang commented on HBASE-21463:


+1

> The checkOnlineRegionsReport can accidentally complete a TRSP
> -
>
> Key: HBASE-21463
> URL: https://issues.apache.org/jira/browse/HBASE-21463
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21463-UT.patch, HBASE-21463-v1.patch, 
> HBASE-21463-v2.patch, HBASE-21463.patch
>
>
> On our testing cluster, we observe a  race condition:
> 1. A regionServerReport request is built
> 2. A TRSP is scheduled to reopen the region
> 3. The region is closed at RS side
> 4. The OpenRegionProcedure is created
> 5. The regionServerReport generated at step 1 is executed, and we find that 
> the region is opened on the RS, so we update the region state to OPEN.
> 6. The OpenRegionProcedure notices that the region has already been in the 
> OPEN state so gives up  and finishes.
> 7. The TRSP finishes.
> 8. The region is recorded as OPEN on the RS but actually not, and can not 
> recover unless we use HBCK2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP

2018-11-11 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683315#comment-16683315
 ] 

Duo Zhang commented on HBASE-21463:
---

OK, all green. If no other objections I will commit to master and branch-2.

Will file a new issue for tracking the possible race for branch-2.1 and 
branch-2.0, as the implementation is a bit different.

Thanks.

> The checkOnlineRegionsReport can accidentally complete a TRSP
> -
>
> Key: HBASE-21463
> URL: https://issues.apache.org/jira/browse/HBASE-21463
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21463-UT.patch, HBASE-21463-v1.patch, 
> HBASE-21463-v2.patch, HBASE-21463.patch
>
>
> On our testing cluster, we observe a  race condition:
> 1. A regionServerReport request is built
> 2. A TRSP is scheduled to reopen the region
> 3. The region is closed at RS side
> 4. The OpenRegionProcedure is created
> 5. The regionServerReport generated at step 1 is executed, and we find that 
> the region is opened on the RS, so we update the region state to OPEN.
> 6. The OpenRegionProcedure notices that the region has already been in the 
> OPEN state so gives up  and finishes.
> 7. The TRSP finishes.
> 8. The region is recorded as OPEN on the RS but actually not, and can not 
> recover unless we use HBCK2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP

2018-11-11 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683303#comment-16683303
 ] 

Hadoop QA commented on HBASE-21463:
---

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
12s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 4 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
23s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  3m 
57s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m  
9s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
32s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  3m 
46s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
36s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
53s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
15s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
 0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m  
5s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green}  3m  
5s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  3m  
5s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
10s{color} | {color:green} The patch passed checkstyle in hbase-protocol-shaded 
{color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
14s{color} | {color:green} The patch passed checkstyle in hbase-procedure 
{color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 8s{color} | {color:green} hbase-server: The patch generated 0 new + 175 
unchanged - 1 fixed = 175 total (was 176) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  3m 
46s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
8m 23s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 
or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} hbaseprotoc {color} | {color:green}  
1m 23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
56s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
32s{color} | {color:green} hbase-protocol-shaded in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
28s{color} | {color:green} hbase-procedure in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}128m 
14s{color} | {color:green} hbase-server in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  1m 
15s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}181m 17s{color} | 
{color:black} 

[jira] [Commented] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP

2018-11-11 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682897#comment-16682897
 ] 

Hadoop QA commented on HBASE-21463:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
11s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 4 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
25s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
28s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m  
5s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
38s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  3m 
54s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
31s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
52s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
14s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
15s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m  
4s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green}  3m  
4s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  3m  
4s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
10s{color} | {color:green} The patch passed checkstyle in hbase-protocol-shaded 
{color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
14s{color} | {color:green} The patch passed checkstyle in hbase-procedure 
{color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
14s{color} | {color:green} hbase-server: The patch generated 0 new + 175 
unchanged - 1 fixed = 175 total (was 176) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  3m 
58s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
9m 20s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 
or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} hbaseprotoc {color} | {color:green}  
1m 25s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  5m 
13s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red}  0m 
31s{color} | {color:red} hbase-server generated 1 new + 0 unchanged - 0 fixed = 
1 total (was 0) {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
32s{color} | {color:green} hbase-protocol-shaded in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
26s{color} | {color:green} hbase-procedure in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}129m 
46s{color} | {color:green} hbase-server in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  1m 
 7s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | 

[jira] [Commented] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP

2018-11-11 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682883#comment-16682883
 ] 

Duo Zhang commented on HBASE-21463:
---

So I think we can apply this for master and branch-2 first. For branch-2.1 and 
branch-2.0, we still need to dig more.

> The checkOnlineRegionsReport can accidentally complete a TRSP
> -
>
> Key: HBASE-21463
> URL: https://issues.apache.org/jira/browse/HBASE-21463
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21463-UT.patch, HBASE-21463-v1.patch, 
> HBASE-21463.patch
>
>
> On our testing cluster, we observe a  race condition:
> 1. A regionServerReport request is built
> 2. A TRSP is scheduled to reopen the region
> 3. The region is closed at RS side
> 4. The OpenRegionProcedure is created
> 5. The regionServerReport generated at step 1 is executed, and we find that 
> the region is opened on the RS, so we update the region state to OPEN.
> 6. The OpenRegionProcedure notices that the region has already been in the 
> OPEN state so gives up  and finishes.
> 7. The TRSP finishes.
> 8. The region is recorded as OPEN on the RS but actually not, and can not 
> recover unless we use HBCK2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP

2018-11-11 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682878#comment-16682878
 ] 

Allan Yang commented on HBASE-21463:


Yes, I just commented out the reportTransition in checkOnlineRegionsReport in 
branch-2.0, TestEnableTableProcedure also failed

> The checkOnlineRegionsReport can accidentally complete a TRSP
> -
>
> Key: HBASE-21463
> URL: https://issues.apache.org/jira/browse/HBASE-21463
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21463-UT.patch, HBASE-21463-v1.patch, 
> HBASE-21463.patch
>
>
> On our testing cluster, we observe a  race condition:
> 1. A regionServerReport request is built
> 2. A TRSP is scheduled to reopen the region
> 3. The region is closed at RS side
> 4. The OpenRegionProcedure is created
> 5. The regionServerReport generated at step 1 is executed, and we find that 
> the region is opened on the RS, so we update the region state to OPEN.
> 6. The OpenRegionProcedure notices that the region has already been in the 
> OPEN state so gives up  and finishes.
> 7. The TRSP finishes.
> 8. The region is recorded as OPEN on the RS but actually not, and can not 
> recover unless we use HBCK2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP

2018-11-11 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682857#comment-16682857
 ] 

Duo Zhang commented on HBASE-21463:
---

So on branch-2.1 and branch-2.0 the AP will schedule the open request again? 
For TRSP it won't schedule the open request again as in the 
reportRegionStateTransition method we will finish the TRSP, so if the 
reportRegionStateTransition is succeeded, then we are done.

> The checkOnlineRegionsReport can accidentally complete a TRSP
> -
>
> Key: HBASE-21463
> URL: https://issues.apache.org/jira/browse/HBASE-21463
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21463-UT.patch, HBASE-21463-v1.patch, 
> HBASE-21463.patch
>
>
> On our testing cluster, we observe a  race condition:
> 1. A regionServerReport request is built
> 2. A TRSP is scheduled to reopen the region
> 3. The region is closed at RS side
> 4. The OpenRegionProcedure is created
> 5. The regionServerReport generated at step 1 is executed, and we find that 
> the region is opened on the RS, so we update the region state to OPEN.
> 6. The OpenRegionProcedure notices that the region has already been in the 
> OPEN state so gives up  and finishes.
> 7. The TRSP finishes.
> 8. The region is recorded as OPEN on the RS but actually not, and can not 
> recover unless we use HBCK2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP

2018-11-11 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682854#comment-16682854
 ] 

Allan Yang commented on HBASE-21463:


I think we need the reportTransition call inside checkOnlineRegionsReport. 
Since:
1. The master scheduled a AP(TRSP in branch-2+) to RS
2. The RS opened the RS and report to Master successfully.
3. The Master restarts and the AP schedule the open requests again, but since 
the RS already opened the region, it ignore the request
4. Then the AP stuck
Another solution is that we can't ignore the open requests even if the region 
is already online, we need to send the report request back to master.

> The checkOnlineRegionsReport can accidentally complete a TRSP
> -
>
> Key: HBASE-21463
> URL: https://issues.apache.org/jira/browse/HBASE-21463
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21463-UT.patch, HBASE-21463-v1.patch, 
> HBASE-21463.patch
>
>
> On our testing cluster, we observe a  race condition:
> 1. A regionServerReport request is built
> 2. A TRSP is scheduled to reopen the region
> 3. The region is closed at RS side
> 4. The OpenRegionProcedure is created
> 5. The regionServerReport generated at step 1 is executed, and we find that 
> the region is opened on the RS, so we update the region state to OPEN.
> 6. The OpenRegionProcedure notices that the region has already been in the 
> OPEN state so gives up  and finishes.
> 7. The TRSP finishes.
> 8. The region is recorded as OPEN on the RS but actually not, and can not 
> recover unless we use HBCK2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP

2018-11-11 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682845#comment-16682845
 ] 

Duo Zhang commented on HBASE-21463:
---

OK the problem is that, we skip the table state check and schedule TRSP to 
assign a region when it is already online, then we stuck there as when the 
region server received the OpenRegion request it finds out that the region is 
already online, so just ignores. In the old time the regionServerReport can 
help us recovering from the stuck, but in the patch here we removed this 
behavior.

This does not make sense. We should always check the table state. If there are 
broken states or some other errors, we should use HBCK2 to fix them.

So I plan to remove the usage of this flag, and also remove the UT.

> The checkOnlineRegionsReport can accidentally complete a TRSP
> -
>
> Key: HBASE-21463
> URL: https://issues.apache.org/jira/browse/HBASE-21463
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21463-UT.patch, HBASE-21463.patch
>
>
> On our testing cluster, we observe a  race condition:
> 1. A regionServerReport request is built
> 2. A TRSP is scheduled to reopen the region
> 3. The region is closed at RS side
> 4. The OpenRegionProcedure is created
> 5. The regionServerReport generated at step 1 is executed, and we find that 
> the region is opened on the RS, so we update the region state to OPEN.
> 6. The OpenRegionProcedure notices that the region has already been in the 
> OPEN state so gives up  and finishes.
> 7. The TRSP finishes.
> 8. The region is recorded as OPEN on the RS but actually not, and can not 
> recover unless we use HBCK2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP

2018-11-11 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682843#comment-16682843
 ] 

Duo Zhang commented on HBASE-21463:
---

The failed UT is related. Let me dig.

> The checkOnlineRegionsReport can accidentally complete a TRSP
> -
>
> Key: HBASE-21463
> URL: https://issues.apache.org/jira/browse/HBASE-21463
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21463-UT.patch, HBASE-21463.patch
>
>
> On our testing cluster, we observe a  race condition:
> 1. A regionServerReport request is built
> 2. A TRSP is scheduled to reopen the region
> 3. The region is closed at RS side
> 4. The OpenRegionProcedure is created
> 5. The regionServerReport generated at step 1 is executed, and we find that 
> the region is opened on the RS, so we update the region state to OPEN.
> 6. The OpenRegionProcedure notices that the region has already been in the 
> OPEN state so gives up  and finishes.
> 7. The TRSP finishes.
> 8. The region is recorded as OPEN on the RS but actually not, and can not 
> recover unless we use HBCK2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP

2018-11-11 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682826#comment-16682826
 ] 

Hadoop QA commented on HBASE-21463:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
12s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
25s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
 0s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m  
7s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
18s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  3m 
45s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
19s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
43s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
15s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  3m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m  
9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  2m  
9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
15s{color} | {color:green} The patch passed checkstyle in hbase-procedure 
{color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 5s{color} | {color:green} hbase-server: The patch generated 0 new + 13 
unchanged - 1 fixed = 13 total (was 14) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  3m 
48s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
8m 22s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 
or 3.0.0. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  2m 
10s{color} | {color:red} hbase-server generated 1 new + 0 unchanged - 0 fixed = 
1 total (was 0) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
43s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
30s{color} | {color:green} hbase-procedure in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}131m 36s{color} 
| {color:red} hbase-server in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
51s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}174m 52s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hbase-server |
|  |  Possible null pointer dereference of regionNode in 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.checkOnlineRegionsReport(ServerStateNode,
 Set)  Dereferenced at AssignmentManager.java:regionNode in 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.checkOnlineRegionsReport(ServerStateNode,
 Set)  Dereferenced at AssignmentManager.java:[line 1037] |
| Failed junit tests | hadoop.hbase.master.procedure.TestEnableTableProcedure |
\\
\\
|| Subsystem || 

[jira] [Commented] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP

2018-11-11 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682802#comment-16682802
 ] 

Allan Yang commented on HBASE-21463:


Great test case! This one should go back to branch-2.0 and branch-2.1, too. I 
also thought about pushing a addendum to HBASE-21421 since 
checkOnlineRegionsReportForMeta is not handled in HBASE-21421 . I think this 
one is good enough. +1 for the patch.

> The checkOnlineRegionsReport can accidentally complete a TRSP
> -
>
> Key: HBASE-21463
> URL: https://issues.apache.org/jira/browse/HBASE-21463
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21463-UT.patch, HBASE-21463.patch
>
>
> On our testing cluster, we observe a  race condition:
> 1. A regionServerReport request is built
> 2. A TRSP is scheduled to reopen the region
> 3. The region is closed at RS side
> 4. The OpenRegionProcedure is created
> 5. The regionServerReport generated at step 1 is executed, and we find that 
> the region is opened on the RS, so we update the region state to OPEN.
> 6. The OpenRegionProcedure notices that the region has already been in the 
> OPEN state so gives up  and finishes.
> 7. The TRSP finishes.
> 8. The region is recorded as OPEN on the RS but actually not, and can not 
> recover unless we use HBCK2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP

2018-11-11 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682784#comment-16682784
 ] 

Duo Zhang commented on HBASE-21463:
---

Review board link:

https://reviews.apache.org/r/69314/

> The checkOnlineRegionsReport can accidentally complete a TRSP
> -
>
> Key: HBASE-21463
> URL: https://issues.apache.org/jira/browse/HBASE-21463
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21463-UT.patch, HBASE-21463.patch
>
>
> On our testing cluster, we observe a  race condition:
> 1. A regionServerReport request is built
> 2. A TRSP is scheduled to reopen the region
> 3. The region is closed at RS side
> 4. The OpenRegionProcedure is created
> 5. The regionServerReport generated at step 1 is executed, and we find that 
> the region is opened on the RS, so we update the region state to OPEN.
> 6. The OpenRegionProcedure notices that the region has already been in the 
> OPEN state so gives up  and finishes.
> 7. The TRSP finishes.
> 8. The region is recorded as OPEN on the RS but actually not, and can not 
> recover unless we use HBCK2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP

2018-11-10 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682721#comment-16682721
 ] 

stack commented on HBASE-21463:
---

My bad.

> The checkOnlineRegionsReport can accidentally complete a TRSP
> -
>
> Key: HBASE-21463
> URL: https://issues.apache.org/jira/browse/HBASE-21463
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21463-UT.patch
>
>
> On our testing cluster, we observe a  race condition:
> 1. A regionServerReport request is built
> 2. A TRSP is scheduled to reopen the region
> 3. The region is closed at RS side
> 4. The OpenRegionProcedure is created
> 5. The regionServerReport generated at step 1 is executed, and we find that 
> the region is opened on the RS, so we update the region state to OPEN.
> 6. The OpenRegionProcedure notices that the region has already been in the 
> OPEN state so gives up  and finishes.
> 7. The TRSP finishes.
> 8. The region is recorded as OPEN on the RS but actually not, and can not 
> recover unless we use HBCK2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP

2018-11-10 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682678#comment-16682678
 ] 

Duo Zhang commented on HBASE-21463:
---

Yes, only a UT(I have a 'UT' in the patch name...). Will provide a patch soon, 
as said above, change the behavior of checkOnlineRegions.

And I believe the race could also happen for branch-2.1 and branch-2.0, 
HBASE-21421 is one possible problem. The issue described here may not be a 
problem as the behavior of AssignProcedure/UnassignProcedure maybe different 
from TRSP, but I'm afraid there could be other strange problems.

> The checkOnlineRegionsReport can accidentally complete a TRSP
> -
>
> Key: HBASE-21463
> URL: https://issues.apache.org/jira/browse/HBASE-21463
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21463-UT.patch
>
>
> On our testing cluster, we observe a  race condition:
> 1. A regionServerReport request is built
> 2. A TRSP is scheduled to reopen the region
> 3. The region is closed at RS side
> 4. The OpenRegionProcedure is created
> 5. The regionServerReport generated at step 1 is executed, and we find that 
> the region is opened on the RS, so we update the region state to OPEN.
> 6. The OpenRegionProcedure notices that the region has already been in the 
> OPEN state so gives up  and finishes.
> 7. The TRSP finishes.
> 8. The region is recorded as OPEN on the RS but actually not, and can not 
> recover unless we use HBCK2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP

2018-11-10 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682554#comment-16682554
 ] 

stack commented on HBASE-21463:
---

Is there a change in the patch? I see making a method visible for testing, a 
reformat of a log message, and a nice looking UT but where is the fix?

> The checkOnlineRegionsReport can accidentally complete a TRSP
> -
>
> Key: HBASE-21463
> URL: https://issues.apache.org/jira/browse/HBASE-21463
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21463-UT.patch
>
>
> On our testing cluster, we observe a  race condition:
> 1. A regionServerReport request is built
> 2. A TRSP is scheduled to reopen the region
> 3. The region is closed at RS side
> 4. The OpenRegionProcedure is created
> 5. The regionServerReport generated at step 1 is executed, and we find that 
> the region is opened on the RS, so we update the region state to OPEN.
> 6. The OpenRegionProcedure notices that the region has already been in the 
> OPEN state so gives up  and finishes.
> 7. The TRSP finishes.
> 8. The region is recorded as OPEN on the RS but actually not, and can not 
> recover unless we use HBCK2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP

2018-11-10 Thread Guanghao Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682410#comment-16682410
 ] 

Guanghao Zhang commented on HBASE-21463:


Great UT. Not easy to find this bug and represent this problem..

> The checkOnlineRegionsReport can accidentally complete a TRSP
> -
>
> Key: HBASE-21463
> URL: https://issues.apache.org/jira/browse/HBASE-21463
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21463-UT.patch
>
>
> On our testing cluster, we observe a  race condition:
> 1. A regionServerReport request is built
> 2. A TRSP is scheduled to reopen the region
> 3. The region is closed at RS side
> 4. The OpenRegionProcedure is created
> 5. The regionServerReport generated at step 1 is executed, and we find that 
> the region is opened on the RS, so we update the region state to OPEN.
> 6. The OpenRegionProcedure notices that the region has already been in the 
> OPEN state so gives up  and finishes.
> 7. The TRSP finishes.
> 8. The region is recorded as OPEN on the RS but actually not, and can not 
> recover unless we use HBCK2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP

2018-11-10 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682388#comment-16682388
 ] 

Duo Zhang commented on HBASE-21463:
---

A UT to represent the problem. [~zghaobac] FYI. If no other opinions, I will 
completely change the behavior of checkOnlineRegions to only report possible 
inconsistency instead of trying to fix them.

> The checkOnlineRegionsReport can accidentally complete a TRSP
> -
>
> Key: HBASE-21463
> URL: https://issues.apache.org/jira/browse/HBASE-21463
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21463-UT.patch
>
>
> On our testing cluster, we observe a  race condition:
> 1. A regionServerReport request is built
> 2. A TRSP is scheduled to reopen the region
> 3. The region is closed at RS side
> 4. The OpenRegionProcedure is created
> 5. The regionServerReport generated at step 1 is executed, and we find that 
> the region is opened on the RS, so we update the region state to OPEN.
> 6. The OpenRegionProcedure notices that the region has already been in the 
> OPEN state so gives up  and finishes.
> 7. The TRSP finishes.
> 8. The region is recorded as OPEN on the RS but actually not, and can not 
> recover unless we use HBCK2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP

2018-11-09 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681523#comment-16681523
 ] 

Duo Zhang commented on HBASE-21463:
---

HBASE-21421 is also a problem introduced by this method. So in general, I do 
not think we should actually do anything in this method, instead, just log the 
possible inconsistencies.

> The checkOnlineRegionsReport can accidentally complete a TRSP
> -
>
> Key: HBASE-21463
> URL: https://issues.apache.org/jira/browse/HBASE-21463
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Priority: Major
> Fix For: 3.0.0, 2.2.0
>
>
> On our testing cluster, we observe a  race condition:
> 1. A regionServerReport request is built
> 2. A TRSP is scheduled to reopen the region
> 3. The region is closed at RS side
> 4. The OpenRegionProcedure is created
> 5. The regionServerReport generated at step 1 is executed, and we find that 
> the region is opened on the RS, so we update the region state to OPEN.
> 6. The OpenRegionProcedure notices that the region has already been in the 
> OPEN state so gives up  and finishes.
> 7. The TRSP finishes.
> 8. The region is recorded as OPEN on the RS but actually not, and can not 
> recover unless we use HBCK2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)