[jira] [Commented] (HBASE-21508) Ignore the reportRegionStateTransition call from a dead server

2018-11-24 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698077#comment-16698077
 ] 

Duo Zhang commented on HBASE-21508:
---

Review board link:

https://reviews.apache.org/r/69443/

> Ignore the reportRegionStateTransition call from a dead server
> --
>
> Key: HBASE-21508
> URL: https://issues.apache.org/jira/browse/HBASE-21508
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21508-v1.patch, HBASE-21508-v2.patch, 
> HBASE-21508.patch
>
>
> In our ITBLL test we observer a race between the SCP and TRSP which causes a 
> region being assigned to two region servers.
> Not fully understand the scenario, but anyway, the we do not consider the 
> situation in the old code, that after SCP gets the region list of a dead 
> server, there could still be a reportRegionStateTransition call from dead 
> server and mess up things.
> In general, I think we should have a fence in the AssignmentManager to 
> prevent the reportRegionStateTransition from the dead servers to mess up the 
> states.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21508) Ignore the reportRegionStateTransition call from a dead server

2018-11-24 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698078#comment-16698078
 ] 

Duo Zhang commented on HBASE-21508:
---

[~zghaobac] FYI.

> Ignore the reportRegionStateTransition call from a dead server
> --
>
> Key: HBASE-21508
> URL: https://issues.apache.org/jira/browse/HBASE-21508
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21508-v1.patch, HBASE-21508-v2.patch, 
> HBASE-21508.patch
>
>
> In our ITBLL test we observer a race between the SCP and TRSP which causes a 
> region being assigned to two region servers.
> Not fully understand the scenario, but anyway, the we do not consider the 
> situation in the old code, that after SCP gets the region list of a dead 
> server, there could still be a reportRegionStateTransition call from dead 
> server and mess up things.
> In general, I think we should have a fence in the AssignmentManager to 
> prevent the reportRegionStateTransition from the dead servers to mess up the 
> states.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21508) Ignore the reportRegionStateTransition call from a dead server

2018-11-24 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698076#comment-16698076
 ] 

Duo Zhang commented on HBASE-21508:
---

The problem is that, we hold lock when doing reportRegionStateTransition, so if 
we are closing meta, and also closing another region from the same server, then 
there will be dead lock, that the reportRegionStateTransition for meta is block 
by another region, but the reportRegionStateTransition for this region can not 
be finished since meta is not online.

So I change to use a ReadWriteLock instead. In reportRegionStateTransition, we 
will use read lock, and in submitServerCrash, we will use write lock.

> Ignore the reportRegionStateTransition call from a dead server
> --
>
> Key: HBASE-21508
> URL: https://issues.apache.org/jira/browse/HBASE-21508
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21508-v1.patch, HBASE-21508-v2.patch, 
> HBASE-21508.patch
>
>
> In our ITBLL test we observer a race between the SCP and TRSP which causes a 
> region being assigned to two region servers.
> Not fully understand the scenario, but anyway, the we do not consider the 
> situation in the old code, that after SCP gets the region list of a dead 
> server, there could still be a reportRegionStateTransition call from dead 
> server and mess up things.
> In general, I think we should have a fence in the AssignmentManager to 
> prevent the reportRegionStateTransition from the dead servers to mess up the 
> states.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21508) Ignore the reportRegionStateTransition call from a dead server

2018-11-24 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21508:
--
Attachment: HBASE-21508-v2.patch

> Ignore the reportRegionStateTransition call from a dead server
> --
>
> Key: HBASE-21508
> URL: https://issues.apache.org/jira/browse/HBASE-21508
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21508-v1.patch, HBASE-21508-v2.patch, 
> HBASE-21508.patch
>
>
> In our ITBLL test we observer a race between the SCP and TRSP which causes a 
> region being assigned to two region servers.
> Not fully understand the scenario, but anyway, the we do not consider the 
> situation in the old code, that after SCP gets the region list of a dead 
> server, there could still be a reportRegionStateTransition call from dead 
> server and mess up things.
> In general, I think we should have a fence in the AssignmentManager to 
> prevent the reportRegionStateTransition from the dead servers to mess up the 
> states.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21511) Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles

2018-11-24 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21511:
---
Attachment: 21511.v3.txt

> Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles
> ---
>
> Key: HBASE-21511
> URL: https://issues.apache.org/jira/browse/HBASE-21511
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Yu
>Priority: Minor
> Attachments: 21511.v1.txt, 21511.v2.txt, 21511.v3.txt
>
>
> During review of HBASE-21387, [~Apache9] mentioned that the check for in 
> progress snapshots in SnapshotFileCache#getUnreferencedFiles is no longer 
> needed now that snapshot hfile cleaner and taking snapshot are mutually 
> exclusive.
> This issue is to address the review comment by removing the check for in 
> progress snapshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21511) Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles

2018-11-24 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698063#comment-16698063
 ] 

Hadoop QA commented on HBASE-21511:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
11s{color} | {color:blue} Docker mode activated. {color} |
| {color:blue}0{color} | {color:blue} patch {color} | {color:blue}  0m  
2s{color} | {color:blue} The patch file was not named according to hbase's 
naming conventions. Please see 
https://yetus.apache.org/documentation/0.8.0/precommit-patchnames for 
instructions. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  3m 
57s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
49s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 5s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  3m 
46s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
58s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
29s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  3m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
49s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  1m  
5s{color} | {color:red} hbase-server: The patch generated 1 new + 6 unchanged - 
2 fixed = 7 total (was 8) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  3m 
50s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
8m 18s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 
or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
30s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}130m 
40s{color} | {color:green} hbase-server in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
26s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}166m 30s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b |
| JIRA Issue | HBASE-21511 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12949396/21511.v2.txt |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
shadedjars  hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux 38566f0962d0 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 
17:16:02 UTC 2018 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | master / 701526d19f |
| maven | version: Apache Maven 3.5.4 
(1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC3 |
| checkstyle | 

[jira] [Updated] (HBASE-21511) Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles

2018-11-24 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21511:
---
Attachment: 21511.v2.txt

> Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles
> ---
>
> Key: HBASE-21511
> URL: https://issues.apache.org/jira/browse/HBASE-21511
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Yu
>Priority: Minor
> Attachments: 21511.v1.txt, 21511.v2.txt
>
>
> During review of HBASE-21387, [~Apache9] mentioned that the check for in 
> progress snapshots in SnapshotFileCache#getUnreferencedFiles is no longer 
> needed now that snapshot hfile cleaner and taking snapshot are mutually 
> exclusive.
> This issue is to address the review comment by removing the check for in 
> progress snapshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21508) Ignore the reportRegionStateTransition call from a dead server

2018-11-24 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698034#comment-16698034
 ] 

Duo Zhang commented on HBASE-21508:
---

Something wrong with decommission, let me dig.

> Ignore the reportRegionStateTransition call from a dead server
> --
>
> Key: HBASE-21508
> URL: https://issues.apache.org/jira/browse/HBASE-21508
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21508-v1.patch, HBASE-21508.patch
>
>
> In our ITBLL test we observer a race between the SCP and TRSP which causes a 
> region being assigned to two region servers.
> Not fully understand the scenario, but anyway, the we do not consider the 
> situation in the old code, that after SCP gets the region list of a dead 
> server, there could still be a reportRegionStateTransition call from dead 
> server and mess up things.
> In general, I think we should have a fence in the AssignmentManager to 
> prevent the reportRegionStateTransition from the dead servers to mess up the 
> states.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-24 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698022#comment-16698022
 ] 

Hudson commented on HBASE-21387:


Results for branch branch-2
[build #1521 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1521/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1521//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1521//JDK8_Nightly_Build_Report_(Hadoop2)/]


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1521//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>  Components: snapshots
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10
>
> Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, 
> 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 
> 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, 
> 21511.v2.txt, HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, 
> HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, 
> HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, 
> two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-24 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698021#comment-16698021
 ] 

Hudson commented on HBASE-21387:


Results for branch branch-1.4
[build #560 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.4/560/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(x) {color:red}-1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.4/560//General_Nightly_Build_Report/]


(x) {color:red}-1 jdk7 checks{color}
-- For more information [see jdk7 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.4/560//JDK7_Nightly_Build_Report/]


(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.4/560//JDK8_Nightly_Build_Report_(Hadoop2)/]




(/) {color:green}+1 source release artifact{color}
-- See build output for details.


> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>  Components: snapshots
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10
>
> Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, 
> 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 
> 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, 
> 21511.v2.txt, HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, 
> HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, 
> HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, 
> two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-24 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698001#comment-16698001
 ] 

Hudson commented on HBASE-21387:


Results for branch branch-1.3
[build #552 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.3/552/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.3/552//General_Nightly_Build_Report/]


(/) {color:green}+1 jdk7 checks{color}
-- For more information [see jdk7 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.3/552//JDK7_Nightly_Build_Report/]


(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.3/552//JDK8_Nightly_Build_Report_(Hadoop2)/]




(/) {color:green}+1 source release artifact{color}
-- See build output for details.


> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>  Components: snapshots
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10
>
> Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, 
> 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 
> 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, 
> 21511.v2.txt, HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, 
> HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, 
> HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, 
> two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-24 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697975#comment-16697975
 ] 

Hudson commented on HBASE-21387:


Results for branch branch-1.2
[build #563 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.2/563/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.2/563//General_Nightly_Build_Report/]


(/) {color:green}+1 jdk7 checks{color}
-- For more information [see jdk7 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.2/563//JDK7_Nightly_Build_Report/]


(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.2/563//JDK8_Nightly_Build_Report_(Hadoop2)/]




(/) {color:green}+1 source release artifact{color}
-- See build output for details.


> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>  Components: snapshots
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10
>
> Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, 
> 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 
> 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, 
> 21511.v2.txt, HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, 
> HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, 
> HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, 
> two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-24 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697872#comment-16697872
 ] 

Hudson commented on HBASE-21387:


Results for branch branch-2.1
[build #632 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/632/]: 
(/) *{color:green}+1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/632//General_Nightly_Build_Report/]




(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/632//JDK8_Nightly_Build_Report_(Hadoop2)/]


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/632//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>  Components: snapshots
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10
>
> Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, 
> 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 
> 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, 
> 21511.v2.txt, HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, 
> HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, 
> HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, 
> two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-24 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697864#comment-16697864
 ] 

Hudson commented on HBASE-21387:


Results for branch branch-2.0
[build #1110 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1110/]: 
(/) *{color:green}+1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1110//General_Nightly_Build_Report/]




(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1110//JDK8_Nightly_Build_Report_(Hadoop2)/]


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1110//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>  Components: snapshots
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10
>
> Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, 
> 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 
> 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, 
> 21511.v2.txt, HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, 
> HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, 
> HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, 
> two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21508) Ignore the reportRegionStateTransition call from a dead server

2018-11-24 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697852#comment-16697852
 ] 

Hadoop QA commented on HBASE-21508:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
10s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  3m 
59s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
48s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
10s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  3m 
48s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
0s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
29s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
 2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 9s{color} | {color:green} hbase-server: The patch generated 0 new + 189 
unchanged - 38 fixed = 189 total (was 227) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  3m 
50s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
8m 18s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 
or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
4s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
30s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}132m 23s{color} 
| {color:red} hbase-server in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
24s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}168m 22s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hbase.client.TestAdmin2 |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b |
| JIRA Issue | HBASE-21508 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12949370/HBASE-21508-v1.patch |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
shadedjars  hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux e5f3fd8dcd8b 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 
17:16:02 UTC 2018 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | master / 701526d19f |
| maven | version: Apache Maven 3.5.4 
(1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC3 |
| unit | 
https://builds.apache.org/job/PreCommit-HBASE-Build/15107/artifact/patchprocess/patch-unit-hbase-server.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HBASE-Build/15107/testReport/ |
| Max. process+thread count | 4978 (vs. ulimit of 1) |
| 

[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-24 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21387:
---
Attachment: 21511.v2.txt

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>  Components: snapshots
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10
>
> Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, 
> 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 
> 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, 
> 21511.v2.txt, HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, 
> HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, 
> HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, 
> two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21511) Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles

2018-11-24 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697840#comment-16697840
 ] 

Hadoop QA commented on HBASE-21511:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
15s{color} | {color:blue} Docker mode activated. {color} |
| {color:blue}0{color} | {color:blue} patch {color} | {color:blue}  0m  
2s{color} | {color:blue} The patch file was not named according to hbase's 
naming conventions. Please see 
https://yetus.apache.org/documentation/0.8.0/precommit-patchnames for 
instructions. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  5m 
31s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m  
2s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
22s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
31s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
30s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
35s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
15s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
 1s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
9m  2s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 
or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
4s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
30s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}167m 11s{color} 
| {color:red} hbase-server in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
57s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}209m 13s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.hbase.master.balancer.TestStochasticLoadBalancerRegionReplicaSameHosts |
|   | hadoop.hbase.master.procedure.TestModifyNamespaceProcedure |
|   | 
hadoop.hbase.master.balancer.TestStochasticLoadBalancerRegionReplicaHighReplication
 |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b |
| JIRA Issue | HBASE-21511 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12949359/21511.v1.txt |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
shadedjars  hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux 7879e353808a 3.13.0-143-generic #192-Ubuntu SMP Tue Feb 27 
10:45:36 UTC 2018 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build@2/component/dev-support/hbase-personality.sh
 |
| git revision | master / d9c773b0a5 |
| maven | version: Apache Maven 

[jira] [Commented] (HBASE-21508) Ignore the reportRegionStateTransition call from a dead server

2018-11-24 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697782#comment-16697782
 ] 

Duo Zhang commented on HBASE-21508:
---

The modification in ProcedureSyncWait may cause long overflow...

> Ignore the reportRegionStateTransition call from a dead server
> --
>
> Key: HBASE-21508
> URL: https://issues.apache.org/jira/browse/HBASE-21508
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21508-v1.patch, HBASE-21508.patch
>
>
> In our ITBLL test we observer a race between the SCP and TRSP which causes a 
> region being assigned to two region servers.
> Not fully understand the scenario, but anyway, the we do not consider the 
> situation in the old code, that after SCP gets the region list of a dead 
> server, there could still be a reportRegionStateTransition call from dead 
> server and mess up things.
> In general, I think we should have a fence in the AssignmentManager to 
> prevent the reportRegionStateTransition from the dead servers to mess up the 
> states.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21508) Ignore the reportRegionStateTransition call from a dead server

2018-11-24 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21508:
--
Attachment: HBASE-21508-v1.patch

> Ignore the reportRegionStateTransition call from a dead server
> --
>
> Key: HBASE-21508
> URL: https://issues.apache.org/jira/browse/HBASE-21508
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21508-v1.patch, HBASE-21508.patch
>
>
> In our ITBLL test we observer a race between the SCP and TRSP which causes a 
> region being assigned to two region servers.
> Not fully understand the scenario, but anyway, the we do not consider the 
> situation in the old code, that after SCP gets the region list of a dead 
> server, there could still be a reportRegionStateTransition call from dead 
> server and mess up things.
> In general, I think we should have a fence in the AssignmentManager to 
> prevent the reportRegionStateTransition from the dead servers to mess up the 
> states.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21511) Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles

2018-11-24 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697776#comment-16697776
 ] 

Hadoop QA commented on HBASE-21511:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  4m 
10s{color} | {color:blue} Docker mode activated. {color} |
| {color:blue}0{color} | {color:blue} patch {color} | {color:blue}  0m  
1s{color} | {color:blue} The patch file was not named according to hbase's 
naming conventions. Please see 
https://yetus.apache.org/documentation/0.8.0/precommit-patchnames for 
instructions. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  5m 
 2s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
30s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
24s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
50s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
31s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
39s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  3m 
20s{color} | {color:red} root in the patch failed. {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  2m 
14s{color} | {color:red} hbase-server in the patch failed. {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  2m 14s{color} 
| {color:red} hbase-server in the patch failed. {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  1m 
16s{color} | {color:red} hbase-server: The patch generated 2 new + 2 unchanged 
- 0 fixed = 4 total (was 2) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} shadedjars {color} | {color:red}  3m 
46s{color} | {color:red} patch has 24 errors when building our shaded 
downstream artifacts. {color} |
| {color:red}-1{color} | {color:red} hadoopcheck {color} | {color:red}  2m 
20s{color} | {color:red} The patch causes 24 errors with Hadoop v2.7.4. {color} 
|
| {color:red}-1{color} | {color:red} hadoopcheck {color} | {color:red}  4m 
47s{color} | {color:red} The patch causes 24 errors with Hadoop v3.0.0. {color} 
|
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  0m 
32s{color} | {color:red} hbase-server in the patch failed. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
38s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  0m 58s{color} 
| {color:red} hbase-server in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
11s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 39m 16s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b |
| JIRA Issue | HBASE-21511 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12949361/21511.v1.txt |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
shadedjars  hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux 211c920a40f5 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 
17:16:02 UTC 2018 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | master / d9c773b0a5 |
| maven | version: Apache Maven 3.5.4 
(1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) |
| 

[jira] [Commented] (HBASE-21510) Test TestRegisterPeerWorkerWhenRestarting is flakey

2018-11-24 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697768#comment-16697768
 ] 

Hadoop QA commented on HBASE-21510:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
13s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
31s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
47s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
12s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
 1s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
6s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
31s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
17s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
47s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  1m 
10s{color} | {color:red} hbase-server: The patch generated 1 new + 0 unchanged 
- 0 fixed = 1 total (was 0) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  3m 
59s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
9m  1s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 
or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
11s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
31s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}137m 
20s{color} | {color:green} hbase-server in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
28s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}175m 37s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b |
| JIRA Issue | HBASE-21510 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12949348/HBASE-21510.patch |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
shadedjars  hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux f72767f1bc84 3.13.0-143-generic #192-Ubuntu SMP Tue Feb 27 
10:45:36 UTC 2018 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | master / 6d0dc960e6 |
| maven | version: Apache Maven 3.5.4 
(1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC3 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-HBASE-Build/15104/artifact/patchprocess/diff-checkstyle-hbase-server.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HBASE-Build/15104/testReport/ |
| Max. process+thread count | 4955 (vs. ulimit of 1) |
| modules | C: hbase-server U: hbase-server |
| Console output | 

[jira] [Commented] (HBASE-21510) Test TestRegisterPeerWorkerWhenRestarting is flakey

2018-11-24 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697771#comment-16697771
 ] 

Duo Zhang commented on HBASE-21510:
---

Pushed to master after fixing the checkstyle issue. Let's see how it works.

> Test TestRegisterPeerWorkerWhenRestarting is flakey
> ---
>
> Key: HBASE-21510
> URL: https://issues.apache.org/jira/browse/HBASE-21510
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, test
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: HBASE-21510.patch
>
>
> {noformat}
> 2018-11-22 16:57:34,066 WARN  [Thread-1101] 
> client.HBaseAdmin$ProcedureFuture(3528): failed to get the procedure result 
> procId=26
> org.apache.hadoop.hbase.DoNotRetryIOException: Unable to instantiate 
> exception received from 
> server:org.apache.hadoop.hbase.master.HMaster$MasterStoppedException.(java.lang.String)
>   at 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.unwrapRemoteException(RemoteWithExtrasException.java:93)
>   at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.makeIOExceptionOfException(ProtobufUtil.java:361)
>   at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.handleRemoteException(ProtobufUtil.java:349)
>   at 
> org.apache.hadoop.hbase.client.MasterCallable.call(MasterCallable.java:101)
>   at 
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:107)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3133)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3125)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin.access$700(HBaseAdmin.java:234)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.getProcedureResult(HBaseAdmin.java:3571)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.waitProcedureResult(HBaseAdmin.java:3523)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.get(HBaseAdmin.java:3479)
>   at org.apache.hadoop.hbase.client.HBaseAdmin.get(HBaseAdmin.java:2199)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin.transitReplicationPeerSyncReplicationState(HBaseAdmin.java:4073)
>   at 
> org.apache.hadoop.hbase.master.replication.TestRegisterPeerWorkerWhenRestarting$1.run(TestRegisterPeerWorkerWhenRestarting.java:102)
> Caused by: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.master.HMaster$MasterStoppedException):
>  org.apache.hadoop.hbase.master.HMaster$MasterStoppedException
>   at 
> org.apache.hadoop.hbase.master.HMaster.checkInitialized(HMaster.java:3080)
>   at 
> org.apache.hadoop.hbase.master.MasterRpcServices.getProcedureResult(MasterRpcServices.java:1181)
>   at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
>   at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
>   at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
>   at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
>   at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:387)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:95)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:410)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:406)
>   at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103)
>   at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118)
>   at 
> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.readResponse(NettyRpcDuplexHandler.java:162)
>   at 
> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.channelRead(NettyRpcDuplexHandler.java:192)
>   at 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
>   at 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
>   at 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
>   at 
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
>   at 
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284)
>   at 
> 

[jira] [Updated] (HBASE-21511) Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles

2018-11-24 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21511:
---
Attachment: 21511.v1.txt

> Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles
> ---
>
> Key: HBASE-21511
> URL: https://issues.apache.org/jira/browse/HBASE-21511
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Yu
>Priority: Minor
> Attachments: 21511.v1.txt
>
>
> During review of HBASE-21387, [~Apache9] mentioned that the check for in 
> progress snapshots in SnapshotFileCache#getUnreferencedFiles is no longer 
> needed now that snapshot hfile cleaner and taking snapshot are mutually 
> exclusive.
> This issue is to address the review comment by removing the check for in 
> progress snapshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21511) Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles

2018-11-24 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21511:
---
Attachment: (was: 21511.v1.txt)

> Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles
> ---
>
> Key: HBASE-21511
> URL: https://issues.apache.org/jira/browse/HBASE-21511
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Yu
>Priority: Minor
> Attachments: 21511.v1.txt
>
>
> During review of HBASE-21387, [~Apache9] mentioned that the check for in 
> progress snapshots in SnapshotFileCache#getUnreferencedFiles is no longer 
> needed now that snapshot hfile cleaner and taking snapshot are mutually 
> exclusive.
> This issue is to address the review comment by removing the check for in 
> progress snapshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21511) Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles

2018-11-24 Thread Ted Yu (JIRA)
Ted Yu created HBASE-21511:
--

 Summary: Remove in progress snapshot check in 
SnapshotFileCache#getUnreferencedFiles
 Key: HBASE-21511
 URL: https://issues.apache.org/jira/browse/HBASE-21511
 Project: HBase
  Issue Type: Improvement
Reporter: Ted Yu
 Attachments: 21511.v1.txt

During review of HBASE-21387, [~Apache9] mentioned that the check for in 
progress snapshots in SnapshotFileCache#getUnreferencedFiles is no longer 
needed now that snapshot hfile cleaner and taking snapshot are mutually 
exclusive.

This issue is to address the review comment by removing the check for in 
progress snapshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21511) Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles

2018-11-24 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21511:
---
Status: Patch Available  (was: Open)

> Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles
> ---
>
> Key: HBASE-21511
> URL: https://issues.apache.org/jira/browse/HBASE-21511
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Yu
>Priority: Minor
> Attachments: 21511.v1.txt
>
>
> During review of HBASE-21387, [~Apache9] mentioned that the check for in 
> progress snapshots in SnapshotFileCache#getUnreferencedFiles is no longer 
> needed now that snapshot hfile cleaner and taking snapshot are mutually 
> exclusive.
> This issue is to address the review comment by removing the check for in 
> progress snapshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21511) Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles

2018-11-24 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21511:
---
Attachment: 21511.v1.txt

> Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles
> ---
>
> Key: HBASE-21511
> URL: https://issues.apache.org/jira/browse/HBASE-21511
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Yu
>Priority: Minor
> Attachments: 21511.v1.txt
>
>
> During review of HBASE-21387, [~Apache9] mentioned that the check for in 
> progress snapshots in SnapshotFileCache#getUnreferencedFiles is no longer 
> needed now that snapshot hfile cleaner and taking snapshot are mutually 
> exclusive.
> This issue is to address the review comment by removing the check for in 
> progress snapshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21509) /hbase/WALs directory has large number of hlogs, which cannot be deleted correctly

2018-11-24 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697748#comment-16697748
 ] 

Allan Yang commented on HBASE-21509:


{quote}
but meta log is not empty 
{quote}
Yes, it is possible, the reason is that the meta region was once on this 
server, it was moved to another RS, but somehow, the log was not cleared after 
this server closed the meta region.
Since this server don't have the meta region when it crashed(restarted by you), 
the meta log won't be split, thus this directory will remain there forever.
But don't panic, it won't cause meta table lose any data. 

> /hbase/WALs directory has large number of hlogs, which cannot be deleted 
> correctly
> --
>
> Key: HBASE-21509
> URL: https://issues.apache.org/jira/browse/HBASE-21509
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.3.1
>Reporter: Bo Cui
>Priority: Minor
>
> When HMaster is initializing, if getMetaRegionLocation() returns null value, 
> then some wal(including metaWAL) cannot be deleted.
> for example
> before restarts
> /hbase/WALs/10-10-10-129,21302,1543048601526/10-10-10-129%2C21302%2C1543048601526.meta.1543048613941.meta
> after restarts
> /hbase/WALs/10-10-10-129,21302,1543048601526-splitting/10-10-10-129%2C21302%2C1543048601526.meta.1543048613941.meta
> /hbase/WALs/10-10-10-29,21302,1543048867527/10-10-10-29%2C21302%2C1543048867527.meta.1543048907265.meta



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21447) HBCK2 tool have questions on holes when HBCK2 checks region chain

2018-11-24 Thread Nicholas Jiang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697747#comment-16697747
 ] 

Nicholas Jiang commented on HBASE-21447:


[~stack]I checked that start row and end row of multiple regions cannot be 
coherent.

>  HBCK2 tool have questions on holes when HBCK2 checks region chain 
> ---
>
> Key: HBASE-21447
> URL: https://issues.apache.org/jira/browse/HBASE-21447
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck2
>Affects Versions: 2.0.2
>Reporter: Nicholas Jiang
>Priority: Major
> Attachments: Hole.png
>
>
> [hbck2]https://github.com/apache/hbase-operator-tools/tree/master/hbase-hbck2 
> This HBCK2 tool have some questions on holes when HBCK2 checks region chain 
> as follows. 
> {code:java}
> ERROR: There is a hole in the region chain between \x01F\x00\x00 and 
> \x02\x8C\x00\x00. You need to create a new .regioninfo and region dir in hdfs 
> to plug the hole. 
> ERROR: There is a hole in the region chain between \x05\x18\x00\x00 and 
> \x06^\x00\x00. You need to create a new .regioninfo and region dir in hdfs to 
> plug the hole. 
> ERROR: There is a hole in the region chain between \x07\x01\x00\x00 and 
> \x07\xA4\x00\x00. You need to create a new .regioninfo and region dir in hdfs 
> to plug the hole. 
> ERROR: There is a hole in the region chain between \x08G\x00\x00 and 
> \x09\x8D\x00\x00. You need to create a new .regioninfo and region dir in hdfs 
> to plug the hole. 
> ERROR: There is a hole in the region chain between \x0A0\x00\x00 and 
> \x0Bv\x00\x00. You need to create a new .regioninfo and region dir in hdfs to 
> plug the hole. 
> ERROR: There is a hole in the region chain between \x0C\x19\x00\x00 and 
> \x0C\xBC\x00\x00. You need to create a new .regioninfo and region dir in hdfs 
> to plug the hole. 
> ERROR: There is a hole in the region chain between \x0D_\x00\x00 and 
> \x0E\xA5\x00\x00. You need to create a new .regioninfo and region dir in hdfs 
> to plug the hole. 
> ERROR: There is a hole in the region chain between \x0F\xEB\x00\x00 and 
> \x111\x00\x00. You need to create a new .regioninfo and region dir in hdfs to 
> plug the hole. 
> ERROR: There is a hole in the region chain between \x16I\x00\x00 and 
> \x16\xEC\x00\x00. You need to create a new .regioninfo and region dir in hdfs 
> to plug the hole. 
> ERROR: There is a hole in the region chain between (\xC0\x00\x00 and 
> *\x06\x00\x00. You need to create a new .regioninfo and region dir in hdfs to 
> plug the hole.
> {code}
> !Hole.png!
> This hole problem can't be solved by HBCK2 tool.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21508) Ignore the reportRegionStateTransition call from a dead server

2018-11-24 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697740#comment-16697740
 ] 

Hadoop QA commented on HBASE-21508:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
11s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
1s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
 3s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
47s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 8s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  3m 
49s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
57s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
30s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
 1s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
47s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  1m  
8s{color} | {color:red} hbase-server: The patch generated 1 new + 190 unchanged 
- 37 fixed = 191 total (was 227) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  3m 
50s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
8m 17s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 
or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
30s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}140m 31s{color} 
| {color:red} hbase-server in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
27s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}176m 39s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hbase.client.TestAsyncRegionAdminApi |
|   | hadoop.hbase.client.TestScannersFromClientSide |
|   | hadoop.hbase.coprocessor.TestMasterObserver |
|   | hadoop.hbase.client.TestSeparateClientZKCluster |
|   | hadoop.hbase.client.TestReplicaWithCluster |
|   | hadoop.hbase.master.TestWarmupRegion |
|   | hadoop.hbase.master.assignment.TestAMServerFailedOpen |
|   | hadoop.hbase.master.TestServerCrashProcedureCarryingMetaStuck |
|   | hadoop.hbase.coprocessor.TestCoprocessorMetrics |
|   | hadoop.hbase.master.assignment.TestReportOnlineRegionsRace |
|   | hadoop.hbase.master.assignment.TestAssignmentManager |
|   | hadoop.hbase.master.assignment.TestRegionBypass |
|   | hadoop.hbase.regionserver.TestRegionMove |
|   | hadoop.hbase.master.TestServerCrashProcedureStuck |
|   | hadoop.hbase.master.assignment.TestAMAssignWithRandExec |
|   | hadoop.hbase.quotas.TestQuotaObserverChoreRegionReports |
|   | hadoop.hbase.client.TestAdmin2 |
|   | hadoop.hbase.client.TestMetaWithReplicas |
|   | hadoop.hbase.master.assignment.TestReportRegionStateTransitionRetry |
|   | hadoop.hbase.client.TestZKAsyncRegistry |
|   | 

[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-24 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697728#comment-16697728
 ] 

Hudson commented on HBASE-21387:


SUCCESS: Integrated in Jenkins build HBase-1.2-IT #1185 (See 
[https://builds.apache.org/job/HBase-1.2-IT/1185/])
HBASE-21387 Addendum fix TestSnapshotFileCache (zhangduo: rev 
96240732bfb4dfa28dc5fe6d445b9551d5ed9814)
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotFileCache.java


> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>  Components: snapshots
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10
>
> Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, 
> 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 
> 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, 
> HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, 
> HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, 
> HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, 
> two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-21509) /hbase/WALs directory has large number of hlogs, which cannot be deleted correctly

2018-11-24 Thread Bo Cui (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697725#comment-16697725
 ] 

Bo Cui edited comment on HBASE-21509 at 11/24/18 9:32 AM:
--

but meta log is not empty 
/hbase/WALs/10-10-10-129,21302,1543048601526-splitting/10-10-10-129%2C21302%2C1543048601526.meta.{color:red}1543048613941.meta{color}(old
 log)
/hbase/WALs/10-10-10-29,21302,1543048867527/10-10-10-29%2C21302%2C1543048867527.meta.{color:red}1543048907265.meta{color}(new
 log)


was (Author: bo cui):
but meta log is not empty 

> /hbase/WALs directory has large number of hlogs, which cannot be deleted 
> correctly
> --
>
> Key: HBASE-21509
> URL: https://issues.apache.org/jira/browse/HBASE-21509
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.3.1
>Reporter: Bo Cui
>Priority: Minor
>
> When HMaster is initializing, if getMetaRegionLocation() returns null value, 
> then some wal(including metaWAL) cannot be deleted.
> for example
> before restarts
> /hbase/WALs/10-10-10-129,21302,1543048601526/10-10-10-129%2C21302%2C1543048601526.meta.1543048613941.meta
> after restarts
> /hbase/WALs/10-10-10-129,21302,1543048601526-splitting/10-10-10-129%2C21302%2C1543048601526.meta.1543048613941.meta
> /hbase/WALs/10-10-10-29,21302,1543048867527/10-10-10-29%2C21302%2C1543048867527.meta.1543048907265.meta



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21509) /hbase/WALs directory has large number of hlogs, which cannot be deleted correctly

2018-11-24 Thread Bo Cui (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697725#comment-16697725
 ] 

Bo Cui commented on HBASE-21509:


but meta log is not empty 

> /hbase/WALs directory has large number of hlogs, which cannot be deleted 
> correctly
> --
>
> Key: HBASE-21509
> URL: https://issues.apache.org/jira/browse/HBASE-21509
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.3.1
>Reporter: Bo Cui
>Priority: Minor
>
> When HMaster is initializing, if getMetaRegionLocation() returns null value, 
> then some wal(including metaWAL) cannot be deleted.
> for example
> before restarts
> /hbase/WALs/10-10-10-129,21302,1543048601526/10-10-10-129%2C21302%2C1543048601526.meta.1543048613941.meta
> after restarts
> /hbase/WALs/10-10-10-129,21302,1543048601526-splitting/10-10-10-129%2C21302%2C1543048601526.meta.1543048613941.meta
> /hbase/WALs/10-10-10-29,21302,1543048867527/10-10-10-29%2C21302%2C1543048867527.meta.1543048907265.meta



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-24 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697724#comment-16697724
 ] 

Hudson commented on HBASE-21387:


SUCCESS: Integrated in Jenkins build HBase-1.3-IT #504 (See 
[https://builds.apache.org/job/HBase-1.3-IT/504/])
HBASE-21387 Addendum fix TestSnapshotFileCache (zhangduo: rev 
ec7461d2020b0a375eeb9b725eb3202aeed4fb13)
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotFileCache.java


> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>  Components: snapshots
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10
>
> Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, 
> 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 
> 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, 
> HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, 
> HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, 
> HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, 
> two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21509) /hbase/WALs directory has large number of hlogs, which cannot be deleted correctly

2018-11-24 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697721#comment-16697721
 ] 

Allan Yang commented on HBASE-21509:


It is duplicated by HBASE-21413. I will fix it maybe next week

> /hbase/WALs directory has large number of hlogs, which cannot be deleted 
> correctly
> --
>
> Key: HBASE-21509
> URL: https://issues.apache.org/jira/browse/HBASE-21509
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.3.1
>Reporter: Bo Cui
>Priority: Minor
>
> When HMaster is initializing, if getMetaRegionLocation() returns null value, 
> then some wal(including metaWAL) cannot be deleted.
> for example
> before restarts
> /hbase/WALs/10-10-10-129,21302,1543048601526/10-10-10-129%2C21302%2C1543048601526.meta.1543048613941.meta
> after restarts
> /hbase/WALs/10-10-10-129,21302,1543048601526-splitting/10-10-10-129%2C21302%2C1543048601526.meta.1543048613941.meta
> /hbase/WALs/10-10-10-29,21302,1543048867527/10-10-10-29%2C21302%2C1543048867527.meta.1543048907265.meta



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21509) /hbase/WALs directory has large number of hlogs, which cannot be deleted correctly

2018-11-24 Thread Bo Cui (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697718#comment-16697718
 ] 

Bo Cui commented on HBASE-21509:


[~allan163] yes, META_FILTER

> /hbase/WALs directory has large number of hlogs, which cannot be deleted 
> correctly
> --
>
> Key: HBASE-21509
> URL: https://issues.apache.org/jira/browse/HBASE-21509
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.3.1
>Reporter: Bo Cui
>Priority: Minor
>
> When HMaster is initializing, if getMetaRegionLocation() returns null value, 
> then some wal(including metaWAL) cannot be deleted.
> for example
> before restarts
> /hbase/WALs/10-10-10-129,21302,1543048601526/10-10-10-129%2C21302%2C1543048601526.meta.1543048613941.meta
> after restarts
> /hbase/WALs/10-10-10-129,21302,1543048601526-splitting/10-10-10-129%2C21302%2C1543048601526.meta.1543048613941.meta
> /hbase/WALs/10-10-10-29,21302,1543048867527/10-10-10-29%2C21302%2C1543048867527.meta.1543048907265.meta



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21508) Ignore the reportRegionStateTransition call from a dead server

2018-11-24 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697719#comment-16697719
 ] 

Allan Yang commented on HBASE-21508:


The fence is needed, +1 for the patch, pending for QA.

> Ignore the reportRegionStateTransition call from a dead server
> --
>
> Key: HBASE-21508
> URL: https://issues.apache.org/jira/browse/HBASE-21508
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21508.patch
>
>
> In our ITBLL test we observer a race between the SCP and TRSP which causes a 
> region being assigned to two region servers.
> Not fully understand the scenario, but anyway, the we do not consider the 
> situation in the old code, that after SCP gets the region list of a dead 
> server, there could still be a reportRegionStateTransition call from dead 
> server and mess up things.
> In general, I think we should have a fence in the AssignmentManager to 
> prevent the reportRegionStateTransition from the dead servers to mess up the 
> states.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-24 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21387:
--
Component/s: snapshots

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>  Components: snapshots
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10
>
> Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, 
> 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 
> 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, 
> HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, 
> HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, 
> HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, 
> two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-24 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21387:
--
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Pushed the addendum to all branches.

Thanks [~tedyu].

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10
>
> Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, 
> 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 
> 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, 
> HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, 
> HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, 
> HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, 
> two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21509) /hbase/WALs directory has large number of hlogs, which cannot be deleted correctly

2018-11-24 Thread Bo Cui (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Cui updated HBASE-21509:
---
Affects Version/s: 1.3.1
 Priority: Minor  (was: Major)
  Description: 
When HMaster is initializing, if getMetaRegionLocation() returns null value, 
then some wal(including metaWAL) cannot be deleted.

for example
before restarts
/hbase/WALs/10-10-10-129,21302,1543048601526/10-10-10-129%2C21302%2C1543048601526.meta.1543048613941.meta
after restarts
/hbase/WALs/10-10-10-129,21302,1543048601526-splitting/10-10-10-129%2C21302%2C1543048601526.meta.1543048613941.meta
/hbase/WALs/10-10-10-29,21302,1543048867527/10-10-10-29%2C21302%2C1543048867527.meta.1543048907265.meta

> /hbase/WALs directory has large number of hlogs, which cannot be deleted 
> correctly
> --
>
> Key: HBASE-21509
> URL: https://issues.apache.org/jira/browse/HBASE-21509
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.3.1
>Reporter: Bo Cui
>Priority: Minor
>
> When HMaster is initializing, if getMetaRegionLocation() returns null value, 
> then some wal(including metaWAL) cannot be deleted.
> for example
> before restarts
> /hbase/WALs/10-10-10-129,21302,1543048601526/10-10-10-129%2C21302%2C1543048601526.meta.1543048613941.meta
> after restarts
> /hbase/WALs/10-10-10-129,21302,1543048601526-splitting/10-10-10-129%2C21302%2C1543048601526.meta.1543048613941.meta
> /hbase/WALs/10-10-10-29,21302,1543048867527/10-10-10-29%2C21302%2C1543048867527.meta.1543048907265.meta



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21510) Test TestRegisterPeerWorkerWhenRestarting is flakey

2018-11-24 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697710#comment-16697710
 ] 

Duo Zhang commented on HBASE-21510:
---

Anyway it is easy to fix the UT itself, just wait until the state transition is 
finished. Maybe we could fix the above problem in another issue, as this may be 
an incompatible change.

> Test TestRegisterPeerWorkerWhenRestarting is flakey
> ---
>
> Key: HBASE-21510
> URL: https://issues.apache.org/jira/browse/HBASE-21510
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, test
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: HBASE-21510.patch
>
>
> {noformat}
> 2018-11-22 16:57:34,066 WARN  [Thread-1101] 
> client.HBaseAdmin$ProcedureFuture(3528): failed to get the procedure result 
> procId=26
> org.apache.hadoop.hbase.DoNotRetryIOException: Unable to instantiate 
> exception received from 
> server:org.apache.hadoop.hbase.master.HMaster$MasterStoppedException.(java.lang.String)
>   at 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.unwrapRemoteException(RemoteWithExtrasException.java:93)
>   at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.makeIOExceptionOfException(ProtobufUtil.java:361)
>   at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.handleRemoteException(ProtobufUtil.java:349)
>   at 
> org.apache.hadoop.hbase.client.MasterCallable.call(MasterCallable.java:101)
>   at 
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:107)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3133)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3125)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin.access$700(HBaseAdmin.java:234)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.getProcedureResult(HBaseAdmin.java:3571)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.waitProcedureResult(HBaseAdmin.java:3523)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.get(HBaseAdmin.java:3479)
>   at org.apache.hadoop.hbase.client.HBaseAdmin.get(HBaseAdmin.java:2199)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin.transitReplicationPeerSyncReplicationState(HBaseAdmin.java:4073)
>   at 
> org.apache.hadoop.hbase.master.replication.TestRegisterPeerWorkerWhenRestarting$1.run(TestRegisterPeerWorkerWhenRestarting.java:102)
> Caused by: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.master.HMaster$MasterStoppedException):
>  org.apache.hadoop.hbase.master.HMaster$MasterStoppedException
>   at 
> org.apache.hadoop.hbase.master.HMaster.checkInitialized(HMaster.java:3080)
>   at 
> org.apache.hadoop.hbase.master.MasterRpcServices.getProcedureResult(MasterRpcServices.java:1181)
>   at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
>   at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
>   at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
>   at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
>   at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:387)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:95)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:410)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:406)
>   at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103)
>   at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118)
>   at 
> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.readResponse(NettyRpcDuplexHandler.java:162)
>   at 
> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.channelRead(NettyRpcDuplexHandler.java:192)
>   at 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
>   at 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
>   at 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
>   at 
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
>   at 
> 

[jira] [Updated] (HBASE-21510) Test TestRegisterPeerWorkerWhenRestarting is flakey

2018-11-24 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21510:
--
Attachment: HBASE-21510.patch

> Test TestRegisterPeerWorkerWhenRestarting is flakey
> ---
>
> Key: HBASE-21510
> URL: https://issues.apache.org/jira/browse/HBASE-21510
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, test
>Reporter: Duo Zhang
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: HBASE-21510.patch
>
>
> {noformat}
> 2018-11-22 16:57:34,066 WARN  [Thread-1101] 
> client.HBaseAdmin$ProcedureFuture(3528): failed to get the procedure result 
> procId=26
> org.apache.hadoop.hbase.DoNotRetryIOException: Unable to instantiate 
> exception received from 
> server:org.apache.hadoop.hbase.master.HMaster$MasterStoppedException.(java.lang.String)
>   at 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.unwrapRemoteException(RemoteWithExtrasException.java:93)
>   at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.makeIOExceptionOfException(ProtobufUtil.java:361)
>   at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.handleRemoteException(ProtobufUtil.java:349)
>   at 
> org.apache.hadoop.hbase.client.MasterCallable.call(MasterCallable.java:101)
>   at 
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:107)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3133)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3125)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin.access$700(HBaseAdmin.java:234)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.getProcedureResult(HBaseAdmin.java:3571)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.waitProcedureResult(HBaseAdmin.java:3523)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.get(HBaseAdmin.java:3479)
>   at org.apache.hadoop.hbase.client.HBaseAdmin.get(HBaseAdmin.java:2199)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin.transitReplicationPeerSyncReplicationState(HBaseAdmin.java:4073)
>   at 
> org.apache.hadoop.hbase.master.replication.TestRegisterPeerWorkerWhenRestarting$1.run(TestRegisterPeerWorkerWhenRestarting.java:102)
> Caused by: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.master.HMaster$MasterStoppedException):
>  org.apache.hadoop.hbase.master.HMaster$MasterStoppedException
>   at 
> org.apache.hadoop.hbase.master.HMaster.checkInitialized(HMaster.java:3080)
>   at 
> org.apache.hadoop.hbase.master.MasterRpcServices.getProcedureResult(MasterRpcServices.java:1181)
>   at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
>   at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
>   at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
>   at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
>   at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:387)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:95)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:410)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:406)
>   at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103)
>   at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118)
>   at 
> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.readResponse(NettyRpcDuplexHandler.java:162)
>   at 
> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.channelRead(NettyRpcDuplexHandler.java:192)
>   at 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
>   at 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
>   at 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
>   at 
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
>   at 
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284)
>   at 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
>   at 
> 

[jira] [Updated] (HBASE-21510) Test TestRegisterPeerWorkerWhenRestarting is flakey

2018-11-24 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21510:
--
Assignee: Duo Zhang
  Status: Patch Available  (was: Open)

> Test TestRegisterPeerWorkerWhenRestarting is flakey
> ---
>
> Key: HBASE-21510
> URL: https://issues.apache.org/jira/browse/HBASE-21510
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, test
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: HBASE-21510.patch
>
>
> {noformat}
> 2018-11-22 16:57:34,066 WARN  [Thread-1101] 
> client.HBaseAdmin$ProcedureFuture(3528): failed to get the procedure result 
> procId=26
> org.apache.hadoop.hbase.DoNotRetryIOException: Unable to instantiate 
> exception received from 
> server:org.apache.hadoop.hbase.master.HMaster$MasterStoppedException.(java.lang.String)
>   at 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.unwrapRemoteException(RemoteWithExtrasException.java:93)
>   at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.makeIOExceptionOfException(ProtobufUtil.java:361)
>   at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.handleRemoteException(ProtobufUtil.java:349)
>   at 
> org.apache.hadoop.hbase.client.MasterCallable.call(MasterCallable.java:101)
>   at 
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:107)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3133)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3125)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin.access$700(HBaseAdmin.java:234)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.getProcedureResult(HBaseAdmin.java:3571)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.waitProcedureResult(HBaseAdmin.java:3523)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.get(HBaseAdmin.java:3479)
>   at org.apache.hadoop.hbase.client.HBaseAdmin.get(HBaseAdmin.java:2199)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin.transitReplicationPeerSyncReplicationState(HBaseAdmin.java:4073)
>   at 
> org.apache.hadoop.hbase.master.replication.TestRegisterPeerWorkerWhenRestarting$1.run(TestRegisterPeerWorkerWhenRestarting.java:102)
> Caused by: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.master.HMaster$MasterStoppedException):
>  org.apache.hadoop.hbase.master.HMaster$MasterStoppedException
>   at 
> org.apache.hadoop.hbase.master.HMaster.checkInitialized(HMaster.java:3080)
>   at 
> org.apache.hadoop.hbase.master.MasterRpcServices.getProcedureResult(MasterRpcServices.java:1181)
>   at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
>   at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
>   at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
>   at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
>   at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:387)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:95)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:410)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:406)
>   at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103)
>   at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118)
>   at 
> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.readResponse(NettyRpcDuplexHandler.java:162)
>   at 
> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.channelRead(NettyRpcDuplexHandler.java:192)
>   at 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
>   at 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
>   at 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
>   at 
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
>   at 
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284)
>   at 
> 

[jira] [Commented] (HBASE-21510) Test TestRegisterPeerWorkerWhenRestarting is flakey

2018-11-24 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697707#comment-16697707
 ] 

Duo Zhang commented on HBASE-21510:
---

And what's more, since these exception could also be returned to client, I 
think we should move these exceptions into hbase-common?

> Test TestRegisterPeerWorkerWhenRestarting is flakey
> ---
>
> Key: HBASE-21510
> URL: https://issues.apache.org/jira/browse/HBASE-21510
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, test
>Reporter: Duo Zhang
>Priority: Major
> Fix For: 3.0.0
>
>
> {noformat}
> 2018-11-22 16:57:34,066 WARN  [Thread-1101] 
> client.HBaseAdmin$ProcedureFuture(3528): failed to get the procedure result 
> procId=26
> org.apache.hadoop.hbase.DoNotRetryIOException: Unable to instantiate 
> exception received from 
> server:org.apache.hadoop.hbase.master.HMaster$MasterStoppedException.(java.lang.String)
>   at 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.unwrapRemoteException(RemoteWithExtrasException.java:93)
>   at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.makeIOExceptionOfException(ProtobufUtil.java:361)
>   at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.handleRemoteException(ProtobufUtil.java:349)
>   at 
> org.apache.hadoop.hbase.client.MasterCallable.call(MasterCallable.java:101)
>   at 
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:107)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3133)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3125)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin.access$700(HBaseAdmin.java:234)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.getProcedureResult(HBaseAdmin.java:3571)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.waitProcedureResult(HBaseAdmin.java:3523)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.get(HBaseAdmin.java:3479)
>   at org.apache.hadoop.hbase.client.HBaseAdmin.get(HBaseAdmin.java:2199)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin.transitReplicationPeerSyncReplicationState(HBaseAdmin.java:4073)
>   at 
> org.apache.hadoop.hbase.master.replication.TestRegisterPeerWorkerWhenRestarting$1.run(TestRegisterPeerWorkerWhenRestarting.java:102)
> Caused by: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.master.HMaster$MasterStoppedException):
>  org.apache.hadoop.hbase.master.HMaster$MasterStoppedException
>   at 
> org.apache.hadoop.hbase.master.HMaster.checkInitialized(HMaster.java:3080)
>   at 
> org.apache.hadoop.hbase.master.MasterRpcServices.getProcedureResult(MasterRpcServices.java:1181)
>   at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
>   at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
>   at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
>   at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
>   at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:387)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:95)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:410)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:406)
>   at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103)
>   at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118)
>   at 
> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.readResponse(NettyRpcDuplexHandler.java:162)
>   at 
> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.channelRead(NettyRpcDuplexHandler.java:192)
>   at 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
>   at 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
>   at 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
>   at 
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
>   at 
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284)
>   at 
> 

[jira] [Commented] (HBASE-21510) Test TestRegisterPeerWorkerWhenRestarting is flakey

2018-11-24 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697706#comment-16697706
 ] 

Duo Zhang commented on HBASE-21510:
---

First, MasterStoppedException should have a constructor which takes a String. 
Second, I do not think MasterStoppedException should be a 
DoNotRetryIOException? At least, for getting the procedure result, we can still 
get the result from the new master.

> Test TestRegisterPeerWorkerWhenRestarting is flakey
> ---
>
> Key: HBASE-21510
> URL: https://issues.apache.org/jira/browse/HBASE-21510
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, test
>Reporter: Duo Zhang
>Priority: Major
> Fix For: 3.0.0
>
>
> {noformat}
> 2018-11-22 16:57:34,066 WARN  [Thread-1101] 
> client.HBaseAdmin$ProcedureFuture(3528): failed to get the procedure result 
> procId=26
> org.apache.hadoop.hbase.DoNotRetryIOException: Unable to instantiate 
> exception received from 
> server:org.apache.hadoop.hbase.master.HMaster$MasterStoppedException.(java.lang.String)
>   at 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.unwrapRemoteException(RemoteWithExtrasException.java:93)
>   at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.makeIOExceptionOfException(ProtobufUtil.java:361)
>   at 
> org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.handleRemoteException(ProtobufUtil.java:349)
>   at 
> org.apache.hadoop.hbase.client.MasterCallable.call(MasterCallable.java:101)
>   at 
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:107)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3133)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3125)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin.access$700(HBaseAdmin.java:234)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.getProcedureResult(HBaseAdmin.java:3571)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.waitProcedureResult(HBaseAdmin.java:3523)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.get(HBaseAdmin.java:3479)
>   at org.apache.hadoop.hbase.client.HBaseAdmin.get(HBaseAdmin.java:2199)
>   at 
> org.apache.hadoop.hbase.client.HBaseAdmin.transitReplicationPeerSyncReplicationState(HBaseAdmin.java:4073)
>   at 
> org.apache.hadoop.hbase.master.replication.TestRegisterPeerWorkerWhenRestarting$1.run(TestRegisterPeerWorkerWhenRestarting.java:102)
> Caused by: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.master.HMaster$MasterStoppedException):
>  org.apache.hadoop.hbase.master.HMaster$MasterStoppedException
>   at 
> org.apache.hadoop.hbase.master.HMaster.checkInitialized(HMaster.java:3080)
>   at 
> org.apache.hadoop.hbase.master.MasterRpcServices.getProcedureResult(MasterRpcServices.java:1181)
>   at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
>   at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
>   at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
>   at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
>   at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:387)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:95)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:410)
>   at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:406)
>   at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103)
>   at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118)
>   at 
> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.readResponse(NettyRpcDuplexHandler.java:162)
>   at 
> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.channelRead(NettyRpcDuplexHandler.java:192)
>   at 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
>   at 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
>   at 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
>   at 
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
>   at 
> 

[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-24 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697700#comment-16697700
 ] 

Duo Zhang commented on HBASE-21387:
---

+1 on the addendum for now, but I;d say the code is a bit confusing. We will 
stop checking unreferenced files if there are snapshot operations in progress, 
but in the code below we will get snapshots in progress and try to filter out 
files...

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10
>
> Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, 
> 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 
> 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, 
> HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, 
> HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, 
> HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, 
> two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21510) Test TestRegisterPeerWorkerWhenRestarting is flakey

2018-11-24 Thread Duo Zhang (JIRA)
Duo Zhang created HBASE-21510:
-

 Summary: Test TestRegisterPeerWorkerWhenRestarting is flakey
 Key: HBASE-21510
 URL: https://issues.apache.org/jira/browse/HBASE-21510
 Project: HBase
  Issue Type: Bug
  Components: Replication, test
Reporter: Duo Zhang
 Fix For: 3.0.0


{noformat}
2018-11-22 16:57:34,066 WARN  [Thread-1101] 
client.HBaseAdmin$ProcedureFuture(3528): failed to get the procedure result 
procId=26
org.apache.hadoop.hbase.DoNotRetryIOException: Unable to instantiate exception 
received from 
server:org.apache.hadoop.hbase.master.HMaster$MasterStoppedException.(java.lang.String)
at 
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.unwrapRemoteException(RemoteWithExtrasException.java:93)
at 
org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.makeIOExceptionOfException(ProtobufUtil.java:361)
at 
org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.handleRemoteException(ProtobufUtil.java:349)
at 
org.apache.hadoop.hbase.client.MasterCallable.call(MasterCallable.java:101)
at 
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:107)
at 
org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3133)
at 
org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3125)
at 
org.apache.hadoop.hbase.client.HBaseAdmin.access$700(HBaseAdmin.java:234)
at 
org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.getProcedureResult(HBaseAdmin.java:3571)
at 
org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.waitProcedureResult(HBaseAdmin.java:3523)
at 
org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.get(HBaseAdmin.java:3479)
at org.apache.hadoop.hbase.client.HBaseAdmin.get(HBaseAdmin.java:2199)
at 
org.apache.hadoop.hbase.client.HBaseAdmin.transitReplicationPeerSyncReplicationState(HBaseAdmin.java:4073)
at 
org.apache.hadoop.hbase.master.replication.TestRegisterPeerWorkerWhenRestarting$1.run(TestRegisterPeerWorkerWhenRestarting.java:102)
Caused by: 
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.master.HMaster$MasterStoppedException):
 org.apache.hadoop.hbase.master.HMaster$MasterStoppedException
at 
org.apache.hadoop.hbase.master.HMaster.checkInitialized(HMaster.java:3080)
at 
org.apache.hadoop.hbase.master.MasterRpcServices.getProcedureResult(MasterRpcServices.java:1181)
at 
org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)

at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:387)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:95)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:410)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:406)
at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103)
at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118)
at 
org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.readResponse(NettyRpcDuplexHandler.java:162)
at 
org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.channelRead(NettyRpcDuplexHandler.java:192)
at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at 
org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
at 
org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284)
at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at 

[jira] [Commented] (HBASE-21509) /hbase/WALs directory has large number of hlogs, which cannot be deleted correctly

2018-11-24 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697693#comment-16697693
 ] 

Allan Yang commented on HBASE-21509:


Are they wals end with .meta?

> /hbase/WALs directory has large number of hlogs, which cannot be deleted 
> correctly
> --
>
> Key: HBASE-21509
> URL: https://issues.apache.org/jira/browse/HBASE-21509
> Project: HBase
>  Issue Type: Bug
>Reporter: Bo Cui
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)