[jira] [Commented] (HBASE-21508) Ignore the reportRegionStateTransition call from a dead server
[ https://issues.apache.org/jira/browse/HBASE-21508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698077#comment-16698077 ] Duo Zhang commented on HBASE-21508: --- Review board link: https://reviews.apache.org/r/69443/ > Ignore the reportRegionStateTransition call from a dead server > -- > > Key: HBASE-21508 > URL: https://issues.apache.org/jira/browse/HBASE-21508 > Project: HBase > Issue Type: Sub-task > Components: amv2 >Reporter: Duo Zhang >Assignee: Duo Zhang >Priority: Major > Fix For: 3.0.0, 2.2.0 > > Attachments: HBASE-21508-v1.patch, HBASE-21508-v2.patch, > HBASE-21508.patch > > > In our ITBLL test we observer a race between the SCP and TRSP which causes a > region being assigned to two region servers. > Not fully understand the scenario, but anyway, the we do not consider the > situation in the old code, that after SCP gets the region list of a dead > server, there could still be a reportRegionStateTransition call from dead > server and mess up things. > In general, I think we should have a fence in the AssignmentManager to > prevent the reportRegionStateTransition from the dead servers to mess up the > states. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21508) Ignore the reportRegionStateTransition call from a dead server
[ https://issues.apache.org/jira/browse/HBASE-21508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698078#comment-16698078 ] Duo Zhang commented on HBASE-21508: --- [~zghaobac] FYI. > Ignore the reportRegionStateTransition call from a dead server > -- > > Key: HBASE-21508 > URL: https://issues.apache.org/jira/browse/HBASE-21508 > Project: HBase > Issue Type: Sub-task > Components: amv2 >Reporter: Duo Zhang >Assignee: Duo Zhang >Priority: Major > Fix For: 3.0.0, 2.2.0 > > Attachments: HBASE-21508-v1.patch, HBASE-21508-v2.patch, > HBASE-21508.patch > > > In our ITBLL test we observer a race between the SCP and TRSP which causes a > region being assigned to two region servers. > Not fully understand the scenario, but anyway, the we do not consider the > situation in the old code, that after SCP gets the region list of a dead > server, there could still be a reportRegionStateTransition call from dead > server and mess up things. > In general, I think we should have a fence in the AssignmentManager to > prevent the reportRegionStateTransition from the dead servers to mess up the > states. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21508) Ignore the reportRegionStateTransition call from a dead server
[ https://issues.apache.org/jira/browse/HBASE-21508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698076#comment-16698076 ] Duo Zhang commented on HBASE-21508: --- The problem is that, we hold lock when doing reportRegionStateTransition, so if we are closing meta, and also closing another region from the same server, then there will be dead lock, that the reportRegionStateTransition for meta is block by another region, but the reportRegionStateTransition for this region can not be finished since meta is not online. So I change to use a ReadWriteLock instead. In reportRegionStateTransition, we will use read lock, and in submitServerCrash, we will use write lock. > Ignore the reportRegionStateTransition call from a dead server > -- > > Key: HBASE-21508 > URL: https://issues.apache.org/jira/browse/HBASE-21508 > Project: HBase > Issue Type: Sub-task > Components: amv2 >Reporter: Duo Zhang >Assignee: Duo Zhang >Priority: Major > Fix For: 3.0.0, 2.2.0 > > Attachments: HBASE-21508-v1.patch, HBASE-21508-v2.patch, > HBASE-21508.patch > > > In our ITBLL test we observer a race between the SCP and TRSP which causes a > region being assigned to two region servers. > Not fully understand the scenario, but anyway, the we do not consider the > situation in the old code, that after SCP gets the region list of a dead > server, there could still be a reportRegionStateTransition call from dead > server and mess up things. > In general, I think we should have a fence in the AssignmentManager to > prevent the reportRegionStateTransition from the dead servers to mess up the > states. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21508) Ignore the reportRegionStateTransition call from a dead server
[ https://issues.apache.org/jira/browse/HBASE-21508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-21508: -- Attachment: HBASE-21508-v2.patch > Ignore the reportRegionStateTransition call from a dead server > -- > > Key: HBASE-21508 > URL: https://issues.apache.org/jira/browse/HBASE-21508 > Project: HBase > Issue Type: Sub-task > Components: amv2 >Reporter: Duo Zhang >Assignee: Duo Zhang >Priority: Major > Fix For: 3.0.0, 2.2.0 > > Attachments: HBASE-21508-v1.patch, HBASE-21508-v2.patch, > HBASE-21508.patch > > > In our ITBLL test we observer a race between the SCP and TRSP which causes a > region being assigned to two region servers. > Not fully understand the scenario, but anyway, the we do not consider the > situation in the old code, that after SCP gets the region list of a dead > server, there could still be a reportRegionStateTransition call from dead > server and mess up things. > In general, I think we should have a fence in the AssignmentManager to > prevent the reportRegionStateTransition from the dead servers to mess up the > states. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21511) Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles
[ https://issues.apache.org/jira/browse/HBASE-21511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21511: --- Attachment: 21511.v3.txt > Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles > --- > > Key: HBASE-21511 > URL: https://issues.apache.org/jira/browse/HBASE-21511 > Project: HBase > Issue Type: Improvement >Reporter: Ted Yu >Priority: Minor > Attachments: 21511.v1.txt, 21511.v2.txt, 21511.v3.txt > > > During review of HBASE-21387, [~Apache9] mentioned that the check for in > progress snapshots in SnapshotFileCache#getUnreferencedFiles is no longer > needed now that snapshot hfile cleaner and taking snapshot are mutually > exclusive. > This issue is to address the review comment by removing the check for in > progress snapshots. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21511) Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles
[ https://issues.apache.org/jira/browse/HBASE-21511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698063#comment-16698063 ] Hadoop QA commented on HBASE-21511: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s{color} | {color:blue} Docker mode activated. {color} | | {color:blue}0{color} | {color:blue} patch {color} | {color:blue} 0m 2s{color} | {color:blue} The patch file was not named according to hbase's naming conventions. Please see https://yetus.apache.org/documentation/0.8.0/precommit-patchnames for instructions. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 57s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 49s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 5s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 46s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 58s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 49s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 5s{color} | {color:red} hbase-server: The patch generated 1 new + 6 unchanged - 2 fixed = 7 total (was 8) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 50s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 8m 18s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green}130m 40s{color} | {color:green} hbase-server in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 26s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}166m 30s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b | | JIRA Issue | HBASE-21511 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12949396/21511.v2.txt | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux 38566f0962d0 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / 701526d19f | | maven | version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) | | Default Java | 1.8.0_181 | | findbugs | v3.1.0-RC3 | | checkstyle |
[jira] [Updated] (HBASE-21511) Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles
[ https://issues.apache.org/jira/browse/HBASE-21511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21511: --- Attachment: 21511.v2.txt > Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles > --- > > Key: HBASE-21511 > URL: https://issues.apache.org/jira/browse/HBASE-21511 > Project: HBase > Issue Type: Improvement >Reporter: Ted Yu >Priority: Minor > Attachments: 21511.v1.txt, 21511.v2.txt > > > During review of HBASE-21387, [~Apache9] mentioned that the check for in > progress snapshots in SnapshotFileCache#getUnreferencedFiles is no longer > needed now that snapshot hfile cleaner and taking snapshot are mutually > exclusive. > This issue is to address the review comment by removing the check for in > progress snapshots. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21508) Ignore the reportRegionStateTransition call from a dead server
[ https://issues.apache.org/jira/browse/HBASE-21508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698034#comment-16698034 ] Duo Zhang commented on HBASE-21508: --- Something wrong with decommission, let me dig. > Ignore the reportRegionStateTransition call from a dead server > -- > > Key: HBASE-21508 > URL: https://issues.apache.org/jira/browse/HBASE-21508 > Project: HBase > Issue Type: Sub-task > Components: amv2 >Reporter: Duo Zhang >Assignee: Duo Zhang >Priority: Major > Fix For: 3.0.0, 2.2.0 > > Attachments: HBASE-21508-v1.patch, HBASE-21508.patch > > > In our ITBLL test we observer a race between the SCP and TRSP which causes a > region being assigned to two region servers. > Not fully understand the scenario, but anyway, the we do not consider the > situation in the old code, that after SCP gets the region list of a dead > server, there could still be a reportRegionStateTransition call from dead > server and mess up things. > In general, I think we should have a fence in the AssignmentManager to > prevent the reportRegionStateTransition from the dead servers to mess up the > states. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698022#comment-16698022 ] Hudson commented on HBASE-21387: Results for branch branch-2 [build #1521 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1521/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1521//General_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1521//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1521//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Components: snapshots >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, > 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, > 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > 21511.v2.txt, HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698021#comment-16698021 ] Hudson commented on HBASE-21387: Results for branch branch-1.4 [build #560 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.4/560/]: (x) *{color:red}-1 overall{color}* details (if available): (x) {color:red}-1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.4/560//General_Nightly_Build_Report/] (x) {color:red}-1 jdk7 checks{color} -- For more information [see jdk7 report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.4/560//JDK7_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.4/560//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Components: snapshots >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, > 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, > 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > 21511.v2.txt, HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698001#comment-16698001 ] Hudson commented on HBASE-21387: Results for branch branch-1.3 [build #552 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.3/552/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.3/552//General_Nightly_Build_Report/] (/) {color:green}+1 jdk7 checks{color} -- For more information [see jdk7 report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.3/552//JDK7_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.3/552//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Components: snapshots >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, > 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, > 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > 21511.v2.txt, HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697975#comment-16697975 ] Hudson commented on HBASE-21387: Results for branch branch-1.2 [build #563 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.2/563/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.2/563//General_Nightly_Build_Report/] (/) {color:green}+1 jdk7 checks{color} -- For more information [see jdk7 report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.2/563//JDK7_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.2/563//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Components: snapshots >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, > 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, > 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > 21511.v2.txt, HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697872#comment-16697872 ] Hudson commented on HBASE-21387: Results for branch branch-2.1 [build #632 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/632/]: (/) *{color:green}+1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/632//General_Nightly_Build_Report/] (/) {color:green}+1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/632//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/632//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Components: snapshots >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, > 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, > 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > 21511.v2.txt, HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697864#comment-16697864 ] Hudson commented on HBASE-21387: Results for branch branch-2.0 [build #1110 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1110/]: (/) *{color:green}+1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1110//General_Nightly_Build_Report/] (/) {color:green}+1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1110//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1110//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Components: snapshots >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, > 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, > 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > 21511.v2.txt, HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21508) Ignore the reportRegionStateTransition call from a dead server
[ https://issues.apache.org/jira/browse/HBASE-21508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697852#comment-16697852 ] Hadoop QA commented on HBASE-21508: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 10s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 59s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 48s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 10s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 48s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 0s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 46s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 46s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 9s{color} | {color:green} hbase-server: The patch generated 0 new + 189 unchanged - 38 fixed = 189 total (was 227) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 50s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 8m 18s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 4s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}132m 23s{color} | {color:red} hbase-server in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 24s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}168m 22s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hbase.client.TestAdmin2 | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b | | JIRA Issue | HBASE-21508 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12949370/HBASE-21508-v1.patch | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux e5f3fd8dcd8b 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / 701526d19f | | maven | version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) | | Default Java | 1.8.0_181 | | findbugs | v3.1.0-RC3 | | unit | https://builds.apache.org/job/PreCommit-HBASE-Build/15107/artifact/patchprocess/patch-unit-hbase-server.txt | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/15107/testReport/ | | Max. process+thread count | 4978 (vs. ulimit of 1) | |
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21387: --- Attachment: 21511.v2.txt > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Components: snapshots >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, > 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, > 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > 21511.v2.txt, HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21511) Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles
[ https://issues.apache.org/jira/browse/HBASE-21511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697840#comment-16697840 ] Hadoop QA commented on HBASE-21511: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 15s{color} | {color:blue} Docker mode activated. {color} | | {color:blue}0{color} | {color:blue} patch {color} | {color:blue} 0m 2s{color} | {color:blue} The patch file was not named according to hbase's naming conventions. Please see https://yetus.apache.org/documentation/0.8.0/precommit-patchnames for instructions. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 5m 31s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 2s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 22s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 31s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 30s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 35s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 1s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 9m 2s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 4s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}167m 11s{color} | {color:red} hbase-server in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 57s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}209m 13s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hbase.master.balancer.TestStochasticLoadBalancerRegionReplicaSameHosts | | | hadoop.hbase.master.procedure.TestModifyNamespaceProcedure | | | hadoop.hbase.master.balancer.TestStochasticLoadBalancerRegionReplicaHighReplication | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b | | JIRA Issue | HBASE-21511 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12949359/21511.v1.txt | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux 7879e353808a 3.13.0-143-generic #192-Ubuntu SMP Tue Feb 27 10:45:36 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build@2/component/dev-support/hbase-personality.sh | | git revision | master / d9c773b0a5 | | maven | version: Apache Maven
[jira] [Commented] (HBASE-21508) Ignore the reportRegionStateTransition call from a dead server
[ https://issues.apache.org/jira/browse/HBASE-21508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697782#comment-16697782 ] Duo Zhang commented on HBASE-21508: --- The modification in ProcedureSyncWait may cause long overflow... > Ignore the reportRegionStateTransition call from a dead server > -- > > Key: HBASE-21508 > URL: https://issues.apache.org/jira/browse/HBASE-21508 > Project: HBase > Issue Type: Sub-task > Components: amv2 >Reporter: Duo Zhang >Assignee: Duo Zhang >Priority: Major > Fix For: 3.0.0, 2.2.0 > > Attachments: HBASE-21508-v1.patch, HBASE-21508.patch > > > In our ITBLL test we observer a race between the SCP and TRSP which causes a > region being assigned to two region servers. > Not fully understand the scenario, but anyway, the we do not consider the > situation in the old code, that after SCP gets the region list of a dead > server, there could still be a reportRegionStateTransition call from dead > server and mess up things. > In general, I think we should have a fence in the AssignmentManager to > prevent the reportRegionStateTransition from the dead servers to mess up the > states. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21508) Ignore the reportRegionStateTransition call from a dead server
[ https://issues.apache.org/jira/browse/HBASE-21508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-21508: -- Attachment: HBASE-21508-v1.patch > Ignore the reportRegionStateTransition call from a dead server > -- > > Key: HBASE-21508 > URL: https://issues.apache.org/jira/browse/HBASE-21508 > Project: HBase > Issue Type: Sub-task > Components: amv2 >Reporter: Duo Zhang >Assignee: Duo Zhang >Priority: Major > Fix For: 3.0.0, 2.2.0 > > Attachments: HBASE-21508-v1.patch, HBASE-21508.patch > > > In our ITBLL test we observer a race between the SCP and TRSP which causes a > region being assigned to two region servers. > Not fully understand the scenario, but anyway, the we do not consider the > situation in the old code, that after SCP gets the region list of a dead > server, there could still be a reportRegionStateTransition call from dead > server and mess up things. > In general, I think we should have a fence in the AssignmentManager to > prevent the reportRegionStateTransition from the dead servers to mess up the > states. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21511) Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles
[ https://issues.apache.org/jira/browse/HBASE-21511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697776#comment-16697776 ] Hadoop QA commented on HBASE-21511: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 4m 10s{color} | {color:blue} Docker mode activated. {color} | | {color:blue}0{color} | {color:blue} patch {color} | {color:blue} 0m 1s{color} | {color:blue} The patch file was not named according to hbase's naming conventions. Please see https://yetus.apache.org/documentation/0.8.0/precommit-patchnames for instructions. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 5m 2s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 30s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 24s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 50s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 31s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 39s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 3m 20s{color} | {color:red} root in the patch failed. {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 2m 14s{color} | {color:red} hbase-server in the patch failed. {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 2m 14s{color} | {color:red} hbase-server in the patch failed. {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 16s{color} | {color:red} hbase-server: The patch generated 2 new + 2 unchanged - 0 fixed = 4 total (was 2) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} shadedjars {color} | {color:red} 3m 46s{color} | {color:red} patch has 24 errors when building our shaded downstream artifacts. {color} | | {color:red}-1{color} | {color:red} hadoopcheck {color} | {color:red} 2m 20s{color} | {color:red} The patch causes 24 errors with Hadoop v2.7.4. {color} | | {color:red}-1{color} | {color:red} hadoopcheck {color} | {color:red} 4m 47s{color} | {color:red} The patch causes 24 errors with Hadoop v3.0.0. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 0m 32s{color} | {color:red} hbase-server in the patch failed. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 38s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 58s{color} | {color:red} hbase-server in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 11s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 39m 16s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b | | JIRA Issue | HBASE-21511 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12949361/21511.v1.txt | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux 211c920a40f5 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / d9c773b0a5 | | maven | version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) | |
[jira] [Commented] (HBASE-21510) Test TestRegisterPeerWorkerWhenRestarting is flakey
[ https://issues.apache.org/jira/browse/HBASE-21510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697768#comment-16697768 ] Hadoop QA commented on HBASE-21510: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 13s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 31s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 47s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 12s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 1s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 6s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 17s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 47s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 47s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 10s{color} | {color:red} hbase-server: The patch generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 59s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 9m 1s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green}137m 20s{color} | {color:green} hbase-server in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 28s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}175m 37s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b | | JIRA Issue | HBASE-21510 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12949348/HBASE-21510.patch | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux f72767f1bc84 3.13.0-143-generic #192-Ubuntu SMP Tue Feb 27 10:45:36 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / 6d0dc960e6 | | maven | version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) | | Default Java | 1.8.0_181 | | findbugs | v3.1.0-RC3 | | checkstyle | https://builds.apache.org/job/PreCommit-HBASE-Build/15104/artifact/patchprocess/diff-checkstyle-hbase-server.txt | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/15104/testReport/ | | Max. process+thread count | 4955 (vs. ulimit of 1) | | modules | C: hbase-server U: hbase-server | | Console output |
[jira] [Commented] (HBASE-21510) Test TestRegisterPeerWorkerWhenRestarting is flakey
[ https://issues.apache.org/jira/browse/HBASE-21510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697771#comment-16697771 ] Duo Zhang commented on HBASE-21510: --- Pushed to master after fixing the checkstyle issue. Let's see how it works. > Test TestRegisterPeerWorkerWhenRestarting is flakey > --- > > Key: HBASE-21510 > URL: https://issues.apache.org/jira/browse/HBASE-21510 > Project: HBase > Issue Type: Bug > Components: Replication, test >Reporter: Duo Zhang >Assignee: Duo Zhang >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-21510.patch > > > {noformat} > 2018-11-22 16:57:34,066 WARN [Thread-1101] > client.HBaseAdmin$ProcedureFuture(3528): failed to get the procedure result > procId=26 > org.apache.hadoop.hbase.DoNotRetryIOException: Unable to instantiate > exception received from > server:org.apache.hadoop.hbase.master.HMaster$MasterStoppedException.(java.lang.String) > at > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.unwrapRemoteException(RemoteWithExtrasException.java:93) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.makeIOExceptionOfException(ProtobufUtil.java:361) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.handleRemoteException(ProtobufUtil.java:349) > at > org.apache.hadoop.hbase.client.MasterCallable.call(MasterCallable.java:101) > at > org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:107) > at > org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3133) > at > org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3125) > at > org.apache.hadoop.hbase.client.HBaseAdmin.access$700(HBaseAdmin.java:234) > at > org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.getProcedureResult(HBaseAdmin.java:3571) > at > org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.waitProcedureResult(HBaseAdmin.java:3523) > at > org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.get(HBaseAdmin.java:3479) > at org.apache.hadoop.hbase.client.HBaseAdmin.get(HBaseAdmin.java:2199) > at > org.apache.hadoop.hbase.client.HBaseAdmin.transitReplicationPeerSyncReplicationState(HBaseAdmin.java:4073) > at > org.apache.hadoop.hbase.master.replication.TestRegisterPeerWorkerWhenRestarting$1.run(TestRegisterPeerWorkerWhenRestarting.java:102) > Caused by: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.master.HMaster$MasterStoppedException): > org.apache.hadoop.hbase.master.HMaster$MasterStoppedException > at > org.apache.hadoop.hbase.master.HMaster.checkInitialized(HMaster.java:3080) > at > org.apache.hadoop.hbase.master.MasterRpcServices.getProcedureResult(MasterRpcServices.java:1181) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:387) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:95) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:410) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:406) > at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103) > at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118) > at > org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.readResponse(NettyRpcDuplexHandler.java:162) > at > org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.channelRead(NettyRpcDuplexHandler.java:192) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) > at > org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) > at > org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284) > at >
[jira] [Updated] (HBASE-21511) Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles
[ https://issues.apache.org/jira/browse/HBASE-21511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21511: --- Attachment: 21511.v1.txt > Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles > --- > > Key: HBASE-21511 > URL: https://issues.apache.org/jira/browse/HBASE-21511 > Project: HBase > Issue Type: Improvement >Reporter: Ted Yu >Priority: Minor > Attachments: 21511.v1.txt > > > During review of HBASE-21387, [~Apache9] mentioned that the check for in > progress snapshots in SnapshotFileCache#getUnreferencedFiles is no longer > needed now that snapshot hfile cleaner and taking snapshot are mutually > exclusive. > This issue is to address the review comment by removing the check for in > progress snapshots. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21511) Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles
[ https://issues.apache.org/jira/browse/HBASE-21511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21511: --- Attachment: (was: 21511.v1.txt) > Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles > --- > > Key: HBASE-21511 > URL: https://issues.apache.org/jira/browse/HBASE-21511 > Project: HBase > Issue Type: Improvement >Reporter: Ted Yu >Priority: Minor > Attachments: 21511.v1.txt > > > During review of HBASE-21387, [~Apache9] mentioned that the check for in > progress snapshots in SnapshotFileCache#getUnreferencedFiles is no longer > needed now that snapshot hfile cleaner and taking snapshot are mutually > exclusive. > This issue is to address the review comment by removing the check for in > progress snapshots. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HBASE-21511) Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles
Ted Yu created HBASE-21511: -- Summary: Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles Key: HBASE-21511 URL: https://issues.apache.org/jira/browse/HBASE-21511 Project: HBase Issue Type: Improvement Reporter: Ted Yu Attachments: 21511.v1.txt During review of HBASE-21387, [~Apache9] mentioned that the check for in progress snapshots in SnapshotFileCache#getUnreferencedFiles is no longer needed now that snapshot hfile cleaner and taking snapshot are mutually exclusive. This issue is to address the review comment by removing the check for in progress snapshots. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21511) Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles
[ https://issues.apache.org/jira/browse/HBASE-21511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21511: --- Status: Patch Available (was: Open) > Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles > --- > > Key: HBASE-21511 > URL: https://issues.apache.org/jira/browse/HBASE-21511 > Project: HBase > Issue Type: Improvement >Reporter: Ted Yu >Priority: Minor > Attachments: 21511.v1.txt > > > During review of HBASE-21387, [~Apache9] mentioned that the check for in > progress snapshots in SnapshotFileCache#getUnreferencedFiles is no longer > needed now that snapshot hfile cleaner and taking snapshot are mutually > exclusive. > This issue is to address the review comment by removing the check for in > progress snapshots. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21511) Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles
[ https://issues.apache.org/jira/browse/HBASE-21511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-21511: --- Attachment: 21511.v1.txt > Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles > --- > > Key: HBASE-21511 > URL: https://issues.apache.org/jira/browse/HBASE-21511 > Project: HBase > Issue Type: Improvement >Reporter: Ted Yu >Priority: Minor > Attachments: 21511.v1.txt > > > During review of HBASE-21387, [~Apache9] mentioned that the check for in > progress snapshots in SnapshotFileCache#getUnreferencedFiles is no longer > needed now that snapshot hfile cleaner and taking snapshot are mutually > exclusive. > This issue is to address the review comment by removing the check for in > progress snapshots. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21509) /hbase/WALs directory has large number of hlogs, which cannot be deleted correctly
[ https://issues.apache.org/jira/browse/HBASE-21509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697748#comment-16697748 ] Allan Yang commented on HBASE-21509: {quote} but meta log is not empty {quote} Yes, it is possible, the reason is that the meta region was once on this server, it was moved to another RS, but somehow, the log was not cleared after this server closed the meta region. Since this server don't have the meta region when it crashed(restarted by you), the meta log won't be split, thus this directory will remain there forever. But don't panic, it won't cause meta table lose any data. > /hbase/WALs directory has large number of hlogs, which cannot be deleted > correctly > -- > > Key: HBASE-21509 > URL: https://issues.apache.org/jira/browse/HBASE-21509 > Project: HBase > Issue Type: Bug >Affects Versions: 1.3.1 >Reporter: Bo Cui >Priority: Minor > > When HMaster is initializing, if getMetaRegionLocation() returns null value, > then some wal(including metaWAL) cannot be deleted. > for example > before restarts > /hbase/WALs/10-10-10-129,21302,1543048601526/10-10-10-129%2C21302%2C1543048601526.meta.1543048613941.meta > after restarts > /hbase/WALs/10-10-10-129,21302,1543048601526-splitting/10-10-10-129%2C21302%2C1543048601526.meta.1543048613941.meta > /hbase/WALs/10-10-10-29,21302,1543048867527/10-10-10-29%2C21302%2C1543048867527.meta.1543048907265.meta -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21447) HBCK2 tool have questions on holes when HBCK2 checks region chain
[ https://issues.apache.org/jira/browse/HBASE-21447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697747#comment-16697747 ] Nicholas Jiang commented on HBASE-21447: [~stack]I checked that start row and end row of multiple regions cannot be coherent. > HBCK2 tool have questions on holes when HBCK2 checks region chain > --- > > Key: HBASE-21447 > URL: https://issues.apache.org/jira/browse/HBASE-21447 > Project: HBase > Issue Type: Improvement > Components: hbck2 >Affects Versions: 2.0.2 >Reporter: Nicholas Jiang >Priority: Major > Attachments: Hole.png > > > [hbck2]https://github.com/apache/hbase-operator-tools/tree/master/hbase-hbck2 > This HBCK2 tool have some questions on holes when HBCK2 checks region chain > as follows. > {code:java} > ERROR: There is a hole in the region chain between \x01F\x00\x00 and > \x02\x8C\x00\x00. You need to create a new .regioninfo and region dir in hdfs > to plug the hole. > ERROR: There is a hole in the region chain between \x05\x18\x00\x00 and > \x06^\x00\x00. You need to create a new .regioninfo and region dir in hdfs to > plug the hole. > ERROR: There is a hole in the region chain between \x07\x01\x00\x00 and > \x07\xA4\x00\x00. You need to create a new .regioninfo and region dir in hdfs > to plug the hole. > ERROR: There is a hole in the region chain between \x08G\x00\x00 and > \x09\x8D\x00\x00. You need to create a new .regioninfo and region dir in hdfs > to plug the hole. > ERROR: There is a hole in the region chain between \x0A0\x00\x00 and > \x0Bv\x00\x00. You need to create a new .regioninfo and region dir in hdfs to > plug the hole. > ERROR: There is a hole in the region chain between \x0C\x19\x00\x00 and > \x0C\xBC\x00\x00. You need to create a new .regioninfo and region dir in hdfs > to plug the hole. > ERROR: There is a hole in the region chain between \x0D_\x00\x00 and > \x0E\xA5\x00\x00. You need to create a new .regioninfo and region dir in hdfs > to plug the hole. > ERROR: There is a hole in the region chain between \x0F\xEB\x00\x00 and > \x111\x00\x00. You need to create a new .regioninfo and region dir in hdfs to > plug the hole. > ERROR: There is a hole in the region chain between \x16I\x00\x00 and > \x16\xEC\x00\x00. You need to create a new .regioninfo and region dir in hdfs > to plug the hole. > ERROR: There is a hole in the region chain between (\xC0\x00\x00 and > *\x06\x00\x00. You need to create a new .regioninfo and region dir in hdfs to > plug the hole. > {code} > !Hole.png! > This hole problem can't be solved by HBCK2 tool. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21508) Ignore the reportRegionStateTransition call from a dead server
[ https://issues.apache.org/jira/browse/HBASE-21508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697740#comment-16697740 ] Hadoop QA commented on HBASE-21508: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 1s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 3s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 47s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 8s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 49s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 57s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 1s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 47s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 47s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 8s{color} | {color:red} hbase-server: The patch generated 1 new + 190 unchanged - 37 fixed = 191 total (was 227) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 50s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 8m 17s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}140m 31s{color} | {color:red} hbase-server in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 27s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}176m 39s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hbase.client.TestAsyncRegionAdminApi | | | hadoop.hbase.client.TestScannersFromClientSide | | | hadoop.hbase.coprocessor.TestMasterObserver | | | hadoop.hbase.client.TestSeparateClientZKCluster | | | hadoop.hbase.client.TestReplicaWithCluster | | | hadoop.hbase.master.TestWarmupRegion | | | hadoop.hbase.master.assignment.TestAMServerFailedOpen | | | hadoop.hbase.master.TestServerCrashProcedureCarryingMetaStuck | | | hadoop.hbase.coprocessor.TestCoprocessorMetrics | | | hadoop.hbase.master.assignment.TestReportOnlineRegionsRace | | | hadoop.hbase.master.assignment.TestAssignmentManager | | | hadoop.hbase.master.assignment.TestRegionBypass | | | hadoop.hbase.regionserver.TestRegionMove | | | hadoop.hbase.master.TestServerCrashProcedureStuck | | | hadoop.hbase.master.assignment.TestAMAssignWithRandExec | | | hadoop.hbase.quotas.TestQuotaObserverChoreRegionReports | | | hadoop.hbase.client.TestAdmin2 | | | hadoop.hbase.client.TestMetaWithReplicas | | | hadoop.hbase.master.assignment.TestReportRegionStateTransitionRetry | | | hadoop.hbase.client.TestZKAsyncRegistry | | |
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697728#comment-16697728 ] Hudson commented on HBASE-21387: SUCCESS: Integrated in Jenkins build HBase-1.2-IT #1185 (See [https://builds.apache.org/job/HBase-1.2-IT/1185/]) HBASE-21387 Addendum fix TestSnapshotFileCache (zhangduo: rev 96240732bfb4dfa28dc5fe6d445b9551d5ed9814) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotFileCache.java > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Components: snapshots >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, > 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, > 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-21509) /hbase/WALs directory has large number of hlogs, which cannot be deleted correctly
[ https://issues.apache.org/jira/browse/HBASE-21509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697725#comment-16697725 ] Bo Cui edited comment on HBASE-21509 at 11/24/18 9:32 AM: -- but meta log is not empty /hbase/WALs/10-10-10-129,21302,1543048601526-splitting/10-10-10-129%2C21302%2C1543048601526.meta.{color:red}1543048613941.meta{color}(old log) /hbase/WALs/10-10-10-29,21302,1543048867527/10-10-10-29%2C21302%2C1543048867527.meta.{color:red}1543048907265.meta{color}(new log) was (Author: bo cui): but meta log is not empty > /hbase/WALs directory has large number of hlogs, which cannot be deleted > correctly > -- > > Key: HBASE-21509 > URL: https://issues.apache.org/jira/browse/HBASE-21509 > Project: HBase > Issue Type: Bug >Affects Versions: 1.3.1 >Reporter: Bo Cui >Priority: Minor > > When HMaster is initializing, if getMetaRegionLocation() returns null value, > then some wal(including metaWAL) cannot be deleted. > for example > before restarts > /hbase/WALs/10-10-10-129,21302,1543048601526/10-10-10-129%2C21302%2C1543048601526.meta.1543048613941.meta > after restarts > /hbase/WALs/10-10-10-129,21302,1543048601526-splitting/10-10-10-129%2C21302%2C1543048601526.meta.1543048613941.meta > /hbase/WALs/10-10-10-29,21302,1543048867527/10-10-10-29%2C21302%2C1543048867527.meta.1543048907265.meta -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21509) /hbase/WALs directory has large number of hlogs, which cannot be deleted correctly
[ https://issues.apache.org/jira/browse/HBASE-21509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697725#comment-16697725 ] Bo Cui commented on HBASE-21509: but meta log is not empty > /hbase/WALs directory has large number of hlogs, which cannot be deleted > correctly > -- > > Key: HBASE-21509 > URL: https://issues.apache.org/jira/browse/HBASE-21509 > Project: HBase > Issue Type: Bug >Affects Versions: 1.3.1 >Reporter: Bo Cui >Priority: Minor > > When HMaster is initializing, if getMetaRegionLocation() returns null value, > then some wal(including metaWAL) cannot be deleted. > for example > before restarts > /hbase/WALs/10-10-10-129,21302,1543048601526/10-10-10-129%2C21302%2C1543048601526.meta.1543048613941.meta > after restarts > /hbase/WALs/10-10-10-129,21302,1543048601526-splitting/10-10-10-129%2C21302%2C1543048601526.meta.1543048613941.meta > /hbase/WALs/10-10-10-29,21302,1543048867527/10-10-10-29%2C21302%2C1543048867527.meta.1543048907265.meta -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697724#comment-16697724 ] Hudson commented on HBASE-21387: SUCCESS: Integrated in Jenkins build HBase-1.3-IT #504 (See [https://builds.apache.org/job/HBase-1.3-IT/504/]) HBASE-21387 Addendum fix TestSnapshotFileCache (zhangduo: rev ec7461d2020b0a375eeb9b725eb3202aeed4fb13) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotFileCache.java > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Components: snapshots >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, > 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, > 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21509) /hbase/WALs directory has large number of hlogs, which cannot be deleted correctly
[ https://issues.apache.org/jira/browse/HBASE-21509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697721#comment-16697721 ] Allan Yang commented on HBASE-21509: It is duplicated by HBASE-21413. I will fix it maybe next week > /hbase/WALs directory has large number of hlogs, which cannot be deleted > correctly > -- > > Key: HBASE-21509 > URL: https://issues.apache.org/jira/browse/HBASE-21509 > Project: HBase > Issue Type: Bug >Affects Versions: 1.3.1 >Reporter: Bo Cui >Priority: Minor > > When HMaster is initializing, if getMetaRegionLocation() returns null value, > then some wal(including metaWAL) cannot be deleted. > for example > before restarts > /hbase/WALs/10-10-10-129,21302,1543048601526/10-10-10-129%2C21302%2C1543048601526.meta.1543048613941.meta > after restarts > /hbase/WALs/10-10-10-129,21302,1543048601526-splitting/10-10-10-129%2C21302%2C1543048601526.meta.1543048613941.meta > /hbase/WALs/10-10-10-29,21302,1543048867527/10-10-10-29%2C21302%2C1543048867527.meta.1543048907265.meta -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21509) /hbase/WALs directory has large number of hlogs, which cannot be deleted correctly
[ https://issues.apache.org/jira/browse/HBASE-21509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697718#comment-16697718 ] Bo Cui commented on HBASE-21509: [~allan163] yes, META_FILTER > /hbase/WALs directory has large number of hlogs, which cannot be deleted > correctly > -- > > Key: HBASE-21509 > URL: https://issues.apache.org/jira/browse/HBASE-21509 > Project: HBase > Issue Type: Bug >Affects Versions: 1.3.1 >Reporter: Bo Cui >Priority: Minor > > When HMaster is initializing, if getMetaRegionLocation() returns null value, > then some wal(including metaWAL) cannot be deleted. > for example > before restarts > /hbase/WALs/10-10-10-129,21302,1543048601526/10-10-10-129%2C21302%2C1543048601526.meta.1543048613941.meta > after restarts > /hbase/WALs/10-10-10-129,21302,1543048601526-splitting/10-10-10-129%2C21302%2C1543048601526.meta.1543048613941.meta > /hbase/WALs/10-10-10-29,21302,1543048867527/10-10-10-29%2C21302%2C1543048867527.meta.1543048907265.meta -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21508) Ignore the reportRegionStateTransition call from a dead server
[ https://issues.apache.org/jira/browse/HBASE-21508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697719#comment-16697719 ] Allan Yang commented on HBASE-21508: The fence is needed, +1 for the patch, pending for QA. > Ignore the reportRegionStateTransition call from a dead server > -- > > Key: HBASE-21508 > URL: https://issues.apache.org/jira/browse/HBASE-21508 > Project: HBase > Issue Type: Sub-task > Components: amv2 >Reporter: Duo Zhang >Assignee: Duo Zhang >Priority: Major > Fix For: 3.0.0, 2.2.0 > > Attachments: HBASE-21508.patch > > > In our ITBLL test we observer a race between the SCP and TRSP which causes a > region being assigned to two region servers. > Not fully understand the scenario, but anyway, the we do not consider the > situation in the old code, that after SCP gets the region list of a dead > server, there could still be a reportRegionStateTransition call from dead > server and mess up things. > In general, I think we should have a fence in the AssignmentManager to > prevent the reportRegionStateTransition from the dead servers to mess up the > states. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-21387: -- Component/s: snapshots > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug > Components: snapshots >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, > 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, > 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-21387: -- Resolution: Fixed Status: Resolved (was: Patch Available) Pushed the addendum to all branches. Thanks [~tedyu]. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, > 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, > 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21509) /hbase/WALs directory has large number of hlogs, which cannot be deleted correctly
[ https://issues.apache.org/jira/browse/HBASE-21509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bo Cui updated HBASE-21509: --- Affects Version/s: 1.3.1 Priority: Minor (was: Major) Description: When HMaster is initializing, if getMetaRegionLocation() returns null value, then some wal(including metaWAL) cannot be deleted. for example before restarts /hbase/WALs/10-10-10-129,21302,1543048601526/10-10-10-129%2C21302%2C1543048601526.meta.1543048613941.meta after restarts /hbase/WALs/10-10-10-129,21302,1543048601526-splitting/10-10-10-129%2C21302%2C1543048601526.meta.1543048613941.meta /hbase/WALs/10-10-10-29,21302,1543048867527/10-10-10-29%2C21302%2C1543048867527.meta.1543048907265.meta > /hbase/WALs directory has large number of hlogs, which cannot be deleted > correctly > -- > > Key: HBASE-21509 > URL: https://issues.apache.org/jira/browse/HBASE-21509 > Project: HBase > Issue Type: Bug >Affects Versions: 1.3.1 >Reporter: Bo Cui >Priority: Minor > > When HMaster is initializing, if getMetaRegionLocation() returns null value, > then some wal(including metaWAL) cannot be deleted. > for example > before restarts > /hbase/WALs/10-10-10-129,21302,1543048601526/10-10-10-129%2C21302%2C1543048601526.meta.1543048613941.meta > after restarts > /hbase/WALs/10-10-10-129,21302,1543048601526-splitting/10-10-10-129%2C21302%2C1543048601526.meta.1543048613941.meta > /hbase/WALs/10-10-10-29,21302,1543048867527/10-10-10-29%2C21302%2C1543048867527.meta.1543048907265.meta -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21510) Test TestRegisterPeerWorkerWhenRestarting is flakey
[ https://issues.apache.org/jira/browse/HBASE-21510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697710#comment-16697710 ] Duo Zhang commented on HBASE-21510: --- Anyway it is easy to fix the UT itself, just wait until the state transition is finished. Maybe we could fix the above problem in another issue, as this may be an incompatible change. > Test TestRegisterPeerWorkerWhenRestarting is flakey > --- > > Key: HBASE-21510 > URL: https://issues.apache.org/jira/browse/HBASE-21510 > Project: HBase > Issue Type: Bug > Components: Replication, test >Reporter: Duo Zhang >Assignee: Duo Zhang >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-21510.patch > > > {noformat} > 2018-11-22 16:57:34,066 WARN [Thread-1101] > client.HBaseAdmin$ProcedureFuture(3528): failed to get the procedure result > procId=26 > org.apache.hadoop.hbase.DoNotRetryIOException: Unable to instantiate > exception received from > server:org.apache.hadoop.hbase.master.HMaster$MasterStoppedException.(java.lang.String) > at > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.unwrapRemoteException(RemoteWithExtrasException.java:93) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.makeIOExceptionOfException(ProtobufUtil.java:361) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.handleRemoteException(ProtobufUtil.java:349) > at > org.apache.hadoop.hbase.client.MasterCallable.call(MasterCallable.java:101) > at > org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:107) > at > org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3133) > at > org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3125) > at > org.apache.hadoop.hbase.client.HBaseAdmin.access$700(HBaseAdmin.java:234) > at > org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.getProcedureResult(HBaseAdmin.java:3571) > at > org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.waitProcedureResult(HBaseAdmin.java:3523) > at > org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.get(HBaseAdmin.java:3479) > at org.apache.hadoop.hbase.client.HBaseAdmin.get(HBaseAdmin.java:2199) > at > org.apache.hadoop.hbase.client.HBaseAdmin.transitReplicationPeerSyncReplicationState(HBaseAdmin.java:4073) > at > org.apache.hadoop.hbase.master.replication.TestRegisterPeerWorkerWhenRestarting$1.run(TestRegisterPeerWorkerWhenRestarting.java:102) > Caused by: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.master.HMaster$MasterStoppedException): > org.apache.hadoop.hbase.master.HMaster$MasterStoppedException > at > org.apache.hadoop.hbase.master.HMaster.checkInitialized(HMaster.java:3080) > at > org.apache.hadoop.hbase.master.MasterRpcServices.getProcedureResult(MasterRpcServices.java:1181) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:387) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:95) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:410) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:406) > at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103) > at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118) > at > org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.readResponse(NettyRpcDuplexHandler.java:162) > at > org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.channelRead(NettyRpcDuplexHandler.java:192) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) > at > org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) > at >
[jira] [Updated] (HBASE-21510) Test TestRegisterPeerWorkerWhenRestarting is flakey
[ https://issues.apache.org/jira/browse/HBASE-21510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-21510: -- Attachment: HBASE-21510.patch > Test TestRegisterPeerWorkerWhenRestarting is flakey > --- > > Key: HBASE-21510 > URL: https://issues.apache.org/jira/browse/HBASE-21510 > Project: HBase > Issue Type: Bug > Components: Replication, test >Reporter: Duo Zhang >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-21510.patch > > > {noformat} > 2018-11-22 16:57:34,066 WARN [Thread-1101] > client.HBaseAdmin$ProcedureFuture(3528): failed to get the procedure result > procId=26 > org.apache.hadoop.hbase.DoNotRetryIOException: Unable to instantiate > exception received from > server:org.apache.hadoop.hbase.master.HMaster$MasterStoppedException.(java.lang.String) > at > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.unwrapRemoteException(RemoteWithExtrasException.java:93) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.makeIOExceptionOfException(ProtobufUtil.java:361) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.handleRemoteException(ProtobufUtil.java:349) > at > org.apache.hadoop.hbase.client.MasterCallable.call(MasterCallable.java:101) > at > org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:107) > at > org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3133) > at > org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3125) > at > org.apache.hadoop.hbase.client.HBaseAdmin.access$700(HBaseAdmin.java:234) > at > org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.getProcedureResult(HBaseAdmin.java:3571) > at > org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.waitProcedureResult(HBaseAdmin.java:3523) > at > org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.get(HBaseAdmin.java:3479) > at org.apache.hadoop.hbase.client.HBaseAdmin.get(HBaseAdmin.java:2199) > at > org.apache.hadoop.hbase.client.HBaseAdmin.transitReplicationPeerSyncReplicationState(HBaseAdmin.java:4073) > at > org.apache.hadoop.hbase.master.replication.TestRegisterPeerWorkerWhenRestarting$1.run(TestRegisterPeerWorkerWhenRestarting.java:102) > Caused by: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.master.HMaster$MasterStoppedException): > org.apache.hadoop.hbase.master.HMaster$MasterStoppedException > at > org.apache.hadoop.hbase.master.HMaster.checkInitialized(HMaster.java:3080) > at > org.apache.hadoop.hbase.master.MasterRpcServices.getProcedureResult(MasterRpcServices.java:1181) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:387) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:95) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:410) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:406) > at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103) > at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118) > at > org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.readResponse(NettyRpcDuplexHandler.java:162) > at > org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.channelRead(NettyRpcDuplexHandler.java:192) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) > at > org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) > at > org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) > at >
[jira] [Updated] (HBASE-21510) Test TestRegisterPeerWorkerWhenRestarting is flakey
[ https://issues.apache.org/jira/browse/HBASE-21510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-21510: -- Assignee: Duo Zhang Status: Patch Available (was: Open) > Test TestRegisterPeerWorkerWhenRestarting is flakey > --- > > Key: HBASE-21510 > URL: https://issues.apache.org/jira/browse/HBASE-21510 > Project: HBase > Issue Type: Bug > Components: Replication, test >Reporter: Duo Zhang >Assignee: Duo Zhang >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-21510.patch > > > {noformat} > 2018-11-22 16:57:34,066 WARN [Thread-1101] > client.HBaseAdmin$ProcedureFuture(3528): failed to get the procedure result > procId=26 > org.apache.hadoop.hbase.DoNotRetryIOException: Unable to instantiate > exception received from > server:org.apache.hadoop.hbase.master.HMaster$MasterStoppedException.(java.lang.String) > at > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.unwrapRemoteException(RemoteWithExtrasException.java:93) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.makeIOExceptionOfException(ProtobufUtil.java:361) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.handleRemoteException(ProtobufUtil.java:349) > at > org.apache.hadoop.hbase.client.MasterCallable.call(MasterCallable.java:101) > at > org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:107) > at > org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3133) > at > org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3125) > at > org.apache.hadoop.hbase.client.HBaseAdmin.access$700(HBaseAdmin.java:234) > at > org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.getProcedureResult(HBaseAdmin.java:3571) > at > org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.waitProcedureResult(HBaseAdmin.java:3523) > at > org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.get(HBaseAdmin.java:3479) > at org.apache.hadoop.hbase.client.HBaseAdmin.get(HBaseAdmin.java:2199) > at > org.apache.hadoop.hbase.client.HBaseAdmin.transitReplicationPeerSyncReplicationState(HBaseAdmin.java:4073) > at > org.apache.hadoop.hbase.master.replication.TestRegisterPeerWorkerWhenRestarting$1.run(TestRegisterPeerWorkerWhenRestarting.java:102) > Caused by: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.master.HMaster$MasterStoppedException): > org.apache.hadoop.hbase.master.HMaster$MasterStoppedException > at > org.apache.hadoop.hbase.master.HMaster.checkInitialized(HMaster.java:3080) > at > org.apache.hadoop.hbase.master.MasterRpcServices.getProcedureResult(MasterRpcServices.java:1181) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:387) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:95) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:410) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:406) > at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103) > at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118) > at > org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.readResponse(NettyRpcDuplexHandler.java:162) > at > org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.channelRead(NettyRpcDuplexHandler.java:192) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) > at > org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) > at > org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284) > at >
[jira] [Commented] (HBASE-21510) Test TestRegisterPeerWorkerWhenRestarting is flakey
[ https://issues.apache.org/jira/browse/HBASE-21510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697707#comment-16697707 ] Duo Zhang commented on HBASE-21510: --- And what's more, since these exception could also be returned to client, I think we should move these exceptions into hbase-common? > Test TestRegisterPeerWorkerWhenRestarting is flakey > --- > > Key: HBASE-21510 > URL: https://issues.apache.org/jira/browse/HBASE-21510 > Project: HBase > Issue Type: Bug > Components: Replication, test >Reporter: Duo Zhang >Priority: Major > Fix For: 3.0.0 > > > {noformat} > 2018-11-22 16:57:34,066 WARN [Thread-1101] > client.HBaseAdmin$ProcedureFuture(3528): failed to get the procedure result > procId=26 > org.apache.hadoop.hbase.DoNotRetryIOException: Unable to instantiate > exception received from > server:org.apache.hadoop.hbase.master.HMaster$MasterStoppedException.(java.lang.String) > at > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.unwrapRemoteException(RemoteWithExtrasException.java:93) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.makeIOExceptionOfException(ProtobufUtil.java:361) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.handleRemoteException(ProtobufUtil.java:349) > at > org.apache.hadoop.hbase.client.MasterCallable.call(MasterCallable.java:101) > at > org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:107) > at > org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3133) > at > org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3125) > at > org.apache.hadoop.hbase.client.HBaseAdmin.access$700(HBaseAdmin.java:234) > at > org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.getProcedureResult(HBaseAdmin.java:3571) > at > org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.waitProcedureResult(HBaseAdmin.java:3523) > at > org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.get(HBaseAdmin.java:3479) > at org.apache.hadoop.hbase.client.HBaseAdmin.get(HBaseAdmin.java:2199) > at > org.apache.hadoop.hbase.client.HBaseAdmin.transitReplicationPeerSyncReplicationState(HBaseAdmin.java:4073) > at > org.apache.hadoop.hbase.master.replication.TestRegisterPeerWorkerWhenRestarting$1.run(TestRegisterPeerWorkerWhenRestarting.java:102) > Caused by: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.master.HMaster$MasterStoppedException): > org.apache.hadoop.hbase.master.HMaster$MasterStoppedException > at > org.apache.hadoop.hbase.master.HMaster.checkInitialized(HMaster.java:3080) > at > org.apache.hadoop.hbase.master.MasterRpcServices.getProcedureResult(MasterRpcServices.java:1181) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:387) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:95) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:410) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:406) > at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103) > at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118) > at > org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.readResponse(NettyRpcDuplexHandler.java:162) > at > org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.channelRead(NettyRpcDuplexHandler.java:192) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) > at > org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) > at > org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284) > at >
[jira] [Commented] (HBASE-21510) Test TestRegisterPeerWorkerWhenRestarting is flakey
[ https://issues.apache.org/jira/browse/HBASE-21510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697706#comment-16697706 ] Duo Zhang commented on HBASE-21510: --- First, MasterStoppedException should have a constructor which takes a String. Second, I do not think MasterStoppedException should be a DoNotRetryIOException? At least, for getting the procedure result, we can still get the result from the new master. > Test TestRegisterPeerWorkerWhenRestarting is flakey > --- > > Key: HBASE-21510 > URL: https://issues.apache.org/jira/browse/HBASE-21510 > Project: HBase > Issue Type: Bug > Components: Replication, test >Reporter: Duo Zhang >Priority: Major > Fix For: 3.0.0 > > > {noformat} > 2018-11-22 16:57:34,066 WARN [Thread-1101] > client.HBaseAdmin$ProcedureFuture(3528): failed to get the procedure result > procId=26 > org.apache.hadoop.hbase.DoNotRetryIOException: Unable to instantiate > exception received from > server:org.apache.hadoop.hbase.master.HMaster$MasterStoppedException.(java.lang.String) > at > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.unwrapRemoteException(RemoteWithExtrasException.java:93) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.makeIOExceptionOfException(ProtobufUtil.java:361) > at > org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.handleRemoteException(ProtobufUtil.java:349) > at > org.apache.hadoop.hbase.client.MasterCallable.call(MasterCallable.java:101) > at > org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:107) > at > org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3133) > at > org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3125) > at > org.apache.hadoop.hbase.client.HBaseAdmin.access$700(HBaseAdmin.java:234) > at > org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.getProcedureResult(HBaseAdmin.java:3571) > at > org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.waitProcedureResult(HBaseAdmin.java:3523) > at > org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.get(HBaseAdmin.java:3479) > at org.apache.hadoop.hbase.client.HBaseAdmin.get(HBaseAdmin.java:2199) > at > org.apache.hadoop.hbase.client.HBaseAdmin.transitReplicationPeerSyncReplicationState(HBaseAdmin.java:4073) > at > org.apache.hadoop.hbase.master.replication.TestRegisterPeerWorkerWhenRestarting$1.run(TestRegisterPeerWorkerWhenRestarting.java:102) > Caused by: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.master.HMaster$MasterStoppedException): > org.apache.hadoop.hbase.master.HMaster$MasterStoppedException > at > org.apache.hadoop.hbase.master.HMaster.checkInitialized(HMaster.java:3080) > at > org.apache.hadoop.hbase.master.MasterRpcServices.getProcedureResult(MasterRpcServices.java:1181) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:387) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:95) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:410) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:406) > at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103) > at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118) > at > org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.readResponse(NettyRpcDuplexHandler.java:162) > at > org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.channelRead(NettyRpcDuplexHandler.java:192) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) > at > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) > at > org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) > at >
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697700#comment-16697700 ] Duo Zhang commented on HBASE-21387: --- +1 on the addendum for now, but I;d say the code is a bit confusing. We will stop checking unreferenced files if there are snapshot operations in progress, but in the code below we will get snapshots in progress and try to filter out files... > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10 > > Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, > 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, > 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, > HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, > HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, > HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, > two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HBASE-21510) Test TestRegisterPeerWorkerWhenRestarting is flakey
Duo Zhang created HBASE-21510: - Summary: Test TestRegisterPeerWorkerWhenRestarting is flakey Key: HBASE-21510 URL: https://issues.apache.org/jira/browse/HBASE-21510 Project: HBase Issue Type: Bug Components: Replication, test Reporter: Duo Zhang Fix For: 3.0.0 {noformat} 2018-11-22 16:57:34,066 WARN [Thread-1101] client.HBaseAdmin$ProcedureFuture(3528): failed to get the procedure result procId=26 org.apache.hadoop.hbase.DoNotRetryIOException: Unable to instantiate exception received from server:org.apache.hadoop.hbase.master.HMaster$MasterStoppedException.(java.lang.String) at org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.unwrapRemoteException(RemoteWithExtrasException.java:93) at org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.makeIOExceptionOfException(ProtobufUtil.java:361) at org.apache.hadoop.hbase.shaded.protobuf.ProtobufUtil.handleRemoteException(ProtobufUtil.java:349) at org.apache.hadoop.hbase.client.MasterCallable.call(MasterCallable.java:101) at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:107) at org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3133) at org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3125) at org.apache.hadoop.hbase.client.HBaseAdmin.access$700(HBaseAdmin.java:234) at org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.getProcedureResult(HBaseAdmin.java:3571) at org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.waitProcedureResult(HBaseAdmin.java:3523) at org.apache.hadoop.hbase.client.HBaseAdmin$ProcedureFuture.get(HBaseAdmin.java:3479) at org.apache.hadoop.hbase.client.HBaseAdmin.get(HBaseAdmin.java:2199) at org.apache.hadoop.hbase.client.HBaseAdmin.transitReplicationPeerSyncReplicationState(HBaseAdmin.java:4073) at org.apache.hadoop.hbase.master.replication.TestRegisterPeerWorkerWhenRestarting$1.run(TestRegisterPeerWorkerWhenRestarting.java:102) Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.master.HMaster$MasterStoppedException): org.apache.hadoop.hbase.master.HMaster$MasterStoppedException at org.apache.hadoop.hbase.master.HMaster.checkInitialized(HMaster.java:3080) at org.apache.hadoop.hbase.master.MasterRpcServices.getProcedureResult(MasterRpcServices.java:1181) at org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:387) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:95) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:410) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:406) at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103) at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118) at org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.readResponse(NettyRpcDuplexHandler.java:162) at org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.channelRead(NettyRpcDuplexHandler.java:192) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) at org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at
[jira] [Commented] (HBASE-21509) /hbase/WALs directory has large number of hlogs, which cannot be deleted correctly
[ https://issues.apache.org/jira/browse/HBASE-21509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697693#comment-16697693 ] Allan Yang commented on HBASE-21509: Are they wals end with .meta? > /hbase/WALs directory has large number of hlogs, which cannot be deleted > correctly > -- > > Key: HBASE-21509 > URL: https://issues.apache.org/jira/browse/HBASE-21509 > Project: HBase > Issue Type: Bug >Reporter: Bo Cui >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)