[jira] [Commented] (HBASE-23181) Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it is not online on us"
[ https://issues.apache.org/jira/browse/HBASE-23181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964867#comment-16964867 ] Hudson commented on HBASE-23181: Results for branch master [build #1523 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/master/1523/]: (/) *{color:green}+1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/master/1523//General_Nightly_Build_Report/] (/) {color:green}+1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/master/1523//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/master/1523//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Blocked WAL archive: "LogRoller: Failed to schedule flush of , because it > is not online on us" > -- > > Key: HBASE-23181 > URL: https://issues.apache.org/jira/browse/HBASE-23181 > Project: HBase > Issue Type: Bug > Components: regionserver, wal >Affects Versions: 2.2.1 >Reporter: Michael Stack >Assignee: Duo Zhang >Priority: Major > Fix For: 3.0.0, 2.3.0, 2.1.8, 2.2.3 > > > On a heavily loaded cluster, WAL count keeps rising and we can get into a > state where we are not rolling the logs off fast enough. In particular, there > is this interesting state at the extreme where we pick a region to flush > because 'Too many WALs' but the region is actually not online. As the WAL > count rises, we keep picking a region-to-flush that is no longer on the > server. This condition blocks our being able to clear WALs; eventually WALs > climb into the hundreds and the RS goes zombie with a full Call queue that > starts throwing CallQueueTooLargeExceptions (bad if this servers is the one > carrying hbase:meta): i.e. clients fail to access the RegionServer. > One symptom is a fast spike in WAL count for the RS. A restart of the RS will > break the bind. > Here is how it looks in the log: > {code} > # Here is region closing > 2019-10-16 23:10:55,897 INFO > org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler: Closed > 8ee433ad59526778c53cc85ed3762d0b > > # Then soon after ... > 2019-10-16 23:11:44,041 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > 2019-10-16 23:11:45,006 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=45, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > ... > # Later... > 2019-10-16 23:20:25,427 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=542, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > 2019-10-16 23:20:25,427 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > {code} > I've seen this runaway WALs 2.2.1. I've seen runaway WALs in a 1.2.x version > regularly that had HBASE-16721 fix in it, but can't say yet if it was for > same reason as above. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23181) Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it is not online on us"
[ https://issues.apache.org/jira/browse/HBASE-23181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964595#comment-16964595 ] Hudson commented on HBASE-23181: Results for branch branch-2.1 [build #1696 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/1696/]: (/) *{color:green}+1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/1696//General_Nightly_Build_Report/] (/) {color:green}+1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/1696//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/1696//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Blocked WAL archive: "LogRoller: Failed to schedule flush of , because it > is not online on us" > -- > > Key: HBASE-23181 > URL: https://issues.apache.org/jira/browse/HBASE-23181 > Project: HBase > Issue Type: Bug > Components: regionserver, wal >Affects Versions: 2.2.1 >Reporter: Michael Stack >Assignee: Duo Zhang >Priority: Major > Fix For: 3.0.0, 2.3.0, 2.1.8, 2.2.3 > > > On a heavily loaded cluster, WAL count keeps rising and we can get into a > state where we are not rolling the logs off fast enough. In particular, there > is this interesting state at the extreme where we pick a region to flush > because 'Too many WALs' but the region is actually not online. As the WAL > count rises, we keep picking a region-to-flush that is no longer on the > server. This condition blocks our being able to clear WALs; eventually WALs > climb into the hundreds and the RS goes zombie with a full Call queue that > starts throwing CallQueueTooLargeExceptions (bad if this servers is the one > carrying hbase:meta): i.e. clients fail to access the RegionServer. > One symptom is a fast spike in WAL count for the RS. A restart of the RS will > break the bind. > Here is how it looks in the log: > {code} > # Here is region closing > 2019-10-16 23:10:55,897 INFO > org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler: Closed > 8ee433ad59526778c53cc85ed3762d0b > > # Then soon after ... > 2019-10-16 23:11:44,041 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > 2019-10-16 23:11:45,006 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=45, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > ... > # Later... > 2019-10-16 23:20:25,427 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=542, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > 2019-10-16 23:20:25,427 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > {code} > I've seen this runaway WALs 2.2.1. I've seen runaway WALs in a 1.2.x version > regularly that had HBASE-16721 fix in it, but can't say yet if it was for > same reason as above. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23181) Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it is not online on us"
[ https://issues.apache.org/jira/browse/HBASE-23181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964521#comment-16964521 ] Hudson commented on HBASE-23181: Results for branch branch-2 [build #2340 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2340/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2340//General_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2340//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2340//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Blocked WAL archive: "LogRoller: Failed to schedule flush of , because it > is not online on us" > -- > > Key: HBASE-23181 > URL: https://issues.apache.org/jira/browse/HBASE-23181 > Project: HBase > Issue Type: Bug > Components: regionserver, wal >Affects Versions: 2.2.1 >Reporter: Michael Stack >Assignee: Duo Zhang >Priority: Major > Fix For: 3.0.0, 2.3.0, 2.1.8, 2.2.3 > > > On a heavily loaded cluster, WAL count keeps rising and we can get into a > state where we are not rolling the logs off fast enough. In particular, there > is this interesting state at the extreme where we pick a region to flush > because 'Too many WALs' but the region is actually not online. As the WAL > count rises, we keep picking a region-to-flush that is no longer on the > server. This condition blocks our being able to clear WALs; eventually WALs > climb into the hundreds and the RS goes zombie with a full Call queue that > starts throwing CallQueueTooLargeExceptions (bad if this servers is the one > carrying hbase:meta): i.e. clients fail to access the RegionServer. > One symptom is a fast spike in WAL count for the RS. A restart of the RS will > break the bind. > Here is how it looks in the log: > {code} > # Here is region closing > 2019-10-16 23:10:55,897 INFO > org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler: Closed > 8ee433ad59526778c53cc85ed3762d0b > > # Then soon after ... > 2019-10-16 23:11:44,041 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > 2019-10-16 23:11:45,006 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=45, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > ... > # Later... > 2019-10-16 23:20:25,427 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=542, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > 2019-10-16 23:20:25,427 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > {code} > I've seen this runaway WALs 2.2.1. I've seen runaway WALs in a 1.2.x version > regularly that had HBASE-16721 fix in it, but can't say yet if it was for > same reason as above. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23181) Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it is not online on us"
[ https://issues.apache.org/jira/browse/HBASE-23181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964510#comment-16964510 ] Hudson commented on HBASE-23181: Results for branch branch-2.2 [build #679 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.2/679/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.2/679//General_Nightly_Build_Report/] (/) {color:green}+1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.2/679//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.2/679//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Blocked WAL archive: "LogRoller: Failed to schedule flush of , because it > is not online on us" > -- > > Key: HBASE-23181 > URL: https://issues.apache.org/jira/browse/HBASE-23181 > Project: HBase > Issue Type: Bug > Components: regionserver, wal >Affects Versions: 2.2.1 >Reporter: Michael Stack >Assignee: Duo Zhang >Priority: Major > Fix For: 3.0.0, 2.3.0, 2.1.8, 2.2.3 > > > On a heavily loaded cluster, WAL count keeps rising and we can get into a > state where we are not rolling the logs off fast enough. In particular, there > is this interesting state at the extreme where we pick a region to flush > because 'Too many WALs' but the region is actually not online. As the WAL > count rises, we keep picking a region-to-flush that is no longer on the > server. This condition blocks our being able to clear WALs; eventually WALs > climb into the hundreds and the RS goes zombie with a full Call queue that > starts throwing CallQueueTooLargeExceptions (bad if this servers is the one > carrying hbase:meta): i.e. clients fail to access the RegionServer. > One symptom is a fast spike in WAL count for the RS. A restart of the RS will > break the bind. > Here is how it looks in the log: > {code} > # Here is region closing > 2019-10-16 23:10:55,897 INFO > org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler: Closed > 8ee433ad59526778c53cc85ed3762d0b > > # Then soon after ... > 2019-10-16 23:11:44,041 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > 2019-10-16 23:11:45,006 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=45, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > ... > # Later... > 2019-10-16 23:20:25,427 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=542, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > 2019-10-16 23:20:25,427 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > {code} > I've seen this runaway WALs 2.2.1. I've seen runaway WALs in a 1.2.x version > regularly that had HBASE-16721 fix in it, but can't say yet if it was for > same reason as above. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23181) Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it is not online on us"
[ https://issues.apache.org/jira/browse/HBASE-23181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960662#comment-16960662 ] Michael Stack commented on HBASE-23181: --- Ran cluster test with heavy load overrunning hbase against branch-2.2 (i.e. w/ this patch). This problem did not show (Plenty of others did but that is another story). Will run more but looking good. > Blocked WAL archive: "LogRoller: Failed to schedule flush of , because it > is not online on us" > -- > > Key: HBASE-23181 > URL: https://issues.apache.org/jira/browse/HBASE-23181 > Project: HBase > Issue Type: Bug > Components: regionserver, wal >Affects Versions: 2.2.1 >Reporter: Michael Stack >Assignee: Duo Zhang >Priority: Major > Fix For: 3.0.0, 2.3.0, 2.1.8, 2.2.3 > > > On a heavily loaded cluster, WAL count keeps rising and we can get into a > state where we are not rolling the logs off fast enough. In particular, there > is this interesting state at the extreme where we pick a region to flush > because 'Too many WALs' but the region is actually not online. As the WAL > count rises, we keep picking a region-to-flush that is no longer on the > server. This condition blocks our being able to clear WALs; eventually WALs > climb into the hundreds and the RS goes zombie with a full Call queue that > starts throwing CallQueueTooLargeExceptions (bad if this servers is the one > carrying hbase:meta): i.e. clients fail to access the RegionServer. > One symptom is a fast spike in WAL count for the RS. A restart of the RS will > break the bind. > Here is how it looks in the log: > {code} > # Here is region closing > 2019-10-16 23:10:55,897 INFO > org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler: Closed > 8ee433ad59526778c53cc85ed3762d0b > > # Then soon after ... > 2019-10-16 23:11:44,041 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > 2019-10-16 23:11:45,006 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=45, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > ... > # Later... > 2019-10-16 23:20:25,427 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=542, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > 2019-10-16 23:20:25,427 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > {code} > I've seen this runaway WALs 2.2.1. I've seen runaway WALs in a 1.2.x version > regularly that had HBASE-16721 fix in it, but can't say yet if it was for > same reason as above. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23181) Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it is not online on us"
[ https://issues.apache.org/jira/browse/HBASE-23181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960626#comment-16960626 ] Hudson commented on HBASE-23181: Results for branch master [build #1518 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/master/1518/]: (x) *{color:red}-1 overall{color}* details (if available): (x) {color:red}-1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/master/1518//General_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/master/1518//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/master/1518//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Blocked WAL archive: "LogRoller: Failed to schedule flush of , because it > is not online on us" > -- > > Key: HBASE-23181 > URL: https://issues.apache.org/jira/browse/HBASE-23181 > Project: HBase > Issue Type: Bug > Components: regionserver, wal >Affects Versions: 2.2.1 >Reporter: Michael Stack >Assignee: Duo Zhang >Priority: Major > Fix For: 3.0.0, 2.3.0, 2.1.8, 2.2.3 > > > On a heavily loaded cluster, WAL count keeps rising and we can get into a > state where we are not rolling the logs off fast enough. In particular, there > is this interesting state at the extreme where we pick a region to flush > because 'Too many WALs' but the region is actually not online. As the WAL > count rises, we keep picking a region-to-flush that is no longer on the > server. This condition blocks our being able to clear WALs; eventually WALs > climb into the hundreds and the RS goes zombie with a full Call queue that > starts throwing CallQueueTooLargeExceptions (bad if this servers is the one > carrying hbase:meta): i.e. clients fail to access the RegionServer. > One symptom is a fast spike in WAL count for the RS. A restart of the RS will > break the bind. > Here is how it looks in the log: > {code} > # Here is region closing > 2019-10-16 23:10:55,897 INFO > org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler: Closed > 8ee433ad59526778c53cc85ed3762d0b > > # Then soon after ... > 2019-10-16 23:11:44,041 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > 2019-10-16 23:11:45,006 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=45, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > ... > # Later... > 2019-10-16 23:20:25,427 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=542, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > 2019-10-16 23:20:25,427 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > {code} > I've seen this runaway WALs 2.2.1. I've seen runaway WALs in a 1.2.x version > regularly that had HBASE-16721 fix in it, but can't say yet if it was for > same reason as above. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23181) Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it is not online on us"
[ https://issues.apache.org/jira/browse/HBASE-23181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960504#comment-16960504 ] Hudson commented on HBASE-23181: Results for branch branch-2.1 [build #1691 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/1691/]: (/) *{color:green}+1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/1691//General_Nightly_Build_Report/] (/) {color:green}+1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/1691//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/1691//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Blocked WAL archive: "LogRoller: Failed to schedule flush of , because it > is not online on us" > -- > > Key: HBASE-23181 > URL: https://issues.apache.org/jira/browse/HBASE-23181 > Project: HBase > Issue Type: Bug > Components: regionserver, wal >Affects Versions: 2.2.1 >Reporter: Michael Stack >Assignee: Duo Zhang >Priority: Major > Fix For: 3.0.0, 2.3.0, 2.1.8, 2.2.3 > > > On a heavily loaded cluster, WAL count keeps rising and we can get into a > state where we are not rolling the logs off fast enough. In particular, there > is this interesting state at the extreme where we pick a region to flush > because 'Too many WALs' but the region is actually not online. As the WAL > count rises, we keep picking a region-to-flush that is no longer on the > server. This condition blocks our being able to clear WALs; eventually WALs > climb into the hundreds and the RS goes zombie with a full Call queue that > starts throwing CallQueueTooLargeExceptions (bad if this servers is the one > carrying hbase:meta): i.e. clients fail to access the RegionServer. > One symptom is a fast spike in WAL count for the RS. A restart of the RS will > break the bind. > Here is how it looks in the log: > {code} > # Here is region closing > 2019-10-16 23:10:55,897 INFO > org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler: Closed > 8ee433ad59526778c53cc85ed3762d0b > > # Then soon after ... > 2019-10-16 23:11:44,041 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > 2019-10-16 23:11:45,006 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=45, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > ... > # Later... > 2019-10-16 23:20:25,427 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=542, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > 2019-10-16 23:20:25,427 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > {code} > I've seen this runaway WALs 2.2.1. I've seen runaway WALs in a 1.2.x version > regularly that had HBASE-16721 fix in it, but can't say yet if it was for > same reason as above. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23181) Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it is not online on us"
[ https://issues.apache.org/jira/browse/HBASE-23181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960487#comment-16960487 ] Hudson commented on HBASE-23181: Results for branch branch-2.2 [build #674 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.2/674/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.2/674//General_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.2/674//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.2/674//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Blocked WAL archive: "LogRoller: Failed to schedule flush of , because it > is not online on us" > -- > > Key: HBASE-23181 > URL: https://issues.apache.org/jira/browse/HBASE-23181 > Project: HBase > Issue Type: Bug > Components: regionserver, wal >Affects Versions: 2.2.1 >Reporter: Michael Stack >Assignee: Duo Zhang >Priority: Major > Fix For: 3.0.0, 2.3.0, 2.1.8, 2.2.3 > > > On a heavily loaded cluster, WAL count keeps rising and we can get into a > state where we are not rolling the logs off fast enough. In particular, there > is this interesting state at the extreme where we pick a region to flush > because 'Too many WALs' but the region is actually not online. As the WAL > count rises, we keep picking a region-to-flush that is no longer on the > server. This condition blocks our being able to clear WALs; eventually WALs > climb into the hundreds and the RS goes zombie with a full Call queue that > starts throwing CallQueueTooLargeExceptions (bad if this servers is the one > carrying hbase:meta): i.e. clients fail to access the RegionServer. > One symptom is a fast spike in WAL count for the RS. A restart of the RS will > break the bind. > Here is how it looks in the log: > {code} > # Here is region closing > 2019-10-16 23:10:55,897 INFO > org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler: Closed > 8ee433ad59526778c53cc85ed3762d0b > > # Then soon after ... > 2019-10-16 23:11:44,041 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > 2019-10-16 23:11:45,006 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=45, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > ... > # Later... > 2019-10-16 23:20:25,427 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=542, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > 2019-10-16 23:20:25,427 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > {code} > I've seen this runaway WALs 2.2.1. I've seen runaway WALs in a 1.2.x version > regularly that had HBASE-16721 fix in it, but can't say yet if it was for > same reason as above. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23181) Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it is not online on us"
[ https://issues.apache.org/jira/browse/HBASE-23181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960479#comment-16960479 ] Hudson commented on HBASE-23181: Results for branch branch-2 [build #2335 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2335/]: (x) *{color:red}-1 overall{color}* details (if available): (x) {color:red}-1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2335//General_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2335//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2335//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Blocked WAL archive: "LogRoller: Failed to schedule flush of , because it > is not online on us" > -- > > Key: HBASE-23181 > URL: https://issues.apache.org/jira/browse/HBASE-23181 > Project: HBase > Issue Type: Bug > Components: regionserver, wal >Affects Versions: 2.2.1 >Reporter: Michael Stack >Assignee: Duo Zhang >Priority: Major > Fix For: 3.0.0, 2.3.0, 2.1.8, 2.2.3 > > > On a heavily loaded cluster, WAL count keeps rising and we can get into a > state where we are not rolling the logs off fast enough. In particular, there > is this interesting state at the extreme where we pick a region to flush > because 'Too many WALs' but the region is actually not online. As the WAL > count rises, we keep picking a region-to-flush that is no longer on the > server. This condition blocks our being able to clear WALs; eventually WALs > climb into the hundreds and the RS goes zombie with a full Call queue that > starts throwing CallQueueTooLargeExceptions (bad if this servers is the one > carrying hbase:meta): i.e. clients fail to access the RegionServer. > One symptom is a fast spike in WAL count for the RS. A restart of the RS will > break the bind. > Here is how it looks in the log: > {code} > # Here is region closing > 2019-10-16 23:10:55,897 INFO > org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler: Closed > 8ee433ad59526778c53cc85ed3762d0b > > # Then soon after ... > 2019-10-16 23:11:44,041 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > 2019-10-16 23:11:45,006 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=45, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > ... > # Later... > 2019-10-16 23:20:25,427 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=542, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > 2019-10-16 23:20:25,427 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > {code} > I've seen this runaway WALs 2.2.1. I've seen runaway WALs in a 1.2.x version > regularly that had HBASE-16721 fix in it, but can't say yet if it was for > same reason as above. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23181) Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it is not online on us"
[ https://issues.apache.org/jira/browse/HBASE-23181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959387#comment-16959387 ] Michael Stack commented on HBASE-23181: --- My first cut at a patch here where the fix is a 'one-liner' changing order in which we do things I now realize as wrong (as [~zhangduo] speculated). It was making sure the MVCC engine was caught-up when our problem is order of appends to WAL completing (and their reflection in SequenceIdAccounting). [~zhangduo] in PR, you say "Issuing a wal sync while holding the writeLock of updateLock is not good, and we have been doing a lot of works to avoid this..." What are you referring to here? Is it the effort at doing the minimum possible while the locks are held so as to minimize the time that we are not taking writes? If so, makes sense. I had been thinking of adding a wait on a sync of the flush edit append. > Blocked WAL archive: "LogRoller: Failed to schedule flush of , because it > is not online on us" > -- > > Key: HBASE-23181 > URL: https://issues.apache.org/jira/browse/HBASE-23181 > Project: HBase > Issue Type: Bug >Affects Versions: 2.2.1 >Reporter: Michael Stack >Priority: Major > Fix For: 3.0.0, 2.3.0, 2.1.8, 2.2.3 > > > On a heavily loaded cluster, WAL count keeps rising and we can get into a > state where we are not rolling the logs off fast enough. In particular, there > is this interesting state at the extreme where we pick a region to flush > because 'Too many WALs' but the region is actually not online. As the WAL > count rises, we keep picking a region-to-flush that is no longer on the > server. This condition blocks our being able to clear WALs; eventually WALs > climb into the hundreds and the RS goes zombie with a full Call queue that > starts throwing CallQueueTooLargeExceptions (bad if this servers is the one > carrying hbase:meta): i.e. clients fail to access the RegionServer. > One symptom is a fast spike in WAL count for the RS. A restart of the RS will > break the bind. > Here is how it looks in the log: > {code} > # Here is region closing > 2019-10-16 23:10:55,897 INFO > org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler: Closed > 8ee433ad59526778c53cc85ed3762d0b > > # Then soon after ... > 2019-10-16 23:11:44,041 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > 2019-10-16 23:11:45,006 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=45, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > ... > # Later... > 2019-10-16 23:20:25,427 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=542, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > 2019-10-16 23:20:25,427 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > {code} > I've seen this runaway WALs 2.2.1. I've seen runaway WALs in a 1.2.x version > regularly that had HBASE-16721 fix in it, but can't say yet if it was for > same reason as above. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23181) Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it is not online on us"
[ https://issues.apache.org/jira/browse/HBASE-23181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957206#comment-16957206 ] Michael Stack commented on HBASE-23181: --- bq. OK, then this could be the problem, where we release the update lock with pending wal entries in WAL's ringbuffer. Let me look more at your suggestion above [~zhangduo]. Thanks. > Blocked WAL archive: "LogRoller: Failed to schedule flush of , because it > is not online on us" > -- > > Key: HBASE-23181 > URL: https://issues.apache.org/jira/browse/HBASE-23181 > Project: HBase > Issue Type: Bug >Affects Versions: 2.2.1 >Reporter: Michael Stack >Priority: Major > > On a heavily loaded cluster, WAL count keeps rising and we can get into a > state where we are not rolling the logs off fast enough. In particular, there > is this interesting state at the extreme where we pick a region to flush > because 'Too many WALs' but the region is actually not online. As the WAL > count rises, we keep picking a region-to-flush that is no longer on the > server. This condition blocks our being able to clear WALs; eventually WALs > climb into the hundreds and the RS goes zombie with a full Call queue that > starts throwing CallQueueTooLargeExceptions (bad if this servers is the one > carrying hbase:meta): i.e. clients fail to access the RegionServer. > One symptom is a fast spike in WAL count for the RS. A restart of the RS will > break the bind. > Here is how it looks in the log: > {code} > # Here is region closing > 2019-10-16 23:10:55,897 INFO > org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler: Closed > 8ee433ad59526778c53cc85ed3762d0b > > # Then soon after ... > 2019-10-16 23:11:44,041 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > 2019-10-16 23:11:45,006 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=45, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > ... > # Later... > 2019-10-16 23:20:25,427 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=542, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > 2019-10-16 23:20:25,427 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > {code} > I've seen this runaway WALs 2.2.1. I've seen runaway WALs in a 1.2.x version > regularly that had HBASE-16721 fix in it, but can't say yet if it was for > same reason as above. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23181) Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it is not online on us"
[ https://issues.apache.org/jira/browse/HBASE-23181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957202#comment-16957202 ] Michael Stack commented on HBASE-23181: --- I should have gone back to HBASE-23157 earlier. We are working in same area. Yeah, HBASE-23157 would help; it could do as a workaround. It does not fix the root cause of what is going on here. Thanks. > Blocked WAL archive: "LogRoller: Failed to schedule flush of , because it > is not online on us" > -- > > Key: HBASE-23181 > URL: https://issues.apache.org/jira/browse/HBASE-23181 > Project: HBase > Issue Type: Bug >Affects Versions: 2.2.1 >Reporter: Michael Stack >Priority: Major > > On a heavily loaded cluster, WAL count keeps rising and we can get into a > state where we are not rolling the logs off fast enough. In particular, there > is this interesting state at the extreme where we pick a region to flush > because 'Too many WALs' but the region is actually not online. As the WAL > count rises, we keep picking a region-to-flush that is no longer on the > server. This condition blocks our being able to clear WALs; eventually WALs > climb into the hundreds and the RS goes zombie with a full Call queue that > starts throwing CallQueueTooLargeExceptions (bad if this servers is the one > carrying hbase:meta): i.e. clients fail to access the RegionServer. > One symptom is a fast spike in WAL count for the RS. A restart of the RS will > break the bind. > Here is how it looks in the log: > {code} > # Here is region closing > 2019-10-16 23:10:55,897 INFO > org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler: Closed > 8ee433ad59526778c53cc85ed3762d0b > > # Then soon after ... > 2019-10-16 23:11:44,041 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > 2019-10-16 23:11:45,006 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=45, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > ... > # Later... > 2019-10-16 23:20:25,427 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=542, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > 2019-10-16 23:20:25,427 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > {code} > I've seen this runaway WALs 2.2.1. I've seen runaway WALs in a 1.2.x version > regularly that had HBASE-16721 fix in it, but can't say yet if it was for > same reason as above. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23181) Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it is not online on us"
[ https://issues.apache.org/jira/browse/HBASE-23181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957194#comment-16957194 ] Michael Stack commented on HBASE-23181: --- Ok. My overnight run failed so indeed, this is not the 'fix'. Again the PURGE complains about a sequenceid that is one less than that of the last flush. Let me look at HBASE-23157 too. > Blocked WAL archive: "LogRoller: Failed to schedule flush of , because it > is not online on us" > -- > > Key: HBASE-23181 > URL: https://issues.apache.org/jira/browse/HBASE-23181 > Project: HBase > Issue Type: Bug >Affects Versions: 2.2.1 >Reporter: Michael Stack >Priority: Major > > On a heavily loaded cluster, WAL count keeps rising and we can get into a > state where we are not rolling the logs off fast enough. In particular, there > is this interesting state at the extreme where we pick a region to flush > because 'Too many WALs' but the region is actually not online. As the WAL > count rises, we keep picking a region-to-flush that is no longer on the > server. This condition blocks our being able to clear WALs; eventually WALs > climb into the hundreds and the RS goes zombie with a full Call queue that > starts throwing CallQueueTooLargeExceptions (bad if this servers is the one > carrying hbase:meta): i.e. clients fail to access the RegionServer. > One symptom is a fast spike in WAL count for the RS. A restart of the RS will > break the bind. > Here is how it looks in the log: > {code} > # Here is region closing > 2019-10-16 23:10:55,897 INFO > org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler: Closed > 8ee433ad59526778c53cc85ed3762d0b > > # Then soon after ... > 2019-10-16 23:11:44,041 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > 2019-10-16 23:11:45,006 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=45, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > ... > # Later... > 2019-10-16 23:20:25,427 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=542, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > 2019-10-16 23:20:25,427 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > {code} > I've seen this runaway WALs 2.2.1. I've seen runaway WALs in a 1.2.x version > regularly that had HBASE-16721 fix in it, but can't say yet if it was for > same reason as above. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23181) Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it is not online on us"
[ https://issues.apache.org/jira/browse/HBASE-23181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957189#comment-16957189 ] Duo Zhang commented on HBASE-23181: --- OK, then this could be the problem, where we release the update lock with pending wal entries in WAL's ringbuffer. The scenario is more or less the same with ASYNC_WAL, so I think the patch in HBASE-23157 could solve the problem, where we just remove all the records in SequenceIdAccounting when we hit the close marker. And another problem is how to count the flushed sequence id correctly, now it may go backwards and cause a warning on master, although there is no data loss issue. > Blocked WAL archive: "LogRoller: Failed to schedule flush of , because it > is not online on us" > -- > > Key: HBASE-23181 > URL: https://issues.apache.org/jira/browse/HBASE-23181 > Project: HBase > Issue Type: Bug >Affects Versions: 2.2.1 >Reporter: Michael Stack >Priority: Major > > On a heavily loaded cluster, WAL count keeps rising and we can get into a > state where we are not rolling the logs off fast enough. In particular, there > is this interesting state at the extreme where we pick a region to flush > because 'Too many WALs' but the region is actually not online. As the WAL > count rises, we keep picking a region-to-flush that is no longer on the > server. This condition blocks our being able to clear WALs; eventually WALs > climb into the hundreds and the RS goes zombie with a full Call queue that > starts throwing CallQueueTooLargeExceptions (bad if this servers is the one > carrying hbase:meta): i.e. clients fail to access the RegionServer. > One symptom is a fast spike in WAL count for the RS. A restart of the RS will > break the bind. > Here is how it looks in the log: > {code} > # Here is region closing > 2019-10-16 23:10:55,897 INFO > org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler: Closed > 8ee433ad59526778c53cc85ed3762d0b > > # Then soon after ... > 2019-10-16 23:11:44,041 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > 2019-10-16 23:11:45,006 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=45, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > ... > # Later... > 2019-10-16 23:20:25,427 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=542, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > 2019-10-16 23:20:25,427 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > {code} > I've seen this runaway WALs 2.2.1. I've seen runaway WALs in a 1.2.x version > regularly that had HBASE-16721 fix in it, but can't say yet if it was for > same reason as above. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23181) Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it is not online on us"
[ https://issues.apache.org/jira/browse/HBASE-23181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957155#comment-16957155 ] Michael Stack commented on HBASE-23181: --- I see 'Failed to get sync result after 30 ms for txid=55133210, WAL system stuck?' a bunch in these logs, yeah. Let me take a look at HBASE-23157 too. Thanks for taking a look Duo. > Blocked WAL archive: "LogRoller: Failed to schedule flush of , because it > is not online on us" > -- > > Key: HBASE-23181 > URL: https://issues.apache.org/jira/browse/HBASE-23181 > Project: HBase > Issue Type: Bug >Affects Versions: 2.2.1 >Reporter: Michael Stack >Priority: Major > > On a heavily loaded cluster, WAL count keeps rising and we can get into a > state where we are not rolling the logs off fast enough. In particular, there > is this interesting state at the extreme where we pick a region to flush > because 'Too many WALs' but the region is actually not online. As the WAL > count rises, we keep picking a region-to-flush that is no longer on the > server. This condition blocks our being able to clear WALs; eventually WALs > climb into the hundreds and the RS goes zombie with a full Call queue that > starts throwing CallQueueTooLargeExceptions (bad if this servers is the one > carrying hbase:meta): i.e. clients fail to access the RegionServer. > One symptom is a fast spike in WAL count for the RS. A restart of the RS will > break the bind. > Here is how it looks in the log: > {code} > # Here is region closing > 2019-10-16 23:10:55,897 INFO > org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler: Closed > 8ee433ad59526778c53cc85ed3762d0b > > # Then soon after ... > 2019-10-16 23:11:44,041 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > 2019-10-16 23:11:45,006 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=45, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > ... > # Later... > 2019-10-16 23:20:25,427 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=542, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > 2019-10-16 23:20:25,427 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > {code} > I've seen this runaway WALs 2.2.1. I've seen runaway WALs in a 1.2.x version > regularly that had HBASE-16721 fix in it, but can't say yet if it was for > same reason as above. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23181) Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it is not online on us"
[ https://issues.apache.org/jira/browse/HBASE-23181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956985#comment-16956985 ] Duo Zhang commented on HBASE-23181: --- If we do not use ASYNC_WAL then I do not think this could be the problem. As here we hold the writeLock of updateLock, so if we can run the code, all the updates to the region should have been done, which means there should be no enries in the ringbuffer of WAL. Do you see any SyncFuture timeout exception? Like "Failed to get sync result after...". If this happens then it could be a problem, as we will release the updateLock before actually waiting for the wal entry to be flushed out, but the default timeout is 5 minutes which should be very rare to timeout I think? And I think a more general solution is in HBASE-23157, where we could also solve the problem for ASYNC_WAL, where the mvcc will be completed without waiting for the wal entry to be flushed out. > Blocked WAL archive: "LogRoller: Failed to schedule flush of , because it > is not online on us" > -- > > Key: HBASE-23181 > URL: https://issues.apache.org/jira/browse/HBASE-23181 > Project: HBase > Issue Type: Bug >Affects Versions: 2.2.1 >Reporter: Michael Stack >Priority: Major > > On a heavily loaded cluster, WAL count keeps rising and we can get into a > state where we are not rolling the logs off fast enough. In particular, there > is this interesting state at the extreme where we pick a region to flush > because 'Too many WALs' but the region is actually not online. As the WAL > count rises, we keep picking a region-to-flush that is no longer on the > server. This condition blocks our being able to clear WALs; eventually WALs > climb into the hundreds and the RS goes zombie with a full Call queue that > starts throwing CallQueueTooLargeExceptions (bad if this servers is the one > carrying hbase:meta): i.e. clients fail to access the RegionServer. > One symptom is a fast spike in WAL count for the RS. A restart of the RS will > break the bind. > Here is how it looks in the log: > {code} > # Here is region closing > 2019-10-16 23:10:55,897 INFO > org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler: Closed > 8ee433ad59526778c53cc85ed3762d0b > > # Then soon after ... > 2019-10-16 23:11:44,041 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > 2019-10-16 23:11:45,006 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=45, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > ... > # Later... > 2019-10-16 23:20:25,427 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=542, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > 2019-10-16 23:20:25,427 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > {code} > I've seen this runaway WALs 2.2.1. I've seen runaway WALs in a 1.2.x version > regularly that had HBASE-16721 fix in it, but can't say yet if it was for > same reason as above. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23181) Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it is not online on us"
[ https://issues.apache.org/jira/browse/HBASE-23181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956644#comment-16956644 ] Michael Stack commented on HBASE-23181: --- If this is the problem, branch-1 has it too... for a long time. > Blocked WAL archive: "LogRoller: Failed to schedule flush of , because it > is not online on us" > -- > > Key: HBASE-23181 > URL: https://issues.apache.org/jira/browse/HBASE-23181 > Project: HBase > Issue Type: Bug >Affects Versions: 2.2.1 >Reporter: Michael Stack >Priority: Major > > On a heavily loaded cluster, WAL count keeps rising and we can get into a > state where we are not rolling the logs off fast enough. In particular, there > is this interesting state at the extreme where we pick a region to flush > because 'Too many WALs' but the region is actually not online. As the WAL > count rises, we keep picking a region-to-flush that is no longer on the > server. This condition blocks our being able to clear WALs; eventually WALs > climb into the hundreds and the RS goes zombie with a full Call queue that > starts throwing CallQueueTooLargeExceptions (bad if this servers is the one > carrying hbase:meta): i.e. clients fail to access the RegionServer. > One symptom is a fast spike in WAL count for the RS. A restart of the RS will > break the bind. > Here is how it looks in the log: > {code} > # Here is region closing > 2019-10-16 23:10:55,897 INFO > org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler: Closed > 8ee433ad59526778c53cc85ed3762d0b > > # Then soon after ... > 2019-10-16 23:11:44,041 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > 2019-10-16 23:11:45,006 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=45, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > ... > # Later... > 2019-10-16 23:20:25,427 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=542, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > 2019-10-16 23:20:25,427 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > {code} > I've seen this runaway WALs 2.2.1. I've seen runaway WALs in a 1.2.x version > regularly that had HBASE-16721 fix in it, but can't say yet if it was for > same reason as above. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-23181) Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it is not online on us"
[ https://issues.apache.org/jira/browse/HBASE-23181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956635#comment-16956635 ] Michael Stack commented on HBASE-23181: --- This is a candidate: {code} --- a/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java +++ b/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java @@ -2606,6 +2606,11 @@ public class HRegion implements HeapSize, PropagatingConfigurationObserver, Regi byte[] encodedRegionName = getRegionInfo().getEncodedNameAsBytes(); try { if (wal != null) { +// Get the next sequenceid. The method will not return until mvcc has caught up to the +// returned flushOpSeqId. This means all edits have been committed up to this point. +// This must be done before we call wal#startCacheFlush else edits may complete AFTER +// the call to startCacheFlush. +flushOpSeqId = getNextSequenceId(wal); Long earliestUnflushedSequenceIdForTheRegion = wal.startCacheFlush(encodedRegionName, flushedFamilyNamesToSeq); if (earliestUnflushedSequenceIdForTheRegion == null) { @@ -2616,7 +2621,6 @@ public class HRegion implements HeapSize, PropagatingConfigurationObserver, Regi new FlushResultImpl(FlushResult.Result.CANNOT_FLUSH, msg, false), myseqid); } -flushOpSeqId = getNextSequenceId(wal); // Back up 1, minus 1 from oldest sequence id in memstore to get last 'flushed' edit flushedSeqId = earliestUnflushedSequenceIdForTheRegion.longValue() == HConstants.NO_SEQNUM? {code} HDFS is usually struggling when this issue hits (loaded cluster). I notice after study that the 'stuck' sequenceid in SequenceIdAccounting puked out when the PURGE code runs (see attached patch) is usually one less than the sequence id used to stamp the last successful flush (This flush is usually the flush that happens on region close). Somehow the 'stuck' sequenceid is getting added to the sequenceidaccounting lowest entries for a region AFTER the flush's sequencied. There is a hole between wal.startCacheFlush where we'll do sequenceid accounting for current memstore flush and the following getNextSequenceId which waits until all sequenceids have 'completed' and then gives out the next highest; supposition is that the flush begins, the last appends complete (and get added to sequenceidaccounting), and then we do the wait on all edits in flight. Testing. > Blocked WAL archive: "LogRoller: Failed to schedule flush of , because it > is not online on us" > -- > > Key: HBASE-23181 > URL: https://issues.apache.org/jira/browse/HBASE-23181 > Project: HBase > Issue Type: Bug >Affects Versions: 2.2.1 >Reporter: Michael Stack >Priority: Major > > On a heavily loaded cluster, WAL count keeps rising and we can get into a > state where we are not rolling the logs off fast enough. In particular, there > is this interesting state at the extreme where we pick a region to flush > because 'Too many WALs' but the region is actually not online. As the WAL > count rises, we keep picking a region-to-flush that is no longer on the > server. This condition blocks our being able to clear WALs; eventually WALs > climb into the hundreds and the RS goes zombie with a full Call queue that > starts throwing CallQueueTooLargeExceptions (bad if this servers is the one > carrying hbase:meta): i.e. clients fail to access the RegionServer. > One symptom is a fast spike in WAL count for the RS. A restart of the RS will > break the bind. > Here is how it looks in the log: > {code} > # Here is region closing > 2019-10-16 23:10:55,897 INFO > org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler: Closed > 8ee433ad59526778c53cc85ed3762d0b > > # Then soon after ... > 2019-10-16 23:11:44,041 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > 2019-10-16 23:11:45,006 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=45, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > ... > # Later... > 2019-10-16 23:20:25,427 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=542, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > 2019-10-16 23:20:25,427 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > {code} > I've seen this runaway WALs 2.2.1. I've seen runaway WALs in a 1.2.x version > regularly that had HBASE-16721 fix in
[jira] [Commented] (HBASE-23181) Blocked WAL archive: "LogRoller: Failed to schedule flush of XXXX, because it is not online on us"
[ https://issues.apache.org/jira/browse/HBASE-23181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956374#comment-16956374 ] Michael Stack commented on HBASE-23181: --- Studying a live 1.2.x deploy (trying to dig up clues on why we go bad), the flush went bad on a region split; thereafter we could no longer flush the parent yet we kept trying to so we could clear the oldest WAL. 2019-10-18 20:15:03,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Region split, hbase:meta updated, and report to master. Parent=XYZ,7dd2c3e4-a908-42cd-9dd7-bf20651acdbb,1522152449885.0f4cd0e561a660ea6b11d4703258cd60., new regions: XYZ,7dd2c3e4-a908-42cd-9dd7-bf20651acdbb,1571429702384.b912e2a50897e5ca002107dd472efcfe., XYZ,7dd6df6c-1949-4116-a5ed-008f6d4ae35a,1571429702384.0890b696df81a28a1ba4e9e00e8c43c0.. Split took 0sec Then later... 2019-10-18 20:19:33,622 INFO org.apache.hadoop.hbase.regionserver.wal.FSHLog: Too many WALs; count=33, max=32; forcing flush of 8 regions(s): 4ab5672c35baa260ce3742e403c03d3b, 0890b696df81a28a1ba4e9e00e8c43c0, 89ca70c06e66c0fe8732d7ebfc9e6eff, f7b98bf63c74325a7ee703be48cbd991, 0cba425e6b3caec2af1ff0d52eaffd92, 0f4cd0e561a660ea6b11d4703258cd60, 9758363e8d237b8559385b4f2a2da78d, b912e2a50897e5ca002107dd472efcfe but... 2019-10-18 20:19:33,623 WARN org.apache.hadoop.hbase.regionserver.LogRoller: Failed to schedule flush of 0f4cd0e561a660ea6b11d4703258cd60, region=null, requester=null And so on. So, seems like split can be problem in branch-1; i.e. we may not clear the region from the sequenceidaccounting on split. > Blocked WAL archive: "LogRoller: Failed to schedule flush of , because it > is not online on us" > -- > > Key: HBASE-23181 > URL: https://issues.apache.org/jira/browse/HBASE-23181 > Project: HBase > Issue Type: Bug >Affects Versions: 2.2.1 >Reporter: Michael Stack >Priority: Major > > On a heavily loaded cluster, WAL count keeps rising and we can get into a > state where we are not rolling the logs off fast enough. In particular, there > is this interesting state at the extreme where we pick a region to flush > because 'Too many WALs' but the region is actually not online. As the WAL > count rises, we keep picking a region-to-flush that is no longer on the > server. This condition blocks our being able to clear WALs; eventually WALs > climb into the hundreds and the RS goes zombie with a full Call queue that > starts throwing CallQueueTooLargeExceptions (bad if this servers is the one > carrying hbase:meta): i.e. clients fail to access the RegionServer. > One symptom is a fast spike in WAL count for the RS. A restart of the RS will > break the bind. > Here is how it looks in the log: > {code} > # Here is region closing > 2019-10-16 23:10:55,897 INFO > org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler: Closed > 8ee433ad59526778c53cc85ed3762d0b > > # Then soon after ... > 2019-10-16 23:11:44,041 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > 2019-10-16 23:11:45,006 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=45, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > ... > # Later... > 2019-10-16 23:20:25,427 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Too many WALs; > count=542, max=32; forcing flush of 1 regions(s): > 8ee433ad59526778c53cc85ed3762d0b > 2019-10-16 23:20:25,427 WARN org.apache.hadoop.hbase.regionserver.LogRoller: > Failed to schedule flush of 8ee433ad59526778c53cc85ed3762d0b, because it is > not online on us > {code} > I've seen this runaway WALs 2.2.1. I've seen runaway WALs in a 1.2.x version > regularly that had HBASE-16721 fix in it, but can't say yet if it was for > same reason as above. -- This message was sent by Atlassian Jira (v8.3.4#803005)