[jira] [Updated] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
[ https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-26042: --- Affects Version/s: 2.4.8 > WAL lockup on 'sync failed' > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer > > > Key: HBASE-26042 > URL: https://issues.apache.org/jira/browse/HBASE-26042 > Project: HBase > Issue Type: Bug >Affects Versions: 2.3.5, 2.4.8 >Reporter: Michael Stack >Priority: Major > Attachments: HBASE-26042-test-repro.patch, debug-dump.txt, > hbase-cvp-regionserver-cvp328.sjc.aristanetworks.com.log, js1, js2 > > > Making note of issue seen in production cluster. > Node had been struggling under load for a few days with slow syncs up to 10 > seconds, a few STUCK MVCCs from which it recovered and some java pauses up to > three seconds in length. > Then the below happened: > {code:java} > 2021-06-27 13:41:27,604 WARN [AsyncFSWAL-0-hdfs://:8020/hbase] > wal.AsyncFSWAL: sync > failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer {code} > ... and WAL turned dead in the water. Scanners start expiring. RPC prints > text versions of requests complaining requestsTooSlow. Then we start to see > these: > {code:java} > org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync > result after 30 ms for txid=552128301, WAL system stuck? {code} > Whats supposed to happen when other side goes away like this is that we will > roll the WAL – go set up a new one. You can see it happening if you run > {code:java} > mvn test > -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter > {code} > I tried hacking the test to repro the above hang by throwing same exception > in above test (on linux because need epoll to repro) but all just worked. > Thread dumps of the hungup WAL subsystem are a little odd. The log roller is > stuck w/o timeout trying to write a long on the WAL header: > > {code:java} > Thread 9464: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, > line=175 (Compiled frame) > - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, > line=1707 (Compiled frame) > - > java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker) > @bci=119, line=3323 (Compiled frame) > - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, > line=1742 (Compiled frame) > - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled > frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer) > @bci=16, line=189 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[], > org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) > @bci=9, line=202 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem, > org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, > long) @bci=107, line=170 (Compiled frame) > - > org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration, > org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, > org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) > @bci=61, line=113 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=22, line=651 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=2, line=128 (Compiled frame) > - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) > @bci=101, line=797 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) > @bci=18, line=263 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 > (Compiled frame) {code} > > Other threads are BLOCKED trying to append the WAL w/ flush markers etc. > unable to add the ringbuffer: > > {code:java} > Thread 9465: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 > (Compiled frame) > -
[jira] [Updated] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
[ https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-26042: --- Attachment: debug-dump.txt > WAL lockup on 'sync failed' > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer > > > Key: HBASE-26042 > URL: https://issues.apache.org/jira/browse/HBASE-26042 > Project: HBase > Issue Type: Bug >Affects Versions: 2.3.5 >Reporter: Michael Stack >Priority: Major > Attachments: HBASE-26042-test-repro.patch, debug-dump.txt, > hbase-cvp-regionserver-cvp328.sjc.aristanetworks.com.log, js1, js2 > > > Making note of issue seen in production cluster. > Node had been struggling under load for a few days with slow syncs up to 10 > seconds, a few STUCK MVCCs from which it recovered and some java pauses up to > three seconds in length. > Then the below happened: > {code:java} > 2021-06-27 13:41:27,604 WARN [AsyncFSWAL-0-hdfs://:8020/hbase] > wal.AsyncFSWAL: sync > failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer {code} > ... and WAL turned dead in the water. Scanners start expiring. RPC prints > text versions of requests complaining requestsTooSlow. Then we start to see > these: > {code:java} > org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync > result after 30 ms for txid=552128301, WAL system stuck? {code} > Whats supposed to happen when other side goes away like this is that we will > roll the WAL – go set up a new one. You can see it happening if you run > {code:java} > mvn test > -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter > {code} > I tried hacking the test to repro the above hang by throwing same exception > in above test (on linux because need epoll to repro) but all just worked. > Thread dumps of the hungup WAL subsystem are a little odd. The log roller is > stuck w/o timeout trying to write a long on the WAL header: > > {code:java} > Thread 9464: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, > line=175 (Compiled frame) > - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, > line=1707 (Compiled frame) > - > java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker) > @bci=119, line=3323 (Compiled frame) > - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, > line=1742 (Compiled frame) > - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled > frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer) > @bci=16, line=189 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[], > org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) > @bci=9, line=202 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem, > org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, > long) @bci=107, line=170 (Compiled frame) > - > org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration, > org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, > org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) > @bci=61, line=113 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=22, line=651 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=2, line=128 (Compiled frame) > - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) > @bci=101, line=797 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) > @bci=18, line=263 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 > (Compiled frame) {code} > > Other threads are BLOCKED trying to append the WAL w/ flush markers etc. > unable to add the ringbuffer: > > {code:java} > Thread 9465: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 > (Compiled frame) > -
[jira] [Updated] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
[ https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoit Sigoure updated HBASE-26042: --- Attachment: hbase-cvp-regionserver-cvp328.sjc.aristanetworks.com.log > WAL lockup on 'sync failed' > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer > > > Key: HBASE-26042 > URL: https://issues.apache.org/jira/browse/HBASE-26042 > Project: HBase > Issue Type: Bug >Affects Versions: 2.3.5 >Reporter: Michael Stack >Priority: Major > Attachments: HBASE-26042-test-repro.patch, debug-dump.txt, > hbase-cvp-regionserver-cvp328.sjc.aristanetworks.com.log, js1, js2 > > > Making note of issue seen in production cluster. > Node had been struggling under load for a few days with slow syncs up to 10 > seconds, a few STUCK MVCCs from which it recovered and some java pauses up to > three seconds in length. > Then the below happened: > {code:java} > 2021-06-27 13:41:27,604 WARN [AsyncFSWAL-0-hdfs://:8020/hbase] > wal.AsyncFSWAL: sync > failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer {code} > ... and WAL turned dead in the water. Scanners start expiring. RPC prints > text versions of requests complaining requestsTooSlow. Then we start to see > these: > {code:java} > org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync > result after 30 ms for txid=552128301, WAL system stuck? {code} > Whats supposed to happen when other side goes away like this is that we will > roll the WAL – go set up a new one. You can see it happening if you run > {code:java} > mvn test > -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter > {code} > I tried hacking the test to repro the above hang by throwing same exception > in above test (on linux because need epoll to repro) but all just worked. > Thread dumps of the hungup WAL subsystem are a little odd. The log roller is > stuck w/o timeout trying to write a long on the WAL header: > > {code:java} > Thread 9464: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, > line=175 (Compiled frame) > - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, > line=1707 (Compiled frame) > - > java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker) > @bci=119, line=3323 (Compiled frame) > - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, > line=1742 (Compiled frame) > - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled > frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer) > @bci=16, line=189 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[], > org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) > @bci=9, line=202 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem, > org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, > long) @bci=107, line=170 (Compiled frame) > - > org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration, > org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, > org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) > @bci=61, line=113 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=22, line=651 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=2, line=128 (Compiled frame) > - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) > @bci=101, line=797 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) > @bci=18, line=263 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 > (Compiled frame) {code} > > Other threads are BLOCKED trying to append the WAL w/ flush markers etc. > unable to add the ringbuffer: > > {code:java} > Thread 9465: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 > (Compiled frame) > -
[jira] [Updated] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
[ https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharath Vissapragada updated HBASE-26042: - Attachment: HBASE-26042-test-repro.patch > WAL lockup on 'sync failed' > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer > > > Key: HBASE-26042 > URL: https://issues.apache.org/jira/browse/HBASE-26042 > Project: HBase > Issue Type: Bug >Affects Versions: 2.3.5 >Reporter: Michael Stack >Priority: Major > Attachments: HBASE-26042-test-repro.patch, js1, js2 > > > Making note of issue seen in production cluster. > Node had been struggling under load for a few days with slow syncs up to 10 > seconds, a few STUCK MVCCs from which it recovered and some java pauses up to > three seconds in length. > Then the below happened: > {code:java} > 2021-06-27 13:41:27,604 WARN [AsyncFSWAL-0-hdfs://:8020/hbase] > wal.AsyncFSWAL: sync > failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer {code} > ... and WAL turned dead in the water. Scanners start expiring. RPC prints > text versions of requests complaining requestsTooSlow. Then we start to see > these: > {code:java} > org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync > result after 30 ms for txid=552128301, WAL system stuck? {code} > Whats supposed to happen when other side goes away like this is that we will > roll the WAL – go set up a new one. You can see it happening if you run > {code:java} > mvn test > -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter > {code} > I tried hacking the test to repro the above hang by throwing same exception > in above test (on linux because need epoll to repro) but all just worked. > Thread dumps of the hungup WAL subsystem are a little odd. The log roller is > stuck w/o timeout trying to write a long on the WAL header: > > {code:java} > Thread 9464: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, > line=175 (Compiled frame) > - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, > line=1707 (Compiled frame) > - > java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker) > @bci=119, line=3323 (Compiled frame) > - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, > line=1742 (Compiled frame) > - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled > frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer) > @bci=16, line=189 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[], > org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) > @bci=9, line=202 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem, > org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, > long) @bci=107, line=170 (Compiled frame) > - > org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration, > org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, > org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) > @bci=61, line=113 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=22, line=651 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=2, line=128 (Compiled frame) > - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) > @bci=101, line=797 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) > @bci=18, line=263 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 > (Compiled frame) {code} > > Other threads are BLOCKED trying to append the WAL w/ flush markers etc. > unable to add the ringbuffer: > > {code:java} > Thread 9465: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 > (Compiled frame) > - com.lmax.disruptor.MultiProducerSequencer.next(int) @bci=82, line=136 > (Compiled frame) >
[jira] [Updated] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
[ https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack updated HBASE-26042: -- Attachment: js2 js1 > WAL lockup on 'sync failed' > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer > > > Key: HBASE-26042 > URL: https://issues.apache.org/jira/browse/HBASE-26042 > Project: HBase > Issue Type: Bug >Affects Versions: 2.3.5 >Reporter: Michael Stack >Priority: Major > Attachments: js1, js2 > > > Making note of issue seen in production cluster. > Node had been struggling under load for a few days with slow syncs up to 10 > seconds, a few STUCK MVCCs from which it recovered and some java pauses up to > three seconds in length. > Then the below happened: > {code:java} > 2021-06-27 13:41:27,604 WARN [AsyncFSWAL-0-hdfs://:8020/hbase] > wal.AsyncFSWAL: sync > failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer {code} > ... and WAL turned dead in the water. Scanners start expiring. RPC prints > text versions of requests complaining requestsTooSlow. Then we start to see > these: > {code:java} > org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync > result after 30 ms for txid=552128301, WAL system stuck? {code} > Whats supposed to happen when other side goes away like this is that we will > roll the WAL – go set up a new one. You can see it happening if you run > {code:java} > mvn test > -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter > {code} > I tried hacking the test to repro the above hang by throwing same exception > in above test (on linux because need epoll to repro) but all just worked. > Thread dumps of the hungup WAL subsystem are a little odd. The log roller is > stuck w/o timeout trying to write a long on the WAL header: > > {code:java} > Thread 9464: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, > line=175 (Compiled frame) > - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, > line=1707 (Compiled frame) > - > java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker) > @bci=119, line=3323 (Compiled frame) > - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, > line=1742 (Compiled frame) > - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled > frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer) > @bci=16, line=189 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[], > org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) > @bci=9, line=202 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem, > org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, > long) @bci=107, line=170 (Compiled frame) > - > org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration, > org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, > org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) > @bci=61, line=113 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=22, line=651 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=2, line=128 (Compiled frame) > - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) > @bci=101, line=797 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) > @bci=18, line=263 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 > (Compiled frame) {code} > > Other threads are BLOCKED trying to append the WAL w/ flush markers etc. > unable to add the ringbuffer: > > {code:java} > Thread 9465: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 > (Compiled frame) > - com.lmax.disruptor.MultiProducerSequencer.next(int) @bci=82, line=136 > (Compiled frame) > -
[jira] [Updated] (HBASE-26042) WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
[ https://issues.apache.org/jira/browse/HBASE-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Stack updated HBASE-26042: -- Summary: WAL lockup on 'sync failed' org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer (was: WAL lockup on 'sync failed') > WAL lockup on 'sync failed' > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer > > > Key: HBASE-26042 > URL: https://issues.apache.org/jira/browse/HBASE-26042 > Project: HBase > Issue Type: Bug >Affects Versions: 2.3.5 >Reporter: Michael Stack >Priority: Major > > Making note of issue seen in production cluster. > Node had been struggling under load for a few days with slow syncs up to 10 > seconds, a few STUCK MVCCs from which it recovered and some java pauses up to > three seconds in length. > Then the below happened: > {code:java} > 2021-06-27 13:41:27,604 WARN [AsyncFSWAL-0-hdfs://:8020/hbase] > wal.AsyncFSWAL: sync > failedorg.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > readAddress(..) failed: Connection reset by peer {code} > ... and WAL turned dead in the water. Scanners start expiring. RPC prints > text versions of requests complaining requestsTooSlow. Then we start to see > these: > {code:java} > org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync > result after 30 ms for txid=552128301, WAL system stuck? {code} > Whats supposed to happen when other side goes away like this is that we will > roll the WAL – go set up a new one. You can see it happening if you run > {code:java} > mvn test > -Dtest=org.apache.hadoop.hbase.regionserver.wal.TestAsyncFSWAL#testBrokenWriter > {code} > I tried hacking the test to repro the above hang by throwing same exception > in above test (on linux because need epoll to repro) but all just worked. > Thread dumps of the hungup WAL subsystem are a little odd. The log roller is > stuck w/o timeout trying to write a long on the WAL header: > > {code:java} > Thread 9464: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, > line=175 (Compiled frame) > - java.util.concurrent.CompletableFuture$Signaller.block() @bci=19, > line=1707 (Compiled frame) > - > java.util.concurrent.ForkJoinPool.managedBlock(java.util.concurrent.ForkJoinPool$ManagedBlocker) > @bci=119, line=3323 (Compiled frame) > - java.util.concurrent.CompletableFuture.waitingGet(boolean) @bci=115, > line=1742 (Compiled frame) > - java.util.concurrent.CompletableFuture.get() @bci=11, line=1908 (Compiled > frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(java.util.function.Consumer) > @bci=16, line=189 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeMagicAndWALHeader(byte[], > org.apache.hadoop.hbase.shaded.protobuf.generated.WALProtos$WALHeader) > @bci=9, line=202 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.init(org.apache.hadoop.fs.FileSystem, > org.apache.hadoop.fs.Path, org.apache.hadoop.conf.Configuration, boolean, > long) @bci=107, line=170 (Compiled frame) > - > org.apache.hadoop.hbase.wal.AsyncFSWALProvider.createAsyncWriter(org.apache.hadoop.conf.Configuration, > org.apache.hadoop.fs.FileSystem, org.apache.hadoop.fs.Path, boolean, long, > org.apache.hbase.thirdparty.io.netty.channel.EventLoopGroup, java.lang.Class) > @bci=61, line=113 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=22, line=651 (Compiled frame) > - > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.createWriterInstance(org.apache.hadoop.fs.Path) > @bci=2, line=128 (Compiled frame) > - org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(boolean) > @bci=101, line=797 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(long) > @bci=18, line=263 (Compiled frame) > - org.apache.hadoop.hbase.wal.AbstractWALRoller.run() @bci=198, line=179 > (Compiled frame) {code} > > Other threads are BLOCKED trying to append the WAL w/ flush markers etc. > unable to add the ringbuffer: > > {code:java} > Thread 9465: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 > (Compiled frame) > -