[jira] [Commented] (HBASE-28184) Tailing the WAL is very slow if there are multiple peers.
[ https://issues.apache.org/jira/browse/HBASE-28184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810158#comment-17810158 ] ASF GitHub Bot commented on HBASE-28184: shahrs87 commented on code in PR #5521: URL: https://github.com/apache/hbase/pull/5521#discussion_r1464207388 ## hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/WALEntryStream.java: ## @@ -259,10 +259,11 @@ private boolean readNextEntryAndRecordReaderPosition() throws IOException { Entry readEntry = reader.next(); long readerPos = reader.getPosition(); OptionalLong fileLength; -if (logQueue.getQueueSize(walGroupId) > 1) { +if (logQueue.getQueueSize(walGroupId) > 2) { Review Comment: @Apache9 ^^ > Tailing the WAL is very slow if there are multiple peers. > - > > Key: HBASE-28184 > URL: https://issues.apache.org/jira/browse/HBASE-28184 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.0.0 >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Major > Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7 > > > Noticed in one of our production clusters which has 4 peers. > Due to sudden ingestion of data, the size of log queue increased to a peak of > 506. We have configured log roll size to 256 MB. Most of the edits in the WAL > were from a table for which replication is disabled. > So all ReplicationSourceWALReader thread had to do was to replay the WAL and > NOT replicate them. Still it took 12 hours to drain the queue. > Took few jstacks and found that ReplicationSourceWALReader was waiting to > acquire rollWriterLock > [here|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AbstractFSWAL.java#L1231] > {noformat} > "regionserver/,1" #1036 daemon prio=5 os_prio=0 tid=0x7f44b374e800 > nid=0xbd7f waiting on condition [0x7f37b4d19000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f3897a3e150> (a > java.util.concurrent.locks.ReentrantLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:837) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:872) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1202) > at > java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:228) > at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290) > at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.getLogFileSizeIfBeingWritten(AbstractFSWAL.java:1102) > at > org.apache.hadoop.hbase.wal.WALProvider.lambda$null$0(WALProvider.java:128) > at > org.apache.hadoop.hbase.wal.WALProvider$$Lambda$177/1119730685.apply(Unknown > Source) > at > java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) > at > java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361) > at > java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126) > at > java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499) > at > java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) > at > java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152) > at > java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.util.stream.ReferencePipeline.findAny(ReferencePipeline.java:536) > at > org.apache.hadoop.hbase.wal.WALProvider.lambda$getWALFileLengthProvider$2(WALProvider.java:129) > at > org.apache.hadoop.hbase.wal.WALProvider$$Lambda$140/1246380717.getLogFileSizeIfBeingWritten(Unknown > Source) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:260) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:172) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:101) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.readWALEntries(ReplicationSourceWALReader.java:222) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(Replicati
[jira] [Commented] (HBASE-28184) Tailing the WAL is very slow if there are multiple peers.
[ https://issues.apache.org/jira/browse/HBASE-28184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784216#comment-17784216 ] Rushabh Shah commented on HBASE-28184: -- There is one test failure TestBasicWALEntryStreamFSHLog#testCleanClosedWALs in branch-2.5 (and in branch-2.4 also I assume). Don' know why I didn't see in the jenkins build report for branch-2.5 PR. It fails with this error: {noformat} java.lang.AssertionError: Expected :0 Actual :1 {noformat} I can see the following exception: {noformat} 2023-11-08T13:50:40,578 DEBUG [Listener at localhost/50204] wal.ProtobufLogReader(442): EOF at position 210 2023-11-08T13:50:40,791 DEBUG [Listener at localhost/50204] wal.ProtobufLogReader(447): Encountered a malformed edit, seeking back to last good position in file, from 218 to 210 java.io.EOFException: Invalid PB, EOF? Ignoring; originalPosition=210, currentPosition=218 at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.readNext(ProtobufLogReader.java:376) ~[classes/:?] at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:104) ~[classes/:?] at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:92) ~[classes/:?] at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:259) ~[classes/:?] at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:181) ~[classes/:?] at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:102) ~[classes/:?] at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.peek(WALEntryStream.java:111) ~[classes/:?] at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.next(WALEntryStream.java:118) ~[classes/:?] at org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.access$001(WALEntryStreamTestBase.java:82) ~[test-classes/:?] at org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.lambda$next$0(WALEntryStreamTestBase.java:95) ~[test-classes/:?] at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:184) ~[test-classes/:?] at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:135) ~[test-classes/:?] at org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.next(WALEntryStreamTestBase.java:94) ~[test-classes/:?] at org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStream.testCleanClosedWALs(TestBasicWALEntryStream.java:726) ~[test-classes/:?] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_292] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_292] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_292] at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_292] at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) ~[junit-4.13.2.jar:4.13.2] at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) ~[junit-4.13.2.jar:4.13.2] at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) ~[junit-4.13.2.jar:4.13.2] at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) ~[junit-4.13.2.jar:4.13.2] at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) ~[junit-4.13.2.jar:4.13.2] at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) ~[junit-4.13.2.jar:4.13.2] at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61) ~[junit-4.13.2.jar:4.13.2] at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) ~[junit-4.13.2.jar:4.13.2] at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) ~[junit-4.13.2.jar:4.13.2] at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) ~[junit-4.13.2.jar:4.13.2] at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) ~[junit-4.13.2.jar:4.13.2] at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) ~[junit-4.13.2.jar:4.13.2] at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) ~[junit-4.13.2.jar:4.13.2] at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) ~[junit-4.13.2.jar:4.13.2] at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) ~[junit-4.13.2.jar:4.13.2] at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) ~[junit-4.
[jira] [Commented] (HBASE-28184) Tailing the WAL is very slow if there are multiple peers.
[ https://issues.apache.org/jira/browse/HBASE-28184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784199#comment-17784199 ] Hudson commented on HBASE-28184: Results for branch branch-2.5 [build #431 on builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/431/]: (/) *{color:green}+1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/431/General_20Nightly_20Build_20Report/] (/) {color:green}+1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/431/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/] (/) {color:green}+1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/431/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/] (/) {color:green}+1 jdk11 hadoop3 checks{color} -- For more information [see jdk11 report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/431/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Tailing the WAL is very slow if there are multiple peers. > - > > Key: HBASE-28184 > URL: https://issues.apache.org/jira/browse/HBASE-28184 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.0.0 >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Major > Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7 > > > Noticed in one of our production clusters which has 4 peers. > Due to sudden ingestion of data, the size of log queue increased to a peak of > 506. We have configured log roll size to 256 MB. Most of the edits in the WAL > were from a table for which replication is disabled. > So all ReplicationSourceWALReader thread had to do was to replay the WAL and > NOT replicate them. Still it took 12 hours to drain the queue. > Took few jstacks and found that ReplicationSourceWALReader was waiting to > acquire rollWriterLock > [here|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AbstractFSWAL.java#L1231] > {noformat} > "regionserver/,1" #1036 daemon prio=5 os_prio=0 tid=0x7f44b374e800 > nid=0xbd7f waiting on condition [0x7f37b4d19000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f3897a3e150> (a > java.util.concurrent.locks.ReentrantLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:837) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:872) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1202) > at > java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:228) > at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290) > at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.getLogFileSizeIfBeingWritten(AbstractFSWAL.java:1102) > at > org.apache.hadoop.hbase.wal.WALProvider.lambda$null$0(WALProvider.java:128) > at > org.apache.hadoop.hbase.wal.WALProvider$$Lambda$177/1119730685.apply(Unknown > Source) > at > java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) > at > java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361) > at > java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126) > at > java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499) > at > java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) > at > java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152) > at > java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.util.stream.ReferencePipeline.findAny(ReferencePipeline.java:536) > at > org.apache.hadoop.hbase.wal.WALProvider.lambda$getWALFileLengthProvider$2(WALProvider.java:129) > at > org.apache.hadoop.hbase.wal.WALProvider$$Lambda$140/1246380717.getLogFileSizeIfBeingWritten(Unknown > Source) > at > org.apache.hadoop.hbase
[jira] [Commented] (HBASE-28184) Tailing the WAL is very slow if there are multiple peers.
[ https://issues.apache.org/jira/browse/HBASE-28184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784194#comment-17784194 ] Hudson commented on HBASE-28184: Results for branch branch-3 [build #80 on builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-3/80/]: (/) *{color:green}+1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-3/80/General_20Nightly_20Build_20Report/] (/) {color:green}+1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-3/80/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/] (/) {color:green}+1 jdk11 hadoop3 checks{color} -- For more information [see jdk11 report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-3/80/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Tailing the WAL is very slow if there are multiple peers. > - > > Key: HBASE-28184 > URL: https://issues.apache.org/jira/browse/HBASE-28184 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.0.0 >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Major > Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7 > > > Noticed in one of our production clusters which has 4 peers. > Due to sudden ingestion of data, the size of log queue increased to a peak of > 506. We have configured log roll size to 256 MB. Most of the edits in the WAL > were from a table for which replication is disabled. > So all ReplicationSourceWALReader thread had to do was to replay the WAL and > NOT replicate them. Still it took 12 hours to drain the queue. > Took few jstacks and found that ReplicationSourceWALReader was waiting to > acquire rollWriterLock > [here|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AbstractFSWAL.java#L1231] > {noformat} > "regionserver/,1" #1036 daemon prio=5 os_prio=0 tid=0x7f44b374e800 > nid=0xbd7f waiting on condition [0x7f37b4d19000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f3897a3e150> (a > java.util.concurrent.locks.ReentrantLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:837) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:872) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1202) > at > java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:228) > at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290) > at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.getLogFileSizeIfBeingWritten(AbstractFSWAL.java:1102) > at > org.apache.hadoop.hbase.wal.WALProvider.lambda$null$0(WALProvider.java:128) > at > org.apache.hadoop.hbase.wal.WALProvider$$Lambda$177/1119730685.apply(Unknown > Source) > at > java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) > at > java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361) > at > java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126) > at > java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499) > at > java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) > at > java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152) > at > java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.util.stream.ReferencePipeline.findAny(ReferencePipeline.java:536) > at > org.apache.hadoop.hbase.wal.WALProvider.lambda$getWALFileLengthProvider$2(WALProvider.java:129) > at > org.apache.hadoop.hbase.wal.WALProvider$$Lambda$140/1246380717.getLogFileSizeIfBeingWritten(Unknown > Source) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:260) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:172) >
[jira] [Commented] (HBASE-28184) Tailing the WAL is very slow if there are multiple peers.
[ https://issues.apache.org/jira/browse/HBASE-28184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784173#comment-17784173 ] Hudson commented on HBASE-28184: Results for branch master [build #940 on builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/940/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/940/General_20Nightly_20Build_20Report/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/940/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/] (/) {color:green}+1 jdk11 hadoop3 checks{color} -- For more information [see jdk11 report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/940/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Tailing the WAL is very slow if there are multiple peers. > - > > Key: HBASE-28184 > URL: https://issues.apache.org/jira/browse/HBASE-28184 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.0.0 >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Major > Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7 > > > Noticed in one of our production clusters which has 4 peers. > Due to sudden ingestion of data, the size of log queue increased to a peak of > 506. We have configured log roll size to 256 MB. Most of the edits in the WAL > were from a table for which replication is disabled. > So all ReplicationSourceWALReader thread had to do was to replay the WAL and > NOT replicate them. Still it took 12 hours to drain the queue. > Took few jstacks and found that ReplicationSourceWALReader was waiting to > acquire rollWriterLock > [here|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AbstractFSWAL.java#L1231] > {noformat} > "regionserver/,1" #1036 daemon prio=5 os_prio=0 tid=0x7f44b374e800 > nid=0xbd7f waiting on condition [0x7f37b4d19000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f3897a3e150> (a > java.util.concurrent.locks.ReentrantLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:837) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:872) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1202) > at > java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:228) > at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290) > at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.getLogFileSizeIfBeingWritten(AbstractFSWAL.java:1102) > at > org.apache.hadoop.hbase.wal.WALProvider.lambda$null$0(WALProvider.java:128) > at > org.apache.hadoop.hbase.wal.WALProvider$$Lambda$177/1119730685.apply(Unknown > Source) > at > java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) > at > java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361) > at > java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126) > at > java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499) > at > java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) > at > java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152) > at > java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.util.stream.ReferencePipeline.findAny(ReferencePipeline.java:536) > at > org.apache.hadoop.hbase.wal.WALProvider.lambda$getWALFileLengthProvider$2(WALProvider.java:129) > at > org.apache.hadoop.hbase.wal.WALProvider$$Lambda$140/1246380717.getLogFileSizeIfBeingWritten(Unknown > Source) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:260) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:172) > at >
[jira] [Commented] (HBASE-28184) Tailing the WAL is very slow if there are multiple peers.
[ https://issues.apache.org/jira/browse/HBASE-28184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17783899#comment-17783899 ] Hudson commented on HBASE-28184: Results for branch branch-2 [build #920 on builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/920/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/920/General_20Nightly_20Build_20Report/] (/) {color:green}+1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/920/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/920/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/] (/) {color:green}+1 jdk11 hadoop3 checks{color} -- For more information [see jdk11 report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/920/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Tailing the WAL is very slow if there are multiple peers. > - > > Key: HBASE-28184 > URL: https://issues.apache.org/jira/browse/HBASE-28184 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.0.0 >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Major > Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7 > > > Noticed in one of our production clusters which has 4 peers. > Due to sudden ingestion of data, the size of log queue increased to a peak of > 506. We have configured log roll size to 256 MB. Most of the edits in the WAL > were from a table for which replication is disabled. > So all ReplicationSourceWALReader thread had to do was to replay the WAL and > NOT replicate them. Still it took 12 hours to drain the queue. > Took few jstacks and found that ReplicationSourceWALReader was waiting to > acquire rollWriterLock > [here|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AbstractFSWAL.java#L1231] > {noformat} > "regionserver/,1" #1036 daemon prio=5 os_prio=0 tid=0x7f44b374e800 > nid=0xbd7f waiting on condition [0x7f37b4d19000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f3897a3e150> (a > java.util.concurrent.locks.ReentrantLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:837) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:872) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1202) > at > java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:228) > at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290) > at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.getLogFileSizeIfBeingWritten(AbstractFSWAL.java:1102) > at > org.apache.hadoop.hbase.wal.WALProvider.lambda$null$0(WALProvider.java:128) > at > org.apache.hadoop.hbase.wal.WALProvider$$Lambda$177/1119730685.apply(Unknown > Source) > at > java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) > at > java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361) > at > java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126) > at > java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499) > at > java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) > at > java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152) > at > java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.util.stream.ReferencePipeline.findAny(ReferencePipeline.java:536) > at > org.apache.hadoop.hbase.wal.WALProvider.lambda$getWALFileLengthProvider$2(WALProvider.java:129) > at > org.apache.hadoop.hbase.wal.WALProvider$$Lambda$140/1246380717.getLogFileSizeIfBeingWritten(Unknown > Source) > at > org.apache.hadoop.hbase.replication.reg
[jira] [Commented] (HBASE-28184) Tailing the WAL is very slow if there are multiple peers.
[ https://issues.apache.org/jira/browse/HBASE-28184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17783883#comment-17783883 ] Hudson commented on HBASE-28184: Results for branch branch-2.4 [build #648 on builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/648/]: (/) *{color:green}+1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/648/General_20Nightly_20Build_20Report/] (/) {color:green}+1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/648/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/] (/) {color:green}+1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/648/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/] (/) {color:green}+1 jdk11 hadoop3 checks{color} -- For more information [see jdk11 report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/648/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Tailing the WAL is very slow if there are multiple peers. > - > > Key: HBASE-28184 > URL: https://issues.apache.org/jira/browse/HBASE-28184 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.0.0 >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Major > Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7 > > > Noticed in one of our production clusters which has 4 peers. > Due to sudden ingestion of data, the size of log queue increased to a peak of > 506. We have configured log roll size to 256 MB. Most of the edits in the WAL > were from a table for which replication is disabled. > So all ReplicationSourceWALReader thread had to do was to replay the WAL and > NOT replicate them. Still it took 12 hours to drain the queue. > Took few jstacks and found that ReplicationSourceWALReader was waiting to > acquire rollWriterLock > [here|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AbstractFSWAL.java#L1231] > {noformat} > "regionserver/,1" #1036 daemon prio=5 os_prio=0 tid=0x7f44b374e800 > nid=0xbd7f waiting on condition [0x7f37b4d19000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f3897a3e150> (a > java.util.concurrent.locks.ReentrantLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:837) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:872) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1202) > at > java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:228) > at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290) > at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.getLogFileSizeIfBeingWritten(AbstractFSWAL.java:1102) > at > org.apache.hadoop.hbase.wal.WALProvider.lambda$null$0(WALProvider.java:128) > at > org.apache.hadoop.hbase.wal.WALProvider$$Lambda$177/1119730685.apply(Unknown > Source) > at > java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) > at > java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361) > at > java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126) > at > java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499) > at > java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) > at > java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152) > at > java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.util.stream.ReferencePipeline.findAny(ReferencePipeline.java:536) > at > org.apache.hadoop.hbase.wal.WALProvider.lambda$getWALFileLengthProvider$2(WALProvider.java:129) > at > org.apache.hadoop.hbase.wal.WALProvider$$Lambda$140/1246380717.getLogFileSizeIfBeingWritten(Unknown > Source) > at > org.apache.hadoop.hbase
[jira] [Commented] (HBASE-28184) Tailing the WAL is very slow if there are multiple peers.
[ https://issues.apache.org/jira/browse/HBASE-28184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17783777#comment-17783777 ] Rushabh Shah commented on HBASE-28184: -- Committed the change to branch-2.4. branch-2.5, branch-2, branch-3 and master. Thank you [~zhangduo] [~vjasani] [~Xiaolin Ha] for the reviews! > Tailing the WAL is very slow if there are multiple peers. > - > > Key: HBASE-28184 > URL: https://issues.apache.org/jira/browse/HBASE-28184 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.0.0 >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Major > Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7 > > > Noticed in one of our production clusters which has 4 peers. > Due to sudden ingestion of data, the size of log queue increased to a peak of > 506. We have configured log roll size to 256 MB. Most of the edits in the WAL > were from a table for which replication is disabled. > So all ReplicationSourceWALReader thread had to do was to replay the WAL and > NOT replicate them. Still it took 12 hours to drain the queue. > Took few jstacks and found that ReplicationSourceWALReader was waiting to > acquire rollWriterLock > [here|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AbstractFSWAL.java#L1231] > {noformat} > "regionserver/,1" #1036 daemon prio=5 os_prio=0 tid=0x7f44b374e800 > nid=0xbd7f waiting on condition [0x7f37b4d19000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f3897a3e150> (a > java.util.concurrent.locks.ReentrantLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:837) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:872) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1202) > at > java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:228) > at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290) > at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.getLogFileSizeIfBeingWritten(AbstractFSWAL.java:1102) > at > org.apache.hadoop.hbase.wal.WALProvider.lambda$null$0(WALProvider.java:128) > at > org.apache.hadoop.hbase.wal.WALProvider$$Lambda$177/1119730685.apply(Unknown > Source) > at > java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) > at > java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361) > at > java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126) > at > java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499) > at > java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) > at > java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152) > at > java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.util.stream.ReferencePipeline.findAny(ReferencePipeline.java:536) > at > org.apache.hadoop.hbase.wal.WALProvider.lambda$getWALFileLengthProvider$2(WALProvider.java:129) > at > org.apache.hadoop.hbase.wal.WALProvider$$Lambda$140/1246380717.getLogFileSizeIfBeingWritten(Unknown > Source) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:260) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:172) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:101) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.readWALEntries(ReplicationSourceWALReader.java:222) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:157) > {noformat} > All the peers will contend for this lock during every batch read. > Look at the code snippet below. We are guarding this section with > rollWriterLock if we are replicating the active WAL file. But in our case we > are NOT replicating active WAL file but still we acquire this lock only to > return OptionalLong.empty(); > {noformat} > /** >* if the given {@code path} is being writt
[jira] [Commented] (HBASE-28184) Tailing the WAL is very slow if there are multiple peers.
[ https://issues.apache.org/jira/browse/HBASE-28184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17782690#comment-17782690 ] Rushabh Shah commented on HBASE-28184: -- Thank you [~zhangduo] for your reply. I can see disk busy at 100% for first 6-7 hours but don't see any read/write load for the next 5 hours. > Tailing the WAL is very slow if there are multiple peers. > - > > Key: HBASE-28184 > URL: https://issues.apache.org/jira/browse/HBASE-28184 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.0.0 >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Major > > Noticed in one of our production clusters which has 4 peers. > Due to sudden ingestion of data, the size of log queue increased to a peak of > 506. We have configured log roll size to 256 MB. Most of the edits in the WAL > were from a table for which replication is disabled. > So all ReplicationSourceWALReader thread had to do was to replay the WAL and > NOT replicate them. Still it took 12 hours to drain the queue. > Took few jstacks and found that ReplicationSourceWALReader was waiting to > acquire rollWriterLock > [here|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AbstractFSWAL.java#L1231] > {noformat} > "regionserver/,1" #1036 daemon prio=5 os_prio=0 tid=0x7f44b374e800 > nid=0xbd7f waiting on condition [0x7f37b4d19000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f3897a3e150> (a > java.util.concurrent.locks.ReentrantLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:837) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:872) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1202) > at > java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:228) > at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290) > at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.getLogFileSizeIfBeingWritten(AbstractFSWAL.java:1102) > at > org.apache.hadoop.hbase.wal.WALProvider.lambda$null$0(WALProvider.java:128) > at > org.apache.hadoop.hbase.wal.WALProvider$$Lambda$177/1119730685.apply(Unknown > Source) > at > java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) > at > java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361) > at > java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126) > at > java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499) > at > java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) > at > java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152) > at > java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.util.stream.ReferencePipeline.findAny(ReferencePipeline.java:536) > at > org.apache.hadoop.hbase.wal.WALProvider.lambda$getWALFileLengthProvider$2(WALProvider.java:129) > at > org.apache.hadoop.hbase.wal.WALProvider$$Lambda$140/1246380717.getLogFileSizeIfBeingWritten(Unknown > Source) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:260) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:172) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:101) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.readWALEntries(ReplicationSourceWALReader.java:222) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:157) > {noformat} > All the peers will contend for this lock during every batch read. > Look at the code snippet below. We are guarding this section with > rollWriterLock if we are replicating the active WAL file. But in our case we > are NOT replicating active WAL file but still we acquire this lock only to > return OptionalLong.empty(); > {noformat} > /** >* if the given {@code path} is being written currently, then return its > length. >* >* This
[jira] [Commented] (HBASE-28184) Tailing the WAL is very slow if there are multiple peers.
[ https://issues.apache.org/jira/browse/HBASE-28184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781941#comment-17781941 ] Duo Zhang commented on HBASE-28184: --- This could be an optimization but I wonder whether this is the real reason on why the replicationt is so slow. As acquiring a lock is just an in memory operation, it will soon return unless someone hold it for a long time and block all others. Have you seen the iostats of you cluster? More peers will cause more IO on the HDFS, as for a same file, each peer will read it once, so 4 peers means 4 times of IO. > Tailing the WAL is very slow if there are multiple peers. > - > > Key: HBASE-28184 > URL: https://issues.apache.org/jira/browse/HBASE-28184 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.0.0 >Reporter: Rushabh Shah >Assignee: Rushabh Shah >Priority: Major > > Noticed in one of our production clusters which has 4 peers. > Due to sudden ingestion of data, the size of log queue increased to a peak of > 506. We have configured log roll size to 256 MB. Most of the edits in the WAL > were from a table for which replication is disabled. > So all ReplicationSourceWALReader thread had to do was to replay the WAL and > NOT replicate them. Still it took 12 hours to drain the queue. > Took few jstacks and found that ReplicationSourceWALReader was waiting to > acquire rollWriterLock > [here|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AbstractFSWAL.java#L1231] > {noformat} > "regionserver/,1" #1036 daemon prio=5 os_prio=0 tid=0x7f44b374e800 > nid=0xbd7f waiting on condition [0x7f37b4d19000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f3897a3e150> (a > java.util.concurrent.locks.ReentrantLock$FairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:837) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:872) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1202) > at > java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:228) > at > java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290) > at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.getLogFileSizeIfBeingWritten(AbstractFSWAL.java:1102) > at > org.apache.hadoop.hbase.wal.WALProvider.lambda$null$0(WALProvider.java:128) > at > org.apache.hadoop.hbase.wal.WALProvider$$Lambda$177/1119730685.apply(Unknown > Source) > at > java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) > at > java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361) > at > java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126) > at > java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499) > at > java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) > at > java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152) > at > java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.util.stream.ReferencePipeline.findAny(ReferencePipeline.java:536) > at > org.apache.hadoop.hbase.wal.WALProvider.lambda$getWALFileLengthProvider$2(WALProvider.java:129) > at > org.apache.hadoop.hbase.wal.WALProvider$$Lambda$140/1246380717.getLogFileSizeIfBeingWritten(Unknown > Source) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:260) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:172) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:101) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.readWALEntries(ReplicationSourceWALReader.java:222) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:157) > {noformat} > All the peers will contend for this lock during every batch read. > Look at the code snippet below. We are guarding this section with > rollWriterLock if we are replicating the active WAL