[jira] [Commented] (HBASE-28184) Tailing the WAL is very slow if there are multiple peers.

2024-01-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810158#comment-17810158
 ] 

ASF GitHub Bot commented on HBASE-28184:


shahrs87 commented on code in PR #5521:
URL: https://github.com/apache/hbase/pull/5521#discussion_r1464207388


##
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/WALEntryStream.java:
##
@@ -259,10 +259,11 @@ private boolean readNextEntryAndRecordReaderPosition() 
throws IOException {
 Entry readEntry = reader.next();
 long readerPos = reader.getPosition();
 OptionalLong fileLength;
-if (logQueue.getQueueSize(walGroupId) > 1) {
+if (logQueue.getQueueSize(walGroupId) > 2) {

Review Comment:
   @Apache9  ^^





> Tailing the WAL is very slow if there are multiple peers.
> -
>
> Key: HBASE-28184
> URL: https://issues.apache.org/jira/browse/HBASE-28184
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.0.0
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7
>
>
> Noticed in one of our production clusters which has 4 peers.
> Due to sudden ingestion of data, the size of log queue increased to a peak of 
> 506. We have configured log roll size to 256 MB. Most of the edits in the WAL 
> were from a table for which replication is disabled. 
> So all ReplicationSourceWALReader thread had to do was to replay the WAL and 
> NOT replicate them. Still it took 12 hours to drain the queue.
> Took few jstacks and found that ReplicationSourceWALReader was waiting to 
> acquire rollWriterLock 
> [here|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AbstractFSWAL.java#L1231]
> {noformat}
> "regionserver/,1" #1036 daemon prio=5 os_prio=0 tid=0x7f44b374e800 
> nid=0xbd7f waiting on condition [0x7f37b4d19000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x7f3897a3e150> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:837)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:872)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1202)
> at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:228)
> at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
> at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.getLogFileSizeIfBeingWritten(AbstractFSWAL.java:1102)
> at 
> org.apache.hadoop.hbase.wal.WALProvider.lambda$null$0(WALProvider.java:128)
> at 
> org.apache.hadoop.hbase.wal.WALProvider$$Lambda$177/1119730685.apply(Unknown 
> Source)
> at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
> at 
> java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361)
> at 
> java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126)
> at 
> java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499)
> at 
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486)
> at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> at 
> java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152)
> at 
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at 
> java.util.stream.ReferencePipeline.findAny(ReferencePipeline.java:536)
> at 
> org.apache.hadoop.hbase.wal.WALProvider.lambda$getWALFileLengthProvider$2(WALProvider.java:129)
> at 
> org.apache.hadoop.hbase.wal.WALProvider$$Lambda$140/1246380717.getLogFileSizeIfBeingWritten(Unknown
>  Source)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:260)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:172)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:101)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.readWALEntries(ReplicationSourceWALReader.java:222)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(Replicati

[jira] [Commented] (HBASE-28184) Tailing the WAL is very slow if there are multiple peers.

2023-11-08 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784216#comment-17784216
 ] 

Rushabh Shah commented on HBASE-28184:
--

There is one test failure  

TestBasicWALEntryStreamFSHLog#testCleanClosedWALs in branch-2.5 (and in 
branch-2.4 also I assume). Don' know why I didn't see in the jenkins build 
report for branch-2.5 PR.

It fails with this error:

{noformat}
java.lang.AssertionError: 
Expected :0
Actual   :1

{noformat}

I can see the following exception:
{noformat}
2023-11-08T13:50:40,578 DEBUG [Listener at localhost/50204] 
wal.ProtobufLogReader(442): EOF at position 210
2023-11-08T13:50:40,791 DEBUG [Listener at localhost/50204] 
wal.ProtobufLogReader(447): Encountered a malformed edit, seeking back to last 
good position in file, from 218 to 210
java.io.EOFException: Invalid PB, EOF? Ignoring; originalPosition=210, 
currentPosition=218
at 
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.readNext(ProtobufLogReader.java:376)
 ~[classes/:?]
at 
org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:104) 
~[classes/:?]
at 
org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:92) 
~[classes/:?]
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:259)
 ~[classes/:?]
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:181)
 ~[classes/:?]
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:102)
 ~[classes/:?]
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.peek(WALEntryStream.java:111)
 ~[classes/:?]
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.next(WALEntryStream.java:118)
 ~[classes/:?]
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.access$001(WALEntryStreamTestBase.java:82)
 ~[test-classes/:?]
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.lambda$next$0(WALEntryStreamTestBase.java:95)
 ~[test-classes/:?]
at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:184) 
~[test-classes/:?]
at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:135) 
~[test-classes/:?]
at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.next(WALEntryStreamTestBase.java:94)
 ~[test-classes/:?]
at 
org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStream.testCleanClosedWALs(TestBasicWALEntryStream.java:726)
 ~[test-classes/:?]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
~[?:1.8.0_292]
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
~[?:1.8.0_292]
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 ~[?:1.8.0_292]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_292]
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
 ~[junit-4.13.2.jar:4.13.2]
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
 ~[junit-4.13.2.jar:4.13.2]
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
 ~[junit-4.13.2.jar:4.13.2]
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
 ~[junit-4.13.2.jar:4.13.2]
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) 
~[junit-4.13.2.jar:4.13.2]
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
~[junit-4.13.2.jar:4.13.2]
at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61) 
~[junit-4.13.2.jar:4.13.2]
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) 
~[junit-4.13.2.jar:4.13.2]
at 
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
 ~[junit-4.13.2.jar:4.13.2]
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) 
~[junit-4.13.2.jar:4.13.2]
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
 ~[junit-4.13.2.jar:4.13.2]
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
 ~[junit-4.13.2.jar:4.13.2]
at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) 
~[junit-4.13.2.jar:4.13.2]
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) 
~[junit-4.13.2.jar:4.13.2]
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) 
~[junit-4.13.2.jar:4.13.2]
at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) 
~[junit-4.

[jira] [Commented] (HBASE-28184) Tailing the WAL is very slow if there are multiple peers.

2023-11-08 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784199#comment-17784199
 ] 

Hudson commented on HBASE-28184:


Results for branch branch-2.5
[build #431 on 
builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/431/]:
 (/) *{color:green}+1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/431/General_20Nightly_20Build_20Report/]


(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/431/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/431/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.5/431/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Tailing the WAL is very slow if there are multiple peers.
> -
>
> Key: HBASE-28184
> URL: https://issues.apache.org/jira/browse/HBASE-28184
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.0.0
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7
>
>
> Noticed in one of our production clusters which has 4 peers.
> Due to sudden ingestion of data, the size of log queue increased to a peak of 
> 506. We have configured log roll size to 256 MB. Most of the edits in the WAL 
> were from a table for which replication is disabled. 
> So all ReplicationSourceWALReader thread had to do was to replay the WAL and 
> NOT replicate them. Still it took 12 hours to drain the queue.
> Took few jstacks and found that ReplicationSourceWALReader was waiting to 
> acquire rollWriterLock 
> [here|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AbstractFSWAL.java#L1231]
> {noformat}
> "regionserver/,1" #1036 daemon prio=5 os_prio=0 tid=0x7f44b374e800 
> nid=0xbd7f waiting on condition [0x7f37b4d19000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x7f3897a3e150> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:837)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:872)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1202)
> at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:228)
> at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
> at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.getLogFileSizeIfBeingWritten(AbstractFSWAL.java:1102)
> at 
> org.apache.hadoop.hbase.wal.WALProvider.lambda$null$0(WALProvider.java:128)
> at 
> org.apache.hadoop.hbase.wal.WALProvider$$Lambda$177/1119730685.apply(Unknown 
> Source)
> at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
> at 
> java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361)
> at 
> java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126)
> at 
> java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499)
> at 
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486)
> at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> at 
> java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152)
> at 
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at 
> java.util.stream.ReferencePipeline.findAny(ReferencePipeline.java:536)
> at 
> org.apache.hadoop.hbase.wal.WALProvider.lambda$getWALFileLengthProvider$2(WALProvider.java:129)
> at 
> org.apache.hadoop.hbase.wal.WALProvider$$Lambda$140/1246380717.getLogFileSizeIfBeingWritten(Unknown
>  Source)
> at 
> org.apache.hadoop.hbase

[jira] [Commented] (HBASE-28184) Tailing the WAL is very slow if there are multiple peers.

2023-11-08 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784194#comment-17784194
 ] 

Hudson commented on HBASE-28184:


Results for branch branch-3
[build #80 on 
builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-3/80/]: 
(/) *{color:green}+1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-3/80/General_20Nightly_20Build_20Report/]




(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-3/80/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-3/80/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Tailing the WAL is very slow if there are multiple peers.
> -
>
> Key: HBASE-28184
> URL: https://issues.apache.org/jira/browse/HBASE-28184
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.0.0
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7
>
>
> Noticed in one of our production clusters which has 4 peers.
> Due to sudden ingestion of data, the size of log queue increased to a peak of 
> 506. We have configured log roll size to 256 MB. Most of the edits in the WAL 
> were from a table for which replication is disabled. 
> So all ReplicationSourceWALReader thread had to do was to replay the WAL and 
> NOT replicate them. Still it took 12 hours to drain the queue.
> Took few jstacks and found that ReplicationSourceWALReader was waiting to 
> acquire rollWriterLock 
> [here|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AbstractFSWAL.java#L1231]
> {noformat}
> "regionserver/,1" #1036 daemon prio=5 os_prio=0 tid=0x7f44b374e800 
> nid=0xbd7f waiting on condition [0x7f37b4d19000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x7f3897a3e150> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:837)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:872)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1202)
> at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:228)
> at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
> at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.getLogFileSizeIfBeingWritten(AbstractFSWAL.java:1102)
> at 
> org.apache.hadoop.hbase.wal.WALProvider.lambda$null$0(WALProvider.java:128)
> at 
> org.apache.hadoop.hbase.wal.WALProvider$$Lambda$177/1119730685.apply(Unknown 
> Source)
> at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
> at 
> java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361)
> at 
> java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126)
> at 
> java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499)
> at 
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486)
> at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> at 
> java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152)
> at 
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at 
> java.util.stream.ReferencePipeline.findAny(ReferencePipeline.java:536)
> at 
> org.apache.hadoop.hbase.wal.WALProvider.lambda$getWALFileLengthProvider$2(WALProvider.java:129)
> at 
> org.apache.hadoop.hbase.wal.WALProvider$$Lambda$140/1246380717.getLogFileSizeIfBeingWritten(Unknown
>  Source)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:260)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:172)
>  

[jira] [Commented] (HBASE-28184) Tailing the WAL is very slow if there are multiple peers.

2023-11-08 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784173#comment-17784173
 ] 

Hudson commented on HBASE-28184:


Results for branch master
[build #940 on 
builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/940/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/940/General_20Nightly_20Build_20Report/]




(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/940/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/940/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Tailing the WAL is very slow if there are multiple peers.
> -
>
> Key: HBASE-28184
> URL: https://issues.apache.org/jira/browse/HBASE-28184
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.0.0
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7
>
>
> Noticed in one of our production clusters which has 4 peers.
> Due to sudden ingestion of data, the size of log queue increased to a peak of 
> 506. We have configured log roll size to 256 MB. Most of the edits in the WAL 
> were from a table for which replication is disabled. 
> So all ReplicationSourceWALReader thread had to do was to replay the WAL and 
> NOT replicate them. Still it took 12 hours to drain the queue.
> Took few jstacks and found that ReplicationSourceWALReader was waiting to 
> acquire rollWriterLock 
> [here|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AbstractFSWAL.java#L1231]
> {noformat}
> "regionserver/,1" #1036 daemon prio=5 os_prio=0 tid=0x7f44b374e800 
> nid=0xbd7f waiting on condition [0x7f37b4d19000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x7f3897a3e150> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:837)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:872)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1202)
> at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:228)
> at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
> at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.getLogFileSizeIfBeingWritten(AbstractFSWAL.java:1102)
> at 
> org.apache.hadoop.hbase.wal.WALProvider.lambda$null$0(WALProvider.java:128)
> at 
> org.apache.hadoop.hbase.wal.WALProvider$$Lambda$177/1119730685.apply(Unknown 
> Source)
> at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
> at 
> java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361)
> at 
> java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126)
> at 
> java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499)
> at 
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486)
> at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> at 
> java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152)
> at 
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at 
> java.util.stream.ReferencePipeline.findAny(ReferencePipeline.java:536)
> at 
> org.apache.hadoop.hbase.wal.WALProvider.lambda$getWALFileLengthProvider$2(WALProvider.java:129)
> at 
> org.apache.hadoop.hbase.wal.WALProvider$$Lambda$140/1246380717.getLogFileSizeIfBeingWritten(Unknown
>  Source)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:260)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:172)
> at 
> 

[jira] [Commented] (HBASE-28184) Tailing the WAL is very slow if there are multiple peers.

2023-11-07 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17783899#comment-17783899
 ] 

Hudson commented on HBASE-28184:


Results for branch branch-2
[build #920 on 
builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/920/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/920/General_20Nightly_20Build_20Report/]


(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/920/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/920/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/920/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Tailing the WAL is very slow if there are multiple peers.
> -
>
> Key: HBASE-28184
> URL: https://issues.apache.org/jira/browse/HBASE-28184
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.0.0
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7
>
>
> Noticed in one of our production clusters which has 4 peers.
> Due to sudden ingestion of data, the size of log queue increased to a peak of 
> 506. We have configured log roll size to 256 MB. Most of the edits in the WAL 
> were from a table for which replication is disabled. 
> So all ReplicationSourceWALReader thread had to do was to replay the WAL and 
> NOT replicate them. Still it took 12 hours to drain the queue.
> Took few jstacks and found that ReplicationSourceWALReader was waiting to 
> acquire rollWriterLock 
> [here|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AbstractFSWAL.java#L1231]
> {noformat}
> "regionserver/,1" #1036 daemon prio=5 os_prio=0 tid=0x7f44b374e800 
> nid=0xbd7f waiting on condition [0x7f37b4d19000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x7f3897a3e150> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:837)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:872)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1202)
> at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:228)
> at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
> at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.getLogFileSizeIfBeingWritten(AbstractFSWAL.java:1102)
> at 
> org.apache.hadoop.hbase.wal.WALProvider.lambda$null$0(WALProvider.java:128)
> at 
> org.apache.hadoop.hbase.wal.WALProvider$$Lambda$177/1119730685.apply(Unknown 
> Source)
> at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
> at 
> java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361)
> at 
> java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126)
> at 
> java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499)
> at 
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486)
> at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> at 
> java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152)
> at 
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at 
> java.util.stream.ReferencePipeline.findAny(ReferencePipeline.java:536)
> at 
> org.apache.hadoop.hbase.wal.WALProvider.lambda$getWALFileLengthProvider$2(WALProvider.java:129)
> at 
> org.apache.hadoop.hbase.wal.WALProvider$$Lambda$140/1246380717.getLogFileSizeIfBeingWritten(Unknown
>  Source)
> at 
> org.apache.hadoop.hbase.replication.reg

[jira] [Commented] (HBASE-28184) Tailing the WAL is very slow if there are multiple peers.

2023-11-07 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17783883#comment-17783883
 ] 

Hudson commented on HBASE-28184:


Results for branch branch-2.4
[build #648 on 
builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/648/]:
 (/) *{color:green}+1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/648/General_20Nightly_20Build_20Report/]


(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/648/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/648/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/648/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Tailing the WAL is very slow if there are multiple peers.
> -
>
> Key: HBASE-28184
> URL: https://issues.apache.org/jira/browse/HBASE-28184
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.0.0
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7
>
>
> Noticed in one of our production clusters which has 4 peers.
> Due to sudden ingestion of data, the size of log queue increased to a peak of 
> 506. We have configured log roll size to 256 MB. Most of the edits in the WAL 
> were from a table for which replication is disabled. 
> So all ReplicationSourceWALReader thread had to do was to replay the WAL and 
> NOT replicate them. Still it took 12 hours to drain the queue.
> Took few jstacks and found that ReplicationSourceWALReader was waiting to 
> acquire rollWriterLock 
> [here|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AbstractFSWAL.java#L1231]
> {noformat}
> "regionserver/,1" #1036 daemon prio=5 os_prio=0 tid=0x7f44b374e800 
> nid=0xbd7f waiting on condition [0x7f37b4d19000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x7f3897a3e150> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:837)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:872)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1202)
> at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:228)
> at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
> at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.getLogFileSizeIfBeingWritten(AbstractFSWAL.java:1102)
> at 
> org.apache.hadoop.hbase.wal.WALProvider.lambda$null$0(WALProvider.java:128)
> at 
> org.apache.hadoop.hbase.wal.WALProvider$$Lambda$177/1119730685.apply(Unknown 
> Source)
> at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
> at 
> java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361)
> at 
> java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126)
> at 
> java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499)
> at 
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486)
> at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> at 
> java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152)
> at 
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at 
> java.util.stream.ReferencePipeline.findAny(ReferencePipeline.java:536)
> at 
> org.apache.hadoop.hbase.wal.WALProvider.lambda$getWALFileLengthProvider$2(WALProvider.java:129)
> at 
> org.apache.hadoop.hbase.wal.WALProvider$$Lambda$140/1246380717.getLogFileSizeIfBeingWritten(Unknown
>  Source)
> at 
> org.apache.hadoop.hbase

[jira] [Commented] (HBASE-28184) Tailing the WAL is very slow if there are multiple peers.

2023-11-07 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17783777#comment-17783777
 ] 

Rushabh Shah commented on HBASE-28184:
--

Committed the change to branch-2.4. branch-2.5, branch-2, branch-3 and master. 

Thank you [~zhangduo]  [~vjasani] [~Xiaolin Ha]  for the reviews!

> Tailing the WAL is very slow if there are multiple peers.
> -
>
> Key: HBASE-28184
> URL: https://issues.apache.org/jira/browse/HBASE-28184
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.0.0
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7
>
>
> Noticed in one of our production clusters which has 4 peers.
> Due to sudden ingestion of data, the size of log queue increased to a peak of 
> 506. We have configured log roll size to 256 MB. Most of the edits in the WAL 
> were from a table for which replication is disabled. 
> So all ReplicationSourceWALReader thread had to do was to replay the WAL and 
> NOT replicate them. Still it took 12 hours to drain the queue.
> Took few jstacks and found that ReplicationSourceWALReader was waiting to 
> acquire rollWriterLock 
> [here|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AbstractFSWAL.java#L1231]
> {noformat}
> "regionserver/,1" #1036 daemon prio=5 os_prio=0 tid=0x7f44b374e800 
> nid=0xbd7f waiting on condition [0x7f37b4d19000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x7f3897a3e150> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:837)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:872)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1202)
> at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:228)
> at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
> at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.getLogFileSizeIfBeingWritten(AbstractFSWAL.java:1102)
> at 
> org.apache.hadoop.hbase.wal.WALProvider.lambda$null$0(WALProvider.java:128)
> at 
> org.apache.hadoop.hbase.wal.WALProvider$$Lambda$177/1119730685.apply(Unknown 
> Source)
> at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
> at 
> java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361)
> at 
> java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126)
> at 
> java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499)
> at 
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486)
> at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> at 
> java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152)
> at 
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at 
> java.util.stream.ReferencePipeline.findAny(ReferencePipeline.java:536)
> at 
> org.apache.hadoop.hbase.wal.WALProvider.lambda$getWALFileLengthProvider$2(WALProvider.java:129)
> at 
> org.apache.hadoop.hbase.wal.WALProvider$$Lambda$140/1246380717.getLogFileSizeIfBeingWritten(Unknown
>  Source)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:260)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:172)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:101)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.readWALEntries(ReplicationSourceWALReader.java:222)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:157)
> {noformat}
>  All the peers will contend for this lock during every batch read.
> Look at the code snippet below. We are guarding this section with 
> rollWriterLock if we are replicating the active WAL file. But in our case we 
> are NOT replicating active WAL file but still we acquire this lock only to 
> return OptionalLong.empty();
> {noformat}
>   /**
>* if the given {@code path} is being writt

[jira] [Commented] (HBASE-28184) Tailing the WAL is very slow if there are multiple peers.

2023-11-03 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17782690#comment-17782690
 ] 

Rushabh Shah commented on HBASE-28184:
--

Thank you [~zhangduo]  for your reply.

I can see disk busy at 100% for first 6-7 hours but don't see any read/write 
load for the next 5 hours. 

> Tailing the WAL is very slow if there are multiple peers.
> -
>
> Key: HBASE-28184
> URL: https://issues.apache.org/jira/browse/HBASE-28184
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.0.0
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
>
> Noticed in one of our production clusters which has 4 peers.
> Due to sudden ingestion of data, the size of log queue increased to a peak of 
> 506. We have configured log roll size to 256 MB. Most of the edits in the WAL 
> were from a table for which replication is disabled. 
> So all ReplicationSourceWALReader thread had to do was to replay the WAL and 
> NOT replicate them. Still it took 12 hours to drain the queue.
> Took few jstacks and found that ReplicationSourceWALReader was waiting to 
> acquire rollWriterLock 
> [here|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AbstractFSWAL.java#L1231]
> {noformat}
> "regionserver/,1" #1036 daemon prio=5 os_prio=0 tid=0x7f44b374e800 
> nid=0xbd7f waiting on condition [0x7f37b4d19000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x7f3897a3e150> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:837)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:872)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1202)
> at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:228)
> at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
> at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.getLogFileSizeIfBeingWritten(AbstractFSWAL.java:1102)
> at 
> org.apache.hadoop.hbase.wal.WALProvider.lambda$null$0(WALProvider.java:128)
> at 
> org.apache.hadoop.hbase.wal.WALProvider$$Lambda$177/1119730685.apply(Unknown 
> Source)
> at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
> at 
> java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361)
> at 
> java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126)
> at 
> java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499)
> at 
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486)
> at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> at 
> java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152)
> at 
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at 
> java.util.stream.ReferencePipeline.findAny(ReferencePipeline.java:536)
> at 
> org.apache.hadoop.hbase.wal.WALProvider.lambda$getWALFileLengthProvider$2(WALProvider.java:129)
> at 
> org.apache.hadoop.hbase.wal.WALProvider$$Lambda$140/1246380717.getLogFileSizeIfBeingWritten(Unknown
>  Source)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:260)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:172)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:101)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.readWALEntries(ReplicationSourceWALReader.java:222)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:157)
> {noformat}
>  All the peers will contend for this lock during every batch read.
> Look at the code snippet below. We are guarding this section with 
> rollWriterLock if we are replicating the active WAL file. But in our case we 
> are NOT replicating active WAL file but still we acquire this lock only to 
> return OptionalLong.empty();
> {noformat}
>   /**
>* if the given {@code path} is being written currently, then return its 
> length.
>* 
>* This 

[jira] [Commented] (HBASE-28184) Tailing the WAL is very slow if there are multiple peers.

2023-11-01 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781941#comment-17781941
 ] 

Duo Zhang commented on HBASE-28184:
---

This could be an optimization but I wonder whether this is the real reason on 
why the replicationt is so slow. As acquiring a lock is just an in memory 
operation, it will soon return unless someone hold it for a long time and block 
all others.

Have you seen the iostats of you cluster? More peers will cause more IO on the 
HDFS, as for a same file, each peer will read it once, so 4 peers means 4 times 
of IO.

> Tailing the WAL is very slow if there are multiple peers.
> -
>
> Key: HBASE-28184
> URL: https://issues.apache.org/jira/browse/HBASE-28184
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.0.0
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
>
> Noticed in one of our production clusters which has 4 peers.
> Due to sudden ingestion of data, the size of log queue increased to a peak of 
> 506. We have configured log roll size to 256 MB. Most of the edits in the WAL 
> were from a table for which replication is disabled. 
> So all ReplicationSourceWALReader thread had to do was to replay the WAL and 
> NOT replicate them. Still it took 12 hours to drain the queue.
> Took few jstacks and found that ReplicationSourceWALReader was waiting to 
> acquire rollWriterLock 
> [here|https://github.com/apache/hbase/blob/branch-2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AbstractFSWAL.java#L1231]
> {noformat}
> "regionserver/,1" #1036 daemon prio=5 os_prio=0 tid=0x7f44b374e800 
> nid=0xbd7f waiting on condition [0x7f37b4d19000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x7f3897a3e150> (a 
> java.util.concurrent.locks.ReentrantLock$FairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:837)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:872)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1202)
> at 
> java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:228)
> at 
> java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
> at 
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.getLogFileSizeIfBeingWritten(AbstractFSWAL.java:1102)
> at 
> org.apache.hadoop.hbase.wal.WALProvider.lambda$null$0(WALProvider.java:128)
> at 
> org.apache.hadoop.hbase.wal.WALProvider$$Lambda$177/1119730685.apply(Unknown 
> Source)
> at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
> at 
> java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1361)
> at 
> java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126)
> at 
> java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499)
> at 
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486)
> at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> at 
> java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152)
> at 
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at 
> java.util.stream.ReferencePipeline.findAny(ReferencePipeline.java:536)
> at 
> org.apache.hadoop.hbase.wal.WALProvider.lambda$getWALFileLengthProvider$2(WALProvider.java:129)
> at 
> org.apache.hadoop.hbase.wal.WALProvider$$Lambda$140/1246380717.getLogFileSizeIfBeingWritten(Unknown
>  Source)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:260)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:172)
> at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:101)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.readWALEntries(ReplicationSourceWALReader.java:222)
> at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:157)
> {noformat}
>  All the peers will contend for this lock during every batch read.
> Look at the code snippet below. We are guarding this section with 
> rollWriterLock if we are replicating the active WAL