[jira] [Commented] (HBASE-22784) OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)

2019-10-13 Thread Karthick (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16950677#comment-16950677
 ] 

Karthick commented on HBASE-22784:
--

[~wchevreuil] I've opened the new Jira 
[here|https://jira.apache.org/jira/browse/HBASE-23169]

> OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 
> clusters)
> -
>
> Key: HBASE-22784
> URL: https://issues.apache.org/jira/browse/HBASE-22784
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, Replication
>Affects Versions: 1.4.9, 1.4.10
>Reporter: Solvannan R M
>Assignee: Wellington Chevreuil
>Priority: Blocker
> Fix For: 1.5.0, 1.4.11
>
> Attachments: HBASE-22784.branch-1.001.patch, 
> HBASE-22784.branch-1.002.patch, HBASE-22784.branch-1.003.patch, 
> HBASE-22784.branch-1.004.patch
>
>
> When a cluster is passive (receiving edits only via replication) in a cyclic 
> replication setup of 2 clusters, OldWALs size keeps on growing. On analysing, 
> we observed the following behaviour.
>  # New entry is added to WAL (Edit replicated from other cluster).
>  # ReplicationSourceWALReaderThread(RSWALRT) reads and applies the configured 
> filters (due to cyclic replication setup, ClusterMarkingEntryFilter discards 
> new entry from other cluster).
>  # Entry is null, RSWALRT neither updates the batch stats 
> (WALEntryBatch.lastWalPosition) nor puts it in the entryBatchQueue.
>  # ReplicationSource thread is blocked in entryBachQueue.take().
>  # So ReplicationSource#updateLogPosition has never invoked and WAL file is 
> never cleared from ReplicationQueue.
>  # Hence LogCleaner on the master, doesn't deletes the oldWAL files from 
> hadoop.
> NOTE: When a new edit is added via hbase-client, ReplicationSource thread 
> process and clears the oldWAL files from replication queues and hence master 
> cleans up the WALs
> Please provide us a solution
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-22784) OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)

2019-10-11 Thread Wellington Chevreuil (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949361#comment-16949361
 ] 

Wellington Chevreuil commented on HBASE-22784:
--

[~KarthickRam], can you open a new Jira and upload one of the failed RSes logs, 
together with a dump replication queue output, if possible from prior and after 
the crash?

> OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 
> clusters)
> -
>
> Key: HBASE-22784
> URL: https://issues.apache.org/jira/browse/HBASE-22784
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, Replication
>Affects Versions: 1.4.9, 1.4.10
>Reporter: Solvannan R M
>Assignee: Wellington Chevreuil
>Priority: Blocker
> Fix For: 1.5.0, 1.4.11
>
> Attachments: HBASE-22784.branch-1.001.patch, 
> HBASE-22784.branch-1.002.patch, HBASE-22784.branch-1.003.patch, 
> HBASE-22784.branch-1.004.patch
>
>
> When a cluster is passive (receiving edits only via replication) in a cyclic 
> replication setup of 2 clusters, OldWALs size keeps on growing. On analysing, 
> we observed the following behaviour.
>  # New entry is added to WAL (Edit replicated from other cluster).
>  # ReplicationSourceWALReaderThread(RSWALRT) reads and applies the configured 
> filters (due to cyclic replication setup, ClusterMarkingEntryFilter discards 
> new entry from other cluster).
>  # Entry is null, RSWALRT neither updates the batch stats 
> (WALEntryBatch.lastWalPosition) nor puts it in the entryBatchQueue.
>  # ReplicationSource thread is blocked in entryBachQueue.take().
>  # So ReplicationSource#updateLogPosition has never invoked and WAL file is 
> never cleared from ReplicationQueue.
>  # Hence LogCleaner on the master, doesn't deletes the oldWAL files from 
> hadoop.
> NOTE: When a new edit is added via hbase-client, ReplicationSource thread 
> process and clears the oldWAL files from replication queues and hence master 
> cleans up the WALs
> Please provide us a solution
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-22784) OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)

2019-10-10 Thread Karthick (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949119#comment-16949119
 ] 

Karthick commented on HBASE-22784:
--

[~wchevreuil]  we applied the patch in hbase-1.4.10 and we noticed random 
region server aborts because of ReplicationQueuesZKImpl#setLogPosition() in 
ReplicationSourceShipperThread. 

 
{quote}2019-10-05 08:17:28,132 FATAL 
[regionserver//172.20.20.20:16020.replicationSource.172.20.20.20%2C16020%2C1570193969775,2]
 regionserver.HRegionServer: ABORTING region server 
172.20.20.20,16020,1570193969775: Failed to write replication wal position 
(filename=172.20.20.20%2C16020%2C1570193969775.1570288637045, 
position=127494739)2019-10-05 08:17:28,132 FATAL 
[regionserver//172.20.20.20:16020.replicationSource.172.20.20.20%2C16020%2C1570193969775,2]
 regionserver.HRegionServer: ABORTING region server 
172.20.20.20,16020,1570193969775: Failed to write replication wal position 
(filename=172.20.20.20%2C16020%2C1570193969775.1570288637045, 
position=127494739)org.apache.zookeeper.KeeperException$NoNodeException: 
KeeperErrorCode = NoNode for 
/hbase/replication/rs/172.20.20.20,16020,1570193969775/2/172.20.20.20%2C16020%2C1570193969775.1570288637045
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at 
org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1327) at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:422)
 at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:824) at 
org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:874) at 
org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:868) at 
org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.setLogPosition(ReplicationQueuesZKImpl.java:155)
 at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:194)
 at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.updateLogPosition(ReplicationSource.java:727)
 at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.shipEdits(ReplicationSource.java:698)
 at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:551)2019-10-05
 08:17:28,133 FATAL 
[regionserver//172.20.20.20:16020.replicationSource.172.20.20.20%2C16020%2C1570193969775,2]
 regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: 
[org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint
{quote}
Please provide us a solution for this and let us know if you need more logs 
regarding this.

> OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 
> clusters)
> -
>
> Key: HBASE-22784
> URL: https://issues.apache.org/jira/browse/HBASE-22784
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, Replication
>Affects Versions: 1.4.9, 1.4.10
>Reporter: Solvannan R M
>Assignee: Wellington Chevreuil
>Priority: Blocker
> Fix For: 1.5.0, 1.4.11
>
> Attachments: HBASE-22784.branch-1.001.patch, 
> HBASE-22784.branch-1.002.patch, HBASE-22784.branch-1.003.patch, 
> HBASE-22784.branch-1.004.patch
>
>
> When a cluster is passive (receiving edits only via replication) in a cyclic 
> replication setup of 2 clusters, OldWALs size keeps on growing. On analysing, 
> we observed the following behaviour.
>  # New entry is added to WAL (Edit replicated from other cluster).
>  # ReplicationSourceWALReaderThread(RSWALRT) reads and applies the configured 
> filters (due to cyclic replication setup, ClusterMarkingEntryFilter discards 
> new entry from other cluster).
>  # Entry is null, RSWALRT neither updates the batch stats 
> (WALEntryBatch.lastWalPosition) nor puts it in the entryBatchQueue.
>  # ReplicationSource thread is blocked in entryBachQueue.take().
>  # So ReplicationSource#updateLogPosition has never invoked and WAL file is 
> never cleared from ReplicationQueue.
>  # Hence LogCleaner on the master, doesn't deletes the oldWAL files from 
> hadoop.
> NOTE: When a new edit is added via hbase-client, ReplicationSource thread 
> process and clears the oldWAL files from replication queues and hence master 
> cleans up the WALs
> Please provide us a solution
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-22784) OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)

2019-08-30 Thread Wellington Chevreuil (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16919402#comment-16919402
 ] 

Wellington Chevreuil commented on HBASE-22784:
--

[~solvannan] , we don't do patched releases of apache hbase. If you are running 
a commercial distribution (i.e. cdh or hdp), you would need to request such 
release for the related vendor. If you are running apache hbase distribution, 
then you will need to wait for 1.4.11, or, since this patch has been committed 
to branch-1.4, you can checkout this branch and build it on your own.

> OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 
> clusters)
> -
>
> Key: HBASE-22784
> URL: https://issues.apache.org/jira/browse/HBASE-22784
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, Replication
>Affects Versions: 1.4.9, 1.4.10
>Reporter: Solvannan R M
>Assignee: Wellington Chevreuil
>Priority: Blocker
> Fix For: 1.5.0, 1.4.11
>
> Attachments: HBASE-22784.branch-1.001.patch, 
> HBASE-22784.branch-1.002.patch, HBASE-22784.branch-1.003.patch, 
> HBASE-22784.branch-1.004.patch
>
>
> When a cluster is passive (receiving edits only via replication) in a cyclic 
> replication setup of 2 clusters, OldWALs size keeps on growing. On analysing, 
> we observed the following behaviour.
>  # New entry is added to WAL (Edit replicated from other cluster).
>  # ReplicationSourceWALReaderThread(RSWALRT) reads and applies the configured 
> filters (due to cyclic replication setup, ClusterMarkingEntryFilter discards 
> new entry from other cluster).
>  # Entry is null, RSWALRT neither updates the batch stats 
> (WALEntryBatch.lastWalPosition) nor puts it in the entryBatchQueue.
>  # ReplicationSource thread is blocked in entryBachQueue.take().
>  # So ReplicationSource#updateLogPosition has never invoked and WAL file is 
> never cleared from ReplicationQueue.
>  # Hence LogCleaner on the master, doesn't deletes the oldWAL files from 
> hadoop.
> NOTE: When a new edit is added via hbase-client, ReplicationSource thread 
> process and clears the oldWAL files from replication queues and hence master 
> cleans up the WALs
> Please provide us a solution
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (HBASE-22784) OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)

2019-08-30 Thread Solvannan R M (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16919332#comment-16919332
 ] 

Solvannan R M commented on HBASE-22784:
---

[~wchevreuil] Thanks for the quick fix. Currently, all our clusters are running 
HBase 1.4.9. Would it be possible to provide a patch for 1.4.9 till the stable 
release of 1.4.11 or 1.5.0?

> OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 
> clusters)
> -
>
> Key: HBASE-22784
> URL: https://issues.apache.org/jira/browse/HBASE-22784
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, Replication
>Affects Versions: 1.4.9, 1.4.10
>Reporter: Solvannan R M
>Assignee: Wellington Chevreuil
>Priority: Blocker
> Fix For: 1.5.0, 1.4.11
>
> Attachments: HBASE-22784.branch-1.001.patch, 
> HBASE-22784.branch-1.002.patch, HBASE-22784.branch-1.003.patch, 
> HBASE-22784.branch-1.004.patch
>
>
> When a cluster is passive (receiving edits only via replication) in a cyclic 
> replication setup of 2 clusters, OldWALs size keeps on growing. On analysing, 
> we observed the following behaviour.
>  # New entry is added to WAL (Edit replicated from other cluster).
>  # ReplicationSourceWALReaderThread(RSWALRT) reads and applies the configured 
> filters (due to cyclic replication setup, ClusterMarkingEntryFilter discards 
> new entry from other cluster).
>  # Entry is null, RSWALRT neither updates the batch stats 
> (WALEntryBatch.lastWalPosition) nor puts it in the entryBatchQueue.
>  # ReplicationSource thread is blocked in entryBachQueue.take().
>  # So ReplicationSource#updateLogPosition has never invoked and WAL file is 
> never cleared from ReplicationQueue.
>  # Hence LogCleaner on the master, doesn't deletes the oldWAL files from 
> hadoop.
> NOTE: When a new edit is added via hbase-client, ReplicationSource thread 
> process and clears the oldWAL files from replication queues and hence master 
> cleans up the WALs
> Please provide us a solution
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (HBASE-22784) OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)

2019-08-13 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16906946#comment-16906946
 ] 

Hudson commented on HBASE-22784:


Results for branch branch-1
[build #1007 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-1/1007/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(x) {color:red}-1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1/1007//General_Nightly_Build_Report/]


(x) {color:red}-1 jdk7 checks{color}
-- For more information [see jdk7 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1/1007//JDK7_Nightly_Build_Report/]


(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1/1007//JDK8_Nightly_Build_Report_(Hadoop2)/]




(/) {color:green}+1 source release artifact{color}
-- See build output for details.


> OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 
> clusters)
> -
>
> Key: HBASE-22784
> URL: https://issues.apache.org/jira/browse/HBASE-22784
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, Replication
>Affects Versions: 1.4.9, 1.4.10
>Reporter: Solvannan R M
>Assignee: Wellington Chevreuil
>Priority: Blocker
> Fix For: 1.5.0, 1.4.11
>
> Attachments: HBASE-22784.branch-1.001.patch, 
> HBASE-22784.branch-1.002.patch, HBASE-22784.branch-1.003.patch, 
> HBASE-22784.branch-1.004.patch
>
>
> When a cluster is passive (receiving edits only via replication) in a cyclic 
> replication setup of 2 clusters, OldWALs size keeps on growing. On analysing, 
> we observed the following behaviour.
>  # New entry is added to WAL (Edit replicated from other cluster).
>  # ReplicationSourceWALReaderThread(RSWALRT) reads and applies the configured 
> filters (due to cyclic replication setup, ClusterMarkingEntryFilter discards 
> new entry from other cluster).
>  # Entry is null, RSWALRT neither updates the batch stats 
> (WALEntryBatch.lastWalPosition) nor puts it in the entryBatchQueue.
>  # ReplicationSource thread is blocked in entryBachQueue.take().
>  # So ReplicationSource#updateLogPosition has never invoked and WAL file is 
> never cleared from ReplicationQueue.
>  # Hence LogCleaner on the master, doesn't deletes the oldWAL files from 
> hadoop.
> NOTE: When a new edit is added via hbase-client, ReplicationSource thread 
> process and clears the oldWAL files from replication queues and hence master 
> cleans up the WALs
> Please provide us a solution
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (HBASE-22784) OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)

2019-08-13 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16906925#comment-16906925
 ] 

Hudson commented on HBASE-22784:


Results for branch branch-1.4
[build #956 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.4/956/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(x) {color:red}-1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.4/956//General_Nightly_Build_Report/]


(x) {color:red}-1 jdk7 checks{color}
-- For more information [see jdk7 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.4/956//JDK7_Nightly_Build_Report/]


(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.4/956//JDK8_Nightly_Build_Report_(Hadoop2)/]




(/) {color:green}+1 source release artifact{color}
-- See build output for details.


> OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 
> clusters)
> -
>
> Key: HBASE-22784
> URL: https://issues.apache.org/jira/browse/HBASE-22784
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, Replication
>Affects Versions: 1.4.9, 1.4.10
>Reporter: Solvannan R M
>Assignee: Wellington Chevreuil
>Priority: Blocker
> Fix For: 1.5.0, 1.4.11
>
> Attachments: HBASE-22784.branch-1.001.patch, 
> HBASE-22784.branch-1.002.patch, HBASE-22784.branch-1.003.patch, 
> HBASE-22784.branch-1.004.patch
>
>
> When a cluster is passive (receiving edits only via replication) in a cyclic 
> replication setup of 2 clusters, OldWALs size keeps on growing. On analysing, 
> we observed the following behaviour.
>  # New entry is added to WAL (Edit replicated from other cluster).
>  # ReplicationSourceWALReaderThread(RSWALRT) reads and applies the configured 
> filters (due to cyclic replication setup, ClusterMarkingEntryFilter discards 
> new entry from other cluster).
>  # Entry is null, RSWALRT neither updates the batch stats 
> (WALEntryBatch.lastWalPosition) nor puts it in the entryBatchQueue.
>  # ReplicationSource thread is blocked in entryBachQueue.take().
>  # So ReplicationSource#updateLogPosition has never invoked and WAL file is 
> never cleared from ReplicationQueue.
>  # Hence LogCleaner on the master, doesn't deletes the oldWAL files from 
> hadoop.
> NOTE: When a new edit is added via hbase-client, ReplicationSource thread 
> process and clears the oldWAL files from replication queues and hence master 
> cleans up the WALs
> Please provide us a solution
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (HBASE-22784) OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)

2019-08-13 Thread Andrew Purtell (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16906739#comment-16906739
 ] 

Andrew Purtell commented on HBASE-22784:


Backport to branch-1.4 had minor conflicts. Just waiting on tests to commit. 

> OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 
> clusters)
> -
>
> Key: HBASE-22784
> URL: https://issues.apache.org/jira/browse/HBASE-22784
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, Replication
>Affects Versions: 1.4.9, 1.4.10
>Reporter: Solvannan R M
>Assignee: Wellington Chevreuil
>Priority: Blocker
> Fix For: 1.5.0, 1.4.11
>
> Attachments: HBASE-22784.branch-1.001.patch, 
> HBASE-22784.branch-1.002.patch, HBASE-22784.branch-1.003.patch, 
> HBASE-22784.branch-1.004.patch
>
>
> When a cluster is passive (receiving edits only via replication) in a cyclic 
> replication setup of 2 clusters, OldWALs size keeps on growing. On analysing, 
> we observed the following behaviour.
>  # New entry is added to WAL (Edit replicated from other cluster).
>  # ReplicationSourceWALReaderThread(RSWALRT) reads and applies the configured 
> filters (due to cyclic replication setup, ClusterMarkingEntryFilter discards 
> new entry from other cluster).
>  # Entry is null, RSWALRT neither updates the batch stats 
> (WALEntryBatch.lastWalPosition) nor puts it in the entryBatchQueue.
>  # ReplicationSource thread is blocked in entryBachQueue.take().
>  # So ReplicationSource#updateLogPosition has never invoked and WAL file is 
> never cleared from ReplicationQueue.
>  # Hence LogCleaner on the master, doesn't deletes the oldWAL files from 
> hadoop.
> NOTE: When a new edit is added via hbase-client, ReplicationSource thread 
> process and clears the oldWAL files from replication queues and hence master 
> cleans up the WALs
> Please provide us a solution
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (HBASE-22784) OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)

2019-08-13 Thread Andrew Purtell (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16906733#comment-16906733
 ] 

Andrew Purtell commented on HBASE-22784:


Let me commit. 
We also have related HBASE-22380 as blocker for 1.5.0 now

> OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 
> clusters)
> -
>
> Key: HBASE-22784
> URL: https://issues.apache.org/jira/browse/HBASE-22784
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, Replication
>Affects Versions: 1.4.9, 1.4.10
>Reporter: Solvannan R M
>Assignee: Wellington Chevreuil
>Priority: Blocker
> Fix For: 1.5.0
>
> Attachments: HBASE-22784.branch-1.001.patch, 
> HBASE-22784.branch-1.002.patch, HBASE-22784.branch-1.003.patch, 
> HBASE-22784.branch-1.004.patch
>
>
> When a cluster is passive (receiving edits only via replication) in a cyclic 
> replication setup of 2 clusters, OldWALs size keeps on growing. On analysing, 
> we observed the following behaviour.
>  # New entry is added to WAL (Edit replicated from other cluster).
>  # ReplicationSourceWALReaderThread(RSWALRT) reads and applies the configured 
> filters (due to cyclic replication setup, ClusterMarkingEntryFilter discards 
> new entry from other cluster).
>  # Entry is null, RSWALRT neither updates the batch stats 
> (WALEntryBatch.lastWalPosition) nor puts it in the entryBatchQueue.
>  # ReplicationSource thread is blocked in entryBachQueue.take().
>  # So ReplicationSource#updateLogPosition has never invoked and WAL file is 
> never cleared from ReplicationQueue.
>  # Hence LogCleaner on the master, doesn't deletes the oldWAL files from 
> hadoop.
> NOTE: When a new edit is added via hbase-client, ReplicationSource thread 
> process and clears the oldWAL files from replication queues and hence master 
> cleans up the WALs
> Please provide us a solution
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (HBASE-22784) OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)

2019-08-12 Thread Wellington Chevreuil (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16905028#comment-16905028
 ] 

Wellington Chevreuil commented on HBASE-22784:
--

Tested branch-2, which does not seem affected by this problem.

> OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 
> clusters)
> -
>
> Key: HBASE-22784
> URL: https://issues.apache.org/jira/browse/HBASE-22784
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, Replication
>Affects Versions: 1.4.9, 1.4.10
>Reporter: Solvannan R M
>Assignee: Wellington Chevreuil
>Priority: Blocker
> Fix For: 1.5.0
>
> Attachments: HBASE-22784.branch-1.001.patch, 
> HBASE-22784.branch-1.002.patch, HBASE-22784.branch-1.003.patch, 
> HBASE-22784.branch-1.004.patch
>
>
> When a cluster is passive (receiving edits only via replication) in a cyclic 
> replication setup of 2 clusters, OldWALs size keeps on growing. On analysing, 
> we observed the following behaviour.
>  # New entry is added to WAL (Edit replicated from other cluster).
>  # ReplicationSourceWALReaderThread(RSWALRT) reads and applies the configured 
> filters (due to cyclic replication setup, ClusterMarkingEntryFilter discards 
> new entry from other cluster).
>  # Entry is null, RSWALRT neither updates the batch stats 
> (WALEntryBatch.lastWalPosition) nor puts it in the entryBatchQueue.
>  # ReplicationSource thread is blocked in entryBachQueue.take().
>  # So ReplicationSource#updateLogPosition has never invoked and WAL file is 
> never cleared from ReplicationQueue.
>  # Hence LogCleaner on the master, doesn't deletes the oldWAL files from 
> hadoop.
> NOTE: When a new edit is added via hbase-client, ReplicationSource thread 
> process and clears the oldWAL files from replication queues and hence master 
> cleans up the WALs
> Please provide us a solution
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (HBASE-22784) OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)

2019-08-09 Thread Andrew Purtell (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16904012#comment-16904012
 ] 

Andrew Purtell commented on HBASE-22784:


Plenty of time. I have been waiting on HBASE-22728 and it is maybe only landing 
today. Will check this out later today hopefully. 

> OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 
> clusters)
> -
>
> Key: HBASE-22784
> URL: https://issues.apache.org/jira/browse/HBASE-22784
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, Replication
>Affects Versions: 1.4.9, 1.4.10
>Reporter: Solvannan R M
>Assignee: Wellington Chevreuil
>Priority: Blocker
> Fix For: 1.5.0
>
> Attachments: HBASE-22784.branch-1.001.patch, 
> HBASE-22784.branch-1.002.patch, HBASE-22784.branch-1.003.patch, 
> HBASE-22784.branch-1.004.patch
>
>
> When a cluster is passive (receiving edits only via replication) in a cyclic 
> replication setup of 2 clusters, OldWALs size keeps on growing. On analysing, 
> we observed the following behaviour.
>  # New entry is added to WAL (Edit replicated from other cluster).
>  # ReplicationSourceWALReaderThread(RSWALRT) reads and applies the configured 
> filters (due to cyclic replication setup, ClusterMarkingEntryFilter discards 
> new entry from other cluster).
>  # Entry is null, RSWALRT neither updates the batch stats 
> (WALEntryBatch.lastWalPosition) nor puts it in the entryBatchQueue.
>  # ReplicationSource thread is blocked in entryBachQueue.take().
>  # So ReplicationSource#updateLogPosition has never invoked and WAL file is 
> never cleared from ReplicationQueue.
>  # Hence LogCleaner on the master, doesn't deletes the oldWAL files from 
> hadoop.
> NOTE: When a new edit is added via hbase-client, ReplicationSource thread 
> process and clears the oldWAL files from replication queues and hence master 
> cleans up the WALs
> Please provide us a solution
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (HBASE-22784) OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)

2019-08-09 Thread Wellington Chevreuil (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16903946#comment-16903946
 ] 

Wellington Chevreuil commented on HBASE-22784:
--

I believe TestMasterBalanceThrottling failure is not related, I have it passing 
locally.

> OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 
> clusters)
> -
>
> Key: HBASE-22784
> URL: https://issues.apache.org/jira/browse/HBASE-22784
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, Replication
>Affects Versions: 1.4.9, 1.4.10
>Reporter: Solvannan R M
>Assignee: Wellington Chevreuil
>Priority: Blocker
> Fix For: 1.5.0
>
> Attachments: HBASE-22784.branch-1.001.patch, 
> HBASE-22784.branch-1.002.patch, HBASE-22784.branch-1.003.patch, 
> HBASE-22784.branch-1.004.patch
>
>
> When a cluster is passive (receiving edits only via replication) in a cyclic 
> replication setup of 2 clusters, OldWALs size keeps on growing. On analysing, 
> we observed the following behaviour.
>  # New entry is added to WAL (Edit replicated from other cluster).
>  # ReplicationSourceWALReaderThread(RSWALRT) reads and applies the configured 
> filters (due to cyclic replication setup, ClusterMarkingEntryFilter discards 
> new entry from other cluster).
>  # Entry is null, RSWALRT neither updates the batch stats 
> (WALEntryBatch.lastWalPosition) nor puts it in the entryBatchQueue.
>  # ReplicationSource thread is blocked in entryBachQueue.take().
>  # So ReplicationSource#updateLogPosition has never invoked and WAL file is 
> never cleared from ReplicationQueue.
>  # Hence LogCleaner on the master, doesn't deletes the oldWAL files from 
> hadoop.
> NOTE: When a new edit is added via hbase-client, ReplicationSource thread 
> process and clears the oldWAL files from replication queues and hence master 
> cleans up the WALs
> Please provide us a solution
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (HBASE-22784) OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)

2019-08-09 Thread HBase QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16903890#comment-16903890
 ] 

HBase QA commented on HBASE-22784:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 34m 
30s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
1s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} branch-1 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  9m 
39s{color} | {color:green} branch-1 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
53s{color} | {color:green} branch-1 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
46s{color} | {color:green} branch-1 passed {color} |
| {color:red}-1{color} | {color:red} shadedjars {color} | {color:red}  0m 
15s{color} | {color:red} branch has 10 errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
46s{color} | {color:green} branch-1 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  2m 
15s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
45s{color} | {color:green} hbase-server: The patch generated 0 new + 23 
unchanged - 1 fixed = 23 total (was 24) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} shadedjars {color} | {color:red}  0m 
15s{color} | {color:red} patch has 10 errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
5m 42s{color} | {color:green} Patch does not cause any errors with Hadoop 2.8.5 
2.9.2. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
41s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}137m 45s{color} 
| {color:red} hbase-server in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
28s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}200m 22s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hbase.master.TestMasterBalanceThrottling |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.1 Server=19.03.1 base: 
https://builds.apache.org/job/PreCommit-HBASE-Build/745/artifact/patchprocess/Dockerfile
 |
| JIRA Issue | HBASE-22784 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12977134/HBASE-22784.branch-1.004.patch
 |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
shadedjars  hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux bb19c89e7705 4.15.0-55-generic #60-Ubuntu SMP Tue Jul 2 
18:22:20 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | dev-support/hbase-personality.sh |
| git revision | branch-1 / e7114f7 |
| maven | version: Apache Maven 3.0.5 |
| Default Java | 1.8.0_222 |
| shadedjars | 
https://builds.apache.org/job/PreCommit-HBASE-Build/745/artifact/patchprocess/branch-shadedjars.txt
 |
| shadedjars | 
https://builds.apache.org/job/PreCommit-HBASE-Build/745/artifact/patchprocess/patch-shadedjars.txt
 |
| unit | 
https://builds.apache.org/job/PreCommit-HBASE-Build/745/artifact/patchprocess/patch-unit-hbase-server.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HBASE-Build/745/testReport/ |
| Ma

[jira] [Commented] (HBASE-22784) OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)

2019-08-09 Thread Wellington Chevreuil (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16903781#comment-16903781
 ] 

Wellington Chevreuil commented on HBASE-22784:
--

Fourth patching addressing checkstyle issues. Sorry for not had noticed those 
checkstyle issues before, [~apurtell]. I hope there's still time for this on 
your release. 

> OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 
> clusters)
> -
>
> Key: HBASE-22784
> URL: https://issues.apache.org/jira/browse/HBASE-22784
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, Replication
>Affects Versions: 1.4.9, 1.4.10
>Reporter: Solvannan R M
>Assignee: Wellington Chevreuil
>Priority: Blocker
> Fix For: 1.5.0
>
> Attachments: HBASE-22784.branch-1.001.patch, 
> HBASE-22784.branch-1.002.patch, HBASE-22784.branch-1.003.patch, 
> HBASE-22784.branch-1.004.patch
>
>
> When a cluster is passive (receiving edits only via replication) in a cyclic 
> replication setup of 2 clusters, OldWALs size keeps on growing. On analysing, 
> we observed the following behaviour.
>  # New entry is added to WAL (Edit replicated from other cluster).
>  # ReplicationSourceWALReaderThread(RSWALRT) reads and applies the configured 
> filters (due to cyclic replication setup, ClusterMarkingEntryFilter discards 
> new entry from other cluster).
>  # Entry is null, RSWALRT neither updates the batch stats 
> (WALEntryBatch.lastWalPosition) nor puts it in the entryBatchQueue.
>  # ReplicationSource thread is blocked in entryBachQueue.take().
>  # So ReplicationSource#updateLogPosition has never invoked and WAL file is 
> never cleared from ReplicationQueue.
>  # Hence LogCleaner on the master, doesn't deletes the oldWAL files from 
> hadoop.
> NOTE: When a new edit is added via hbase-client, ReplicationSource thread 
> process and clears the oldWAL files from replication queues and hence master 
> cleans up the WALs
> Please provide us a solution
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (HBASE-22784) OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)

2019-08-09 Thread Wellington Chevreuil (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16903710#comment-16903710
 ] 

Wellington Chevreuil commented on HBASE-22784:
--

Hi [~solvannan],

Thanks for your comments. Yeah, please consider only the last 3rd patch, first 
two still have some issues. Third patch has some checkstyle issues, which I 
will address shortly, but is functionally correct. 

 
{quote}If the WAL reader thread does not have any entry batch (after passing 
through all the filters) after some configured time threshold, it can queue an 
empty batch, with the last read log position, to the entryBatchQueue. Now the 
ReplicationSourceShipperThread will read this empty batch and update it's 
position and invoke cleanup logic.
{quote}

 I think this would also work. I'm not so fond of adding/creating extra objects 
for hacking this lack of communication between two threads. There would still 
be a need for some similar checks on wether a new _empty_ entry is really 
needed, if the shipper thread is stuck to deliver a given entry to target, i.e. 
target is down. Because of its sequential nature, reader will keep trying on 
this current entry. Meanwhile, if reader thread relies only on timeouts, it may 
end up creating and enqueuing unneeded _empty_ entries.

 

> OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 
> clusters)
> -
>
> Key: HBASE-22784
> URL: https://issues.apache.org/jira/browse/HBASE-22784
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, Replication
>Affects Versions: 1.4.9, 1.4.10
>Reporter: Solvannan R M
>Assignee: Wellington Chevreuil
>Priority: Blocker
> Fix For: 1.5.0
>
> Attachments: HBASE-22784.branch-1.001.patch, 
> HBASE-22784.branch-1.002.patch, HBASE-22784.branch-1.003.patch
>
>
> When a cluster is passive (receiving edits only via replication) in a cyclic 
> replication setup of 2 clusters, OldWALs size keeps on growing. On analysing, 
> we observed the following behaviour.
>  # New entry is added to WAL (Edit replicated from other cluster).
>  # ReplicationSourceWALReaderThread(RSWALRT) reads and applies the configured 
> filters (due to cyclic replication setup, ClusterMarkingEntryFilter discards 
> new entry from other cluster).
>  # Entry is null, RSWALRT neither updates the batch stats 
> (WALEntryBatch.lastWalPosition) nor puts it in the entryBatchQueue.
>  # ReplicationSource thread is blocked in entryBachQueue.take().
>  # So ReplicationSource#updateLogPosition has never invoked and WAL file is 
> never cleared from ReplicationQueue.
>  # Hence LogCleaner on the master, doesn't deletes the oldWAL files from 
> hadoop.
> NOTE: When a new edit is added via hbase-client, ReplicationSource thread 
> process and clears the oldWAL files from replication queues and hence master 
> cleans up the WALs
> Please provide us a solution
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (HBASE-22784) OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)

2019-08-09 Thread Solvannan R M (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16903673#comment-16903673
 ] 

Solvannan R M commented on HBASE-22784:
---

Hi [~wchevreuil],

Thanks for the patch! We will set up a test cluster and get back to you after 
trying it out.

Also, we were analysing the patches provided. We see that 
{{logPositionAndCleanOldLogs }}is called from both 
ReplicationSourceWALReaderThread and ReplicationSourceShipperThread and the 
shipment state is maintained by both the threads. Whereas originally it was 
handled only by the ReplicationSourceShipperThread, avoiding this state 
maintenance overhead at two places. We had been exploring the possibility of 
sending an empty batch to the shipper thread periodically which would handle 
the log postion update and cleanup logic organically. The flow being:

If the WAL reader thread does not have any entry batch (after passing through 
all the filters) after some configured time threshold, it can queue an empty 
batch, with the last read log position, to the entryBatchQueue. Now the 
ReplicationSourceShipperThread will read this empty batch and update it's 
position and invoke cleanup logic. 

Please let us know if this logic will lead to any inconsistencies. 

 

> OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 
> clusters)
> -
>
> Key: HBASE-22784
> URL: https://issues.apache.org/jira/browse/HBASE-22784
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, Replication
>Affects Versions: 1.4.9, 1.4.10
>Reporter: Solvannan R M
>Assignee: Wellington Chevreuil
>Priority: Blocker
> Fix For: 1.5.0
>
> Attachments: HBASE-22784.branch-1.001.patch, 
> HBASE-22784.branch-1.002.patch, HBASE-22784.branch-1.003.patch
>
>
> When a cluster is passive (receiving edits only via replication) in a cyclic 
> replication setup of 2 clusters, OldWALs size keeps on growing. On analysing, 
> we observed the following behaviour.
>  # New entry is added to WAL (Edit replicated from other cluster).
>  # ReplicationSourceWALReaderThread(RSWALRT) reads and applies the configured 
> filters (due to cyclic replication setup, ClusterMarkingEntryFilter discards 
> new entry from other cluster).
>  # Entry is null, RSWALRT neither updates the batch stats 
> (WALEntryBatch.lastWalPosition) nor puts it in the entryBatchQueue.
>  # ReplicationSource thread is blocked in entryBachQueue.take().
>  # So ReplicationSource#updateLogPosition has never invoked and WAL file is 
> never cleared from ReplicationQueue.
>  # Hence LogCleaner on the master, doesn't deletes the oldWAL files from 
> hadoop.
> NOTE: When a new edit is added via hbase-client, ReplicationSource thread 
> process and clears the oldWAL files from replication queues and hence master 
> cleans up the WALs
> Please provide us a solution
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (HBASE-22784) OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)

2019-08-08 Thread HBase QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16903499#comment-16903499
 ] 

HBase QA commented on HBASE-22784:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
37s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} branch-1 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  9m 
22s{color} | {color:green} branch-1 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
40s{color} | {color:green} branch-1 passed with JDK v1.8.0_222 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
39s{color} | {color:green} branch-1 passed with JDK v1.7.0_232 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
17s{color} | {color:green} branch-1 passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  2m 
46s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
33s{color} | {color:green} branch-1 passed with JDK v1.8.0_222 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
34s{color} | {color:green} branch-1 passed with JDK v1.7.0_232 {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
43s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
34s{color} | {color:green} the patch passed with JDK v1.8.0_222 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
37s{color} | {color:green} the patch passed with JDK v1.7.0_232 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
37s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  1m 
15s{color} | {color:red} hbase-server: The patch generated 4 new + 23 unchanged 
- 1 fixed = 27 total (was 24) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  2m 
35s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
4m 41s{color} | {color:green} Patch does not cause any errors with Hadoop 2.8.5 
2.9.2. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
28s{color} | {color:green} the patch passed with JDK v1.8.0_222 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
34s{color} | {color:green} the patch passed with JDK v1.7.0_232 {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}109m 
30s{color} | {color:green} hbase-server in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
28s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}141m  8s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.1 Server=19.03.1 base: 
https://builds.apache.org/job/PreCommit-HBASE-Build/743/artifact/patchprocess/Dockerfile
 |
| JIRA Issue | HBASE-22784 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12977085/HBASE-22784.branch-1.003.patch
 |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
shadedjars  hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux 12a959d

[jira] [Commented] (HBASE-22784) OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)

2019-08-08 Thread Wellington Chevreuil (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16903419#comment-16903419
 ] 

Wellington Chevreuil commented on HBASE-22784:
--

Third patch, fixing the problem causing previous build test failures. Issue was 
that tests were bulkloading edits targeted to replication at T1, then also 
bulkloading entries not targeted to replication at T2, with target cluster 
down. Because of my changes, we were always updating log position once a 
not-to-be replicated entry was coming. In this case, we had updated current log 
position to the entries from T2, but entries from T1 didn't get replicated yet, 
because destination cluster is down. Then, if source and target cluster are 
restarted, entries from T1 will never get replicated, because log position is 
now above those entries. 

Last patch added extra check to advance log position only if there's no entry 
currently pending replication. 

> OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 
> clusters)
> -
>
> Key: HBASE-22784
> URL: https://issues.apache.org/jira/browse/HBASE-22784
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, Replication
>Affects Versions: 1.4.9, 1.4.10
>Reporter: Solvannan R M
>Assignee: Wellington Chevreuil
>Priority: Blocker
> Fix For: 1.5.0
>
> Attachments: HBASE-22784.branch-1.001.patch, 
> HBASE-22784.branch-1.002.patch, HBASE-22784.branch-1.003.patch
>
>
> When a cluster is passive (receiving edits only via replication) in a cyclic 
> replication setup of 2 clusters, OldWALs size keeps on growing. On analysing, 
> we observed the following behaviour.
>  # New entry is added to WAL (Edit replicated from other cluster).
>  # ReplicationSourceWALReaderThread(RSWALRT) reads and applies the configured 
> filters (due to cyclic replication setup, ClusterMarkingEntryFilter discards 
> new entry from other cluster).
>  # Entry is null, RSWALRT neither updates the batch stats 
> (WALEntryBatch.lastWalPosition) nor puts it in the entryBatchQueue.
>  # ReplicationSource thread is blocked in entryBachQueue.take().
>  # So ReplicationSource#updateLogPosition has never invoked and WAL file is 
> never cleared from ReplicationQueue.
>  # Hence LogCleaner on the master, doesn't deletes the oldWAL files from 
> hadoop.
> NOTE: When a new edit is added via hbase-client, ReplicationSource thread 
> process and clears the oldWAL files from replication queues and hence master 
> cleans up the WALs
> Please provide us a solution
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (HBASE-22784) OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)

2019-08-08 Thread Wellington Chevreuil (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16903378#comment-16903378
 ] 

Wellington Chevreuil commented on HBASE-22784:
--

Yep, the failures are due to my changes. For some reason, the second bulk of 
inserted rows are getting filtered by _ScopeWALEntryFilter,_ and now we 
advancing the log position for filtered edits. Trying to figure out why these 
edits are getting filtered on second bulk only, but not in the first.

> OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 
> clusters)
> -
>
> Key: HBASE-22784
> URL: https://issues.apache.org/jira/browse/HBASE-22784
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, Replication
>Affects Versions: 1.4.9, 1.4.10
>Reporter: Solvannan R M
>Assignee: Wellington Chevreuil
>Priority: Blocker
> Fix For: 1.5.0
>
> Attachments: HBASE-22784.branch-1.001.patch, 
> HBASE-22784.branch-1.002.patch
>
>
> When a cluster is passive (receiving edits only via replication) in a cyclic 
> replication setup of 2 clusters, OldWALs size keeps on growing. On analysing, 
> we observed the following behaviour.
>  # New entry is added to WAL (Edit replicated from other cluster).
>  # ReplicationSourceWALReaderThread(RSWALRT) reads and applies the configured 
> filters (due to cyclic replication setup, ClusterMarkingEntryFilter discards 
> new entry from other cluster).
>  # Entry is null, RSWALRT neither updates the batch stats 
> (WALEntryBatch.lastWalPosition) nor puts it in the entryBatchQueue.
>  # ReplicationSource thread is blocked in entryBachQueue.take().
>  # So ReplicationSource#updateLogPosition has never invoked and WAL file is 
> never cleared from ReplicationQueue.
>  # Hence LogCleaner on the master, doesn't deletes the oldWAL files from 
> hadoop.
> NOTE: When a new edit is added via hbase-client, ReplicationSource thread 
> process and clears the oldWAL files from replication queues and hence master 
> cleans up the WALs
> Please provide us a solution
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (HBASE-22784) OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)

2019-08-08 Thread Wellington Chevreuil (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16903299#comment-16903299
 ] 

Wellington Chevreuil commented on HBASE-22784:
--

Checking on those failures. Will confirm on it soon.

> OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 
> clusters)
> -
>
> Key: HBASE-22784
> URL: https://issues.apache.org/jira/browse/HBASE-22784
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, Replication
>Affects Versions: 1.4.9, 1.4.10
>Reporter: Solvannan R M
>Assignee: Wellington Chevreuil
>Priority: Blocker
> Fix For: 1.5.0
>
> Attachments: HBASE-22784.branch-1.001.patch, 
> HBASE-22784.branch-1.002.patch
>
>
> When a cluster is passive (receiving edits only via replication) in a cyclic 
> replication setup of 2 clusters, OldWALs size keeps on growing. On analysing, 
> we observed the following behaviour.
>  # New entry is added to WAL (Edit replicated from other cluster).
>  # ReplicationSourceWALReaderThread(RSWALRT) reads and applies the configured 
> filters (due to cyclic replication setup, ClusterMarkingEntryFilter discards 
> new entry from other cluster).
>  # Entry is null, RSWALRT neither updates the batch stats 
> (WALEntryBatch.lastWalPosition) nor puts it in the entryBatchQueue.
>  # ReplicationSource thread is blocked in entryBachQueue.take().
>  # So ReplicationSource#updateLogPosition has never invoked and WAL file is 
> never cleared from ReplicationQueue.
>  # Hence LogCleaner on the master, doesn't deletes the oldWAL files from 
> hadoop.
> NOTE: When a new edit is added via hbase-client, ReplicationSource thread 
> process and clears the oldWAL files from replication queues and hence master 
> cleans up the WALs
> Please provide us a solution
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (HBASE-22784) OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)

2019-08-08 Thread Andrew Purtell (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16903269#comment-16903269
 ] 

Andrew Purtell commented on HBASE-22784:


Thanks for the patch!

Precommit failures are all replication unit tests. Valid? 

Hopefully I get a chance to test this today, similar scenario. 

> OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 
> clusters)
> -
>
> Key: HBASE-22784
> URL: https://issues.apache.org/jira/browse/HBASE-22784
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, Replication
>Affects Versions: 1.4.9, 1.4.10
>Reporter: Solvannan R M
>Assignee: Wellington Chevreuil
>Priority: Blocker
> Fix For: 1.5.0
>
> Attachments: HBASE-22784.branch-1.001.patch, 
> HBASE-22784.branch-1.002.patch
>
>
> When a cluster is passive (receiving edits only via replication) in a cyclic 
> replication setup of 2 clusters, OldWALs size keeps on growing. On analysing, 
> we observed the following behaviour.
>  # New entry is added to WAL (Edit replicated from other cluster).
>  # ReplicationSourceWALReaderThread(RSWALRT) reads and applies the configured 
> filters (due to cyclic replication setup, ClusterMarkingEntryFilter discards 
> new entry from other cluster).
>  # Entry is null, RSWALRT neither updates the batch stats 
> (WALEntryBatch.lastWalPosition) nor puts it in the entryBatchQueue.
>  # ReplicationSource thread is blocked in entryBachQueue.take().
>  # So ReplicationSource#updateLogPosition has never invoked and WAL file is 
> never cleared from ReplicationQueue.
>  # Hence LogCleaner on the master, doesn't deletes the oldWAL files from 
> hadoop.
> NOTE: When a new edit is added via hbase-client, ReplicationSource thread 
> process and clears the oldWAL files from replication queues and hence master 
> cleans up the WALs
> Please provide us a solution
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (HBASE-22784) OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)

2019-08-08 Thread HBase QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16903230#comment-16903230
 ] 

HBase QA commented on HBASE-22784:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
45s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
1s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} branch-1 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  9m 
14s{color} | {color:green} branch-1 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
44s{color} | {color:green} branch-1 passed with JDK v1.8.0_222 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
41s{color} | {color:green} branch-1 passed with JDK v1.7.0_232 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
27s{color} | {color:green} branch-1 passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  2m 
49s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
38s{color} | {color:green} branch-1 passed with JDK v1.8.0_222 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
39s{color} | {color:green} branch-1 passed with JDK v1.7.0_232 {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
52s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed with JDK v1.8.0_222 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed with JDK v1.7.0_232 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  1m 
20s{color} | {color:red} hbase-server: The patch generated 4 new + 21 unchanged 
- 0 fixed = 25 total (was 21) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  2m 
46s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
4m 48s{color} | {color:green} Patch does not cause any errors with Hadoop 2.8.5 
2.9.2. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
29s{color} | {color:green} the patch passed with JDK v1.8.0_222 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
38s{color} | {color:green} the patch passed with JDK v1.7.0_232 {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}111m 14s{color} 
| {color:red} hbase-server in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
29s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}144m  9s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.hbase.replication.TestReplicationSyncUpToolWithBulkLoadedData |
|   | hadoop.hbase.replication.TestReplicationSyncUpTool |
|   | 
hadoop.hbase.replication.multiwal.TestReplicationSyncUpToolWithMultipleWAL |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.1 Server=19.03.1 base: 
https://builds.apache.org/job/PreCommit-HBASE-Build/737/artifact/patchprocess/Dockerfile
 |
| JIRA Issue | HBASE-22784 |
| JIRA Pat

[jira] [Commented] (HBASE-22784) OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)

2019-08-08 Thread Wellington Chevreuil (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902888#comment-16902888
 ] 

Wellington Chevreuil commented on HBASE-22784:
--

Thanks for the heads up, [~apurtell]. I had attached an initial proposal patch 
for branch-1, with a simple fix and additional UT. Am going through further 
manual tests for this fix. If other folks want to give this patch a try too, it 
would be great. [~solvannan] would you have a staging/test cluster where this 
could be tested as well?

> OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 
> clusters)
> -
>
> Key: HBASE-22784
> URL: https://issues.apache.org/jira/browse/HBASE-22784
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, Replication
>Affects Versions: 1.4.9, 1.4.10
>Reporter: Solvannan R M
>Assignee: Wellington Chevreuil
>Priority: Blocker
> Fix For: 1.5.0
>
> Attachments: HBASE-22784.branch-1.001.patch
>
>
> When a cluster is passive (receiving edits only via replication) in a cyclic 
> replication setup of 2 clusters, OldWALs size keeps on growing. On analysing, 
> we observed the following behaviour.
>  # New entry is added to WAL (Edit replicated from other cluster).
>  # ReplicationSourceWALReaderThread(RSWALRT) reads and applies the configured 
> filters (due to cyclic replication setup, ClusterMarkingEntryFilter discards 
> new entry from other cluster).
>  # Entry is null, RSWALRT neither updates the batch stats 
> (WALEntryBatch.lastWalPosition) nor puts it in the entryBatchQueue.
>  # ReplicationSource thread is blocked in entryBachQueue.take().
>  # So ReplicationSource#updateLogPosition has never invoked and WAL file is 
> never cleared from ReplicationQueue.
>  # Hence LogCleaner on the master, doesn't deletes the oldWAL files from 
> hadoop.
> NOTE: When a new edit is added via hbase-client, ReplicationSource thread 
> process and clears the oldWAL files from replication queues and hence master 
> cleans up the WALs
> Please provide us a solution
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (HBASE-22784) OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)

2019-08-07 Thread Wellington Chevreuil (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902388#comment-16902388
 ] 

Wellington Chevreuil commented on HBASE-22784:
--

Yep, [~solvannan] analysis makes sense from what we can see in the logs/jstack. 
It seems this was introduced by refactorings from HBASE-15995. As 
[~anoop.hbase] mentioned, even if we find nothing to get replicated, we should 
still advance the reading position in the wal. When we do 
_[entryStream.hasNext|https://github.com/apache/hbase/blob/branch-1.4/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReaderThread.java#L126],_
 which ends up in 
_[WALEntryStream.readNextAndSetPosition|https://github.com/apache/hbase/blob/branch-1.4/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/WALEntryStream.java#L280],_
 we are updating the position at the _WALEntryStream_ instance only. We then 
rely that _WALEntryStream.next_ _[returns 
WAL.Entry|https://github.com/apache/hbase/blob/branch-1.4/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReaderThread.java#L130],_
 and that this entry is not filtered, so that it can [set the stream position 
into the WALEntryBatch instance to be 
queued|https://github.com/apache/hbase/blob/branch-1.4/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReaderThread.java#L137].
 Then, as [~solvannan] originally pointed out, we only update log position if 
we get something back from the queue and call 
[_shipEdits_|https://github.com/apache/hbase/blob/branch-1.4/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java#L551],
 which will then finally [update log 
position|https://github.com/apache/hbase/blob/branch-1.4/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java#L638].
 

 Prior to HBASE-15995, read and shipment were done in the same thread. We can 
see it used to properly [set log position even if no entries were found for 
replication|https://github.com/apache/hbase/commit/3cf4433260b60a0e0455090628cf60a9d5a180f3?diff=split#diff-3ac91d43acf51f23f0ffd8b0e5d2e649L711].

Let me check which branches might get affected by this issue, then will work on 
a patch for this.

> OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 
> clusters)
> -
>
> Key: HBASE-22784
> URL: https://issues.apache.org/jira/browse/HBASE-22784
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, Replication
>Affects Versions: 1.4.9, 1.4.10
>Reporter: Solvannan R M
>Assignee: Wellington Chevreuil
>Priority: Major
>
> When a cluster is passive (receiving edits only via replication) in a cyclic 
> replication setup of 2 clusters, OldWALs size keeps on growing. On analysing, 
> we observed the following behaviour.
>  # New entry is added to WAL (Edit replicated from other cluster).
>  # ReplicationSourceWALReaderThread(RSWALRT) reads and applies the configured 
> filters (due to cyclic replication setup, ClusterMarkingEntryFilter discards 
> new entry from other cluster).
>  # Entry is null, RSWALRT neither updates the batch stats 
> (WALEntryBatch.lastWalPosition) nor puts it in the entryBatchQueue.
>  # ReplicationSource thread is blocked in entryBachQueue.take().
>  # So ReplicationSource#updateLogPosition has never invoked and WAL file is 
> never cleared from ReplicationQueue.
>  # Hence LogCleaner on the master, doesn't deletes the oldWAL files from 
> hadoop.
> NOTE: When a new edit is added via hbase-client, ReplicationSource thread 
> process and clears the oldWAL files from replication queues and hence master 
> cleans up the WALs
> Please provide us a solution
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (HBASE-22784) OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)

2019-08-06 Thread Anoop Sam John (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901309#comment-16901309
 ] 

Anoop Sam John commented on HBASE-22784:


So in the replication flow, even if the Filter(s) filter WAL entries from 
getting replicated to other cluster, we should make sure to move the log 
position forward.
Even on RS restart things not getting changed means it is not because of some 
possible causes making one thread to go deadlock or blocked. Looks like ur 
analysis is correct.

> OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 
> clusters)
> -
>
> Key: HBASE-22784
> URL: https://issues.apache.org/jira/browse/HBASE-22784
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, Replication
>Affects Versions: 1.4.9, 1.4.10
>Reporter: Solvannan R M
>Assignee: Wellington Chevreuil
>Priority: Major
>
> When a cluster is passive (receiving edits only via replication) in a cyclic 
> replication setup of 2 clusters, OldWALs size keeps on growing. On analysing, 
> we observed the following behaviour.
>  # New entry is added to WAL (Edit replicated from other cluster).
>  # ReplicationSourceWALReaderThread(RSWALRT) reads and applies the configured 
> filters (due to cyclic replication setup, ClusterMarkingEntryFilter discards 
> new entry from other cluster).
>  # Entry is null, RSWALRT neither updates the batch stats 
> (WALEntryBatch.lastWalPosition) nor puts it in the entryBatchQueue.
>  # ReplicationSource thread is blocked in entryBachQueue.take().
>  # So ReplicationSource#updateLogPosition has never invoked and WAL file is 
> never cleared from ReplicationQueue.
>  # Hence LogCleaner on the master, doesn't deletes the oldWAL files from 
> hadoop.
> NOTE: When a new edit is added via hbase-client, ReplicationSource thread 
> process and clears the oldWAL files from replication queues and hence master 
> cleans up the WALs
> Please provide us a solution
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (HBASE-22784) OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)

2019-08-06 Thread Solvannan R M (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900971#comment-16900971
 ] 

Solvannan R M commented on HBASE-22784:
---

Hi [~wchevreuil],

Thanks for the pointers !

1. *ReplicationWALReaderThread stack trace*
{code:java}
"main-EventThread.replicationSource,3.replicationSource.replicationWALReaderThread.10.216.xxx.xxx%2C16020%2C1554360804184,3"
 #10121292 daemon prio=5 os_prio=0 tid=0x7f00e0f75000 nid=0x6d4c1 waiting 
on condition [0x7ef765a8e000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:162)
{code}
2. Even when restarted the regionserver, the logs have not been cleared.

3. *RegionServer logs:*

**When enabled TRACE logs, the RSWALRT keeps on printing the following message
{code:java}
[regionserver//10.216.xxx.xxx:16020.replicationSource.replicationWALReaderThread.10.216.xxx.xxx%2C16020%2C1554361253037,1]
 regionserver.ReplicationSourceWALReaderThread: Didn't read any new entries 
from WAL
2019-08-03 17:48:56,722 TRACE 
[main-EventThread.replicationSource,3.replicationSource.replicationWALReaderThread.10.216.xxx.xxx%2C16020%2C1554361253037,3]
 regionserver.ReplicationSourceWALReaderThread: Didn't read any new entries 
from WAL
2019-08-03 17:48:57,725 TRACE 
[main-EventThread.replicationSource,3.replicationSource.replicationWALReaderThread.10.216.xxx.xxx%2C16020%2C1554361253037,3]
 regionserver.ReplicationSourceWALReaderThread: Didn't read any new entries 
from WAL
{code}
As we analyzed the replication source and on running the debugger in 
Regionserver process, we came to the observations that we have mentioned in the 
description, where the RSWALRT doesn't queue any entries, leaving the 
ReplicationSourceShipperThread in a blocked state.

Also we came across the Jira HBASE-22620 which seems to be relevant.

As for the cyclic replication setup, our use case is an Active - Active setup 
and when there is no load in one side, this problem occurs

> OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 
> clusters)
> -
>
> Key: HBASE-22784
> URL: https://issues.apache.org/jira/browse/HBASE-22784
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, Replication
>Affects Versions: 1.4.9, 1.4.10
>Reporter: Solvannan R M
>Assignee: Wellington Chevreuil
>Priority: Major
>
> When a cluster is passive (receiving edits only via replication) in a cyclic 
> replication setup of 2 clusters, OldWALs size keeps on growing. On analysing, 
> we observed the following behaviour.
>  # New entry is added to WAL (Edit replicated from other cluster).
>  # ReplicationSourceWALReaderThread(RSWALRT) reads and applies the configured 
> filters (due to cyclic replication setup, ClusterMarkingEntryFilter discards 
> new entry from other cluster).
>  # Entry is null, RSWALRT neither updates the batch stats 
> (WALEntryBatch.lastWalPosition) nor puts it in the entryBatchQueue.
>  # ReplicationSource thread is blocked in entryBachQueue.take().
>  # So ReplicationSource#updateLogPosition has never invoked and WAL file is 
> never cleared from ReplicationQueue.
>  # Hence LogCleaner on the master, doesn't deletes the oldWAL files from 
> hadoop.
> NOTE: When a new edit is added via hbase-client, ReplicationSource thread 
> process and clears the oldWAL files from replication queues and hence master 
> cleans up the WALs
> Please provide us a solution
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (HBASE-22784) OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)

2019-08-05 Thread Wellington Chevreuil (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900115#comment-16900115
 ] 

Wellington Chevreuil commented on HBASE-22784:
--

Hi [~solvannan], thanks for the info. Yeah, those show this oldWALs 
accumulation is definitely something abnormal with replication. Some ideas to 
look for:

1) On the same jstack from the thread you showed on your previous comment, do 
you see any running thread with name containing 
_.replicationSource.replicationWALReaderThread._ substring? This is the thread 
responsible for reading the wal, so maybe it faced some unexpected condition 
and halted.

2) Does this issue persists even if this slave cluster RSes are restarted?

3) If so, would it be possible to set TRACE log level for the RSes, then look 
for the following message patters?

 
{noformat}
Didn't read any new entries from WAL{noformat}
Or
{noformat}
Failed to read stream of replication entries{noformat}
Or
{noformat}
Interrupted while sleeping between WAL reads{noformat}
 

On a side note, when you mention:
{quote}When a cluster is passive (receiving edits only via replication) in a 
cyclic replication setup of 2 clusters,
{quote}
It seems you don't really need cyclic replication, as your slave here only 
receives edits via replication and there are only 2 clusters.

> OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 
> clusters)
> -
>
> Key: HBASE-22784
> URL: https://issues.apache.org/jira/browse/HBASE-22784
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, Replication
>Affects Versions: 1.4.9, 1.4.10
>Reporter: Solvannan R M
>Priority: Major
>
> When a cluster is passive (receiving edits only via replication) in a cyclic 
> replication setup of 2 clusters, OldWALs size keeps on growing. On analysing, 
> we observed the following behaviour.
>  # New entry is added to WAL (Edit replicated from other cluster).
>  # ReplicationSourceWALReaderThread(RSWALRT) reads and applies the configured 
> filters (due to cyclic replication setup, ClusterMarkingEntryFilter discards 
> new entry from other cluster).
>  # Entry is null, RSWALRT neither updates the batch stats 
> (WALEntryBatch.lastWalPosition) nor puts it in the entryBatchQueue.
>  # ReplicationSource thread is blocked in entryBachQueue.take().
>  # So ReplicationSource#updateLogPosition has never invoked and WAL file is 
> never cleared from ReplicationQueue.
>  # Hence LogCleaner on the master, doesn't deletes the oldWAL files from 
> hadoop.
> NOTE: When a new edit is added via hbase-client, ReplicationSource thread 
> process and clears the oldWAL files from replication queues and hence master 
> cleans up the WALs
> Please provide us a solution
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (HBASE-22784) OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)

2019-08-03 Thread Solvannan R M (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899460#comment-16899460
 ] 

Solvannan R M commented on HBASE-22784:
---

 

 

Hi [~wchevreuil],

Thanks for the reply !

We have managed to extract the following information relating to the issue:

*HBase DFS Usage Report:*

We could see the oldWALs size is very high compared to actual data size
{code:java}
hbaseu...@10.216.xxx.xxx~>./hadoop-2.7.3/bin/hdfs dfs -du -h /hbasedata
0 /hbasedata/.tmp
0 /hbasedata/xx
0 /hbasedata/MasterProcWALs
50.3 G /hbasedata/WALs
0 /hbasedata/archive
0 /hbasedata/corrupt
561.4 G /hbasedata/data
0 /hbasedata/hbase
42 /hbasedata/hbase.id
7 /hbasedata/hbase.version
405.9 G /hbasedata/oldWALs
0 /hbasedata/
0 /hbasedata/
0 /hbasedata/x{code}
 

*Zookeeper myQueuesZnodes:*

The replication queues entries are not cleared from zookeeper for a long time
{code:java}
[zk: 10.216.xxx.xxx:2191,10.216.xxx.xxx:2191,10.216.xxx.xxx:2191(CONNECTED) 3] 
get 
/hbase/replication/rs/10.216.xxx.xxx,16020,1554361253037/3/10.216.xxx.xxx%2C16020%2C1554361253037.1564
Display all 179 possibilities? (y or n)
10.216.xxx.xxx%2C16020%2C1554361253037.1554638617629 
10.216.xxx.xxx%2C16020%2C1554361253037.1554485372383

{code}
 

 

*Status 'replication' output of the regionserver:*

 
{code:java}
SOURCE: PeerID=1, AgeOfLastShippedOp=0, SizeOfLogQueue=1, 
TimeStampsOfLastShippedOp=Wed May 01 07:40:55 IST 2019, Replication 
Lag=8158202958 PeerID=3, AgeOfLastShippedOp=0, SizeOfLogQueue=1, 
TimeStampsOfLastShippedOp=Thu Jan 01 05:30:00 IST 1970, Replication 
Lag=1564834858373 SINK : AgeOfLastAppliedOp=378, TimeStampsOfLastAppliedOp=Sat 
Aug 03 17:50:44 IST 2019
{code}
 

 

 

*Stacktrace of the replicationSource thread:*

We can see the thread has been blocked for a long time.
{code:java}
"main-EventThread.replicationSource,3.replicationSource.10.216.xxx.xxx%2C16020%2C1554360804184,3"
 #10121291 daemon prio=5 os_prio=0 tid=0x7f00e14ea000 nid=0x6d4c0 waiting 
on condition [0x7ef7598c9000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x7ef8889a1b98> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.take(ReplicationSourceWALReaderThread.java:227)
at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:550)
{code}
**Kindly let us know if you need any specific logs for your analysis.

 

 

> OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 
> clusters)
> -
>
> Key: HBASE-22784
> URL: https://issues.apache.org/jira/browse/HBASE-22784
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, Replication
>Affects Versions: 1.4.9, 1.4.10
>Reporter: Solvannan R M
>Priority: Major
>
> When a cluster is passive (receiving edits only via replication) in a cyclic 
> replication setup of 2 clusters, OldWALs size keeps on growing. On analysing, 
> we observed the following behaviour.
>  # New entry is added to WAL (Edit replicated from other cluster).
>  # ReplicationSourceWALReaderThread(RSWALRT) reads and applies the configured 
> filters (due to cyclic replication setup, ClusterMarkingEntryFilter discards 
> new entry from other cluster).
>  # Entry is null, RSWALRT neither updates the batch stats 
> (WALEntryBatch.lastWalPosition) nor puts it in the entryBatchQueue.
>  # ReplicationSource thread is blocked in entryBachQueue.take().
>  # So ReplicationSource#updateLogPosition has never invoked and WAL file is 
> never cleared from ReplicationQueue.
>  # Hence LogCleaner on the master, doesn't deletes the oldWAL files from 
> hadoop.
> NOTE: When a new edit is added via hbase-client, ReplicationSource thread 
> process and clears the oldWAL files from replication queues and hence master 
> cleans up the WALs
> Please provide us a solution
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (HBASE-22784) OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 clusters)

2019-08-02 Thread Wellington Chevreuil (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899124#comment-16899124
 ] 

Wellington Chevreuil commented on HBASE-22784:
--

Thanks for filing this, [~solvannan]. Would you paste some log snippets/dum 
replication queue/status replication outputs for each of your enumerated 
assumptions? 

> OldWALs not cleared in a replication slave cluster (cyclic replication bw 2 
> clusters)
> -
>
> Key: HBASE-22784
> URL: https://issues.apache.org/jira/browse/HBASE-22784
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver, Replication
>Affects Versions: 1.4.9, 1.4.10
>Reporter: Solvannan R M
>Priority: Major
>
> When a cluster is passive (receiving edits only via replication) in a cyclic 
> replication setup of 2 clusters, OldWALs size keeps on growing. On analysing, 
> we observed the following behaviour.
>  # New entry is added to WAL (Edit replicated from other cluster).
>  # ReplicationSourceWALReaderThread(RSWALRT) reads and applies the configured 
> filters (due to cyclic replication setup, ClusterMarkingEntryFilter discards 
> new entry from other cluster).
>  # Entry is null, RSWALRT neither updates the batch stats 
> (WALEntryBatch.lastWalPosition) nor puts it in the entryBatchQueue.
>  # ReplicationSource thread is blocked in entryBachQueue.take().
>  # So ReplicationSource#updateLogPosition has never invoked and WAL file is 
> never cleared from ReplicationQueue.
>  # Hence LogCleaner on the master, doesn't deletes the oldWAL files from 
> hadoop.
> NOTE: When a new edit is added via hbase-client, ReplicationSource thread 
> process and clears the oldWAL files from replication queues and hence master 
> cleans up the WALs
> Please provide us a solution
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)