[jira] [Commented] (HBASE-26913) Replication Observability Framework

2022-11-23 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637693#comment-17637693
 ] 

Duo Zhang commented on HBASE-26913:
---

What I mean for the 'placeholder' table is that, it must not exist. Reusing an 
existing table is easy to introduce other side effects in the future.

> Replication Observability Framework
> ---
>
> Key: HBASE-26913
> URL: https://issues.apache.org/jira/browse/HBASE-26913
> Project: HBase
>  Issue Type: New Feature
>  Components: regionserver, Replication
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4
>
>
> In our production clusters, we have seen cases where data is present in 
> source cluster but not in the sink cluster and 1 case where data is present 
> in sink cluster but not in source cluster. 
> We have internal tools where we take incremental backup every day on both 
> source and sink clusters and we compare the hash of the data in both the 
> backups. We have seen many cases where hash doesn't match which means data is 
> not consistent between source and sink for that given day. The Mean Time To 
> Detect (MTTD) these inconsistencies is atleast 2 days and requires lot of 
> manual debugging.
> We need some tool where we can reduce MTTD and requires less manual debugging.
> I have attached design doc. Huge thanks to [~bharathv]  to come up with this 
> design at my work place.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-26913) Replication Observability Framework

2022-11-17 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17635568#comment-17635568
 ] 

Viraj Jasani commented on HBASE-26913:
--

{quote}In the first proposal, we were re-using an existing table we created for 
this framework instead of creating yet another table.
{quote}
[~zhangduo] WDYT?

> Replication Observability Framework
> ---
>
> Key: HBASE-26913
> URL: https://issues.apache.org/jira/browse/HBASE-26913
> Project: HBase
>  Issue Type: New Feature
>  Components: regionserver, Replication
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4
>
>
> In our production clusters, we have seen cases where data is present in 
> source cluster but not in the sink cluster and 1 case where data is present 
> in sink cluster but not in source cluster. 
> We have internal tools where we take incremental backup every day on both 
> source and sink clusters and we compare the hash of the data in both the 
> backups. We have seen many cases where hash doesn't match which means data is 
> not consistent between source and sink for that given day. The Mean Time To 
> Detect (MTTD) these inconsistencies is atleast 2 days and requires lot of 
> manual debugging.
> We need some tool where we can reduce MTTD and requires less manual debugging.
> I have attached design doc. Huge thanks to [~bharathv]  to come up with this 
> design at my work place.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-26913) Replication Observability Framework

2022-11-09 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17631362#comment-17631362
 ] 

Hudson commented on HBASE-26913:


Results for branch branch-2
[build #677 on 
builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/677/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(x) {color:red}-1 general checks{color}
-- For more information [see general 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/677/General_20Nightly_20Build_20Report/]


(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/677/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/677/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2/677/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Replication Observability Framework
> ---
>
> Key: HBASE-26913
> URL: https://issues.apache.org/jira/browse/HBASE-26913
> Project: HBase
>  Issue Type: New Feature
>  Components: regionserver, Replication
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4
>
>
> In our production clusters, we have seen cases where data is present in 
> source cluster but not in the sink cluster and 1 case where data is present 
> in sink cluster but not in source cluster. 
> We have internal tools where we take incremental backup every day on both 
> source and sink clusters and we compare the hash of the data in both the 
> backups. We have seen many cases where hash doesn't match which means data is 
> not consistent between source and sink for that given day. The Mean Time To 
> Detect (MTTD) these inconsistencies is atleast 2 days and requires lot of 
> manual debugging.
> We need some tool where we can reduce MTTD and requires less manual debugging.
> I have attached design doc. Huge thanks to [~bharathv]  to come up with this 
> design at my work place.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-26913) Replication Observability Framework

2022-11-07 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629946#comment-17629946
 ] 

Rushabh Shah commented on HBASE-26913:
--

> After looking at the current PR, I think maybe we could introduce a special 
>region info, for handling the region server level markers, for example, let's 
>call the table 'hbase:replication_marker_placeholder', and we will always use 
>this table's first region info, i.e, creating by 
>RegionInfoBuilder.newBuilder(tableName).build(), to write region server level 
>markers. And it will be replicated to remote peers, but when splitting, we 
>will just drop it, which is almost the same with the current implementation.

In this case also, we will create an edit with a region which will not reside 
on any region server. In the first proposal, we were re-using an existing table 
we created for this framework instead of creating yet another table. 
[~zhangduo]  Am I missing something? 

> Replication Observability Framework
> ---
>
> Key: HBASE-26913
> URL: https://issues.apache.org/jira/browse/HBASE-26913
> Project: HBase
>  Issue Type: New Feature
>  Components: regionserver, Replication
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4
>
>
> In our production clusters, we have seen cases where data is present in 
> source cluster but not in the sink cluster and 1 case where data is present 
> in sink cluster but not in source cluster. 
> We have internal tools where we take incremental backup every day on both 
> source and sink clusters and we compare the hash of the data in both the 
> backups. We have seen many cases where hash doesn't match which means data is 
> not consistent between source and sink for that given day. The Mean Time To 
> Detect (MTTD) these inconsistencies is atleast 2 days and requires lot of 
> manual debugging.
> We need some tool where we can reduce MTTD and requires less manual debugging.
> I have attached design doc. Huge thanks to [~bharathv]  to come up with this 
> design at my work place.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-26913) Replication Observability Framework

2022-11-04 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629246#comment-17629246
 ] 

Duo Zhang commented on HBASE-26913:
---

{quote}
This is why in the original design we injected "regionserver-level" markers 
using fake WAL keys, but for valid reasons there were concerns about that 
raised on the PR, and so per those design discussions this aspect of the design 
was changed. So now we must live with this drawback, in the current design.
{quote}

E... What I suggested on PR is to not use an existing region, which may be 
not on the region server to attach the marker, as it may have other side 
effect. So either we make a better abtraction on how to write a region server 
level WAL marker, or we reuse some existing regions to attach the marker.

After looking at the current PR, I think maybe we could introduce a special 
region info, for handling the region server level markers, for example, let's 
call the table 'hbase:replication_marker_placeholder', and we will always use 
this table's first region info, i.e, creating by 
RegionInfoBuilder.newBuilder(tableName).build(), to write region server level 
markers. And it will be replicated to remote peers, but when splitting, we will 
just drop it, which is almost the same with the current implementation.

And in this way, I think it is easy to add multi WAL support too, just iterator 
overall WAL instances, add a marker with this placeholder region info. Done.

WDYT?

[~apurtell] [~shahrs87][~vjasani].

Thanks.

> Replication Observability Framework
> ---
>
> Key: HBASE-26913
> URL: https://issues.apache.org/jira/browse/HBASE-26913
> Project: HBase
>  Issue Type: New Feature
>  Components: regionserver, Replication
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4
>
>
> In our production clusters, we have seen cases where data is present in 
> source cluster but not in the sink cluster and 1 case where data is present 
> in sink cluster but not in source cluster. 
> We have internal tools where we take incremental backup every day on both 
> source and sink clusters and we compare the hash of the data in both the 
> backups. We have seen many cases where hash doesn't match which means data is 
> not consistent between source and sink for that given day. The Mean Time To 
> Detect (MTTD) these inconsistencies is atleast 2 days and requires lot of 
> manual debugging.
> We need some tool where we can reduce MTTD and requires less manual debugging.
> I have attached design doc. Huge thanks to [~bharathv]  to come up with this 
> design at my work place.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-26913) Replication Observability Framework

2022-11-04 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629236#comment-17629236
 ] 

Hudson commented on HBASE-26913:


Results for branch master
[build #714 on 
builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/714/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/714/General_20Nightly_20Build_20Report/]




(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/714/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/714/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Replication Observability Framework
> ---
>
> Key: HBASE-26913
> URL: https://issues.apache.org/jira/browse/HBASE-26913
> Project: HBase
>  Issue Type: New Feature
>  Components: regionserver, Replication
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4
>
>
> In our production clusters, we have seen cases where data is present in 
> source cluster but not in the sink cluster and 1 case where data is present 
> in sink cluster but not in source cluster. 
> We have internal tools where we take incremental backup every day on both 
> source and sink clusters and we compare the hash of the data in both the 
> backups. We have seen many cases where hash doesn't match which means data is 
> not consistent between source and sink for that given day. The Mean Time To 
> Detect (MTTD) these inconsistencies is atleast 2 days and requires lot of 
> manual debugging.
> We need some tool where we can reduce MTTD and requires less manual debugging.
> I have attached design doc. Huge thanks to [~bharathv]  to come up with this 
> design at my work place.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-26913) Replication Observability Framework

2022-11-04 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629168#comment-17629168
 ] 

Andrew Kyle Purtell commented on HBASE-26913:
-

bq. If a region server has some regions in the past but when we want to add the 
marker, all regions have been moved to other regions server, then in the 
current implementation, we will skip adding the marker. But since the region 
server held some regions in the past, it could still have some pending WAL 
entrires which have not been replicated yet...

This is why in the original design we injected "regionserver-level" markers 
using fake WAL keys, but for valid reasons you had concerns about that, and so 
per those design discussions this aspect of the design was changed. So now we 
have live with this drawback, in the current design. 

For sake of simplicity I see two paths forward:

1. Change the design back to injecting "regionserver-level" markers that are 
not associated with any table or region. 

2. If a regionsever temporarily has no regions, and we miss injecting a marker, 
this is noticed by the sink side tooling as expected, and operators presumably 
will be alerted. So we are actually good, but to avoid false positives like 
this, probabilistically, in a running system no regionserver should have 0 
regions. Perhaps that means we make adjustments to balancer policy, like make 
it run more frequently. This can be documented: "If you enable the replication 
observability framework, then you need this minimum balancer policy settings X, 
Y, and Z."  

> Replication Observability Framework
> ---
>
> Key: HBASE-26913
> URL: https://issues.apache.org/jira/browse/HBASE-26913
> Project: HBase
>  Issue Type: New Feature
>  Components: regionserver, Replication
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4
>
>
> In our production clusters, we have seen cases where data is present in 
> source cluster but not in the sink cluster and 1 case where data is present 
> in sink cluster but not in source cluster. 
> We have internal tools where we take incremental backup every day on both 
> source and sink clusters and we compare the hash of the data in both the 
> backups. We have seen many cases where hash doesn't match which means data is 
> not consistent between source and sink for that given day. The Mean Time To 
> Detect (MTTD) these inconsistencies is atleast 2 days and requires lot of 
> manual debugging.
> We need some tool where we can reduce MTTD and requires less manual debugging.
> I have attached design doc. Huge thanks to [~bharathv]  to come up with this 
> design at my work place.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-26913) Replication Observability Framework

2022-11-03 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628669#comment-17628669
 ] 

Duo Zhang commented on HBASE-26913:
---

And on random selecting a region to add the wal marker, I realized another 
possible problem.

If a region server has some regions in the past but when we want to add the 
marker, all regions have been moved to other regions server, then in the 
current implementation, we will skip adding the marker. But since the region 
server held some regions in the past, it could still have some pending WAL 
entrires which have not been replicated yet...

> Replication Observability Framework
> ---
>
> Key: HBASE-26913
> URL: https://issues.apache.org/jira/browse/HBASE-26913
> Project: HBase
>  Issue Type: New Feature
>  Components: regionserver, Replication
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4
>
>
> In our production clusters, we have seen cases where data is present in 
> source cluster but not in the sink cluster and 1 case where data is present 
> in sink cluster but not in source cluster. 
> We have internal tools where we take incremental backup every day on both 
> source and sink clusters and we compare the hash of the data in both the 
> backups. We have seen many cases where hash doesn't match which means data is 
> not consistent between source and sink for that given day. The Mean Time To 
> Detect (MTTD) these inconsistencies is atleast 2 days and requires lot of 
> manual debugging.
> We need some tool where we can reduce MTTD and requires less manual debugging.
> I have attached design doc. Huge thanks to [~bharathv]  to come up with this 
> design at my work place.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-26913) Replication Observability Framework

2022-11-03 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628666#comment-17628666
 ] 

Viraj Jasani commented on HBASE-26913:
--

Sounds good, just awaiting QA results on branch-2 PR, should get merged by 
tomorrow hopefully.

> Replication Observability Framework
> ---
>
> Key: HBASE-26913
> URL: https://issues.apache.org/jira/browse/HBASE-26913
> Project: HBase
>  Issue Type: New Feature
>  Components: regionserver, Replication
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4
>
>
> In our production clusters, we have seen cases where data is present in 
> source cluster but not in the sink cluster and 1 case where data is present 
> in sink cluster but not in source cluster. 
> We have internal tools where we take incremental backup every day on both 
> source and sink clusters and we compare the hash of the data in both the 
> backups. We have seen many cases where hash doesn't match which means data is 
> not consistent between source and sink for that given day. The Mean Time To 
> Detect (MTTD) these inconsistencies is atleast 2 days and requires lot of 
> manual debugging.
> We need some tool where we can reduce MTTD and requires less manual debugging.
> I have attached design doc. Huge thanks to [~bharathv]  to come up with this 
> design at my work place.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-26913) Replication Observability Framework

2022-11-03 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628665#comment-17628665
 ] 

Duo Zhang commented on HBASE-26913:
---

OK, then let's resolve the sub tasks as implemented, and clear the fix 
versions...

> Replication Observability Framework
> ---
>
> Key: HBASE-26913
> URL: https://issues.apache.org/jira/browse/HBASE-26913
> Project: HBase
>  Issue Type: New Feature
>  Components: regionserver, Replication
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4
>
>
> In our production clusters, we have seen cases where data is present in 
> source cluster but not in the sink cluster and 1 case where data is present 
> in sink cluster but not in source cluster. 
> We have internal tools where we take incremental backup every day on both 
> source and sink clusters and we compare the hash of the data in both the 
> backups. We have seen many cases where hash doesn't match which means data is 
> not consistent between source and sink for that given day. The Mean Time To 
> Detect (MTTD) these inconsistencies is atleast 2 days and requires lot of 
> manual debugging.
> We need some tool where we can reduce MTTD and requires less manual debugging.
> I have attached design doc. Huge thanks to [~bharathv]  to come up with this 
> design at my work place.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-26913) Replication Observability Framework

2022-11-03 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628661#comment-17628661
 ] 

Viraj Jasani commented on HBASE-26913:
--

[~zhangduo] The recent changes diverged quite a bit from the original sub-task 
commits, and hence squash and merge was the only way.

> Replication Observability Framework
> ---
>
> Key: HBASE-26913
> URL: https://issues.apache.org/jira/browse/HBASE-26913
> Project: HBase
>  Issue Type: New Feature
>  Components: regionserver, Replication
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4
>
>
> In our production clusters, we have seen cases where data is present in 
> source cluster but not in the sink cluster and 1 case where data is present 
> in sink cluster but not in source cluster. 
> We have internal tools where we take incremental backup every day on both 
> source and sink clusters and we compare the hash of the data in both the 
> backups. We have seen many cases where hash doesn't match which means data is 
> not consistent between source and sink for that given day. The Mean Time To 
> Detect (MTTD) these inconsistencies is atleast 2 days and requires lot of 
> manual debugging.
> We need some tool where we can reduce MTTD and requires less manual debugging.
> I have attached design doc. Huge thanks to [~bharathv]  to come up with this 
> design at my work place.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-26913) Replication Observability Framework

2022-11-03 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628658#comment-17628658
 ] 

Duo Zhang commented on HBASE-26913:
---

Oh, [~vjasani] so you merged the PR with squash committs? I suppose we should 
merge them while keeping the commits of the sub tasks...

> Replication Observability Framework
> ---
>
> Key: HBASE-26913
> URL: https://issues.apache.org/jira/browse/HBASE-26913
> Project: HBase
>  Issue Type: New Feature
>  Components: regionserver, Replication
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.5.2
>
>
> In our production clusters, we have seen cases where data is present in 
> source cluster but not in the sink cluster and 1 case where data is present 
> in sink cluster but not in source cluster. 
> We have internal tools where we take incremental backup every day on both 
> source and sink clusters and we compare the hash of the data in both the 
> backups. We have seen many cases where hash doesn't match which means data is 
> not consistent between source and sink for that given day. The Mean Time To 
> Detect (MTTD) these inconsistencies is atleast 2 days and requires lot of 
> manual debugging.
> We need some tool where we can reduce MTTD and requires less manual debugging.
> I have attached design doc. Huge thanks to [~bharathv]  to come up with this 
> design at my work place.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-26913) Replication Observability Framework

2022-11-03 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17628649#comment-17628649
 ] 

Duo Zhang commented on HBASE-26913:
---

I do not think we should include this in 2.5?  This is a big new feature, 
better release it in 2.6.0.

> Replication Observability Framework
> ---
>
> Key: HBASE-26913
> URL: https://issues.apache.org/jira/browse/HBASE-26913
> Project: HBase
>  Issue Type: New Feature
>  Components: regionserver, Replication
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
> Fix For: 2.6.0, 3.0.0-alpha-4, 2.5.2
>
>
> In our production clusters, we have seen cases where data is present in 
> source cluster but not in the sink cluster and 1 case where data is present 
> in sink cluster but not in source cluster. 
> We have internal tools where we take incremental backup every day on both 
> source and sink clusters and we compare the hash of the data in both the 
> backups. We have seen many cases where hash doesn't match which means data is 
> not consistent between source and sink for that given day. The Mean Time To 
> Detect (MTTD) these inconsistencies is atleast 2 days and requires lot of 
> manual debugging.
> We need some tool where we can reduce MTTD and requires less manual debugging.
> I have attached design doc. Huge thanks to [~bharathv]  to come up with this 
> design at my work place.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-26913) Replication Observability Framework

2022-06-21 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556980#comment-17556980
 ] 

Rushabh Shah commented on HBASE-26913:
--

Thank you [~vjasani] for merging all the sub-tasks for this feature. What 
should be the fix versions for the sub-tasks ? Since all the sub-tasks are 
merged to feature branch, should we close them with HBASE-26913 as the fix 
version ? Please advise. 

> Replication Observability Framework
> ---
>
> Key: HBASE-26913
> URL: https://issues.apache.org/jira/browse/HBASE-26913
> Project: HBase
>  Issue Type: New Feature
>  Components: regionserver, Replication
>Reporter: Rushabh Shah
>Assignee: Rushabh Shah
>Priority: Major
>
> In our production clusters, we have seen cases where data is present in 
> source cluster but not in the sink cluster and 1 case where data is present 
> in sink cluster but not in source cluster. 
> We have internal tools where we take incremental backup every day on both 
> source and sink clusters and we compare the hash of the data in both the 
> backups. We have seen many cases where hash doesn't match which means data is 
> not consistent between source and sink for that given day. The Mean Time To 
> Detect (MTTD) these inconsistencies is atleast 2 days and requires lot of 
> manual debugging.
> We need some tool where we can reduce MTTD and requires less manual debugging.
> I have attached design doc. Huge thanks to [~bharathv]  to come up with this 
> design at my work place.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)