Re: [PR] HBASE-28366 Mis-order of SCP and regionServerReport results into region inconsistencies [hbase]

2024-03-24 Thread via GitHub


virajjasani commented on code in PR #5774:
URL: https://github.com/apache/hbase/pull/5774#discussion_r1537084622


##
hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java:
##
@@ -324,8 +324,19 @@ public void regionServerReport(ServerName sn, 
ServerMetrics sl) throws YouAreDea
   // the ServerName to use. Here we presume a master has already done
   // that so we'll press on with whatever it gave us for ServerName.
   if (!checkAndRecordNewServer(sn, sl)) {
-LOG.info("RegionServerReport ignored, could not record the server: " + 
sn);
-return; // Not recorded, so no need to move on
+// Master already registered server with same (host + port) and higher 
startcode.

Review Comment:
   Sounds good, will add more comments to make it clear. Thanks Duo!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@hbase.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] HBASE-28366 Mis-order of SCP and regionServerReport results into region inconsistencies [hbase]

2024-03-24 Thread via GitHub


Apache9 commented on code in PR #5774:
URL: https://github.com/apache/hbase/pull/5774#discussion_r1537006507


##
hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java:
##
@@ -324,8 +324,19 @@ public void regionServerReport(ServerName sn, 
ServerMetrics sl) throws YouAreDea
   // the ServerName to use. Here we presume a master has already done
   // that so we'll press on with whatever it gave us for ServerName.
   if (!checkAndRecordNewServer(sn, sl)) {
-LOG.info("RegionServerReport ignored, could not record the server: " + 
sn);
-return; // Not recorded, so no need to move on
+// Master already registered server with same (host + port) and higher 
startcode.

Review Comment:
   I checked the code, now I understand why we need to throw an exception here. 
Your comment totally missed the most important part...
   
   At least your comment should include these two points:
   
   1. The exception thrown here is not meant to tell the region server it is 
dead because if there is a new server on the same host port, the old server 
should have already been dead.
   2. The exception thrown here is to skip the later steps of the whole 
regionServerReport request processing. Usually, after recording it in 
ServerManager, we will call the related methods in AssignmentManager to record 
region states. If the region server is already dead, we should not do these 
steps any more, so here we throw an exception to let the upper layer know that 
they should not continue processing any more.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@hbase.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] HBASE-28366 Mis-order of SCP and regionServerReport results into region inconsistencies [hbase]

2024-03-24 Thread via GitHub


virajjasani commented on code in PR #5774:
URL: https://github.com/apache/hbase/pull/5774#discussion_r1536981102


##
hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java:
##
@@ -324,8 +324,19 @@ public void regionServerReport(ServerName sn, 
ServerMetrics sl) throws YouAreDea
   // the ServerName to use. Here we presume a master has already done
   // that so we'll press on with whatever it gave us for ServerName.
   if (!checkAndRecordNewServer(sn, sl)) {
-LOG.info("RegionServerReport ignored, could not record the server: " + 
sn);
-return; // Not recorded, so no need to move on
+// Master already registered server with same (host + port) and higher 
startcode.

Review Comment:
   The other way to think about this is: why should we even receive any report 
from old server and not throw YouAreDeadException while we already know that 
new server is alive and is already registered? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@hbase.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] HBASE-28366 Mis-order of SCP and regionServerReport results into region inconsistencies [hbase]

2024-03-24 Thread via GitHub


virajjasani commented on code in PR #5774:
URL: https://github.com/apache/hbase/pull/5774#discussion_r1536980319


##
hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java:
##
@@ -324,8 +324,19 @@ public void regionServerReport(ServerName sn, 
ServerMetrics sl) throws YouAreDea
   // the ServerName to use. Here we presume a master has already done
   // that so we'll press on with whatever it gave us for ServerName.
   if (!checkAndRecordNewServer(sn, sl)) {
-LOG.info("RegionServerReport ignored, could not record the server: " + 
sn);
-return; // Not recorded, so no need to move on
+// Master already registered server with same (host + port) and higher 
startcode.

Review Comment:
   This is more of safety check, it will prevent inconsistencies. I agree that 
anyone looking at this would think, why do we need such extra safety, it's 
valid point but I can guarantee that not having such strict validation has 
caused inconsistencies.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@hbase.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] HBASE-28366 Mis-order of SCP and regionServerReport results into region inconsistencies [hbase]

2024-03-24 Thread via GitHub


virajjasani commented on code in PR #5774:
URL: https://github.com/apache/hbase/pull/5774#discussion_r1536979403


##
hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java:
##
@@ -324,8 +324,19 @@ public void regionServerReport(ServerName sn, 
ServerMetrics sl) throws YouAreDea
   // the ServerName to use. Here we presume a master has already done
   // that so we'll press on with whatever it gave us for ServerName.
   if (!checkAndRecordNewServer(sn, sl)) {
-LOG.info("RegionServerReport ignored, could not record the server: " + 
sn);
-return; // Not recorded, so no need to move on
+// Master already registered server with same (host + port) and higher 
startcode.

Review Comment:
   When it happened (as per logs mentioned on the jira), master processed the 
report and that generated inconsistencies.
   
   We have seen this happen many times in the past when regionserver is not 
really aborted but looses connection with Zookeeper, triggering SCP by master. 
And regionserver with new startcode is not only alive but has also reported 
regionservers to master. After that, somehow master still receives regionserver 
report from old startcode regionserver, master processes it and that results 
into inconsistencies. I know this is rare case but it definitely happened more 
than once in more than one prod clusters.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@hbase.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] HBASE-28366 Mis-order of SCP and regionServerReport results into region inconsistencies [hbase]

2024-03-24 Thread via GitHub


virajjasani commented on code in PR #5774:
URL: https://github.com/apache/hbase/pull/5774#discussion_r1536979403


##
hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java:
##
@@ -324,8 +324,19 @@ public void regionServerReport(ServerName sn, 
ServerMetrics sl) throws YouAreDea
   // the ServerName to use. Here we presume a master has already done
   // that so we'll press on with whatever it gave us for ServerName.
   if (!checkAndRecordNewServer(sn, sl)) {
-LOG.info("RegionServerReport ignored, could not record the server: " + 
sn);
-return; // Not recorded, so no need to move on
+// Master already registered server with same (host + port) and higher 
startcode.

Review Comment:
   When it happened (as per logs mentioned on the jira), master processed the 
report and that generated inconsistencies.
   
   We have seen this happen many times in the past when regionserver is not 
really aborted but looses connection with Zookeeper, triggering SCP by master. 
And regionserver with new startcode is not only alive but has also reported 
regionservers to master. After that, somehow master still receives regionserver 
report, master processes it and that results into inconsistencies. I know this 
is rare case but it definitely happened more than once in more than one prod 
clusters.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@hbase.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] HBASE-28366 Mis-order of SCP and regionServerReport results into region inconsistencies [hbase]

2024-03-24 Thread via GitHub


Apache9 commented on code in PR #5774:
URL: https://github.com/apache/hbase/pull/5774#discussion_r1536961183


##
hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java:
##
@@ -324,8 +324,19 @@ public void regionServerReport(ServerName sn, 
ServerMetrics sl) throws YouAreDea
   // the ServerName to use. Here we presume a master has already done
   // that so we'll press on with whatever it gave us for ServerName.
   if (!checkAndRecordNewServer(sn, sl)) {
-LOG.info("RegionServerReport ignored, could not record the server: " + 
sn);
-return; // Not recorded, so no need to move on
+// Master already registered server with same (host + port) and higher 
startcode.

Review Comment:
   I still do not think this is necessary, because if the new server with the 
same host and port has already registered to master, how can we return this 
`YouAreDeadException` to the old server? Even if there is a race condition, 
when sending we will receive a connection reset because the old server is 
already dead...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@hbase.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] HBASE-28366 Mis-order of SCP and regionServerReport results into region inconsistencies [hbase]

2024-03-24 Thread via GitHub


Apache-HBase commented on PR #5774:
URL: https://github.com/apache/hbase/pull/5774#issuecomment-2017023902

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime | Comment |
   |::|--:|:|:|
   | +0 :ok: |  reexec  |   0m 57s |  Docker mode activated.  |
   | -0 :warning: |  yetus  |   0m  3s |  Unprocessed flag(s): 
--brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list 
--whitespace-tabs-ignore-list --quick-hadoopcheck  |
   ||| _ Prechecks _ |
   ||| _ master Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   2m 45s |  master passed  |
   | +1 :green_heart: |  compile  |   0m 44s |  master passed  |
   | +1 :green_heart: |  shadedjars  |   5m 10s |  branch has no errors when 
building our shaded downstream artifacts.  |
   | +1 :green_heart: |  javadoc  |   0m 26s |  master passed  |
   ||| _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   2m 32s |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 42s |  the patch passed  |
   | +1 :green_heart: |  javac  |   0m 42s |  the patch passed  |
   | +1 :green_heart: |  shadedjars  |   5m  8s |  patch has no errors when 
building our shaded downstream artifacts.  |
   | +1 :green_heart: |  javadoc  |   0m 25s |  the patch passed  |
   ||| _ Other Tests _ |
   | +1 :green_heart: |  unit  | 239m 22s |  hbase-server in the patch passed.  
|
   |  |   | 262m 51s |   |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.45 ServerAPI=1.45 base: 
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5774/1/artifact/yetus-jdk8-hadoop3-check/output/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hbase/pull/5774 |
   | Optional Tests | javac javadoc unit shadedjars compile |
   | uname | Linux fc283394a300 5.4.0-163-generic #180-Ubuntu SMP Tue Sep 5 
13:21:23 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/hbase-personality.sh |
   | git revision | master / ade6ab2148 |
   | Default Java | Temurin-1.8.0_352-b08 |
   |  Test Results | 
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5774/1/testReport/
 |
   | Max. process+thread count | 5350 (vs. ulimit of 3) |
   | modules | C: hbase-server U: hbase-server |
   | Console output | 
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5774/1/console 
|
   | versions | git=2.34.1 maven=3.8.6 |
   | Powered by | Apache Yetus 0.12.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@hbase.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HBASE-27826) Region split and merge time while offline is O(n) with respect to number of store files

2024-03-24 Thread Andrew Kyle Purtell (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Kyle Purtell updated HBASE-27826:

Fix Version/s: 2.7.0
   3.0.0-beta-2

> Region split and merge time while offline is O(n) with respect to number of 
> store files
> ---
>
> Key: HBASE-27826
> URL: https://issues.apache.org/jira/browse/HBASE-27826
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.5.4
>Reporter: Andrew Kyle Purtell
>Priority: Major
> Fix For: 2.7.0, 3.0.0-beta-2
>
>
> This is a significant availability issue when HFiles are on S3. =
> HBASE-26079 ({_}Use StoreFileTracker when splitting and merging{_}) changed 
> the split and merge table procedure implementations to indirect through the 
> StoreFileTracker implementation when selecting HFiles to be merged or split, 
> rather than directly listing those using file system APIs. It also changed 
> the commit logic in HRegionFileSystem to add the link/ref files on resulting 
> split or merged regions to the StoreFileTracker. However, the creation of a 
> link file is still a filesystem operation and creating a “file” on S3 can 
> take well over a second. If, for example there are 20 store files in a 
> region, which is not uncommon, after the region is taken offline for a split 
> (or merge) it may require more than 20 seconds to create the link files 
> before the results can be brought back online, creating a severe availability 
> problem. Splits and merges are supposed to be fast, completing in less than a 
> second, certainly less than a few seconds. This has been true when HFiles are 
> stored on HDFS only because file creation operations there are nearly 
> instantaneous. 
> There are two issues but both can be handled with modifications to the store 
> file tracker interface and the file based store file tracker implementation. 
> When the file based store file file tracker is enabled the HFile links should 
> be virtual entities that only exist in the file manifest. We do not require 
> physical files in the filesystem to serve as links now. That is the magic of 
> the this file tracker, the manifest file replaces requirements to list the 
> filesystem.
> Then, when splitting or merging, the HFile links should be collected into a 
> list and committed in one batch using a new FILE file tracker interface, 
> requiring only one update of the manifest file in S3, bringing the time 
> requirement for this operation to O(1) down from O[n].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] HBASE-28366 Mis-order of SCP and regionServerReport results into region inconsistencies [hbase]

2024-03-24 Thread via GitHub


Apache-HBase commented on PR #5774:
URL: https://github.com/apache/hbase/pull/5774#issuecomment-2017011424

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime | Comment |
   |::|--:|:|:|
   | +0 :ok: |  reexec  |   2m 33s |  Docker mode activated.  |
   | -0 :warning: |  yetus  |   0m  2s |  Unprocessed flag(s): 
--brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list 
--whitespace-tabs-ignore-list --quick-hadoopcheck  |
   ||| _ Prechecks _ |
   ||| _ master Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   3m 21s |  master passed  |
   | +1 :green_heart: |  compile  |   0m 47s |  master passed  |
   | +1 :green_heart: |  shadedjars  |   5m 48s |  branch has no errors when 
building our shaded downstream artifacts.  |
   | +1 :green_heart: |  javadoc  |   0m 25s |  master passed  |
   ||| _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   2m 56s |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 46s |  the patch passed  |
   | +1 :green_heart: |  javac  |   0m 46s |  the patch passed  |
   | +1 :green_heart: |  shadedjars  |   5m 52s |  patch has no errors when 
building our shaded downstream artifacts.  |
   | +1 :green_heart: |  javadoc  |   0m 23s |  the patch passed  |
   ||| _ Other Tests _ |
   | +1 :green_heart: |  unit  | 218m 44s |  hbase-server in the patch passed.  
|
   |  |   | 245m 37s |   |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5774/1/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hbase/pull/5774 |
   | Optional Tests | javac javadoc unit shadedjars compile |
   | uname | Linux 362b64cb26d2 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 
23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/hbase-personality.sh |
   | git revision | master / ade6ab2148 |
   | Default Java | Eclipse Adoptium-11.0.17+8 |
   |  Test Results | 
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5774/1/testReport/
 |
   | Max. process+thread count | 5305 (vs. ulimit of 3) |
   | modules | C: hbase-server U: hbase-server |
   | Console output | 
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5774/1/console 
|
   | versions | git=2.34.1 maven=3.8.6 |
   | Powered by | Apache Yetus 0.12.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@hbase.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] HBASE-28366 Mis-order of SCP and regionServerReport results into region inconsistencies [hbase]

2024-03-24 Thread via GitHub


Apache-HBase commented on PR #5774:
URL: https://github.com/apache/hbase/pull/5774#issuecomment-2017004220

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime | Comment |
   |::|--:|:|:|
   | +0 :ok: |  reexec  |   2m 20s |  Docker mode activated.  |
   | -0 :warning: |  yetus  |   0m  3s |  Unprocessed flag(s): 
--brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list 
--whitespace-tabs-ignore-list --quick-hadoopcheck  |
   ||| _ Prechecks _ |
   ||| _ master Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   3m 12s |  master passed  |
   | +1 :green_heart: |  compile  |   0m 53s |  master passed  |
   | +1 :green_heart: |  shadedjars  |   5m 38s |  branch has no errors when 
building our shaded downstream artifacts.  |
   | +1 :green_heart: |  javadoc  |   0m 27s |  master passed  |
   ||| _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   2m 55s |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 53s |  the patch passed  |
   | +1 :green_heart: |  javac  |   0m 53s |  the patch passed  |
   | +1 :green_heart: |  shadedjars  |   5m 35s |  patch has no errors when 
building our shaded downstream artifacts.  |
   | +1 :green_heart: |  javadoc  |   0m 25s |  the patch passed  |
   ||| _ Other Tests _ |
   | +1 :green_heart: |  unit  | 205m  2s |  hbase-server in the patch passed.  
|
   |  |   | 231m 22s |   |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5774/1/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hbase/pull/5774 |
   | Optional Tests | javac javadoc unit shadedjars compile |
   | uname | Linux 19a01ffa4e79 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 
23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/hbase-personality.sh |
   | git revision | master / ade6ab2148 |
   | Default Java | Eclipse Adoptium-17.0.10+7 |
   |  Test Results | 
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5774/1/testReport/
 |
   | Max. process+thread count | 5362 (vs. ulimit of 3) |
   | modules | C: hbase-server U: hbase-server |
   | Console output | 
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5774/1/console 
|
   | versions | git=2.34.1 maven=3.8.6 |
   | Powered by | Apache Yetus 0.12.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@hbase.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] HBASE-28366 Mis-order of SCP and regionServerReport results into region inconsistencies [hbase]

2024-03-24 Thread via GitHub


Apache-HBase commented on PR #5774:
URL: https://github.com/apache/hbase/pull/5774#issuecomment-2016939886

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime | Comment |
   |::|--:|:|:|
   | +0 :ok: |  reexec  |   1m 41s |  Docker mode activated.  |
   ||| _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  No case conflicting files 
found.  |
   | +1 :green_heart: |  hbaseanti  |   0m  0s |  Patch does not have any 
anti-patterns.  |
   | +1 :green_heart: |  @author  |   0m  0s |  The patch does not contain any 
@author tags.  |
   ||| _ master Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   4m  4s |  master passed  |
   | +1 :green_heart: |  compile  |   3m 11s |  master passed  |
   | +1 :green_heart: |  checkstyle  |   0m 46s |  master passed  |
   | +1 :green_heart: |  spotless  |   1m  1s |  branch has no errors when 
running spotless:check.  |
   | +1 :green_heart: |  spotbugs  |   2m  3s |  master passed  |
   ||| _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   2m 56s |  the patch passed  |
   | +1 :green_heart: |  compile  |   2m 33s |  the patch passed  |
   | +1 :green_heart: |  javac  |   2m 33s |  the patch passed  |
   | +1 :green_heart: |  checkstyle  |   0m 35s |  the patch passed  |
   | +1 :green_heart: |  whitespace  |   0m  0s |  The patch has no whitespace 
issues.  |
   | +1 :green_heart: |  hadoopcheck  |   5m 17s |  Patch does not cause any 
errors with Hadoop 3.3.6.  |
   | +1 :green_heart: |  spotless  |   0m 41s |  patch has no errors when 
running spotless:check.  |
   | +1 :green_heart: |  spotbugs  |   1m 41s |  the patch passed  |
   ||| _ Other Tests _ |
   | +1 :green_heart: |  asflicense  |   0m 13s |  The patch does not generate 
ASF License warnings.  |
   |  |   |  33m 22s |   |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.45 ServerAPI=1.45 base: 
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5774/1/artifact/yetus-general-check/output/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hbase/pull/5774 |
   | Optional Tests | dupname asflicense javac spotbugs hadoopcheck hbaseanti 
spotless checkstyle compile |
   | uname | Linux 04506890469b 5.4.0-169-generic #187-Ubuntu SMP Thu Nov 23 
14:52:28 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/hbase-personality.sh |
   | git revision | master / ade6ab2148 |
   | Default Java | Eclipse Adoptium-11.0.17+8 |
   | Max. process+thread count | 80 (vs. ulimit of 3) |
   | modules | C: hbase-server U: hbase-server |
   | Console output | 
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5774/1/console 
|
   | versions | git=2.34.1 maven=3.8.6 spotbugs=4.7.3 |
   | Powered by | Apache Yetus 0.12.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@hbase.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HBASE-28366) Mis-order of SCP and regionServerReport results into region inconsistencies

2024-03-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HBASE-28366:
---
Labels: pull-request-available  (was: )

> Mis-order of SCP and regionServerReport results into region inconsistencies
> ---
>
> Key: HBASE-28366
> URL: https://issues.apache.org/jira/browse/HBASE-28366
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.4.17, 3.0.0-beta-1, 2.5.7
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>
> If the regionserver is online but due to network issue, if it's rs ephemeral 
> node gets deleted in zookeeper, active master schedules the SCP. However, if 
> the regionserver is alive, it can still send regionServerReport to active 
> master. In the case where SCP assigns regions to other regionserver that were 
> previously hosted on the old regionserver (which is still alive), the old rs 
> can continue to sent regionServerReport to active master.
> Eventually this results into region inconsistencies because region is alive 
> on two regionservers at the same time (though it's temporary state because 
> the rs will be aborted soon). While old regionserver can have zookeeper 
> connectivity issues, it can still make rpc calls to active master.
> Logs:
> SCP:
> {code:java}
> 2024-01-29 16:50:33,956 INFO [RegionServerTracker-0] 
> assignment.AssignmentManager - Scheduled ServerCrashProcedure pid=9812440 for 
> server1-114.xyz,61020,1706541866103 (carryingMeta=false) 
> server1-114.xyz,61020,1706541866103/CRASHED/regionCount=364/lock=java.util.concurrent.locks.ReentrantReadWriteLock@5d5fc31[Write
>  locks = 1, Read locks = 0], oldState=ONLINE.
> 2024-01-29 16:50:33,956 DEBUG [RegionServerTracker-0] 
> procedure2.ProcedureExecutor - Stored pid=9812440, 
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
> server1-114.xyz,61020,1706541866103, splitWal=true, meta=false
> 2024-01-29 16:50:33,973 INFO [PEWorker-36] procedure.ServerCrashProcedure - 
> Splitting WALs pid=9812440, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, 
> locked=true; ServerCrashProcedure server1-114.xyz,61020,1706541866103, 
> splitWal=true, meta=false, isMeta: false
>  {code}
> As part of SCP, d743ace9f70d55f55ba1ecc6dc49a5cb was assigned to another 
> server:
>  
> {code:java}
> 2024-01-29 16:50:42,656 INFO [PEWorker-24] procedure.MasterProcedureScheduler 
> - Took xlock for pid=9818494, ppid=9812440, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
> TransitRegionStateProcedure 
> table=PLATFORM_ENTITY.PLATFORM_IMMUTABLE_ENTITY_DATA, 
> region=d743ace9f70d55f55ba1ecc6dc49a5cb, ASSIGN
> 2024-01-29 16:50:43,106 INFO [PEWorker-23] assignment.RegionStateStore - 
> pid=9818494 updating hbase:meta row=d743ace9f70d55f55ba1ecc6dc49a5cb, 
> regionState=OPEN, repBarrier=12867482, openSeqNum=12867482, 
> regionLocation=server1-65.xyz,61020,1706165574050
>  {code}
>  
> rs abort, after ~5 min:
> {code:java}
> 2024-01-29 16:54:27,235 ERROR [regionserver/server1-114:61020] 
> regionserver.HRegionServer - * ABORTING region server 
> server1-114.xyz,61020,1706541866103: Unexpected exception handling getData 
> *
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss for /hbase/master
>     at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
>     at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
>     at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1229)
>     at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:414)
>     at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataInternal(ZKUtil.java:403)
>     at 
> org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:367)
>     at 
> org.apache.hadoop.hbase.zookeeper.ZKNodeTracker.getData(ZKNodeTracker.java:180)
>     at 
> org.apache.hadoop.hbase.zookeeper.MasterAddressTracker.getMasterAddress(MasterAddressTracker.java:152)
>     at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.createRegionServerStatusStub(HRegionServer.java:2892)
>     at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:1352)
>     at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1142)
>  {code}
>  
> Several region transition failure report logs:
> {code:java}
> 2024-01-29 16:55:13,029 INFO  [_REGION-regionserver/server1-114:61020-0] 
> regionserver.HRegionServer - Failed report transition server { host_name: 
> "server1-114.xyz" port: 61020 start_code: 1706541866103 } transition { 
> transition_code: CLOSED region_info { region_id: 1671555604277 table_name { 
> namespace: "default" qualifier: "TABLE1" } start_key: 

[jira] [Commented] (HBASE-28456) HBase Restore restores old data if data for the same timestamp is in different hfiles

2024-03-24 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830259#comment-17830259
 ] 

Bryan Beaudreault commented on HBASE-28456:
---

Actually, it can't work the same way but I do think you're right that we need 
to get the sequenceId in there. Will look into it a bit

> HBase Restore restores old data if data for the same timestamp is in 
> different hfiles
> -
>
> Key: HBASE-28456
> URL: https://issues.apache.org/jira/browse/HBASE-28456
> Project: HBase
>  Issue Type: Bug
>  Components: backuprestore
>Affects Versions: 2.6.0, 3.0.0
>Reporter: Ruben Van Wanzeele
>Priority: Critical
> Attachments: 
> ChangesOnHFilesOnSameTimestampAreNotCorrectlyRestored.java
>
>
> The restore brings back 'old' data when executing restore.
> It feels like the hfile sequence id is not respected during the restore.
> See testing code attached. The workaround solution is to trigger major 
> compaction before doing the backup (not really feasible for daily backups)
> We didn't investigate this yet, but this might also impact the merge of 
> multiple incremental backups (since that follows a similar code path merging 
> hfiles).
> This currently blocks our support for HBase backup and restore.
> Willing to participate in a solution if necessary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] HBASE-28443 Return too slow when scanning a table with non-existing REGION_REPLICA_ID [hbase]

2024-03-24 Thread via GitHub


Apache9 commented on PR #5767:
URL: https://github.com/apache/hbase/pull/5767#issuecomment-2016817710

   I think we should try to check the replica id when we hit the 
RegionOfflineException, like what we have done in AsyncRpcRetryingCaller.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@hbase.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (HBASE-28456) HBase Restore restores old data if data for the same timestamp is in different hfiles

2024-03-24 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830248#comment-17830248
 ] 

Bryan Beaudreault commented on HBASE-28456:
---

Restore uses a custom map reduce job MapReduceHFileSplitterJob. I’m wondering 
if a full restore would be more efficient if it did distcp + restore snapshot, 
but that’s an aside. Anyway we can try adding a similar solution as we did in 
HBASE-28456. It’s a 1 line change, do you want to give it a try on your test?

> HBase Restore restores old data if data for the same timestamp is in 
> different hfiles
> -
>
> Key: HBASE-28456
> URL: https://issues.apache.org/jira/browse/HBASE-28456
> Project: HBase
>  Issue Type: Bug
>  Components: backuprestore
>Affects Versions: 2.6.0, 3.0.0
>Reporter: Ruben Van Wanzeele
>Priority: Critical
> Attachments: 
> ChangesOnHFilesOnSameTimestampAreNotCorrectlyRestored.java
>
>
> The restore brings back 'old' data when executing restore.
> It feels like the hfile sequence id is not respected during the restore.
> See testing code attached. The workaround solution is to trigger major 
> compaction before doing the backup (not really feasible for daily backups)
> We didn't investigate this yet, but this might also impact the merge of 
> multiple incremental backups (since that follows a similar code path merging 
> hfiles).
> This currently blocks our support for HBase backup and restore.
> Willing to participate in a solution if necessary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-28456) HBase Restore restores old data if data for the same timestamp is in different hfiles

2024-03-24 Thread Ruben Van Wanzeele (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-28456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830215#comment-17830215
 ] 

Ruben Van Wanzeele commented on HBASE-28456:


[~bbeaudreault] 

Same issue with puts and flushes in between, changed the load method to the 
following
{code:java}
private void load(TableName tableName, Instant timestamp, String data)
  throws IOException {
  LOG.info("Writing new data to HBase: " + data);
  try (Connection connection = 
ConnectionFactory.createConnection(cluster.getConf())) {
Table table = connection.getTable(SOURCE_TABLE_NAME);
List puts = new ArrayList<>();
for (int i = 0; i < 10_000; i++) {
  Put put = new Put(Bytes.toBytes(i), timestamp.toEpochMilli());
  put.add(new KeyValue(Bytes.toBytes(i), COLUMN_FAMILY, 
Bytes.toBytes("data"),
timestamp.toEpochMilli(), Bytes.toBytes(data)));
puts.add(put);
}
table.put(puts);
connection.getAdmin().flush(tableName);
  }
} {code}

> HBase Restore restores old data if data for the same timestamp is in 
> different hfiles
> -
>
> Key: HBASE-28456
> URL: https://issues.apache.org/jira/browse/HBASE-28456
> Project: HBase
>  Issue Type: Bug
>  Components: backuprestore
>Affects Versions: 2.6.0, 3.0.0
>Reporter: Ruben Van Wanzeele
>Priority: Critical
> Attachments: 
> ChangesOnHFilesOnSameTimestampAreNotCorrectlyRestored.java
>
>
> The restore brings back 'old' data when executing restore.
> It feels like the hfile sequence id is not respected during the restore.
> See testing code attached. The workaround solution is to trigger major 
> compaction before doing the backup (not really feasible for daily backups)
> We didn't investigate this yet, but this might also impact the merge of 
> multiple incremental backups (since that follows a similar code path merging 
> hfiles).
> This currently blocks our support for HBase backup and restore.
> Willing to participate in a solution if necessary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)