[jira] [Commented] (HBASE-21811) region can be opened on two servers due to race condition with procedures and server reports
[ https://issues.apache.org/jira/browse/HBASE-21811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762012#comment-16762012 ] Sergey Shelukhin commented on HBASE-21811: -- So you're saying the procedure will mismatch the original open procedure? I guess that also can work, however it's much less obvious IMO. Other methods in the same class (reportTransitionOpen, etc.) all compare server name just fine. > region can be opened on two servers due to race condition with procedures and > server reports > > > Key: HBASE-21811 > URL: https://issues.apache.org/jira/browse/HBASE-21811 > Project: HBase > Issue Type: Bug > Components: amv2 >Affects Versions: 3.0.0 >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin >Priority: Blocker > Fix For: 3.0.0, 2.2.0, 2.3.0 > > Attachments: HBASE-21811-UT.patch, HBASE-21811-v1.patch, > HBASE-21811-v2.patch, HBASE-21811.patch > > > Looks like the region server responses are being processed incorrectly in > places allowing te region to be opened on two servers. > * The region server report handling in procedures should check which server > is reporting. > * Also although I didn't check (and it isn't implicated in this bug), RS must > check in OPEN that it's actually the correct RS master sent open to (w.r.t. > start timestamp) > This was previosuly "mitigated" by master killing the RS with incorrect > reports, but due to race conditions with reports and assignment the kill was > replaced with a warning, so now this condition persists. > Regardless, the kill approach is not a good fix because there's still a > window when a region can be opened on two servers. > A region is being opened by server_48c. The server dies, and we process the > retry correctly (retry=3 because 2 previous similar open failures were > processed correctly). > We start opening it on server_1aa now. > {noformat} > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > assignment.RegionStateStore: pid=4915 updating hbase:meta > row=8be2a423b16471b9417f0f7de04281c6, regionState=ABNORMALLY_CLOSED > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > procedure.ServerCrashProcedure: pid=11944, > state=RUNNABLE:SERVER_CRASH_ASSIGN, hasLock=true; ServerCrashProcedure > server=server_48c,17020,1548726406632, splitWal=true, meta=false found RIT > pid=4915, ppid=7, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, > hasLock=true; TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=OPENING, > location=server_48c,17020,1548726406632, table=table, > region=8be2a423b16471b9417f0f7de04281c6 > 2019-01-28 18:12:10,778 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Retry=3 of max=2147483647; pid=4915, > ppid=7, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null > ... > 2019-01-28 18:12:10,902 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Starting pid=4915, ppid=7, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null; forceNewPlan=true, retain=false > 2019-01-28 18:12:11,114 INFO [PEWorker-7] assignment.RegionStateStore: > pid=4915 updating hbase:meta row=8be2a423b16471b9417f0f7de04281c6, > regionState=OPENING, regionLocation=server_1aa,17020,1548727658713 > {noformat} > However, we get the remote procedure failure from 48c after we've already > started that. > It actually tried to open on the restarted RS, which makes me wonder if this > is safe also w.r.t. other races - what if RS already initialized and didn't > error out? > Need to check if we verify the start code expected by master on RS when > opening. > {noformat} > 2019-01-28 18:12:12,179 WARN [RSProcedureDispatcher-pool4-t362] > assignment.RegionRemoteProcedureBase: The remote operation pid=11050, > ppid=4915, state=SUCCESS, hasLock=false; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region > {ENCODED => 8be2a423b16471b9417f0f7de04281c6 ... to server > server_48c,17020,1548726406632 failed > org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: > org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server > server_48c,17020,1548727752747 is not running yet > 2019-01-28 18:12:12,179 WARN [RSProcedureDispatcher-pool4-t362] > procedure.RSProcedureDispatcher: server server_48c,17020,1548726406632 is not > up for a while; try a new one > {noformat} > Without any other reason (at least logged), the
[jira] [Commented] (HBASE-21811) region can be opened on two servers due to race condition with procedures and server reports
[ https://issues.apache.org/jira/browse/HBASE-21811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761517#comment-16761517 ] Duo Zhang commented on HBASE-21811: --- As I explained above, checking for ServerName does not show us the root cause. The problem here is that, the OpenRegionProcedure/CloseRegionProcedure has already been completed, as we have already interrupted it in SCP. But the remoteCallFailed call is coming a bit late, where we have already scheduled new OpenRegionProcedure for the same region, and then the remoteCallFailed call wakes up the wrong procedure and lead to double assign. > region can be opened on two servers due to race condition with procedures and > server reports > > > Key: HBASE-21811 > URL: https://issues.apache.org/jira/browse/HBASE-21811 > Project: HBase > Issue Type: Bug > Components: amv2 >Affects Versions: 3.0.0 >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin >Priority: Blocker > Fix For: 3.0.0, 2.2.0, 2.3.0 > > Attachments: HBASE-21811-UT.patch, HBASE-21811-v1.patch, > HBASE-21811-v2.patch, HBASE-21811.patch > > > Looks like the region server responses are being processed incorrectly in > places allowing te region to be opened on two servers. > * The region server report handling in procedures should check which server > is reporting. > * Also although I didn't check (and it isn't implicated in this bug), RS must > check in OPEN that it's actually the correct RS master sent open to (w.r.t. > start timestamp) > This was previosuly "mitigated" by master killing the RS with incorrect > reports, but due to race conditions with reports and assignment the kill was > replaced with a warning, so now this condition persists. > Regardless, the kill approach is not a good fix because there's still a > window when a region can be opened on two servers. > A region is being opened by server_48c. The server dies, and we process the > retry correctly (retry=3 because 2 previous similar open failures were > processed correctly). > We start opening it on server_1aa now. > {noformat} > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > assignment.RegionStateStore: pid=4915 updating hbase:meta > row=8be2a423b16471b9417f0f7de04281c6, regionState=ABNORMALLY_CLOSED > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > procedure.ServerCrashProcedure: pid=11944, > state=RUNNABLE:SERVER_CRASH_ASSIGN, hasLock=true; ServerCrashProcedure > server=server_48c,17020,1548726406632, splitWal=true, meta=false found RIT > pid=4915, ppid=7, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, > hasLock=true; TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=OPENING, > location=server_48c,17020,1548726406632, table=table, > region=8be2a423b16471b9417f0f7de04281c6 > 2019-01-28 18:12:10,778 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Retry=3 of max=2147483647; pid=4915, > ppid=7, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null > ... > 2019-01-28 18:12:10,902 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Starting pid=4915, ppid=7, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null; forceNewPlan=true, retain=false > 2019-01-28 18:12:11,114 INFO [PEWorker-7] assignment.RegionStateStore: > pid=4915 updating hbase:meta row=8be2a423b16471b9417f0f7de04281c6, > regionState=OPENING, regionLocation=server_1aa,17020,1548727658713 > {noformat} > However, we get the remote procedure failure from 48c after we've already > started that. > It actually tried to open on the restarted RS, which makes me wonder if this > is safe also w.r.t. other races - what if RS already initialized and didn't > error out? > Need to check if we verify the start code expected by master on RS when > opening. > {noformat} > 2019-01-28 18:12:12,179 WARN [RSProcedureDispatcher-pool4-t362] > assignment.RegionRemoteProcedureBase: The remote operation pid=11050, > ppid=4915, state=SUCCESS, hasLock=false; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region > {ENCODED => 8be2a423b16471b9417f0f7de04281c6 ... to server > server_48c,17020,1548726406632 failed > org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: > org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server > server_48c,17020,1548727752747 is not running yet > 2019-01-28 18:12:12,179 WARN
[jira] [Commented] (HBASE-21811) region can be opened on two servers due to race condition with procedures and server reports
[ https://issues.apache.org/jira/browse/HBASE-21811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761130#comment-16761130 ] Sergey Shelukhin commented on HBASE-21811: -- There are 2 separate bug fixes in the original (and new) patch... the one on server for start code is ok in both. I'm talking about the one in master for responding to Open... The original fix checked server name, just like existing code in successful-open. The new code checks if event is waiting. However, it looks to me like the event will be waiting in this case, because it's waiting for the response from the wrong server. I wonder, if successful open already checks server name (see the log line "but the region is not on it, should be a retry, ignore"), the failed open shouldn't also check server name? > region can be opened on two servers due to race condition with procedures and > server reports > > > Key: HBASE-21811 > URL: https://issues.apache.org/jira/browse/HBASE-21811 > Project: HBase > Issue Type: Bug > Components: amv2 >Affects Versions: 3.0.0 >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin >Priority: Blocker > Fix For: 3.0.0, 2.2.0, 2.3.0 > > Attachments: HBASE-21811-UT.patch, HBASE-21811-v1.patch, > HBASE-21811-v2.patch, HBASE-21811.patch > > > Looks like the region server responses are being processed incorrectly in > places allowing te region to be opened on two servers. > * The region server report handling in procedures should check which server > is reporting. > * Also although I didn't check (and it isn't implicated in this bug), RS must > check in OPEN that it's actually the correct RS master sent open to (w.r.t. > start timestamp) > This was previosuly "mitigated" by master killing the RS with incorrect > reports, but due to race conditions with reports and assignment the kill was > replaced with a warning, so now this condition persists. > Regardless, the kill approach is not a good fix because there's still a > window when a region can be opened on two servers. > A region is being opened by server_48c. The server dies, and we process the > retry correctly (retry=3 because 2 previous similar open failures were > processed correctly). > We start opening it on server_1aa now. > {noformat} > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > assignment.RegionStateStore: pid=4915 updating hbase:meta > row=8be2a423b16471b9417f0f7de04281c6, regionState=ABNORMALLY_CLOSED > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > procedure.ServerCrashProcedure: pid=11944, > state=RUNNABLE:SERVER_CRASH_ASSIGN, hasLock=true; ServerCrashProcedure > server=server_48c,17020,1548726406632, splitWal=true, meta=false found RIT > pid=4915, ppid=7, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, > hasLock=true; TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=OPENING, > location=server_48c,17020,1548726406632, table=table, > region=8be2a423b16471b9417f0f7de04281c6 > 2019-01-28 18:12:10,778 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Retry=3 of max=2147483647; pid=4915, > ppid=7, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null > ... > 2019-01-28 18:12:10,902 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Starting pid=4915, ppid=7, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null; forceNewPlan=true, retain=false > 2019-01-28 18:12:11,114 INFO [PEWorker-7] assignment.RegionStateStore: > pid=4915 updating hbase:meta row=8be2a423b16471b9417f0f7de04281c6, > regionState=OPENING, regionLocation=server_1aa,17020,1548727658713 > {noformat} > However, we get the remote procedure failure from 48c after we've already > started that. > It actually tried to open on the restarted RS, which makes me wonder if this > is safe also w.r.t. other races - what if RS already initialized and didn't > error out? > Need to check if we verify the start code expected by master on RS when > opening. > {noformat} > 2019-01-28 18:12:12,179 WARN [RSProcedureDispatcher-pool4-t362] > assignment.RegionRemoteProcedureBase: The remote operation pid=11050, > ppid=4915, state=SUCCESS, hasLock=false; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region > {ENCODED => 8be2a423b16471b9417f0f7de04281c6 ... to server > server_48c,17020,1548726406632 failed >
[jira] [Commented] (HBASE-21811) region can be opened on two servers due to race condition with procedures and server reports
[ https://issues.apache.org/jira/browse/HBASE-21811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16760309#comment-16760309 ] Duo Zhang commented on HBASE-21811: --- We have checked the start code, but only before executing any procedures, please see the code carefully. > region can be opened on two servers due to race condition with procedures and > server reports > > > Key: HBASE-21811 > URL: https://issues.apache.org/jira/browse/HBASE-21811 > Project: HBase > Issue Type: Bug > Components: amv2 >Affects Versions: 3.0.0 >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin >Priority: Blocker > Fix For: 3.0.0, 2.2.0, 2.3.0 > > Attachments: HBASE-21811-UT.patch, HBASE-21811-v1.patch, > HBASE-21811-v2.patch, HBASE-21811.patch > > > Looks like the region server responses are being processed incorrectly in > places allowing te region to be opened on two servers. > * The region server report handling in procedures should check which server > is reporting. > * Also although I didn't check (and it isn't implicated in this bug), RS must > check in OPEN that it's actually the correct RS master sent open to (w.r.t. > start timestamp) > This was previosuly "mitigated" by master killing the RS with incorrect > reports, but due to race conditions with reports and assignment the kill was > replaced with a warning, so now this condition persists. > Regardless, the kill approach is not a good fix because there's still a > window when a region can be opened on two servers. > A region is being opened by server_48c. The server dies, and we process the > retry correctly (retry=3 because 2 previous similar open failures were > processed correctly). > We start opening it on server_1aa now. > {noformat} > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > assignment.RegionStateStore: pid=4915 updating hbase:meta > row=8be2a423b16471b9417f0f7de04281c6, regionState=ABNORMALLY_CLOSED > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > procedure.ServerCrashProcedure: pid=11944, > state=RUNNABLE:SERVER_CRASH_ASSIGN, hasLock=true; ServerCrashProcedure > server=server_48c,17020,1548726406632, splitWal=true, meta=false found RIT > pid=4915, ppid=7, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, > hasLock=true; TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=OPENING, > location=server_48c,17020,1548726406632, table=table, > region=8be2a423b16471b9417f0f7de04281c6 > 2019-01-28 18:12:10,778 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Retry=3 of max=2147483647; pid=4915, > ppid=7, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null > ... > 2019-01-28 18:12:10,902 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Starting pid=4915, ppid=7, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null; forceNewPlan=true, retain=false > 2019-01-28 18:12:11,114 INFO [PEWorker-7] assignment.RegionStateStore: > pid=4915 updating hbase:meta row=8be2a423b16471b9417f0f7de04281c6, > regionState=OPENING, regionLocation=server_1aa,17020,1548727658713 > {noformat} > However, we get the remote procedure failure from 48c after we've already > started that. > It actually tried to open on the restarted RS, which makes me wonder if this > is safe also w.r.t. other races - what if RS already initialized and didn't > error out? > Need to check if we verify the start code expected by master on RS when > opening. > {noformat} > 2019-01-28 18:12:12,179 WARN [RSProcedureDispatcher-pool4-t362] > assignment.RegionRemoteProcedureBase: The remote operation pid=11050, > ppid=4915, state=SUCCESS, hasLock=false; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region > {ENCODED => 8be2a423b16471b9417f0f7de04281c6 ... to server > server_48c,17020,1548726406632 failed > org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: > org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server > server_48c,17020,1548727752747 is not running yet > 2019-01-28 18:12:12,179 WARN [RSProcedureDispatcher-pool4-t362] > procedure.RSProcedureDispatcher: server server_48c,17020,1548726406632 is not > up for a while; try a new one > {noformat} > Without any other reason (at least logged), the RIT immediately retries again > and chooses a new candidate. It then retries again and goes to the new 48c, > but that's unrelated. >
[jira] [Commented] (HBASE-21811) region can be opened on two servers due to race condition with procedures and server reports
[ https://issues.apache.org/jira/browse/HBASE-21811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16760112#comment-16760112 ] Sergey Shelukhin commented on HBASE-21811: -- Hmm... the patch that got committed doesn't have the server check in the procedure waking. I think this check is still needed - in our case, the procedure was waiting for an event, just from a different server, so it would still wake it up, right? > region can be opened on two servers due to race condition with procedures and > server reports > > > Key: HBASE-21811 > URL: https://issues.apache.org/jira/browse/HBASE-21811 > Project: HBase > Issue Type: Bug > Components: amv2 >Affects Versions: 3.0.0 >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin >Priority: Blocker > Fix For: 3.0.0, 2.2.0, 2.3.0 > > Attachments: HBASE-21811-UT.patch, HBASE-21811-v1.patch, > HBASE-21811-v2.patch, HBASE-21811.patch > > > Looks like the region server responses are being processed incorrectly in > places allowing te region to be opened on two servers. > * The region server report handling in procedures should check which server > is reporting. > * Also although I didn't check (and it isn't implicated in this bug), RS must > check in OPEN that it's actually the correct RS master sent open to (w.r.t. > start timestamp) > This was previosuly "mitigated" by master killing the RS with incorrect > reports, but due to race conditions with reports and assignment the kill was > replaced with a warning, so now this condition persists. > Regardless, the kill approach is not a good fix because there's still a > window when a region can be opened on two servers. > A region is being opened by server_48c. The server dies, and we process the > retry correctly (retry=3 because 2 previous similar open failures were > processed correctly). > We start opening it on server_1aa now. > {noformat} > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > assignment.RegionStateStore: pid=4915 updating hbase:meta > row=8be2a423b16471b9417f0f7de04281c6, regionState=ABNORMALLY_CLOSED > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > procedure.ServerCrashProcedure: pid=11944, > state=RUNNABLE:SERVER_CRASH_ASSIGN, hasLock=true; ServerCrashProcedure > server=server_48c,17020,1548726406632, splitWal=true, meta=false found RIT > pid=4915, ppid=7, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, > hasLock=true; TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=OPENING, > location=server_48c,17020,1548726406632, table=table, > region=8be2a423b16471b9417f0f7de04281c6 > 2019-01-28 18:12:10,778 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Retry=3 of max=2147483647; pid=4915, > ppid=7, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null > ... > 2019-01-28 18:12:10,902 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Starting pid=4915, ppid=7, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null; forceNewPlan=true, retain=false > 2019-01-28 18:12:11,114 INFO [PEWorker-7] assignment.RegionStateStore: > pid=4915 updating hbase:meta row=8be2a423b16471b9417f0f7de04281c6, > regionState=OPENING, regionLocation=server_1aa,17020,1548727658713 > {noformat} > However, we get the remote procedure failure from 48c after we've already > started that. > It actually tried to open on the restarted RS, which makes me wonder if this > is safe also w.r.t. other races - what if RS already initialized and didn't > error out? > Need to check if we verify the start code expected by master on RS when > opening. > {noformat} > 2019-01-28 18:12:12,179 WARN [RSProcedureDispatcher-pool4-t362] > assignment.RegionRemoteProcedureBase: The remote operation pid=11050, > ppid=4915, state=SUCCESS, hasLock=false; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region > {ENCODED => 8be2a423b16471b9417f0f7de04281c6 ... to server > server_48c,17020,1548726406632 failed > org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: > org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server > server_48c,17020,1548727752747 is not running yet > 2019-01-28 18:12:12,179 WARN [RSProcedureDispatcher-pool4-t362] > procedure.RSProcedureDispatcher: server server_48c,17020,1548726406632 is not > up for a while; try a new one > {noformat} > Without any other reason (at
[jira] [Commented] (HBASE-21811) region can be opened on two servers due to race condition with procedures and server reports
[ https://issues.apache.org/jira/browse/HBASE-21811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16759084#comment-16759084 ] Hudson commented on HBASE-21811: Results for branch master [build #766 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/master/766/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/master/766//General_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/master/766//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/master/766//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > region can be opened on two servers due to race condition with procedures and > server reports > > > Key: HBASE-21811 > URL: https://issues.apache.org/jira/browse/HBASE-21811 > Project: HBase > Issue Type: Bug > Components: amv2 >Affects Versions: 3.0.0 >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin >Priority: Blocker > Fix For: 3.0.0, 2.2.0, 2.3.0 > > Attachments: HBASE-21811-UT.patch, HBASE-21811-v1.patch, > HBASE-21811-v2.patch, HBASE-21811.patch > > > Looks like the region server responses are being processed incorrectly in > places allowing te region to be opened on two servers. > * The region server report handling in procedures should check which server > is reporting. > * Also although I didn't check (and it isn't implicated in this bug), RS must > check in OPEN that it's actually the correct RS master sent open to (w.r.t. > start timestamp) > This was previosuly "mitigated" by master killing the RS with incorrect > reports, but due to race conditions with reports and assignment the kill was > replaced with a warning, so now this condition persists. > Regardless, the kill approach is not a good fix because there's still a > window when a region can be opened on two servers. > A region is being opened by server_48c. The server dies, and we process the > retry correctly (retry=3 because 2 previous similar open failures were > processed correctly). > We start opening it on server_1aa now. > {noformat} > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > assignment.RegionStateStore: pid=4915 updating hbase:meta > row=8be2a423b16471b9417f0f7de04281c6, regionState=ABNORMALLY_CLOSED > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > procedure.ServerCrashProcedure: pid=11944, > state=RUNNABLE:SERVER_CRASH_ASSIGN, hasLock=true; ServerCrashProcedure > server=server_48c,17020,1548726406632, splitWal=true, meta=false found RIT > pid=4915, ppid=7, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, > hasLock=true; TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=OPENING, > location=server_48c,17020,1548726406632, table=table, > region=8be2a423b16471b9417f0f7de04281c6 > 2019-01-28 18:12:10,778 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Retry=3 of max=2147483647; pid=4915, > ppid=7, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null > ... > 2019-01-28 18:12:10,902 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Starting pid=4915, ppid=7, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null; forceNewPlan=true, retain=false > 2019-01-28 18:12:11,114 INFO [PEWorker-7] assignment.RegionStateStore: > pid=4915 updating hbase:meta row=8be2a423b16471b9417f0f7de04281c6, > regionState=OPENING, regionLocation=server_1aa,17020,1548727658713 > {noformat} > However, we get the remote procedure failure from 48c after we've already > started that. > It actually tried to open on the restarted RS, which makes me wonder if this > is safe also w.r.t. other races - what if RS already initialized and didn't > error out? > Need to check if we verify the start code expected by master on RS when > opening. > {noformat} > 2019-01-28 18:12:12,179 WARN [RSProcedureDispatcher-pool4-t362] > assignment.RegionRemoteProcedureBase: The remote operation
[jira] [Commented] (HBASE-21811) region can be opened on two servers due to race condition with procedures and server reports
[ https://issues.apache.org/jira/browse/HBASE-21811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16759080#comment-16759080 ] Hudson commented on HBASE-21811: Results for branch branch-2.2 [build #12 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.2/12/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.2/12//General_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.2/12//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.2/12//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > region can be opened on two servers due to race condition with procedures and > server reports > > > Key: HBASE-21811 > URL: https://issues.apache.org/jira/browse/HBASE-21811 > Project: HBase > Issue Type: Bug > Components: amv2 >Affects Versions: 3.0.0 >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin >Priority: Blocker > Fix For: 3.0.0, 2.2.0, 2.3.0 > > Attachments: HBASE-21811-UT.patch, HBASE-21811-v1.patch, > HBASE-21811-v2.patch, HBASE-21811.patch > > > Looks like the region server responses are being processed incorrectly in > places allowing te region to be opened on two servers. > * The region server report handling in procedures should check which server > is reporting. > * Also although I didn't check (and it isn't implicated in this bug), RS must > check in OPEN that it's actually the correct RS master sent open to (w.r.t. > start timestamp) > This was previosuly "mitigated" by master killing the RS with incorrect > reports, but due to race conditions with reports and assignment the kill was > replaced with a warning, so now this condition persists. > Regardless, the kill approach is not a good fix because there's still a > window when a region can be opened on two servers. > A region is being opened by server_48c. The server dies, and we process the > retry correctly (retry=3 because 2 previous similar open failures were > processed correctly). > We start opening it on server_1aa now. > {noformat} > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > assignment.RegionStateStore: pid=4915 updating hbase:meta > row=8be2a423b16471b9417f0f7de04281c6, regionState=ABNORMALLY_CLOSED > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > procedure.ServerCrashProcedure: pid=11944, > state=RUNNABLE:SERVER_CRASH_ASSIGN, hasLock=true; ServerCrashProcedure > server=server_48c,17020,1548726406632, splitWal=true, meta=false found RIT > pid=4915, ppid=7, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, > hasLock=true; TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=OPENING, > location=server_48c,17020,1548726406632, table=table, > region=8be2a423b16471b9417f0f7de04281c6 > 2019-01-28 18:12:10,778 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Retry=3 of max=2147483647; pid=4915, > ppid=7, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null > ... > 2019-01-28 18:12:10,902 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Starting pid=4915, ppid=7, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null; forceNewPlan=true, retain=false > 2019-01-28 18:12:11,114 INFO [PEWorker-7] assignment.RegionStateStore: > pid=4915 updating hbase:meta row=8be2a423b16471b9417f0f7de04281c6, > regionState=OPENING, regionLocation=server_1aa,17020,1548727658713 > {noformat} > However, we get the remote procedure failure from 48c after we've already > started that. > It actually tried to open on the restarted RS, which makes me wonder if this > is safe also w.r.t. other races - what if RS already initialized and didn't > error out? > Need to check if we verify the start code expected by master on RS when > opening. > {noformat} > 2019-01-28 18:12:12,179 WARN [RSProcedureDispatcher-pool4-t362] > assignment.RegionRemoteProcedureBase: The
[jira] [Commented] (HBASE-21811) region can be opened on two servers due to race condition with procedures and server reports
[ https://issues.apache.org/jira/browse/HBASE-21811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16759023#comment-16759023 ] Hudson commented on HBASE-21811: Results for branch branch-2 [build #1654 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1654/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1654//General_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1654//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1654//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > region can be opened on two servers due to race condition with procedures and > server reports > > > Key: HBASE-21811 > URL: https://issues.apache.org/jira/browse/HBASE-21811 > Project: HBase > Issue Type: Bug > Components: amv2 >Affects Versions: 3.0.0 >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin >Priority: Blocker > Fix For: 3.0.0, 2.2.0, 2.3.0 > > Attachments: HBASE-21811-UT.patch, HBASE-21811-v1.patch, > HBASE-21811-v2.patch, HBASE-21811.patch > > > Looks like the region server responses are being processed incorrectly in > places allowing te region to be opened on two servers. > * The region server report handling in procedures should check which server > is reporting. > * Also although I didn't check (and it isn't implicated in this bug), RS must > check in OPEN that it's actually the correct RS master sent open to (w.r.t. > start timestamp) > This was previosuly "mitigated" by master killing the RS with incorrect > reports, but due to race conditions with reports and assignment the kill was > replaced with a warning, so now this condition persists. > Regardless, the kill approach is not a good fix because there's still a > window when a region can be opened on two servers. > A region is being opened by server_48c. The server dies, and we process the > retry correctly (retry=3 because 2 previous similar open failures were > processed correctly). > We start opening it on server_1aa now. > {noformat} > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > assignment.RegionStateStore: pid=4915 updating hbase:meta > row=8be2a423b16471b9417f0f7de04281c6, regionState=ABNORMALLY_CLOSED > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > procedure.ServerCrashProcedure: pid=11944, > state=RUNNABLE:SERVER_CRASH_ASSIGN, hasLock=true; ServerCrashProcedure > server=server_48c,17020,1548726406632, splitWal=true, meta=false found RIT > pid=4915, ppid=7, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, > hasLock=true; TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=OPENING, > location=server_48c,17020,1548726406632, table=table, > region=8be2a423b16471b9417f0f7de04281c6 > 2019-01-28 18:12:10,778 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Retry=3 of max=2147483647; pid=4915, > ppid=7, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null > ... > 2019-01-28 18:12:10,902 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Starting pid=4915, ppid=7, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null; forceNewPlan=true, retain=false > 2019-01-28 18:12:11,114 INFO [PEWorker-7] assignment.RegionStateStore: > pid=4915 updating hbase:meta row=8be2a423b16471b9417f0f7de04281c6, > regionState=OPENING, regionLocation=server_1aa,17020,1548727658713 > {noformat} > However, we get the remote procedure failure from 48c after we've already > started that. > It actually tried to open on the restarted RS, which makes me wonder if this > is safe also w.r.t. other races - what if RS already initialized and didn't > error out? > Need to check if we verify the start code expected by master on RS when > opening. > {noformat} > 2019-01-28 18:12:12,179 WARN [RSProcedureDispatcher-pool4-t362] > assignment.RegionRemoteProcedureBase: The
[jira] [Commented] (HBASE-21811) region can be opened on two servers due to race condition with procedures and server reports
[ https://issues.apache.org/jira/browse/HBASE-21811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758938#comment-16758938 ] Duo Zhang commented on HBASE-21811: --- Good. Let me commit. > region can be opened on two servers due to race condition with procedures and > server reports > > > Key: HBASE-21811 > URL: https://issues.apache.org/jira/browse/HBASE-21811 > Project: HBase > Issue Type: Bug > Components: amv2 >Affects Versions: 3.0.0 >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin >Priority: Blocker > Fix For: 3.0.0, 2.2.0 > > Attachments: HBASE-21811-UT.patch, HBASE-21811-v1.patch, > HBASE-21811-v2.patch, HBASE-21811.patch > > > Looks like the region server responses are being processed incorrectly in > places allowing te region to be opened on two servers. > * The region server report handling in procedures should check which server > is reporting. > * Also although I didn't check (and it isn't implicated in this bug), RS must > check in OPEN that it's actually the correct RS master sent open to (w.r.t. > start timestamp) > This was previosuly "mitigated" by master killing the RS with incorrect > reports, but due to race conditions with reports and assignment the kill was > replaced with a warning, so now this condition persists. > Regardless, the kill approach is not a good fix because there's still a > window when a region can be opened on two servers. > A region is being opened by server_48c. The server dies, and we process the > retry correctly (retry=3 because 2 previous similar open failures were > processed correctly). > We start opening it on server_1aa now. > {noformat} > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > assignment.RegionStateStore: pid=4915 updating hbase:meta > row=8be2a423b16471b9417f0f7de04281c6, regionState=ABNORMALLY_CLOSED > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > procedure.ServerCrashProcedure: pid=11944, > state=RUNNABLE:SERVER_CRASH_ASSIGN, hasLock=true; ServerCrashProcedure > server=server_48c,17020,1548726406632, splitWal=true, meta=false found RIT > pid=4915, ppid=7, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, > hasLock=true; TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=OPENING, > location=server_48c,17020,1548726406632, table=table, > region=8be2a423b16471b9417f0f7de04281c6 > 2019-01-28 18:12:10,778 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Retry=3 of max=2147483647; pid=4915, > ppid=7, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null > ... > 2019-01-28 18:12:10,902 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Starting pid=4915, ppid=7, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null; forceNewPlan=true, retain=false > 2019-01-28 18:12:11,114 INFO [PEWorker-7] assignment.RegionStateStore: > pid=4915 updating hbase:meta row=8be2a423b16471b9417f0f7de04281c6, > regionState=OPENING, regionLocation=server_1aa,17020,1548727658713 > {noformat} > However, we get the remote procedure failure from 48c after we've already > started that. > It actually tried to open on the restarted RS, which makes me wonder if this > is safe also w.r.t. other races - what if RS already initialized and didn't > error out? > Need to check if we verify the start code expected by master on RS when > opening. > {noformat} > 2019-01-28 18:12:12,179 WARN [RSProcedureDispatcher-pool4-t362] > assignment.RegionRemoteProcedureBase: The remote operation pid=11050, > ppid=4915, state=SUCCESS, hasLock=false; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region > {ENCODED => 8be2a423b16471b9417f0f7de04281c6 ... to server > server_48c,17020,1548726406632 failed > org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: > org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server > server_48c,17020,1548727752747 is not running yet > 2019-01-28 18:12:12,179 WARN [RSProcedureDispatcher-pool4-t362] > procedure.RSProcedureDispatcher: server server_48c,17020,1548726406632 is not > up for a while; try a new one > {noformat} > Without any other reason (at least logged), the RIT immediately retries again > and chooses a new candidate. It then retries again and goes to the new 48c, > but that's unrelated. > {noformat} > 2019-01-28 18:12:12,289 INFO [KeepAlivePEWorker-100] >
[jira] [Commented] (HBASE-21811) region can be opened on two servers due to race condition with procedures and server reports
[ https://issues.apache.org/jira/browse/HBASE-21811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758937#comment-16758937 ] Hadoop QA commented on HBASE-21811: --- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 1s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 27s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 22s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 16s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 22s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 9s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 36s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 43s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 15s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 7s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 19s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 11s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 8m 43s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 43s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 19s{color} | {color:green} hbase-procedure in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green}127m 25s{color} | {color:green} hbase-server in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 52s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}172m 50s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b | | JIRA Issue | HBASE-21811 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12957362/HBASE-21811-v2.patch | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux 7df90c84adc9 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / 5784a09fff | | maven | version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) | | Default Java | 1.8.0_181 | | findbugs |
[jira] [Commented] (HBASE-21811) region can be opened on two servers due to race condition with procedures and server reports
[ https://issues.apache.org/jira/browse/HBASE-21811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758898#comment-16758898 ] Duo Zhang commented on HBASE-21811: --- Forgot to add the ClassRule annotation... > region can be opened on two servers due to race condition with procedures and > server reports > > > Key: HBASE-21811 > URL: https://issues.apache.org/jira/browse/HBASE-21811 > Project: HBase > Issue Type: Bug > Components: amv2 >Affects Versions: 3.0.0 >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin >Priority: Blocker > Fix For: 3.0.0, 2.2.0 > > Attachments: HBASE-21811-UT.patch, HBASE-21811-v1.patch, > HBASE-21811-v2.patch, HBASE-21811.patch > > > Looks like the region server responses are being processed incorrectly in > places allowing te region to be opened on two servers. > * The region server report handling in procedures should check which server > is reporting. > * Also although I didn't check (and it isn't implicated in this bug), RS must > check in OPEN that it's actually the correct RS master sent open to (w.r.t. > start timestamp) > This was previosuly "mitigated" by master killing the RS with incorrect > reports, but due to race conditions with reports and assignment the kill was > replaced with a warning, so now this condition persists. > Regardless, the kill approach is not a good fix because there's still a > window when a region can be opened on two servers. > A region is being opened by server_48c. The server dies, and we process the > retry correctly (retry=3 because 2 previous similar open failures were > processed correctly). > We start opening it on server_1aa now. > {noformat} > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > assignment.RegionStateStore: pid=4915 updating hbase:meta > row=8be2a423b16471b9417f0f7de04281c6, regionState=ABNORMALLY_CLOSED > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > procedure.ServerCrashProcedure: pid=11944, > state=RUNNABLE:SERVER_CRASH_ASSIGN, hasLock=true; ServerCrashProcedure > server=server_48c,17020,1548726406632, splitWal=true, meta=false found RIT > pid=4915, ppid=7, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, > hasLock=true; TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=OPENING, > location=server_48c,17020,1548726406632, table=table, > region=8be2a423b16471b9417f0f7de04281c6 > 2019-01-28 18:12:10,778 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Retry=3 of max=2147483647; pid=4915, > ppid=7, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null > ... > 2019-01-28 18:12:10,902 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Starting pid=4915, ppid=7, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null; forceNewPlan=true, retain=false > 2019-01-28 18:12:11,114 INFO [PEWorker-7] assignment.RegionStateStore: > pid=4915 updating hbase:meta row=8be2a423b16471b9417f0f7de04281c6, > regionState=OPENING, regionLocation=server_1aa,17020,1548727658713 > {noformat} > However, we get the remote procedure failure from 48c after we've already > started that. > It actually tried to open on the restarted RS, which makes me wonder if this > is safe also w.r.t. other races - what if RS already initialized and didn't > error out? > Need to check if we verify the start code expected by master on RS when > opening. > {noformat} > 2019-01-28 18:12:12,179 WARN [RSProcedureDispatcher-pool4-t362] > assignment.RegionRemoteProcedureBase: The remote operation pid=11050, > ppid=4915, state=SUCCESS, hasLock=false; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region > {ENCODED => 8be2a423b16471b9417f0f7de04281c6 ... to server > server_48c,17020,1548726406632 failed > org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: > org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server > server_48c,17020,1548727752747 is not running yet > 2019-01-28 18:12:12,179 WARN [RSProcedureDispatcher-pool4-t362] > procedure.RSProcedureDispatcher: server server_48c,17020,1548726406632 is not > up for a while; try a new one > {noformat} > Without any other reason (at least logged), the RIT immediately retries again > and chooses a new candidate. It then retries again and goes to the new 48c, > but that's unrelated. > {noformat} > 2019-01-28 18:12:12,289 INFO [KeepAlivePEWorker-100] >
[jira] [Commented] (HBASE-21811) region can be opened on two servers due to race condition with procedures and server reports
[ https://issues.apache.org/jira/browse/HBASE-21811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758890#comment-16758890 ] Hadoop QA commented on HBASE-21811: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 14s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 26s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 15s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 15s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 22s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 5s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 39s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 43s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 16s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 9s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 19s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 8s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 8m 38s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 45s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 23s{color} | {color:green} hbase-procedure in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red}127m 54s{color} | {color:red} hbase-server in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 53s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}173m 19s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hbase.master.assignment.TestWakeUpUnexpectedProcedure | | | hadoop.hbase.replication.TestSyncReplicationStandbyKillRS | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b | | JIRA Issue | HBASE-21811 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12957355/HBASE-21811-v1.patch | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux ca520ca841fd 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision
[jira] [Commented] (HBASE-21811) region can be opened on two servers due to race condition with procedures and server reports
[ https://issues.apache.org/jira/browse/HBASE-21811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758849#comment-16758849 ] Duo Zhang commented on HBASE-21811: --- And we need to check for start code before executing any procedures in executeProcedures, otherwise it will also lead to multiple assigns. We need to fail all the procedures. > region can be opened on two servers due to race condition with procedures and > server reports > > > Key: HBASE-21811 > URL: https://issues.apache.org/jira/browse/HBASE-21811 > Project: HBase > Issue Type: Bug > Components: amv2 >Affects Versions: 3.0.0 >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin >Priority: Blocker > Fix For: 3.0.0, 2.2.0 > > Attachments: HBASE-21811-UT.patch, HBASE-21811.patch > > > Looks like the region server responses are being processed incorrectly in > places allowing te region to be opened on two servers. > * The region server report handling in procedures should check which server > is reporting. > * Also although I didn't check (and it isn't implicated in this bug), RS must > check in OPEN that it's actually the correct RS master sent open to (w.r.t. > start timestamp) > This was previosuly "mitigated" by master killing the RS with incorrect > reports, but due to race conditions with reports and assignment the kill was > replaced with a warning, so now this condition persists. > Regardless, the kill approach is not a good fix because there's still a > window when a region can be opened on two servers. > A region is being opened by server_48c. The server dies, and we process the > retry correctly (retry=3 because 2 previous similar open failures were > processed correctly). > We start opening it on server_1aa now. > {noformat} > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > assignment.RegionStateStore: pid=4915 updating hbase:meta > row=8be2a423b16471b9417f0f7de04281c6, regionState=ABNORMALLY_CLOSED > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > procedure.ServerCrashProcedure: pid=11944, > state=RUNNABLE:SERVER_CRASH_ASSIGN, hasLock=true; ServerCrashProcedure > server=server_48c,17020,1548726406632, splitWal=true, meta=false found RIT > pid=4915, ppid=7, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, > hasLock=true; TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=OPENING, > location=server_48c,17020,1548726406632, table=table, > region=8be2a423b16471b9417f0f7de04281c6 > 2019-01-28 18:12:10,778 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Retry=3 of max=2147483647; pid=4915, > ppid=7, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null > ... > 2019-01-28 18:12:10,902 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Starting pid=4915, ppid=7, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null; forceNewPlan=true, retain=false > 2019-01-28 18:12:11,114 INFO [PEWorker-7] assignment.RegionStateStore: > pid=4915 updating hbase:meta row=8be2a423b16471b9417f0f7de04281c6, > regionState=OPENING, regionLocation=server_1aa,17020,1548727658713 > {noformat} > However, we get the remote procedure failure from 48c after we've already > started that. > It actually tried to open on the restarted RS, which makes me wonder if this > is safe also w.r.t. other races - what if RS already initialized and didn't > error out? > Need to check if we verify the start code expected by master on RS when > opening. > {noformat} > 2019-01-28 18:12:12,179 WARN [RSProcedureDispatcher-pool4-t362] > assignment.RegionRemoteProcedureBase: The remote operation pid=11050, > ppid=4915, state=SUCCESS, hasLock=false; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region > {ENCODED => 8be2a423b16471b9417f0f7de04281c6 ... to server > server_48c,17020,1548726406632 failed > org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: > org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server > server_48c,17020,1548727752747 is not running yet > 2019-01-28 18:12:12,179 WARN [RSProcedureDispatcher-pool4-t362] > procedure.RSProcedureDispatcher: server server_48c,17020,1548726406632 is not > up for a while; try a new one > {noformat} > Without any other reason (at least logged), the RIT immediately retries again > and chooses a new candidate. It then retries again and goes to the new 48c, > but that's
[jira] [Commented] (HBASE-21811) region can be opened on two servers due to race condition with procedures and server reports
[ https://issues.apache.org/jira/browse/HBASE-21811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758819#comment-16758819 ] Duo Zhang commented on HBASE-21811: --- I know that checking for ServerName could work but it may confuse people on what is actually going on here. > region can be opened on two servers due to race condition with procedures and > server reports > > > Key: HBASE-21811 > URL: https://issues.apache.org/jira/browse/HBASE-21811 > Project: HBase > Issue Type: Bug > Components: amv2 >Affects Versions: 3.0.0 >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin >Priority: Blocker > Fix For: 3.0.0, 2.2.0 > > Attachments: HBASE-21811-UT.patch, HBASE-21811.patch > > > Looks like the region server responses are being processed incorrectly in > places allowing te region to be opened on two servers. > * The region server report handling in procedures should check which server > is reporting. > * Also although I didn't check (and it isn't implicated in this bug), RS must > check in OPEN that it's actually the correct RS master sent open to (w.r.t. > start timestamp) > This was previosuly "mitigated" by master killing the RS with incorrect > reports, but due to race conditions with reports and assignment the kill was > replaced with a warning, so now this condition persists. > Regardless, the kill approach is not a good fix because there's still a > window when a region can be opened on two servers. > A region is being opened by server_48c. The server dies, and we process the > retry correctly (retry=3 because 2 previous similar open failures were > processed correctly). > We start opening it on server_1aa now. > {noformat} > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > assignment.RegionStateStore: pid=4915 updating hbase:meta > row=8be2a423b16471b9417f0f7de04281c6, regionState=ABNORMALLY_CLOSED > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > procedure.ServerCrashProcedure: pid=11944, > state=RUNNABLE:SERVER_CRASH_ASSIGN, hasLock=true; ServerCrashProcedure > server=server_48c,17020,1548726406632, splitWal=true, meta=false found RIT > pid=4915, ppid=7, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, > hasLock=true; TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=OPENING, > location=server_48c,17020,1548726406632, table=table, > region=8be2a423b16471b9417f0f7de04281c6 > 2019-01-28 18:12:10,778 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Retry=3 of max=2147483647; pid=4915, > ppid=7, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null > ... > 2019-01-28 18:12:10,902 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Starting pid=4915, ppid=7, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null; forceNewPlan=true, retain=false > 2019-01-28 18:12:11,114 INFO [PEWorker-7] assignment.RegionStateStore: > pid=4915 updating hbase:meta row=8be2a423b16471b9417f0f7de04281c6, > regionState=OPENING, regionLocation=server_1aa,17020,1548727658713 > {noformat} > However, we get the remote procedure failure from 48c after we've already > started that. > It actually tried to open on the restarted RS, which makes me wonder if this > is safe also w.r.t. other races - what if RS already initialized and didn't > error out? > Need to check if we verify the start code expected by master on RS when > opening. > {noformat} > 2019-01-28 18:12:12,179 WARN [RSProcedureDispatcher-pool4-t362] > assignment.RegionRemoteProcedureBase: The remote operation pid=11050, > ppid=4915, state=SUCCESS, hasLock=false; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region > {ENCODED => 8be2a423b16471b9417f0f7de04281c6 ... to server > server_48c,17020,1548726406632 failed > org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: > org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server > server_48c,17020,1548727752747 is not running yet > 2019-01-28 18:12:12,179 WARN [RSProcedureDispatcher-pool4-t362] > procedure.RSProcedureDispatcher: server server_48c,17020,1548726406632 is not > up for a while; try a new one > {noformat} > Without any other reason (at least logged), the RIT immediately retries again > and chooses a new candidate. It then retries again and goes to the new 48c, > but that's unrelated. > {noformat} > 2019-01-28 18:12:12,289 INFO
[jira] [Commented] (HBASE-21811) region can be opened on two servers due to race condition with procedures and server reports
[ https://issues.apache.org/jira/browse/HBASE-21811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758563#comment-16758563 ] Sergey Shelukhin commented on HBASE-21811: -- [~Apache9] the patch with a fix is already attached :) It's a straightforward server name check. This check actually already exists in the successful case - it's reported in this bug logs, where it says wrong server opened the region and ignores it. > region can be opened on two servers due to race condition with procedures and > server reports > > > Key: HBASE-21811 > URL: https://issues.apache.org/jira/browse/HBASE-21811 > Project: HBase > Issue Type: Bug > Components: amv2 >Affects Versions: 3.0.0 >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin >Priority: Blocker > Fix For: 3.0.0, 2.2.0 > > Attachments: HBASE-21811-UT.patch, HBASE-21811.patch > > > Looks like the region server responses are being processed incorrectly in > places allowing te region to be opened on two servers. > * The region server report handling in procedures should check which server > is reporting. > * Also although I didn't check (and it isn't implicated in this bug), RS must > check in OPEN that it's actually the correct RS master sent open to (w.r.t. > start timestamp) > This was previosuly "mitigated" by master killing the RS with incorrect > reports, but due to race conditions with reports and assignment the kill was > replaced with a warning, so now this condition persists. > Regardless, the kill approach is not a good fix because there's still a > window when a region can be opened on two servers. > A region is being opened by server_48c. The server dies, and we process the > retry correctly (retry=3 because 2 previous similar open failures were > processed correctly). > We start opening it on server_1aa now. > {noformat} > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > assignment.RegionStateStore: pid=4915 updating hbase:meta > row=8be2a423b16471b9417f0f7de04281c6, regionState=ABNORMALLY_CLOSED > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > procedure.ServerCrashProcedure: pid=11944, > state=RUNNABLE:SERVER_CRASH_ASSIGN, hasLock=true; ServerCrashProcedure > server=server_48c,17020,1548726406632, splitWal=true, meta=false found RIT > pid=4915, ppid=7, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, > hasLock=true; TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=OPENING, > location=server_48c,17020,1548726406632, table=table, > region=8be2a423b16471b9417f0f7de04281c6 > 2019-01-28 18:12:10,778 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Retry=3 of max=2147483647; pid=4915, > ppid=7, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null > ... > 2019-01-28 18:12:10,902 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Starting pid=4915, ppid=7, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null; forceNewPlan=true, retain=false > 2019-01-28 18:12:11,114 INFO [PEWorker-7] assignment.RegionStateStore: > pid=4915 updating hbase:meta row=8be2a423b16471b9417f0f7de04281c6, > regionState=OPENING, regionLocation=server_1aa,17020,1548727658713 > {noformat} > However, we get the remote procedure failure from 48c after we've already > started that. > It actually tried to open on the restarted RS, which makes me wonder if this > is safe also w.r.t. other races - what if RS already initialized and didn't > error out? > Need to check if we verify the start code expected by master on RS when > opening. > {noformat} > 2019-01-28 18:12:12,179 WARN [RSProcedureDispatcher-pool4-t362] > assignment.RegionRemoteProcedureBase: The remote operation pid=11050, > ppid=4915, state=SUCCESS, hasLock=false; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region > {ENCODED => 8be2a423b16471b9417f0f7de04281c6 ... to server > server_48c,17020,1548726406632 failed > org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: > org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server > server_48c,17020,1548727752747 is not running yet > 2019-01-28 18:12:12,179 WARN [RSProcedureDispatcher-pool4-t362] > procedure.RSProcedureDispatcher: server server_48c,17020,1548726406632 is not > up for a while; try a new one > {noformat} > Without any other reason (at least logged), the RIT immediately retries again >
[jira] [Commented] (HBASE-21811) region can be opened on two servers due to race condition with procedures and server reports
[ https://issues.apache.org/jira/browse/HBASE-21811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758305#comment-16758305 ] Hadoop QA commented on HBASE-21811: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 5s{color} | {color:red} HBASE-21811 does not apply to master. Rebase required? Wrong Branch? See https://yetus.apache.org/documentation/0.8.0/precommit-patchnames for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | HBASE-21811 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12957265/HBASE-21811-UT.patch | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/15835/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > region can be opened on two servers due to race condition with procedures and > server reports > > > Key: HBASE-21811 > URL: https://issues.apache.org/jira/browse/HBASE-21811 > Project: HBase > Issue Type: Bug > Components: amv2 >Affects Versions: 3.0.0 >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin >Priority: Blocker > Fix For: 3.0.0, 2.2.0 > > Attachments: HBASE-21811-UT.patch, HBASE-21811.patch > > > Looks like the region server responses are being processed incorrectly in > places allowing te region to be opened on two servers. > * The region server report handling in procedures should check which server > is reporting. > * Also although I didn't check (and it isn't implicated in this bug), RS must > check in OPEN that it's actually the correct RS master sent open to (w.r.t. > start timestamp) > This was previosuly "mitigated" by master killing the RS with incorrect > reports, but due to race conditions with reports and assignment the kill was > replaced with a warning, so now this condition persists. > Regardless, the kill approach is not a good fix because there's still a > window when a region can be opened on two servers. > A region is being opened by server_48c. The server dies, and we process the > retry correctly (retry=3 because 2 previous similar open failures were > processed correctly). > We start opening it on server_1aa now. > {noformat} > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > assignment.RegionStateStore: pid=4915 updating hbase:meta > row=8be2a423b16471b9417f0f7de04281c6, regionState=ABNORMALLY_CLOSED > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > procedure.ServerCrashProcedure: pid=11944, > state=RUNNABLE:SERVER_CRASH_ASSIGN, hasLock=true; ServerCrashProcedure > server=server_48c,17020,1548726406632, splitWal=true, meta=false found RIT > pid=4915, ppid=7, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, > hasLock=true; TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=OPENING, > location=server_48c,17020,1548726406632, table=table, > region=8be2a423b16471b9417f0f7de04281c6 > 2019-01-28 18:12:10,778 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Retry=3 of max=2147483647; pid=4915, > ppid=7, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null > ... > 2019-01-28 18:12:10,902 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Starting pid=4915, ppid=7, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null; forceNewPlan=true, retain=false > 2019-01-28 18:12:11,114 INFO [PEWorker-7] assignment.RegionStateStore: > pid=4915 updating hbase:meta row=8be2a423b16471b9417f0f7de04281c6, > regionState=OPENING, regionLocation=server_1aa,17020,1548727658713 > {noformat} > However, we get the remote procedure failure from 48c after we've already > started that. > It actually tried to open on the restarted RS, which makes me wonder if this > is safe also w.r.t. other races - what if RS already initialized and didn't > error out? > Need to check if we verify the start code expected by master on RS when > opening. > {noformat} > 2019-01-28 18:12:12,179 WARN [RSProcedureDispatcher-pool4-t362] > assignment.RegionRemoteProcedureBase: The remote operation pid=11050, > ppid=4915, state=SUCCESS, hasLock=false; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
[jira] [Commented] (HBASE-21811) region can be opened on two servers due to race condition with procedures and server reports
[ https://issues.apache.org/jira/browse/HBASE-21811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758298#comment-16758298 ] Duo Zhang commented on HBASE-21811: --- A UT to reproduce the problem. Let me think what is the best way to fix it... > region can be opened on two servers due to race condition with procedures and > server reports > > > Key: HBASE-21811 > URL: https://issues.apache.org/jira/browse/HBASE-21811 > Project: HBase > Issue Type: Bug > Components: amv2 >Affects Versions: 3.0.0 >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin >Priority: Blocker > Fix For: 3.0.0, 2.2.0 > > Attachments: HBASE-21811-UT.patch, HBASE-21811.patch > > > Looks like the region server responses are being processed incorrectly in > places allowing te region to be opened on two servers. > * The region server report handling in procedures should check which server > is reporting. > * Also although I didn't check (and it isn't implicated in this bug), RS must > check in OPEN that it's actually the correct RS master sent open to (w.r.t. > start timestamp) > This was previosuly "mitigated" by master killing the RS with incorrect > reports, but due to race conditions with reports and assignment the kill was > replaced with a warning, so now this condition persists. > Regardless, the kill approach is not a good fix because there's still a > window when a region can be opened on two servers. > A region is being opened by server_48c. The server dies, and we process the > retry correctly (retry=3 because 2 previous similar open failures were > processed correctly). > We start opening it on server_1aa now. > {noformat} > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > assignment.RegionStateStore: pid=4915 updating hbase:meta > row=8be2a423b16471b9417f0f7de04281c6, regionState=ABNORMALLY_CLOSED > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > procedure.ServerCrashProcedure: pid=11944, > state=RUNNABLE:SERVER_CRASH_ASSIGN, hasLock=true; ServerCrashProcedure > server=server_48c,17020,1548726406632, splitWal=true, meta=false found RIT > pid=4915, ppid=7, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, > hasLock=true; TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=OPENING, > location=server_48c,17020,1548726406632, table=table, > region=8be2a423b16471b9417f0f7de04281c6 > 2019-01-28 18:12:10,778 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Retry=3 of max=2147483647; pid=4915, > ppid=7, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null > ... > 2019-01-28 18:12:10,902 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Starting pid=4915, ppid=7, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null; forceNewPlan=true, retain=false > 2019-01-28 18:12:11,114 INFO [PEWorker-7] assignment.RegionStateStore: > pid=4915 updating hbase:meta row=8be2a423b16471b9417f0f7de04281c6, > regionState=OPENING, regionLocation=server_1aa,17020,1548727658713 > {noformat} > However, we get the remote procedure failure from 48c after we've already > started that. > It actually tried to open on the restarted RS, which makes me wonder if this > is safe also w.r.t. other races - what if RS already initialized and didn't > error out? > Need to check if we verify the start code expected by master on RS when > opening. > {noformat} > 2019-01-28 18:12:12,179 WARN [RSProcedureDispatcher-pool4-t362] > assignment.RegionRemoteProcedureBase: The remote operation pid=11050, > ppid=4915, state=SUCCESS, hasLock=false; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region > {ENCODED => 8be2a423b16471b9417f0f7de04281c6 ... to server > server_48c,17020,1548726406632 failed > org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: > org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server > server_48c,17020,1548727752747 is not running yet > 2019-01-28 18:12:12,179 WARN [RSProcedureDispatcher-pool4-t362] > procedure.RSProcedureDispatcher: server server_48c,17020,1548726406632 is not > up for a while; try a new one > {noformat} > Without any other reason (at least logged), the RIT immediately retries again > and chooses a new candidate. It then retries again and goes to the new 48c, > but that's unrelated. > {noformat} > 2019-01-28 18:12:12,289 INFO [KeepAlivePEWorker-100] >
[jira] [Commented] (HBASE-21811) region can be opened on two servers due to race condition with procedures and server reports
[ https://issues.apache.org/jira/browse/HBASE-21811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758077#comment-16758077 ] Duo Zhang commented on HBASE-21811: --- [~sershe] Let me try to write a UT to reproduce the problem if you do not mind. > region can be opened on two servers due to race condition with procedures and > server reports > > > Key: HBASE-21811 > URL: https://issues.apache.org/jira/browse/HBASE-21811 > Project: HBase > Issue Type: Bug > Components: amv2 >Affects Versions: 3.0.0 >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin >Priority: Blocker > Fix For: 3.0.0, 2.2.0 > > Attachments: HBASE-21811.patch > > > Looks like the region server responses are being processed incorrectly in > places allowing te region to be opened on two servers. > * The region server report handling in procedures should check which server > is reporting. > * Also although I didn't check (and it isn't implicated in this bug), RS must > check in OPEN that it's actually the correct RS master sent open to (w.r.t. > start timestamp) > This was previosuly "mitigated" by master killing the RS with incorrect > reports, but due to race conditions with reports and assignment the kill was > replaced with a warning, so now this condition persists. > Regardless, the kill approach is not a good fix because there's still a > window when a region can be opened on two servers. > A region is being opened by server_48c. The server dies, and we process the > retry correctly (retry=3 because 2 previous similar open failures were > processed correctly). > We start opening it on server_1aa now. > {noformat} > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > assignment.RegionStateStore: pid=4915 updating hbase:meta > row=8be2a423b16471b9417f0f7de04281c6, regionState=ABNORMALLY_CLOSED > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > procedure.ServerCrashProcedure: pid=11944, > state=RUNNABLE:SERVER_CRASH_ASSIGN, hasLock=true; ServerCrashProcedure > server=server_48c,17020,1548726406632, splitWal=true, meta=false found RIT > pid=4915, ppid=7, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, > hasLock=true; TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=OPENING, > location=server_48c,17020,1548726406632, table=table, > region=8be2a423b16471b9417f0f7de04281c6 > 2019-01-28 18:12:10,778 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Retry=3 of max=2147483647; pid=4915, > ppid=7, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null > ... > 2019-01-28 18:12:10,902 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Starting pid=4915, ppid=7, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null; forceNewPlan=true, retain=false > 2019-01-28 18:12:11,114 INFO [PEWorker-7] assignment.RegionStateStore: > pid=4915 updating hbase:meta row=8be2a423b16471b9417f0f7de04281c6, > regionState=OPENING, regionLocation=server_1aa,17020,1548727658713 > {noformat} > However, we get the remote procedure failure from 48c after we've already > started that. > It actually tried to open on the restarted RS, which makes me wonder if this > is safe also w.r.t. other races - what if RS already initialized and didn't > error out? > Need to check if we verify the start code expected by master on RS when > opening. > {noformat} > 2019-01-28 18:12:12,179 WARN [RSProcedureDispatcher-pool4-t362] > assignment.RegionRemoteProcedureBase: The remote operation pid=11050, > ppid=4915, state=SUCCESS, hasLock=false; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region > {ENCODED => 8be2a423b16471b9417f0f7de04281c6 ... to server > server_48c,17020,1548726406632 failed > org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: > org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server > server_48c,17020,1548727752747 is not running yet > 2019-01-28 18:12:12,179 WARN [RSProcedureDispatcher-pool4-t362] > procedure.RSProcedureDispatcher: server server_48c,17020,1548726406632 is not > up for a while; try a new one > {noformat} > Without any other reason (at least logged), the RIT immediately retries again > and chooses a new candidate. It then retries again and goes to the new 48c, > but that's unrelated. > {noformat} > 2019-01-28 18:12:12,289 INFO [KeepAlivePEWorker-100] > assignment.TransitRegionStateProcedure:
[jira] [Commented] (HBASE-21811) region can be opened on two servers due to race condition with procedures and server reports
[ https://issues.apache.org/jira/browse/HBASE-21811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758043#comment-16758043 ] Duo Zhang commented on HBASE-21811: --- Ping [~zghaobac], I think this should be blocker for 2.2.0. In general, there is no way to interrupt an ongoing executeProcedures call if it is stuck there, so the problem here could happen. I think the problem here is that, the Open/CloseRegionProcedure has already been woken up by SCP, and then we have schedule a new OpenRegionProcedure, and suspend the ProedureEvent in the RegionStateNode, and then the remoteCallFailed method for the completed Open/CloseRegionProcedure has been called again and we will wake up the new OpenRegionProcedure which is waiting on the ProcedureEvent, and this will lead to a double assign. So here we need to know that we should not wake up the procedure. But I wonder if we have a better way to determine that the wake up is stale other than checking for ServerName... Maybe we should check whether we are actually waiting on the event before waking it up... > region can be opened on two servers due to race condition with procedures and > server reports > > > Key: HBASE-21811 > URL: https://issues.apache.org/jira/browse/HBASE-21811 > Project: HBase > Issue Type: Bug > Components: amv2 >Affects Versions: 3.0.0 >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin >Priority: Blocker > Fix For: 3.0.0, 2.2.0 > > Attachments: HBASE-21811.patch > > > Looks like the region server responses are being processed incorrectly in > places allowing te region to be opened on two servers. > * The region server report handling in procedures should check which server > is reporting. > * Also although I didn't check (and it isn't implicated in this bug), RS must > check in OPEN that it's actually the correct RS master sent open to (w.r.t. > start timestamp) > This was previosuly "mitigated" by master killing the RS with incorrect > reports, but due to race conditions with reports and assignment the kill was > replaced with a warning, so now this condition persists. > Regardless, the kill approach is not a good fix because there's still a > window when a region can be opened on two servers. > A region is being opened by server_48c. The server dies, and we process the > retry correctly (retry=3 because 2 previous similar open failures were > processed correctly). > We start opening it on server_1aa now. > {noformat} > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > assignment.RegionStateStore: pid=4915 updating hbase:meta > row=8be2a423b16471b9417f0f7de04281c6, regionState=ABNORMALLY_CLOSED > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > procedure.ServerCrashProcedure: pid=11944, > state=RUNNABLE:SERVER_CRASH_ASSIGN, hasLock=true; ServerCrashProcedure > server=server_48c,17020,1548726406632, splitWal=true, meta=false found RIT > pid=4915, ppid=7, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, > hasLock=true; TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=OPENING, > location=server_48c,17020,1548726406632, table=table, > region=8be2a423b16471b9417f0f7de04281c6 > 2019-01-28 18:12:10,778 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Retry=3 of max=2147483647; pid=4915, > ppid=7, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null > ... > 2019-01-28 18:12:10,902 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Starting pid=4915, ppid=7, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null; forceNewPlan=true, retain=false > 2019-01-28 18:12:11,114 INFO [PEWorker-7] assignment.RegionStateStore: > pid=4915 updating hbase:meta row=8be2a423b16471b9417f0f7de04281c6, > regionState=OPENING, regionLocation=server_1aa,17020,1548727658713 > {noformat} > However, we get the remote procedure failure from 48c after we've already > started that. > It actually tried to open on the restarted RS, which makes me wonder if this > is safe also w.r.t. other races - what if RS already initialized and didn't > error out? > Need to check if we verify the start code expected by master on RS when > opening. > {noformat} > 2019-01-28 18:12:12,179 WARN [RSProcedureDispatcher-pool4-t362] > assignment.RegionRemoteProcedureBase: The remote operation pid=11050, > ppid=4915, state=SUCCESS, hasLock=false; >
[jira] [Commented] (HBASE-21811) region can be opened on two servers due to race condition with procedures and server reports
[ https://issues.apache.org/jira/browse/HBASE-21811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756857#comment-16756857 ] Hadoop QA commented on HBASE-21811: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 12s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:orange}-0{color} | {color:orange} test4tests {color} | {color:orange} 0m 0s{color} | {color:orange} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 5m 6s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 4s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 15s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 37s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 30s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 45s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 2s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 14s{color} | {color:red} hbase-server: The patch generated 4 new + 64 unchanged - 0 fixed = 68 total (was 64) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 32s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 9m 57s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 43s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 41s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green}135m 23s{color} | {color:green} hbase-server in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 24s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}178m 28s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:b002b0b | | JIRA Issue | HBASE-21811 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12956979/HBASE-21811.patch | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux a09c5f4b0491 4.4.0-139-generic #165~14.04.1-Ubuntu SMP Wed Oct 31 10:55:11 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build@2/component/dev-support/hbase-personality.sh | | git revision | master / c90e9ff5ef | | maven | version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) | | Default Java | 1.8.0_181 | | findbugs | v3.1.0-RC3 | | checkstyle | https://builds.apache.org/job/PreCommit-HBASE-Build/15801/artifact/patchprocess/diff-checkstyle-hbase-server.txt | | Test Results |
[jira] [Commented] (HBASE-21811) region can be opened on two servers due to race condition with procedures and server reports
[ https://issues.apache.org/jira/browse/HBASE-21811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756626#comment-16756626 ] Sergey Shelukhin commented on HBASE-21811: -- We are not running branch-2, but I expect it will affect at least 2.2, because the assignment is very similar. W.r.t. 2.1 I'm not sure, I suspect they are affected, but there were some changes to RITs since then. > region can be opened on two servers due to race condition with procedures and > server reports > > > Key: HBASE-21811 > URL: https://issues.apache.org/jira/browse/HBASE-21811 > Project: HBase > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin >Priority: Blocker > > Looks like the region server responses are being processed incorrectly in > places allowing te region to be opened on two servers. > * The region server report handling in procedures should check which server > is reporting. > * Also although I didn't check (and it isn't implicated in this bug), RS must > check in OPEN that it's actually the correct RS master sent open to (w.r.t. > start timestamp) > This was previosuly "mitigated" by master killing the RS with incorrect > reports, but due to race conditions with reports and assignment the kill was > replaced with a warning, so now this condition persists. > Regardless, the kill approach is not a good fix because there's still a > window when a region can be opened on two servers. > A region is being opened by server_48c. The server dies, and we process the > retry correctly (retry=3 because 2 previous similar open failures were > processed correctly). > We start opening it on server_1aa now. > {noformat} > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > assignment.RegionStateStore: pid=4915 updating hbase:meta > row=8be2a423b16471b9417f0f7de04281c6, regionState=ABNORMALLY_CLOSED > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > procedure.ServerCrashProcedure: pid=11944, > state=RUNNABLE:SERVER_CRASH_ASSIGN, hasLock=true; ServerCrashProcedure > server=server_48c,17020,1548726406632, splitWal=true, meta=false found RIT > pid=4915, ppid=7, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, > hasLock=true; TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=OPENING, > location=server_48c,17020,1548726406632, table=table, > region=8be2a423b16471b9417f0f7de04281c6 > 2019-01-28 18:12:10,778 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Retry=3 of max=2147483647; pid=4915, > ppid=7, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null > ... > 2019-01-28 18:12:10,902 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Starting pid=4915, ppid=7, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null; forceNewPlan=true, retain=false > 2019-01-28 18:12:11,114 INFO [PEWorker-7] assignment.RegionStateStore: > pid=4915 updating hbase:meta row=8be2a423b16471b9417f0f7de04281c6, > regionState=OPENING, regionLocation=server_1aa,17020,1548727658713 > {noformat} > However, we get the remote procedure failure from 48c after we've already > started that. > It actually tried to open on the restarted RS, which makes me wonder if this > is safe also w.r.t. other races - what if RS already initialized and didn't > error out? > Need to check if we verify the start code expected by master on RS when > opening. > {noformat} > 2019-01-28 18:12:12,179 WARN [RSProcedureDispatcher-pool4-t362] > assignment.RegionRemoteProcedureBase: The remote operation pid=11050, > ppid=4915, state=SUCCESS, hasLock=false; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region > {ENCODED => 8be2a423b16471b9417f0f7de04281c6 ... to server > server_48c,17020,1548726406632 failed > org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: > org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server > server_48c,17020,1548727752747 is not running yet > 2019-01-28 18:12:12,179 WARN [RSProcedureDispatcher-pool4-t362] > procedure.RSProcedureDispatcher: server server_48c,17020,1548726406632 is not > up for a while; try a new one > {noformat} > Without any other reason (at least logged), the RIT immediately retries again > and chooses a new candidate. It then retries again and goes to the new 48c, > but that's unrelated. > {noformat} > 2019-01-28 18:12:12,289 INFO [KeepAlivePEWorker-100]
[jira] [Commented] (HBASE-21811) region can be opened on two servers due to race condition with procedures and server reports
[ https://issues.apache.org/jira/browse/HBASE-21811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756578#comment-16756578 ] Josh Elser commented on HBASE-21811: This unique to master, [~sershe], or you see this on branch-2\* too? > region can be opened on two servers due to race condition with procedures and > server reports > > > Key: HBASE-21811 > URL: https://issues.apache.org/jira/browse/HBASE-21811 > Project: HBase > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Sergey Shelukhin >Assignee: Sergey Shelukhin >Priority: Blocker > > Looks like the region server responses are being processed incorrectly in > places allowing te region to be opened on two servers. > * The region server report handling in procedures should check which server > is reporting. > * Also although I didn't check (and it isn't implicated in this bug), RS must > check in OPEN that it's actually the correct RS master sent open to (w.r.t. > start timestamp) > This was previosuly "mitigated" by master killing the RS with incorrect > reports, but due to race conditions with reports and assignment the kill was > replaced with a warning, so now this condition persists. > Regardless, the kill approach is not a good fix because there's still a > window when a region can be opened on two servers. > A region is being opened by server_48c. The server dies, and we process the > retry correctly (retry=3 because 2 previous similar open failures were > processed correctly). > We start opening it on server_1aa now. > {noformat} > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > assignment.RegionStateStore: pid=4915 updating hbase:meta > row=8be2a423b16471b9417f0f7de04281c6, regionState=ABNORMALLY_CLOSED > 2019-01-28 18:12:09,862 INFO [KeepAlivePEWorker-104] > procedure.ServerCrashProcedure: pid=11944, > state=RUNNABLE:SERVER_CRASH_ASSIGN, hasLock=true; ServerCrashProcedure > server=server_48c,17020,1548726406632, splitWal=true, meta=false found RIT > pid=4915, ppid=7, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, > hasLock=true; TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=OPENING, > location=server_48c,17020,1548726406632, table=table, > region=8be2a423b16471b9417f0f7de04281c6 > 2019-01-28 18:12:10,778 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Retry=3 of max=2147483647; pid=4915, > ppid=7, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null > ... > 2019-01-28 18:12:10,902 INFO [KeepAlivePEWorker-80] > assignment.TransitRegionStateProcedure: Starting pid=4915, ppid=7, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; > TransitRegionStateProcedure table=table, > region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, > location=null; forceNewPlan=true, retain=false > 2019-01-28 18:12:11,114 INFO [PEWorker-7] assignment.RegionStateStore: > pid=4915 updating hbase:meta row=8be2a423b16471b9417f0f7de04281c6, > regionState=OPENING, regionLocation=server_1aa,17020,1548727658713 > {noformat} > However, we get the remote procedure failure from 48c after we've already > started that. > It actually tried to open on the restarted RS, which makes me wonder if this > is safe also w.r.t. other races - what if RS already initialized and didn't > error out? > Need to check if we verify the start code expected by master on RS when > opening. > {noformat} > 2019-01-28 18:12:12,179 WARN [RSProcedureDispatcher-pool4-t362] > assignment.RegionRemoteProcedureBase: The remote operation pid=11050, > ppid=4915, state=SUCCESS, hasLock=false; > org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region > {ENCODED => 8be2a423b16471b9417f0f7de04281c6 ... to server > server_48c,17020,1548726406632 failed > org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: > org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server > server_48c,17020,1548727752747 is not running yet > 2019-01-28 18:12:12,179 WARN [RSProcedureDispatcher-pool4-t362] > procedure.RSProcedureDispatcher: server server_48c,17020,1548726406632 is not > up for a while; try a new one > {noformat} > Without any other reason (at least logged), the RIT immediately retries again > and chooses a new candidate. > {noformat} > 2019-01-28 18:12:12,289 INFO [KeepAlivePEWorker-100] > assignment.TransitRegionStateProcedure: Retry=4 of max=2147483647; pid=4915, > ppid=7, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; > TransitRegionStateProcedure table=table, >