[jira] [Updated] (HBASE-17625) Slow to enable a table
[ https://issues.apache.org/jira/browse/HBASE-17625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-17625: Description: Tried to enable a table with 10k+ regions, it takes more time to generate the plan than do the actual assignment. This is so embarrassing. :) It turns out that it took quite some time to get the top HDFS block locations in registering regions when creating the Cluster object. was: Tried to enable a table with 10k+ regions, it takes more time to generate the plan than do the actual assignment. This is so embarrassing. :) It turns out that it took quite some time to get the top HDFS block locations in registering regions when creating the Cluster object. There is no new region server, why do we need such info when trying to retain assignment? Is the region availability thing related to region replica? Can we avoid such penalty if region replica in not needed? > Slow to enable a table > -- > > Key: HBASE-17625 > URL: https://issues.apache.org/jira/browse/HBASE-17625 > Project: HBase > Issue Type: Improvement >Reporter: Jimmy Xiang > > Tried to enable a table with 10k+ regions, it takes more time to generate the > plan than do the actual assignment. This is so embarrassing. :) > It turns out that it took quite some time to get the top HDFS block locations > in registering regions when creating the Cluster object. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HBASE-17625) Slow to enable a table
Jimmy Xiang created HBASE-17625: --- Summary: Slow to enable a table Key: HBASE-17625 URL: https://issues.apache.org/jira/browse/HBASE-17625 Project: HBase Issue Type: Improvement Reporter: Jimmy Xiang Tried to enable a table with 10k+ regions, it takes more time to generate the plan than do the actual assignment. This is so embarrassing. :) It turns out that it took quite some time to get the top HDFS block locations in registering regions when creating the Cluster object. There is no new region server, why do we need such info when trying to retain assignment? Is the region availability thing related to region replica? Can we avoid such penalty if region replica in not needed? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-11611) Clean up ZK-based region assignment
[ https://issues.apache.org/jira/browse/HBASE-11611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14649386#comment-14649386 ] Jimmy Xiang commented on HBASE-11611: - This change is in the trunk branch (2.0). We can not upgrade 0.94 to 2.0 directly. One upgrade path could be 0.94 -> 0.96/0.98 -> 1.0 -> 2.0. > Clean up ZK-based region assignment > --- > > Key: HBASE-11611 > URL: https://issues.apache.org/jira/browse/HBASE-11611 > Project: HBase > Issue Type: Improvement > Components: Region Assignment >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang > Fix For: 2.0.0 > > Attachments: hbase-11611.addendum, hbase-11611.patch, > hbase-11611_v1.patch, hbase-11611_v2.patch > > > We can clean up the ZK-based region assignment code and use the ZK-less one > in the master branch, to make the code easier to understand and maintain. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13605) RegionStates should not keep its list of dead servers
[ https://issues.apache.org/jira/browse/HBASE-13605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14527748#comment-14527748 ] Jimmy Xiang commented on HBASE-13605: - The patch is fine with me for the master branch. For branch 1, I am not sure. > RegionStates should not keep its list of dead servers > - > > Key: HBASE-13605 > URL: https://issues.apache.org/jira/browse/HBASE-13605 > Project: HBase > Issue Type: Bug >Reporter: Enis Soztutar >Assignee: Enis Soztutar > Fix For: 2.0.0, 1.0.2, 1.1.1 > > Attachments: hbase-13605_v1.patch > > > As mentioned in > https://issues.apache.org/jira/browse/HBASE-9514?focusedCommentId=13769761&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13769761 > and HBASE-12844 we should have only 1 source of cluster membership. > The list of dead server and RegionStates doing it's own liveliness check > (ServerManager.isServerReachable()) has caused an assignment problem again in > a test cluster where the region states "thinks" that the server is dead and > SSH will handle the region assignment. However the RS is not dead at all, > living happily, and never gets zk expiry or YouAreDeadException or anything. > This leaves the list of regions unassigned in OFFLINE state. > master assigning the region: > {code} > 15-04-20 09:02:25,780 DEBUG [AM.ZK.Worker-pool3-t330] master.RegionStates: > Onlined 77dddcd50c22e56bfff133c0e1f9165b on > os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268 {ENCODED => > 77dddcd50c > {code} > Master then disabled the table, and unassigned the region: > {code} > 2015-04-20 09:02:27,158 WARN [ProcedureExecutorThread-1] > zookeeper.ZKTableStateManager: Moving table loadtest_d1 state from DISABLING > to DISABLING > Starting unassign of > loadtest_d1,,1429520544378.77dddcd50c22e56bfff133c0e1f9165b. (offlining), > current state: {77dddcd50c22e56bfff133c0e1f9165b state=OPEN, > ts=1429520545780, > server=os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268} > bleProcedure$BulkDisabler-0] master.AssignmentManager: Sent CLOSE to > os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268 for region > loadtest_d1,,1429520544378.77dddcd50c22e56bfff133c0e1f9165b. > 2015-04-20 09:02:27,414 INFO [AM.ZK.Worker-pool3-t316] master.RegionStates: > Offlined 77dddcd50c22e56bfff133c0e1f9165b from > os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268 > {code} > On table re-enable, AM does not assign the region: > {code} > 2015-04-20 09:02:30,415 INFO [ProcedureExecutorThread-3] > balancer.BaseLoadBalancer: Reassigned 25 regions. 25 retained the pre-restart > assignment.· > 2015-04-20 09:02:30,415 INFO [ProcedureExecutorThread-3] > procedure.EnableTableProcedure: Bulk assigning 25 region(s) across 5 > server(s), retainAssignment=true > l,16000,1429515659726-GeneralBulkAssigner-4] master.RegionStates: Couldn't > reach online server > os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268 > l,16000,1429515659726-GeneralBulkAssigner-4] master.AssignmentManager: > Updating the state to OFFLINE to allow to be reassigned by SSH > nmentManager: Skip assigning > loadtest_d1,,1429520544378.77dddcd50c22e56bfff133c0e1f9165b., it is on a dead > but not processed yet server: > os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13605) RegionStates should not keep its list of dead servers
[ https://issues.apache.org/jira/browse/HBASE-13605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14526944#comment-14526944 ] Jimmy Xiang commented on HBASE-13605: - I was confused the dead server list with the processed server list. In ZK-less region assignment, we don't need to ping the sever any more, since the server could be restarted right after the ping returns, so we should not rely on the ping result, i.e. the ping result is as good as DeadServer. > RegionStates should not keep its list of dead servers > - > > Key: HBASE-13605 > URL: https://issues.apache.org/jira/browse/HBASE-13605 > Project: HBase > Issue Type: Bug >Reporter: Enis Soztutar >Assignee: Enis Soztutar > Fix For: 2.0.0, 1.0.2, 1.1.1 > > Attachments: hbase-13605_v1.patch > > > As mentioned in > https://issues.apache.org/jira/browse/HBASE-9514?focusedCommentId=13769761&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13769761 > and HBASE-12844 we should have only 1 source of cluster membership. > The list of dead server and RegionStates doing it's own liveliness check > (ServerManager.isServerReachable()) has caused an assignment problem again in > a test cluster where the region states "thinks" that the server is dead and > SSH will handle the region assignment. However the RS is not dead at all, > living happily, and never gets zk expiry or YouAreDeadException or anything. > This leaves the list of regions unassigned in OFFLINE state. > master assigning the region: > {code} > 15-04-20 09:02:25,780 DEBUG [AM.ZK.Worker-pool3-t330] master.RegionStates: > Onlined 77dddcd50c22e56bfff133c0e1f9165b on > os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268 {ENCODED => > 77dddcd50c > {code} > Master then disabled the table, and unassigned the region: > {code} > 2015-04-20 09:02:27,158 WARN [ProcedureExecutorThread-1] > zookeeper.ZKTableStateManager: Moving table loadtest_d1 state from DISABLING > to DISABLING > Starting unassign of > loadtest_d1,,1429520544378.77dddcd50c22e56bfff133c0e1f9165b. (offlining), > current state: {77dddcd50c22e56bfff133c0e1f9165b state=OPEN, > ts=1429520545780, > server=os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268} > bleProcedure$BulkDisabler-0] master.AssignmentManager: Sent CLOSE to > os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268 for region > loadtest_d1,,1429520544378.77dddcd50c22e56bfff133c0e1f9165b. > 2015-04-20 09:02:27,414 INFO [AM.ZK.Worker-pool3-t316] master.RegionStates: > Offlined 77dddcd50c22e56bfff133c0e1f9165b from > os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268 > {code} > On table re-enable, AM does not assign the region: > {code} > 2015-04-20 09:02:30,415 INFO [ProcedureExecutorThread-3] > balancer.BaseLoadBalancer: Reassigned 25 regions. 25 retained the pre-restart > assignment.· > 2015-04-20 09:02:30,415 INFO [ProcedureExecutorThread-3] > procedure.EnableTableProcedure: Bulk assigning 25 region(s) across 5 > server(s), retainAssignment=true > l,16000,1429515659726-GeneralBulkAssigner-4] master.RegionStates: Couldn't > reach online server > os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268 > l,16000,1429515659726-GeneralBulkAssigner-4] master.AssignmentManager: > Updating the state to OFFLINE to allow to be reassigned by SSH > nmentManager: Skip assigning > loadtest_d1,,1429520544378.77dddcd50c22e56bfff133c0e1f9165b., it is on a dead > but not processed yet server: > os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13605) RegionStates should not keep its list of dead servers
[ https://issues.apache.org/jira/browse/HBASE-13605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14523986#comment-14523986 ] Jimmy Xiang commented on HBASE-13605: - The dead server list in RegionStates is a list of servers that are dead and have been processed by SSH. It is used by AM to make sure regions are not assigned before SSH has finished log-splitting. It is critical to make sure there is no data loss. RegionStates does not know if a server is dead unless SSH tells it. If the server is not dead, but it is on the dead server list in the RegionStates, this is possible only if your cluster has some time sync issue. I was wondering how this actually happened. There may be some clue in the log. > RegionStates should not keep its list of dead servers > - > > Key: HBASE-13605 > URL: https://issues.apache.org/jira/browse/HBASE-13605 > Project: HBase > Issue Type: Bug >Reporter: Enis Soztutar >Assignee: Enis Soztutar > Fix For: 2.0.0, 1.0.2, 1.1.1 > > Attachments: hbase-13605_v1.patch > > > As mentioned in > https://issues.apache.org/jira/browse/HBASE-9514?focusedCommentId=13769761&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13769761 > and HBASE-12844 we should have only 1 source of cluster membership. > The list of dead server and RegionStates doing it's own liveliness check > (ServerManager.isServerReachable()) has caused an assignment problem again in > a test cluster where the region states "thinks" that the server is dead and > SSH will handle the region assignment. However the RS is not dead at all, > living happily, and never gets zk expiry or YouAreDeadException or anything. > This leaves the list of regions unassigned in OFFLINE state. > master assigning the region: > {code} > 15-04-20 09:02:25,780 DEBUG [AM.ZK.Worker-pool3-t330] master.RegionStates: > Onlined 77dddcd50c22e56bfff133c0e1f9165b on > os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268 {ENCODED => > 77dddcd50c > {code} > Master then disabled the table, and unassigned the region: > {code} > 2015-04-20 09:02:27,158 WARN [ProcedureExecutorThread-1] > zookeeper.ZKTableStateManager: Moving table loadtest_d1 state from DISABLING > to DISABLING > Starting unassign of > loadtest_d1,,1429520544378.77dddcd50c22e56bfff133c0e1f9165b. (offlining), > current state: {77dddcd50c22e56bfff133c0e1f9165b state=OPEN, > ts=1429520545780, > server=os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268} > bleProcedure$BulkDisabler-0] master.AssignmentManager: Sent CLOSE to > os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268 for region > loadtest_d1,,1429520544378.77dddcd50c22e56bfff133c0e1f9165b. > 2015-04-20 09:02:27,414 INFO [AM.ZK.Worker-pool3-t316] master.RegionStates: > Offlined 77dddcd50c22e56bfff133c0e1f9165b from > os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268 > {code} > On table re-enable, AM does not assign the region: > {code} > 2015-04-20 09:02:30,415 INFO [ProcedureExecutorThread-3] > balancer.BaseLoadBalancer: Reassigned 25 regions. 25 retained the pre-restart > assignment.· > 2015-04-20 09:02:30,415 INFO [ProcedureExecutorThread-3] > procedure.EnableTableProcedure: Bulk assigning 25 region(s) across 5 > server(s), retainAssignment=true > l,16000,1429515659726-GeneralBulkAssigner-4] master.RegionStates: Couldn't > reach online server > os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268 > l,16000,1429515659726-GeneralBulkAssigner-4] master.AssignmentManager: > Updating the state to OFFLINE to allow to be reassigned by SSH > nmentManager: Skip assigning > loadtest_d1,,1429520544378.77dddcd50c22e56bfff133c0e1f9165b., it is on a dead > but not processed yet server: > os-amb-r6-us-1429512014-hbase4-6.novalocal,16020,1429520535268 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13337) Table regions are not assigning back, after restarting all regionservers at once.
[ https://issues.apache.org/jira/browse/HBASE-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393279#comment-14393279 ] Jimmy Xiang commented on HBASE-13337: - I looked into it again and tried to reproduce it. Whenever a server is restarted, we get lots of java.nio.channels.ClosedChannelException. This is something new. So I tried with some old code, and found out there is no such problem with the master branch commit 1723245282ba39567f7da4234cdd31ba534cb869 (Add in an hbasecon2015 logo for the banner). This seems not to be a problem with assignment. I am not sure if branch-1 is affected now. It looks like this has something to do with the aysc RPC changes. Connections seem not to be able to recover from server restarts. > Table regions are not assigning back, after restarting all regionservers at > once. > - > > Key: HBASE-13337 > URL: https://issues.apache.org/jira/browse/HBASE-13337 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: Y. SREENIVASULU REDDY >Priority: Blocker > Fix For: 2.0.0 > > Attachments: HBASE-13337.patch > > > Regions of the table are continouly in state=FAILED_CLOSE. > {noformat} > RegionState > > RIT time (ms) > 8f62e819b356736053e06240f7f7c6fd > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM1,16040,1427362531818 113929 > caf59209ae65ea80fca6bdc6996a7d68 > t1,,1427362431330.caf59209ae65ea80fca6bdc6996a7d68. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM2,16040,1427362533691 113929 > db52a74988f71e5cf257bbabf31f26f3 > t1,,1427362431330.db52a74988f71e5cf257bbabf31f26f3. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM3,16040,1427362533691 113920 > 43f3a65b9f9ff283f598c5450feab1f8 > t1,,1427362431330.43f3a65b9f9ff283f598c5450feab1f8. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM1,16040,1427362531818 113920 > {noformat} > *Steps to reproduce:* > 1. Start HBase cluster with more than one regionserver. > 2. Create a table with precreated regions. (lets say 15 regions) > 3. Make sure the regions are well balanced. > 4. Restart all the Regionservers process at once across the cluster, except > HMaster process > 5. After restarting the Regionservers, successfully will connect to the > HMaster. > *Bug:* > But no regions are assigning back to the Regionservers. > *Master log shows as follows:* > {noformat} > 2015-03-26 15:05:36,201 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStates: Transition {8f62e819b356736053e06240f7f7c6fd > state=OFFLINE, ts=1427362536106, server=VM2,16040,1427362242602} to > {8f62e819b356736053e06240f7f7c6fd state=PENDING_OPEN, ts=1427362536201, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,202 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStateStore: Updating row > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. with > state=PENDING_OPEN&sn=VM1,16040,1427362531818 > 2015-03-26 15:05:36,244 DEBUG [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Force region state offline > {8f62e819b356736053e06240f7f7c6fd state=PENDING_OPEN, ts=1427362536201, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,244 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStates: Transition {8f62e819b356736053e06240f7f7c6fd > state=PENDING_OPEN, ts=1427362536201, server=VM1,16040,1427362531818} to > {8f62e819b356736053e06240f7f7c6fd state=PENDING_CLOSE, ts=1427362536244, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,244 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStateStore: Updating row > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. with > state=PENDING_CLOSE > 2015-03-26 15:05:36,248 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=1 of 10 > 2015-03-26 15:05:36,248 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=2 of 10 > 2015-03-26 15:05:36,249 IN
[jira] [Updated] (HBASE-13337) Table regions are not assigning back, after restarting all regionservers at once.
[ https://issues.apache.org/jira/browse/HBASE-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-13337: Assignee: (was: Jimmy Xiang) > Table regions are not assigning back, after restarting all regionservers at > once. > - > > Key: HBASE-13337 > URL: https://issues.apache.org/jira/browse/HBASE-13337 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: Y. SREENIVASULU REDDY >Priority: Blocker > Fix For: 2.0.0 > > Attachments: HBASE-13337.patch > > > Regions of the table are continouly in state=FAILED_CLOSE. > {noformat} > RegionState > > RIT time (ms) > 8f62e819b356736053e06240f7f7c6fd > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM1,16040,1427362531818 113929 > caf59209ae65ea80fca6bdc6996a7d68 > t1,,1427362431330.caf59209ae65ea80fca6bdc6996a7d68. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM2,16040,1427362533691 113929 > db52a74988f71e5cf257bbabf31f26f3 > t1,,1427362431330.db52a74988f71e5cf257bbabf31f26f3. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM3,16040,1427362533691 113920 > 43f3a65b9f9ff283f598c5450feab1f8 > t1,,1427362431330.43f3a65b9f9ff283f598c5450feab1f8. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM1,16040,1427362531818 113920 > {noformat} > *Steps to reproduce:* > 1. Start HBase cluster with more than one regionserver. > 2. Create a table with precreated regions. (lets say 15 regions) > 3. Make sure the regions are well balanced. > 4. Restart all the Regionservers process at once across the cluster, except > HMaster process > 5. After restarting the Regionservers, successfully will connect to the > HMaster. > *Bug:* > But no regions are assigning back to the Regionservers. > *Master log shows as follows:* > {noformat} > 2015-03-26 15:05:36,201 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStates: Transition {8f62e819b356736053e06240f7f7c6fd > state=OFFLINE, ts=1427362536106, server=VM2,16040,1427362242602} to > {8f62e819b356736053e06240f7f7c6fd state=PENDING_OPEN, ts=1427362536201, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,202 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStateStore: Updating row > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. with > state=PENDING_OPEN&sn=VM1,16040,1427362531818 > 2015-03-26 15:05:36,244 DEBUG [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Force region state offline > {8f62e819b356736053e06240f7f7c6fd state=PENDING_OPEN, ts=1427362536201, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,244 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStates: Transition {8f62e819b356736053e06240f7f7c6fd > state=PENDING_OPEN, ts=1427362536201, server=VM1,16040,1427362531818} to > {8f62e819b356736053e06240f7f7c6fd state=PENDING_CLOSE, ts=1427362536244, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,244 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStateStore: Updating row > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. with > state=PENDING_CLOSE > 2015-03-26 15:05:36,248 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=1 of 10 > 2015-03-26 15:05:36,248 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=2 of 10 > 2015-03-26 15:05:36,249 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=3 of 10 > 2015-03-26 15:05:36,249 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=4 of 10 > 2015-03-26 15:05:36,249 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] >
[jira] [Updated] (HBASE-13337) Table regions are not assigning back, after restarting all regionservers at once.
[ https://issues.apache.org/jira/browse/HBASE-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-13337: Status: Open (was: Patch Available) > Table regions are not assigning back, after restarting all regionservers at > once. > - > > Key: HBASE-13337 > URL: https://issues.apache.org/jira/browse/HBASE-13337 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: Y. SREENIVASULU REDDY >Assignee: Jimmy Xiang >Priority: Blocker > Fix For: 2.0.0 > > Attachments: HBASE-13337.patch > > > Regions of the table are continouly in state=FAILED_CLOSE. > {noformat} > RegionState > > RIT time (ms) > 8f62e819b356736053e06240f7f7c6fd > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM1,16040,1427362531818 113929 > caf59209ae65ea80fca6bdc6996a7d68 > t1,,1427362431330.caf59209ae65ea80fca6bdc6996a7d68. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM2,16040,1427362533691 113929 > db52a74988f71e5cf257bbabf31f26f3 > t1,,1427362431330.db52a74988f71e5cf257bbabf31f26f3. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM3,16040,1427362533691 113920 > 43f3a65b9f9ff283f598c5450feab1f8 > t1,,1427362431330.43f3a65b9f9ff283f598c5450feab1f8. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM1,16040,1427362531818 113920 > {noformat} > *Steps to reproduce:* > 1. Start HBase cluster with more than one regionserver. > 2. Create a table with precreated regions. (lets say 15 regions) > 3. Make sure the regions are well balanced. > 4. Restart all the Regionservers process at once across the cluster, except > HMaster process > 5. After restarting the Regionservers, successfully will connect to the > HMaster. > *Bug:* > But no regions are assigning back to the Regionservers. > *Master log shows as follows:* > {noformat} > 2015-03-26 15:05:36,201 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStates: Transition {8f62e819b356736053e06240f7f7c6fd > state=OFFLINE, ts=1427362536106, server=VM2,16040,1427362242602} to > {8f62e819b356736053e06240f7f7c6fd state=PENDING_OPEN, ts=1427362536201, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,202 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStateStore: Updating row > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. with > state=PENDING_OPEN&sn=VM1,16040,1427362531818 > 2015-03-26 15:05:36,244 DEBUG [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Force region state offline > {8f62e819b356736053e06240f7f7c6fd state=PENDING_OPEN, ts=1427362536201, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,244 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStates: Transition {8f62e819b356736053e06240f7f7c6fd > state=PENDING_OPEN, ts=1427362536201, server=VM1,16040,1427362531818} to > {8f62e819b356736053e06240f7f7c6fd state=PENDING_CLOSE, ts=1427362536244, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,244 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStateStore: Updating row > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. with > state=PENDING_CLOSE > 2015-03-26 15:05:36,248 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=1 of 10 > 2015-03-26 15:05:36,248 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=2 of 10 > 2015-03-26 15:05:36,249 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=3 of 10 > 2015-03-26 15:05:36,249 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=4 of 10 > 2015-03-26 15:05:36,249 INFO [VM2,16020,1
[jira] [Commented] (HBASE-13337) Table regions are not assigning back, after restarting all regionservers at once.
[ https://issues.apache.org/jira/browse/HBASE-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392835#comment-14392835 ] Jimmy Xiang commented on HBASE-13337: - Branch-1 is affected if ZK-less assignment is turned on. > Table regions are not assigning back, after restarting all regionservers at > once. > - > > Key: HBASE-13337 > URL: https://issues.apache.org/jira/browse/HBASE-13337 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: Y. SREENIVASULU REDDY >Assignee: Jimmy Xiang >Priority: Blocker > Fix For: 2.0.0 > > Attachments: HBASE-13337.patch > > > Regions of the table are continouly in state=FAILED_CLOSE. > {noformat} > RegionState > > RIT time (ms) > 8f62e819b356736053e06240f7f7c6fd > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM1,16040,1427362531818 113929 > caf59209ae65ea80fca6bdc6996a7d68 > t1,,1427362431330.caf59209ae65ea80fca6bdc6996a7d68. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM2,16040,1427362533691 113929 > db52a74988f71e5cf257bbabf31f26f3 > t1,,1427362431330.db52a74988f71e5cf257bbabf31f26f3. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM3,16040,1427362533691 113920 > 43f3a65b9f9ff283f598c5450feab1f8 > t1,,1427362431330.43f3a65b9f9ff283f598c5450feab1f8. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM1,16040,1427362531818 113920 > {noformat} > *Steps to reproduce:* > 1. Start HBase cluster with more than one regionserver. > 2. Create a table with precreated regions. (lets say 15 regions) > 3. Make sure the regions are well balanced. > 4. Restart all the Regionservers process at once across the cluster, except > HMaster process > 5. After restarting the Regionservers, successfully will connect to the > HMaster. > *Bug:* > But no regions are assigning back to the Regionservers. > *Master log shows as follows:* > {noformat} > 2015-03-26 15:05:36,201 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStates: Transition {8f62e819b356736053e06240f7f7c6fd > state=OFFLINE, ts=1427362536106, server=VM2,16040,1427362242602} to > {8f62e819b356736053e06240f7f7c6fd state=PENDING_OPEN, ts=1427362536201, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,202 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStateStore: Updating row > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. with > state=PENDING_OPEN&sn=VM1,16040,1427362531818 > 2015-03-26 15:05:36,244 DEBUG [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Force region state offline > {8f62e819b356736053e06240f7f7c6fd state=PENDING_OPEN, ts=1427362536201, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,244 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStates: Transition {8f62e819b356736053e06240f7f7c6fd > state=PENDING_OPEN, ts=1427362536201, server=VM1,16040,1427362531818} to > {8f62e819b356736053e06240f7f7c6fd state=PENDING_CLOSE, ts=1427362536244, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,244 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStateStore: Updating row > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. with > state=PENDING_CLOSE > 2015-03-26 15:05:36,248 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=1 of 10 > 2015-03-26 15:05:36,248 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=2 of 10 > 2015-03-26 15:05:36,249 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=3 of 10 > 2015-03-26 15:05:36,249 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b35673605
[jira] [Updated] (HBASE-13337) Table regions are not assigning back, after restarting all regionservers at once.
[ https://issues.apache.org/jira/browse/HBASE-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-13337: Status: Patch Available (was: Open) > Table regions are not assigning back, after restarting all regionservers at > once. > - > > Key: HBASE-13337 > URL: https://issues.apache.org/jira/browse/HBASE-13337 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: Y. SREENIVASULU REDDY >Assignee: Jimmy Xiang >Priority: Blocker > Fix For: 2.0.0 > > Attachments: HBASE-13337.patch > > > Regions of the table are continouly in state=FAILED_CLOSE. > {noformat} > RegionState > > RIT time (ms) > 8f62e819b356736053e06240f7f7c6fd > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM1,16040,1427362531818 113929 > caf59209ae65ea80fca6bdc6996a7d68 > t1,,1427362431330.caf59209ae65ea80fca6bdc6996a7d68. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM2,16040,1427362533691 113929 > db52a74988f71e5cf257bbabf31f26f3 > t1,,1427362431330.db52a74988f71e5cf257bbabf31f26f3. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM3,16040,1427362533691 113920 > 43f3a65b9f9ff283f598c5450feab1f8 > t1,,1427362431330.43f3a65b9f9ff283f598c5450feab1f8. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM1,16040,1427362531818 113920 > {noformat} > *Steps to reproduce:* > 1. Start HBase cluster with more than one regionserver. > 2. Create a table with precreated regions. (lets say 15 regions) > 3. Make sure the regions are well balanced. > 4. Restart all the Regionservers process at once across the cluster, except > HMaster process > 5. After restarting the Regionservers, successfully will connect to the > HMaster. > *Bug:* > But no regions are assigning back to the Regionservers. > *Master log shows as follows:* > {noformat} > 2015-03-26 15:05:36,201 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStates: Transition {8f62e819b356736053e06240f7f7c6fd > state=OFFLINE, ts=1427362536106, server=VM2,16040,1427362242602} to > {8f62e819b356736053e06240f7f7c6fd state=PENDING_OPEN, ts=1427362536201, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,202 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStateStore: Updating row > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. with > state=PENDING_OPEN&sn=VM1,16040,1427362531818 > 2015-03-26 15:05:36,244 DEBUG [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Force region state offline > {8f62e819b356736053e06240f7f7c6fd state=PENDING_OPEN, ts=1427362536201, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,244 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStates: Transition {8f62e819b356736053e06240f7f7c6fd > state=PENDING_OPEN, ts=1427362536201, server=VM1,16040,1427362531818} to > {8f62e819b356736053e06240f7f7c6fd state=PENDING_CLOSE, ts=1427362536244, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,244 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStateStore: Updating row > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. with > state=PENDING_CLOSE > 2015-03-26 15:05:36,248 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=1 of 10 > 2015-03-26 15:05:36,248 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=2 of 10 > 2015-03-26 15:05:36,249 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=3 of 10 > 2015-03-26 15:05:36,249 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=4 of 10 > 2015-03-26 15:05:36,249 INFO [VM2,16020,1
[jira] [Updated] (HBASE-13337) Table regions are not assigning back, after restarting all regionservers at once.
[ https://issues.apache.org/jira/browse/HBASE-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-13337: Attachment: HBASE-13337.patch Attached a patch that checks for such a scenario in case a region is forced to assign with a new plan. [~sreenivasulureddy], could you give it a try? > Table regions are not assigning back, after restarting all regionservers at > once. > - > > Key: HBASE-13337 > URL: https://issues.apache.org/jira/browse/HBASE-13337 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: Y. SREENIVASULU REDDY >Assignee: Jimmy Xiang >Priority: Blocker > Fix For: 2.0.0 > > Attachments: HBASE-13337.patch > > > Regions of the table are continouly in state=FAILED_CLOSE. > {noformat} > RegionState > > RIT time (ms) > 8f62e819b356736053e06240f7f7c6fd > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM1,16040,1427362531818 113929 > caf59209ae65ea80fca6bdc6996a7d68 > t1,,1427362431330.caf59209ae65ea80fca6bdc6996a7d68. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM2,16040,1427362533691 113929 > db52a74988f71e5cf257bbabf31f26f3 > t1,,1427362431330.db52a74988f71e5cf257bbabf31f26f3. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM3,16040,1427362533691 113920 > 43f3a65b9f9ff283f598c5450feab1f8 > t1,,1427362431330.43f3a65b9f9ff283f598c5450feab1f8. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM1,16040,1427362531818 113920 > {noformat} > *Steps to reproduce:* > 1. Start HBase cluster with more than one regionserver. > 2. Create a table with precreated regions. (lets say 15 regions) > 3. Make sure the regions are well balanced. > 4. Restart all the Regionservers process at once across the cluster, except > HMaster process > 5. After restarting the Regionservers, successfully will connect to the > HMaster. > *Bug:* > But no regions are assigning back to the Regionservers. > *Master log shows as follows:* > {noformat} > 2015-03-26 15:05:36,201 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStates: Transition {8f62e819b356736053e06240f7f7c6fd > state=OFFLINE, ts=1427362536106, server=VM2,16040,1427362242602} to > {8f62e819b356736053e06240f7f7c6fd state=PENDING_OPEN, ts=1427362536201, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,202 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStateStore: Updating row > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. with > state=PENDING_OPEN&sn=VM1,16040,1427362531818 > 2015-03-26 15:05:36,244 DEBUG [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Force region state offline > {8f62e819b356736053e06240f7f7c6fd state=PENDING_OPEN, ts=1427362536201, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,244 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStates: Transition {8f62e819b356736053e06240f7f7c6fd > state=PENDING_OPEN, ts=1427362536201, server=VM1,16040,1427362531818} to > {8f62e819b356736053e06240f7f7c6fd state=PENDING_CLOSE, ts=1427362536244, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,244 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStateStore: Updating row > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. with > state=PENDING_CLOSE > 2015-03-26 15:05:36,248 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=1 of 10 > 2015-03-26 15:05:36,248 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=2 of 10 > 2015-03-26 15:05:36,249 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=3 of 10 > 2015-03-26 15:05:36,249 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.Cl
[jira] [Assigned] (HBASE-13337) Table regions are not assigning back, after restarting all regionservers at once.
[ https://issues.apache.org/jira/browse/HBASE-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang reassigned HBASE-13337: --- Assignee: Jimmy Xiang > Table regions are not assigning back, after restarting all regionservers at > once. > - > > Key: HBASE-13337 > URL: https://issues.apache.org/jira/browse/HBASE-13337 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: Y. SREENIVASULU REDDY >Assignee: Jimmy Xiang >Priority: Blocker > Fix For: 2.0.0 > > > Regions of the table are continouly in state=FAILED_CLOSE. > {noformat} > RegionState > > RIT time (ms) > 8f62e819b356736053e06240f7f7c6fd > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM1,16040,1427362531818 113929 > caf59209ae65ea80fca6bdc6996a7d68 > t1,,1427362431330.caf59209ae65ea80fca6bdc6996a7d68. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM2,16040,1427362533691 113929 > db52a74988f71e5cf257bbabf31f26f3 > t1,,1427362431330.db52a74988f71e5cf257bbabf31f26f3. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM3,16040,1427362533691 113920 > 43f3a65b9f9ff283f598c5450feab1f8 > t1,,1427362431330.43f3a65b9f9ff283f598c5450feab1f8. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM1,16040,1427362531818 113920 > {noformat} > *Steps to reproduce:* > 1. Start HBase cluster with more than one regionserver. > 2. Create a table with precreated regions. (lets say 15 regions) > 3. Make sure the regions are well balanced. > 4. Restart all the Regionservers process at once across the cluster, except > HMaster process > 5. After restarting the Regionservers, successfully will connect to the > HMaster. > *Bug:* > But no regions are assigning back to the Regionservers. > *Master log shows as follows:* > {noformat} > 2015-03-26 15:05:36,201 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStates: Transition {8f62e819b356736053e06240f7f7c6fd > state=OFFLINE, ts=1427362536106, server=VM2,16040,1427362242602} to > {8f62e819b356736053e06240f7f7c6fd state=PENDING_OPEN, ts=1427362536201, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,202 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStateStore: Updating row > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. with > state=PENDING_OPEN&sn=VM1,16040,1427362531818 > 2015-03-26 15:05:36,244 DEBUG [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Force region state offline > {8f62e819b356736053e06240f7f7c6fd state=PENDING_OPEN, ts=1427362536201, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,244 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStates: Transition {8f62e819b356736053e06240f7f7c6fd > state=PENDING_OPEN, ts=1427362536201, server=VM1,16040,1427362531818} to > {8f62e819b356736053e06240f7f7c6fd state=PENDING_CLOSE, ts=1427362536244, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,244 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStateStore: Updating row > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. with > state=PENDING_CLOSE > 2015-03-26 15:05:36,248 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=1 of 10 > 2015-03-26 15:05:36,248 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=2 of 10 > 2015-03-26 15:05:36,249 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=3 of 10 > 2015-03-26 15:05:36,249 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=4 of 10 > 2015-03-26 15:05:36,249 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.Assig
[jira] [Commented] (HBASE-13337) Table regions are not assigning back, after restarting all regionservers at once.
[ https://issues.apache.org/jira/browse/HBASE-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385920#comment-14385920 ] Jimmy Xiang commented on HBASE-13337: - Thanks a lot for verifying it. > Table regions are not assigning back, after restarting all regionservers at > once. > - > > Key: HBASE-13337 > URL: https://issues.apache.org/jira/browse/HBASE-13337 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: Y. SREENIVASULU REDDY >Priority: Blocker > Fix For: 2.0.0 > > > Regions of the table are continouly in state=FAILED_CLOSE. > {noformat} > RegionState > > RIT time (ms) > 8f62e819b356736053e06240f7f7c6fd > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM1,16040,1427362531818 113929 > caf59209ae65ea80fca6bdc6996a7d68 > t1,,1427362431330.caf59209ae65ea80fca6bdc6996a7d68. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM2,16040,1427362533691 113929 > db52a74988f71e5cf257bbabf31f26f3 > t1,,1427362431330.db52a74988f71e5cf257bbabf31f26f3. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM3,16040,1427362533691 113920 > 43f3a65b9f9ff283f598c5450feab1f8 > t1,,1427362431330.43f3a65b9f9ff283f598c5450feab1f8. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM1,16040,1427362531818 113920 > {noformat} > *Steps to reproduce:* > 1. Start HBase cluster with more than one regionserver. > 2. Create a table with precreated regions. (lets say 15 regions) > 3. Make sure the regions are well balanced. > 4. Restart all the Regionservers process at once across the cluster, except > HMaster process > 5. After restarting the Regionservers, successfully will connect to the > HMaster. > *Bug:* > But no regions are assigning back to the Regionservers. > *Master log shows as follows:* > {noformat} > 2015-03-26 15:05:36,201 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStates: Transition {8f62e819b356736053e06240f7f7c6fd > state=OFFLINE, ts=1427362536106, server=VM2,16040,1427362242602} to > {8f62e819b356736053e06240f7f7c6fd state=PENDING_OPEN, ts=1427362536201, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,202 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStateStore: Updating row > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. with > state=PENDING_OPEN&sn=VM1,16040,1427362531818 > 2015-03-26 15:05:36,244 DEBUG [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Force region state offline > {8f62e819b356736053e06240f7f7c6fd state=PENDING_OPEN, ts=1427362536201, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,244 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStates: Transition {8f62e819b356736053e06240f7f7c6fd > state=PENDING_OPEN, ts=1427362536201, server=VM1,16040,1427362531818} to > {8f62e819b356736053e06240f7f7c6fd state=PENDING_CLOSE, ts=1427362536244, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,244 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStateStore: Updating row > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. with > state=PENDING_CLOSE > 2015-03-26 15:05:36,248 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=1 of 10 > 2015-03-26 15:05:36,248 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=2 of 10 > 2015-03-26 15:05:36,249 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=3 of 10 > 2015-03-26 15:05:36,249 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=4 of 10 > 2015-03-26 15:05:36,249 INFO [VM2,16020,1427362216887-GeneralBulkAssign
[jira] [Commented] (HBASE-13337) Table regions are not assigning back, after restarting all regionservers at once.
[ https://issues.apache.org/jira/browse/HBASE-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14384523#comment-14384523 ] Jimmy Xiang commented on HBASE-13337: - By the way, as a work-around, if you restart the master as step 6, all regions should be assigned as expected. > Table regions are not assigning back, after restarting all regionservers at > once. > - > > Key: HBASE-13337 > URL: https://issues.apache.org/jira/browse/HBASE-13337 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: Y. SREENIVASULU REDDY >Priority: Blocker > Fix For: 2.0.0 > > > Regions of the table are continouly in state=FAILED_CLOSE. > {noformat} > RegionState > > RIT time (ms) > 8f62e819b356736053e06240f7f7c6fd > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM1,16040,1427362531818 113929 > caf59209ae65ea80fca6bdc6996a7d68 > t1,,1427362431330.caf59209ae65ea80fca6bdc6996a7d68. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM2,16040,1427362533691 113929 > db52a74988f71e5cf257bbabf31f26f3 > t1,,1427362431330.db52a74988f71e5cf257bbabf31f26f3. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM3,16040,1427362533691 113920 > 43f3a65b9f9ff283f598c5450feab1f8 > t1,,1427362431330.43f3a65b9f9ff283f598c5450feab1f8. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM1,16040,1427362531818 113920 > {noformat} > *Steps to reproduce:* > 1. Start HBase cluster with more than one regionserver. > 2. Create a table with precreated regions. (lets say 15 regions) > 3. Make sure the regions are well balanced. > 4. Restart all the Regionservers process at once across the cluster, except > HMaster process > 5. After restarting the Regionservers, successfully will connect to the > HMaster. > *Bug:* > But no regions are assigning back to the Regionservers. > *Master log shows as follows:* > {noformat} > 2015-03-26 15:05:36,201 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStates: Transition {8f62e819b356736053e06240f7f7c6fd > state=OFFLINE, ts=1427362536106, server=VM2,16040,1427362242602} to > {8f62e819b356736053e06240f7f7c6fd state=PENDING_OPEN, ts=1427362536201, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,202 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStateStore: Updating row > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. with > state=PENDING_OPEN&sn=VM1,16040,1427362531818 > 2015-03-26 15:05:36,244 DEBUG [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Force region state offline > {8f62e819b356736053e06240f7f7c6fd state=PENDING_OPEN, ts=1427362536201, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,244 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStates: Transition {8f62e819b356736053e06240f7f7c6fd > state=PENDING_OPEN, ts=1427362536201, server=VM1,16040,1427362531818} to > {8f62e819b356736053e06240f7f7c6fd state=PENDING_CLOSE, ts=1427362536244, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,244 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStateStore: Updating row > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. with > state=PENDING_CLOSE > 2015-03-26 15:05:36,248 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=1 of 10 > 2015-03-26 15:05:36,248 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=2 of 10 > 2015-03-26 15:05:36,249 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=3 of 10 > 2015-03-26 15:05:36,249 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=4
[jira] [Commented] (HBASE-13337) Table regions are not assigning back, after restarting all regionservers at once.
[ https://issues.apache.org/jira/browse/HBASE-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14384489#comment-14384489 ] Jimmy Xiang commented on HBASE-13337: - For graceful shutdown, no need for log splitting. Those regions on the dead server are still re-assigned by SSH if the master is not restarted. > Table regions are not assigning back, after restarting all regionservers at > once. > - > > Key: HBASE-13337 > URL: https://issues.apache.org/jira/browse/HBASE-13337 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: Y. SREENIVASULU REDDY >Priority: Blocker > Fix For: 2.0.0 > > > Regions of the table are continouly in state=FAILED_CLOSE. > {noformat} > RegionState > > RIT time (ms) > 8f62e819b356736053e06240f7f7c6fd > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM1,16040,1427362531818 113929 > caf59209ae65ea80fca6bdc6996a7d68 > t1,,1427362431330.caf59209ae65ea80fca6bdc6996a7d68. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM2,16040,1427362533691 113929 > db52a74988f71e5cf257bbabf31f26f3 > t1,,1427362431330.db52a74988f71e5cf257bbabf31f26f3. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM3,16040,1427362533691 113920 > 43f3a65b9f9ff283f598c5450feab1f8 > t1,,1427362431330.43f3a65b9f9ff283f598c5450feab1f8. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM1,16040,1427362531818 113920 > {noformat} > *Steps to reproduce:* > 1. Start HBase cluster with more than one regionserver. > 2. Create a table with precreated regions. (lets say 15 regions) > 3. Make sure the regions are well balanced. > 4. Restart all the Regionservers process at once across the cluster, except > HMaster process > 5. After restarting the Regionservers, successfully will connect to the > HMaster. > *Bug:* > But no regions are assigning back to the Regionservers. > *Master log shows as follows:* > {noformat} > 2015-03-26 15:05:36,201 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStates: Transition {8f62e819b356736053e06240f7f7c6fd > state=OFFLINE, ts=1427362536106, server=VM2,16040,1427362242602} to > {8f62e819b356736053e06240f7f7c6fd state=PENDING_OPEN, ts=1427362536201, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,202 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStateStore: Updating row > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. with > state=PENDING_OPEN&sn=VM1,16040,1427362531818 > 2015-03-26 15:05:36,244 DEBUG [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Force region state offline > {8f62e819b356736053e06240f7f7c6fd state=PENDING_OPEN, ts=1427362536201, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,244 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStates: Transition {8f62e819b356736053e06240f7f7c6fd > state=PENDING_OPEN, ts=1427362536201, server=VM1,16040,1427362531818} to > {8f62e819b356736053e06240f7f7c6fd state=PENDING_CLOSE, ts=1427362536244, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,244 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStateStore: Updating row > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. with > state=PENDING_CLOSE > 2015-03-26 15:05:36,248 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=1 of 10 > 2015-03-26 15:05:36,248 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=2 of 10 > 2015-03-26 15:05:36,249 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=3 of 10 > 2015-03-26 15:05:36,249 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e81
[jira] [Commented] (HBASE-13337) Table regions are not assigning back, after restarting all regionservers at once.
[ https://issues.apache.org/jira/browse/HBASE-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14384394#comment-14384394 ] Jimmy Xiang commented on HBASE-13337: - Or master doesn't know it soon enough. It is a racing between SSH and master ZK event handling (for regionserver is gone). Possible fixes could be (1) fail the region assignments and re-queue the dead server for SSH, (2) fail the master process (by shutting down itself) if such a scenario is detected. > Table regions are not assigning back, after restarting all regionservers at > once. > - > > Key: HBASE-13337 > URL: https://issues.apache.org/jira/browse/HBASE-13337 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: Y. SREENIVASULU REDDY >Priority: Blocker > Fix For: 2.0.0 > > > Regions of the table are continouly in state=FAILED_CLOSE. > {noformat} > RegionState > > RIT time (ms) > 8f62e819b356736053e06240f7f7c6fd > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM1,16040,1427362531818 113929 > caf59209ae65ea80fca6bdc6996a7d68 > t1,,1427362431330.caf59209ae65ea80fca6bdc6996a7d68. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM2,16040,1427362533691 113929 > db52a74988f71e5cf257bbabf31f26f3 > t1,,1427362431330.db52a74988f71e5cf257bbabf31f26f3. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM3,16040,1427362533691 113920 > 43f3a65b9f9ff283f598c5450feab1f8 > t1,,1427362431330.43f3a65b9f9ff283f598c5450feab1f8. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM1,16040,1427362531818 113920 > {noformat} > *Steps to reproduce:* > 1. Start HBase cluster with more than one regionserver. > 2. Create a table with precreated regions. (lets say 15 regions) > 3. Make sure the regions are well balanced. > 4. Restart all the Regionservers process at once across the cluster, except > HMaster process > 5. After restarting the Regionservers, successfully will connect to the > HMaster. > *Bug:* > But no regions are assigning back to the Regionservers. > *Master log shows as follows:* > {noformat} > 2015-03-26 15:05:36,201 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStates: Transition {8f62e819b356736053e06240f7f7c6fd > state=OFFLINE, ts=1427362536106, server=VM2,16040,1427362242602} to > {8f62e819b356736053e06240f7f7c6fd state=PENDING_OPEN, ts=1427362536201, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,202 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStateStore: Updating row > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. with > state=PENDING_OPEN&sn=VM1,16040,1427362531818 > 2015-03-26 15:05:36,244 DEBUG [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Force region state offline > {8f62e819b356736053e06240f7f7c6fd state=PENDING_OPEN, ts=1427362536201, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,244 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStates: Transition {8f62e819b356736053e06240f7f7c6fd > state=PENDING_OPEN, ts=1427362536201, server=VM1,16040,1427362531818} to > {8f62e819b356736053e06240f7f7c6fd state=PENDING_CLOSE, ts=1427362536244, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,244 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStateStore: Updating row > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. with > state=PENDING_CLOSE > 2015-03-26 15:05:36,248 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=1 of 10 > 2015-03-26 15:05:36,248 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=2 of 10 > 2015-03-26 15:05:36,249 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=3 of 10 > 2015-03-26 15:05:36,249 INFO [VM2,16020,1427362216887-GeneralBulkAssi
[jira] [Commented] (HBASE-13337) Table regions are not assigning back, after restarting all regionservers at once.
[ https://issues.apache.org/jira/browse/HBASE-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14384353#comment-14384353 ] Jimmy Xiang commented on HBASE-13337: - At step 4, all the region servers are down, master needs to split log and recover those regions. In ServerShutdownHandler, we have {noformat} while (!this.server.isStopped() && serverManager.countOfRegionServers() < 2) { {noformat} to wait till some regionserver joins in before assigning any region. This looks like a racing issue. Although all regionservers are down, master doesn't know it yet. So SSH starts to assign regions, which fails. > Table regions are not assigning back, after restarting all regionservers at > once. > - > > Key: HBASE-13337 > URL: https://issues.apache.org/jira/browse/HBASE-13337 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.0.0 >Reporter: Y. SREENIVASULU REDDY >Priority: Blocker > Fix For: 2.0.0 > > > Regions of the table are continouly in state=FAILED_CLOSE. > {noformat} > RegionState > > RIT time (ms) > 8f62e819b356736053e06240f7f7c6fd > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM1,16040,1427362531818 113929 > caf59209ae65ea80fca6bdc6996a7d68 > t1,,1427362431330.caf59209ae65ea80fca6bdc6996a7d68. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM2,16040,1427362533691 113929 > db52a74988f71e5cf257bbabf31f26f3 > t1,,1427362431330.db52a74988f71e5cf257bbabf31f26f3. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM3,16040,1427362533691 113920 > 43f3a65b9f9ff283f598c5450feab1f8 > t1,,1427362431330.43f3a65b9f9ff283f598c5450feab1f8. > state=FAILED_CLOSE, ts=Thu Mar 26 15:05:36 IST 2015 (113s ago), > server=VM1,16040,1427362531818 113920 > {noformat} > *Steps to reproduce:* > 1. Start HBase cluster with more than one regionserver. > 2. Create a table with precreated regions. (lets say 15 regions) > 3. Make sure the regions are well balanced. > 4. Restart all the Regionservers process at once across the cluster, except > HMaster process > 5. After restarting the Regionservers, successfully will connect to the > HMaster. > *Bug:* > But no regions are assigning back to the Regionservers. > *Master log shows as follows:* > {noformat} > 2015-03-26 15:05:36,201 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStates: Transition {8f62e819b356736053e06240f7f7c6fd > state=OFFLINE, ts=1427362536106, server=VM2,16040,1427362242602} to > {8f62e819b356736053e06240f7f7c6fd state=PENDING_OPEN, ts=1427362536201, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,202 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStateStore: Updating row > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. with > state=PENDING_OPEN&sn=VM1,16040,1427362531818 > 2015-03-26 15:05:36,244 DEBUG [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Force region state offline > {8f62e819b356736053e06240f7f7c6fd state=PENDING_OPEN, ts=1427362536201, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,244 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStates: Transition {8f62e819b356736053e06240f7f7c6fd > state=PENDING_OPEN, ts=1427362536201, server=VM1,16040,1427362531818} to > {8f62e819b356736053e06240f7f7c6fd state=PENDING_CLOSE, ts=1427362536244, > server=VM1,16040,1427362531818} > 2015-03-26 15:05:36,244 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.RegionStateStore: Updating row > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd. with > state=PENDING_CLOSE > 2015-03-26 15:05:36,248 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=1 of 10 > 2015-03-26 15:05:36,248 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException for > t1,,1427362431330.8f62e819b356736053e06240f7f7c6fd., try=2 of 10 > 2015-03-26 15:05:36,249 INFO [VM2,16020,1427362216887-GeneralBulkAssigner-0] > master.AssignmentManager: Server VM1,16040,1427362531818 returned > java.nio.channels.ClosedChannelException
[jira] [Commented] (HBASE-13194) TableNamespaceManager not ready cause MasterQuotaManager initialization fail
[ https://issues.apache.org/jira/browse/HBASE-13194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358931#comment-14358931 ] Jimmy Xiang commented on HBASE-13194: - Looks good to me. > TableNamespaceManager not ready cause MasterQuotaManager initialization fail > - > > Key: HBASE-13194 > URL: https://issues.apache.org/jira/browse/HBASE-13194 > Project: HBase > Issue Type: Bug > Components: master >Affects Versions: 2.0.0 >Reporter: zhangduo >Assignee: zhangduo > Fix For: 2.0.0 > > Attachments: HBASE-13194.patch > > > This cause TestNamespaceAuditor to fail. > https://builds.apache.org/job/HBase-TRUNK/6237/testReport/junit/org.apache.hadoop.hbase.namespace/TestNamespaceAuditor/testRegionOperations/ > {noformat} > 2015-03-10 22:42:01,372 ERROR [hemera:48616.activeMasterManager] > namespace.NamespaceStateManager(204): Error while update namespace state. > java.io.IOException: Table Namespace Manager not ready yet, try again later > at > org.apache.hadoop.hbase.master.HMaster.checkNamespaceManagerReady(HMaster.java:1912) > at > org.apache.hadoop.hbase.master.HMaster.listNamespaceDescriptors(HMaster.java:2131) > at > org.apache.hadoop.hbase.namespace.NamespaceStateManager.initialize(NamespaceStateManager.java:188) > at > org.apache.hadoop.hbase.namespace.NamespaceStateManager.start(NamespaceStateManager.java:63) > at > org.apache.hadoop.hbase.namespace.NamespaceAuditor.start(NamespaceAuditor.java:57) > at > org.apache.hadoop.hbase.quotas.MasterQuotaManager.start(MasterQuotaManager.java:88) > at > org.apache.hadoop.hbase.master.HMaster.initQuotaManager(HMaster.java:902) > at > org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:756) > at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:161) > at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1455) > at java.lang.Thread.run(Thread.java:744) > {noformat} > The direct reason is that we do not have a retry here, if init fails then it > always fails. But I skimmed the code, seems there is no async init operations > when calling finishActiveMasterInitialization, so it is very strange. Need to > dig more. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13194) TableNamespaceManager not ready cause MasterQuotaManager initialization fail
[ https://issues.apache.org/jira/browse/HBASE-13194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357906#comment-14357906 ] Jimmy Xiang commented on HBASE-13194: - The code doesn't match the comment any more, right? If failed to init namespace table, we throw an exception now. It used to not throw an exception when the namespace table is just introduced. > TableNamespaceManager not ready cause MasterQuotaManager initialization fail > - > > Key: HBASE-13194 > URL: https://issues.apache.org/jira/browse/HBASE-13194 > Project: HBase > Issue Type: Bug > Components: master >Reporter: zhangduo > > This cause TestNamespaceAuditor to fail. > https://builds.apache.org/job/HBase-TRUNK/6237/testReport/junit/org.apache.hadoop.hbase.namespace/TestNamespaceAuditor/testRegionOperations/ > {noformat} > 2015-03-10 22:42:01,372 ERROR [hemera:48616.activeMasterManager] > namespace.NamespaceStateManager(204): Error while update namespace state. > java.io.IOException: Table Namespace Manager not ready yet, try again later > at > org.apache.hadoop.hbase.master.HMaster.checkNamespaceManagerReady(HMaster.java:1912) > at > org.apache.hadoop.hbase.master.HMaster.listNamespaceDescriptors(HMaster.java:2131) > at > org.apache.hadoop.hbase.namespace.NamespaceStateManager.initialize(NamespaceStateManager.java:188) > at > org.apache.hadoop.hbase.namespace.NamespaceStateManager.start(NamespaceStateManager.java:63) > at > org.apache.hadoop.hbase.namespace.NamespaceAuditor.start(NamespaceAuditor.java:57) > at > org.apache.hadoop.hbase.quotas.MasterQuotaManager.start(MasterQuotaManager.java:88) > at > org.apache.hadoop.hbase.master.HMaster.initQuotaManager(HMaster.java:902) > at > org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:756) > at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:161) > at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1455) > at java.lang.Thread.run(Thread.java:744) > {noformat} > The direct reason is that we do not have a retry here, if init fails then it > always fails. But I skimmed the code, seems there is no async init operations > when calling finishActiveMasterInitialization, so it is very strange. Need to > dig more. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13172) TestDistributedLogSplitting.testThreeRSAbort fails several times on branch-1
[ https://issues.apache.org/jira/browse/HBASE-13172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14353452#comment-14353452 ] Jimmy Xiang commented on HBASE-13172: - +1. Looks good to me. As to the issue [~jeffreyz] pointed out, that part is needed. It is preferred that a RS dies naturally (means per ZK) instead of marked dead by AM. Call isServerReachable should not return false info after retries since we check the start code, if the retries take longer the ZK session time-out time. > TestDistributedLogSplitting.testThreeRSAbort fails several times on branch-1 > > > Key: HBASE-13172 > URL: https://issues.apache.org/jira/browse/HBASE-13172 > Project: HBase > Issue Type: Bug > Components: test >Affects Versions: 1.1.0 >Reporter: zhangduo >Assignee: zhangduo > Attachments: HBASE-13172-branch-1.patch > > > The direct reason is we are stuck in ServerManager.isServerReachable. > https://builds.apache.org/job/HBase-1.1/253/testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testThreeRSAbort/ > {noformat} > 2015-03-06 04:06:19,430 DEBUG [AM.-pool300-t1] master.ServerManager(855): > Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=0 of 10 > 2015-03-06 04:07:10,545 DEBUG [AM.-pool300-t1] master.ServerManager(855): > Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=9 of 10 > {noformat} > The interval between first and last retry log is about 1 minute, and we only > wait 1 minute so the test is timeout. > Still do not know why this happen. > And at last there are lots of this > {noformat} > 2015-03-06 04:07:21,529 DEBUG [AM.-pool300-t1] master.ServerManager(855): > Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=9 of 10 > org.apache.hadoop.hbase.ipc.StoppedRpcClientException > at > org.apache.hadoop.hbase.ipc.RpcClientImpl.getConnection(RpcClientImpl.java:1261) > at > org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1146) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287) > at > org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.getServerInfo(AdminProtos.java:22031) > at > org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1797) > at > org.apache.hadoop.hbase.master.ServerManager.isServerReachable(ServerManager.java:850) > at > org.apache.hadoop.hbase.master.RegionStates.isServerDeadAndNotProcessed(RegionStates.java:843) > at > org.apache.hadoop.hbase.master.AssignmentManager.forceRegionStateToOffline(AssignmentManager.java:1969) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1576) > at > org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:48) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > {noformat} > I think the problem is here > {code:title=ServerManager.java} > while (retryCounter.shouldRetry()) { > ... > try { > retryCounter.sleepUntilNextRetry(); > } catch(InterruptedException ie) { > Thread.currentThread().interrupt(); > } > ... > } > {code} > We need to break out of the while loop when getting InterruptedException, not > just mark current thread as interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13172) TestDistributedLogSplitting.testThreeRSAbort fails several times on branch-1
[ https://issues.apache.org/jira/browse/HBASE-13172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14352339#comment-14352339 ] Jimmy Xiang commented on HBASE-13172: - Some tests at branch-1 are more flaky than in master because we may kill RS holding meta which takes longer to recover. In master, there is no such issue since meta is on master all the time. This also means it is usually a bug if some assignment related test is flaky in master. For branch-1, it is a little complicated. You are right this test is not meant to test region assignment. If we can assure the 3 RS killed don't hold meta, the test may not be that flaky. We can have another test for meta handling if there is not such a testcase already. > TestDistributedLogSplitting.testThreeRSAbort fails several times on branch-1 > > > Key: HBASE-13172 > URL: https://issues.apache.org/jira/browse/HBASE-13172 > Project: HBase > Issue Type: Bug > Components: test >Affects Versions: 1.1.0 >Reporter: zhangduo > > The direct reason is we are stuck in ServerManager.isServerReachable. > https://builds.apache.org/job/HBase-1.1/253/testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testThreeRSAbort/ > {noformat} > 2015-03-06 04:06:19,430 DEBUG [AM.-pool300-t1] master.ServerManager(855): > Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=0 of 10 > 2015-03-06 04:07:10,545 DEBUG [AM.-pool300-t1] master.ServerManager(855): > Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=9 of 10 > {noformat} > The interval between first and last retry log is about 1 minute, and we only > wait 1 minute so the test is timeout. > Still do not know why this happen. > And at last there are lots of this > {noformat} > 2015-03-06 04:07:21,529 DEBUG [AM.-pool300-t1] master.ServerManager(855): > Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=9 of 10 > org.apache.hadoop.hbase.ipc.StoppedRpcClientException > at > org.apache.hadoop.hbase.ipc.RpcClientImpl.getConnection(RpcClientImpl.java:1261) > at > org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1146) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287) > at > org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.getServerInfo(AdminProtos.java:22031) > at > org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1797) > at > org.apache.hadoop.hbase.master.ServerManager.isServerReachable(ServerManager.java:850) > at > org.apache.hadoop.hbase.master.RegionStates.isServerDeadAndNotProcessed(RegionStates.java:843) > at > org.apache.hadoop.hbase.master.AssignmentManager.forceRegionStateToOffline(AssignmentManager.java:1969) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1576) > at > org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:48) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > {noformat} > I think the problem is here > {code:title=ServerManager.java} > while (retryCounter.shouldRetry()) { > ... > try { > retryCounter.sleepUntilNextRetry(); > } catch(InterruptedException ie) { > Thread.currentThread().interrupt(); > } > ... > } > {code} > We need to break out of the while loop when getting InterruptedException, not > just mark current thread as interrupted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13150) TestMasterObserver failing disable table at end of test
[ https://issues.apache.org/jira/browse/HBASE-13150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349319#comment-14349319 ] Jimmy Xiang commented on HBASE-13150: - Good analysing. I think we are good to remove that part as Andrey did in HBASE-13076. We should not change (write) a table state in assigning any region. We only need to check (read) the state instead. > TestMasterObserver failing disable table at end of test > --- > > Key: HBASE-13150 > URL: https://issues.apache.org/jira/browse/HBASE-13150 > Project: HBase > Issue Type: Bug > Components: test >Reporter: stack >Assignee: stack > > I see in > https://builds.apache.org/view/H-L/view/HBase/job/HBase-TRUNK/6202/testReport/junit/org.apache.hadoop.hbase.coprocessor/TestMasterObserver/testRegionTransitionOperations/ > , now we have added in timeouts, that we are failing to disable a table. It > looks like table is disabled but regions are being opened on the disabled > table still, like HBASE-6537 > Let me see if can figure why this happening. Will be back. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13076) Table can be forcibly enabled in AssignmentManager during table disabling.
[ https://issues.apache.org/jira/browse/HBASE-13076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349313#comment-14349313 ] Jimmy Xiang commented on HBASE-13076: - I remember that the code removed in this patch is initially introduced in HBASE-5155 (later on changed a little in HBASE-6229). At that time, we had a some problem to tell if a table is enabled if it is already enabled. For this issue, I think we can remove the code, or fail the assignment if the table is not enabled/enabling. I prefer to remove the code since the table state is checked later anyway (and the change is simpler/safer). (Note: If we fail the assignment now, it is good, but we need to update the state accordingly. That's some enhancement. If this doesn't happen a lot, we may not need the enhancement.) > Table can be forcibly enabled in AssignmentManager during table disabling. > -- > > Key: HBASE-13076 > URL: https://issues.apache.org/jira/browse/HBASE-13076 > Project: HBase > Issue Type: Bug > Components: master, Region Assignment >Affects Versions: 2.0.0 >Reporter: Andrey Stepachev >Assignee: Andrey Stepachev > Attachments: 23757f039d83f4f17ca18815eae70b28.log, HBASE-13076.patch > > > Got situation where region can be opened while table is disabling by > DisableTableHandler. Here is relevant log for such situation. There is no > clues who issued OPEN to region. > Log file attached. > UPD: A bit more details. It seems that even in case of new state put into > meta, it still possible to get previous state. > That leads to one more round of assignment invoked in > AssignmentManager#onRegionClosed. > UPD: Table become ENABLED, thats leads to regions instructed to assign > immediately on onRegionClosed. BulkDisabler will not know about that and will > wait indefinitely, because it will not issue unassign for newly opened > regions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13076) Table can be forcibly enabled in AssignmentManager during table disabling.
[ https://issues.apache.org/jira/browse/HBASE-13076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333533#comment-14333533 ] Jimmy Xiang commented on HBASE-13076: - Looks like table state is out of sync with region states. This patch probably doesn't fix the problem. However, if table state is persisted in meta (including DISABLING and DISABLED tables?), it is good for this patch to remove the dead code. > Table can be forcibly enabled in AssignmentManager during table disabling. > -- > > Key: HBASE-13076 > URL: https://issues.apache.org/jira/browse/HBASE-13076 > Project: HBase > Issue Type: Bug > Components: master, Region Assignment >Affects Versions: 2.0.0 >Reporter: Andrey Stepachev >Assignee: Andrey Stepachev > Attachments: 23757f039d83f4f17ca18815eae70b28.log, HBASE-13076.patch > > > Got situation where region can be opened while table is disabling by > DisableTableHandler. Here is relevant log for such situation. There is no > clues who issued OPEN to region. > Log file attached. > UPD: A bit more details. It seems that even in case of new state put into > meta, it still possible to get previous state. > That leads to one more round of assignment invoked in > AssignmentManager#onRegionClosed. > UPD: Table become ENABLED, thats leads to regions instructed to assign > immediately on onRegionClosed. BulkDisabler will not know about that and will > wait indefinitely, because it will not issue unassign for newly opened > regions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12958) SSH doing hbase:meta get but hbase:meta not assigned
[ https://issues.apache.org/jira/browse/HBASE-12958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14306180#comment-14306180 ] Jimmy Xiang commented on HBASE-12958: - +1, good fix. Just one nit, the change to MetaTableAccessor.java, the null check should be at the beginning of the method. > SSH doing hbase:meta get but hbase:meta not assigned > > > Key: HBASE-12958 > URL: https://issues.apache.org/jira/browse/HBASE-12958 > Project: HBase > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: stack >Assignee: stack > Fix For: 1.0.0, 2.0.0, 1.1.0, 0.98.11 > > Attachments: 12958.txt > > > All master threads are blocked waiting on this call to return: > {code} > "MASTER_SERVER_OPERATIONS-c2020:16020-2" #189 prio=5 os_prio=0 > tid=0x7f4b0408b000 nid=0x7821 in Object.wait() [0x7f4ada24d000] >java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at > org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:168) > - locked <0x00041c374f50> (a > java.util.concurrent.atomic.AtomicBoolean) > at org.apache.hadoop.hbase.client.HTable.get(HTable.java:881) > at > org.apache.hadoop.hbase.MetaTableAccessor.get(MetaTableAccessor.java:208) > at > org.apache.hadoop.hbase.MetaTableAccessor.getRegionLocation(MetaTableAccessor.java:250) > at > org.apache.hadoop.hbase.MetaTableAccessor.getRegion(MetaTableAccessor.java:225) > at > org.apache.hadoop.hbase.master.RegionStates.serverOffline(RegionStates.java:634) > - locked <0x00041c1f0d80> (a > org.apache.hadoop.hbase.master.RegionStates) > at > org.apache.hadoop.hbase.master.AssignmentManager.processServerShutdown(AssignmentManager.java:3298) > at > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:226) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Master is stuck trying to find hbase:meta on the server that just crashed and > that we just recovered: > Mon Feb 02 23:00:02 PST 2015, null, java.net.SocketTimeoutException: > callTimeout=6, callDuration=68181: row '' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, > hostname=c2022.halxg.cloudera.com,16020,1422944918568, seqNum=0 > Will add more detail in a sec. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12034) If I kill single RS in branch-1, all regions end up on Master!
[ https://issues.apache.org/jira/browse/HBASE-12034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14293813#comment-14293813 ] Jimmy Xiang commented on HBASE-12034: - Thanks a lot for pointed that out. I updated the release notes for HBASE-10923 a little about the "none" value. It is not reliable/feasible to use a space char. As to upper case, I saw similar usage in hbase-default.xml, for example, hbase.regionserver.regionSplitLimit, hbase.zookeeper.property.maxClientCnxns, etc. We have mixed usage. I am open to either way. One of the reason that it is not documented in hbase-default.xml is that this is not turned on by default in branch 1, and probably users should not touch it? > If I kill single RS in branch-1, all regions end up on Master! > -- > > Key: HBASE-12034 > URL: https://issues.apache.org/jira/browse/HBASE-12034 > Project: HBase > Issue Type: Bug > Components: master >Reporter: stack >Assignee: Jimmy Xiang >Priority: Critical > Fix For: 2.0.0, 0.99.1 > > Attachments: hbase-12034_1.patch, hbase-12034_2.patch > > > This is unexpected. M should not be carrying regions in branch-1. Right > [~jxiang]? Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-10923) Control where to put meta region
[ https://issues.apache.org/jira/browse/HBASE-10923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-10923: Release Note: This patch introduced a new configuration "hbase.balancer.tablesOnMaster" to control what tables' regions should be put on the master by a load balancer. By default, we will put regions of table acl, namespace, and meta on master, i.e. the default configuration is the same as "hbase:acl,hbase:namespace,hbase:meta". To put no region on the master, you need to set "hbase.balancer.tablesOnMaster" to "none" instead of an empty string(the default will be used if it is empty). (was: This patch introduced a new configuration "hbase.balancer.tablesOnMaster" to control what tables' regions should be put on the master by a load balancer. By default, we will put regions of table acl, namespace, and meta on master, i.e. the default configuration is the same as "hbase:acl,hbase:namespace,hbase:meta". To put no region on the master, you need to set "hbase.balancer.tablesOnMaster" to " " instead of an empty string(the default will be used if it is empty).) > Control where to put meta region > > > Key: HBASE-10923 > URL: https://issues.apache.org/jira/browse/HBASE-10923 > Project: HBase > Issue Type: Improvement >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang > Fix For: 0.99.0 > > Attachments: hbase-10923.patch > > > There is a concern on placing meta regions on the master, as in the comments > of HBASE-10569. I was thinking we should have a configuration for a load > balancer to decide where to put it. Adjusting this configuration we can > control whether to put the meta on master, or other region server. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12880) RegionState in state SPLIT doesn't removed from region states
[ https://issues.apache.org/jira/browse/HBASE-12880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14284462#comment-14284462 ] Jimmy Xiang commented on HBASE-12880: - [~octo47], these states in the map are not removed. You have too many regions in such states? If so, I think it is fine to remove them after 30 minutes ~ couple hours. > RegionState in state SPLIT doesn't removed from region states > - > > Key: HBASE-12880 > URL: https://issues.apache.org/jira/browse/HBASE-12880 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 2.0.0, 1.1.0 >Reporter: Andrey Stepachev >Assignee: Andrey Stepachev > Attachments: HBASE-12880.patch, master-with-split-regions-2-1.jpg > > > During my work on patch HBASE-7332 I stumbled on strange behaviour in > RegionStates. Split region doesn't removed from regionStates in > regionOffline() method and RegionState for this region sits in regionStates > map indefinitely long (until RS rebooted). > (that is clearly seen in HBASE-7332 by simple creating table and splitting it > from command line). > Is that was intended to be so and some chore eventually will remove it from > regionStates (didn't find with fast code scanning) or here can be resource > leak? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12640) Add Thrift-over-HTTPS and doAs support for Thrift Server
[ https://issues.apache.org/jira/browse/HBASE-12640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14250928#comment-14250928 ] Jimmy Xiang commented on HBASE-12640: - Done. Thanks. > Add Thrift-over-HTTPS and doAs support for Thrift Server > > > Key: HBASE-12640 > URL: https://issues.apache.org/jira/browse/HBASE-12640 > Project: HBase > Issue Type: Improvement > Components: Thrift >Reporter: Srikanth Srungarapu >Assignee: Srikanth Srungarapu > Fix For: 1.0.0, 2.0.0 > > Attachments: HBASE-12640_addendum.patch, HBASE-12640_v1.patch, > HBASE-12640_v2.patch, HBASE-12640_v3.patch > > > In HBASE-11349, impersonation support has been added to Thrift Server. But > the limitation is thrift client must use same set of credentials throughout > the session. These changes will help us in circumventing this problem, by > allowing user to populate doAs parameter as per his needs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12640) Add Thrift-over-HTTPS and doAs support for Thrift Server
[ https://issues.apache.org/jira/browse/HBASE-12640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-12640: Resolution: Fixed Fix Version/s: 2.0.0 1.0.0 Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Thanks Srikanth for the patch. Integrated into branch 1 and master. > Add Thrift-over-HTTPS and doAs support for Thrift Server > > > Key: HBASE-12640 > URL: https://issues.apache.org/jira/browse/HBASE-12640 > Project: HBase > Issue Type: Improvement > Components: Thrift >Reporter: Srikanth Srungarapu >Assignee: Srikanth Srungarapu > Fix For: 1.0.0, 2.0.0 > > Attachments: HBASE-12640_v1.patch, HBASE-12640_v2.patch, > HBASE-12640_v3.patch > > > In HBASE-11349, impersonation support has been added to Thrift Server. But > the limitation is thrift client must use same set of credentials throughout > the session. These changes will help us in circumventing this problem, by > allowing user to populate doAs parameter as per his needs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12704) Add demo client which uses doAs functionality on Thrift-over-HTTPS.
[ https://issues.apache.org/jira/browse/HBASE-12704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-12704: Resolution: Fixed Fix Version/s: 2.0.0 1.0.0 Status: Resolved (was: Patch Available) Thanks Srikanth for the patch. Integrated into branch 1 and master. > Add demo client which uses doAs functionality on Thrift-over-HTTPS. > --- > > Key: HBASE-12704 > URL: https://issues.apache.org/jira/browse/HBASE-12704 > Project: HBase > Issue Type: Sub-task > Components: Thrift >Reporter: Srikanth Srungarapu >Assignee: Srikanth Srungarapu >Priority: Minor > Fix For: 1.0.0, 2.0.0 > > Attachments: HBASE-12704.patch > > > As per the description. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12704) Add demo client which uses doAs functionality on Thrift-over-HTTPS.
[ https://issues.apache.org/jira/browse/HBASE-12704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14250603#comment-14250603 ] Jimmy Xiang commented on HBASE-12704: - +1 > Add demo client which uses doAs functionality on Thrift-over-HTTPS. > --- > > Key: HBASE-12704 > URL: https://issues.apache.org/jira/browse/HBASE-12704 > Project: HBase > Issue Type: Sub-task > Components: Thrift >Reporter: Srikanth Srungarapu >Assignee: Srikanth Srungarapu >Priority: Minor > Attachments: HBASE-12704.patch > > > As per the description. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12640) Add Thrift-over-HTTPS and doAs support for Thrift Server
[ https://issues.apache.org/jira/browse/HBASE-12640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14250584#comment-14250584 ] Jimmy Xiang commented on HBASE-12640: - +1. Looks good to me. Just some nits: authFilter is not used/needed. > Add Thrift-over-HTTPS and doAs support for Thrift Server > > > Key: HBASE-12640 > URL: https://issues.apache.org/jira/browse/HBASE-12640 > Project: HBase > Issue Type: Improvement > Components: Thrift >Reporter: Srikanth Srungarapu >Assignee: Srikanth Srungarapu > Attachments: HBASE-12640_v1.patch, HBASE-12640_v2.patch > > > In HBASE-11349, impersonation support has been added to Thrift Server. But > the limitation is thrift client must use same set of credentials throughout > the session. These changes will help us in circumventing this problem, by > allowing user to populate doAs parameter as per his needs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12572) Meta flush hangs
[ https://issues.apache.org/jira/browse/HBASE-12572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224872#comment-14224872 ] Jimmy Xiang commented on HBASE-12572: - Probably you won't be able to find this commit. It's my local commit to revert surefire to 2.17 (just simple one line pom.xml change). The parent shra is b1f7d7cd32d4c1ea1b9207472dfab6ca257aa800 (HBASE-12448). > Meta flush hangs > > > Key: HBASE-12572 > URL: https://issues.apache.org/jira/browse/HBASE-12572 > Project: HBase > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Jimmy Xiang > Attachments: master.jstack, meta-flushing.png > > > Not sure if this is still an issue with the latest branch 1 code. I ran into > this with branch 1 commit: 0.99.2-SNAPSHOT, > revision=290749fc56d07461441bd532f62d70f562eee588. > Jstack shows lots of scanners blocked at close. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12572) Meta flush hangs
[ https://issues.apache.org/jira/browse/HBASE-12572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-12572: Attachment: meta-flushing.png > Meta flush hangs > > > Key: HBASE-12572 > URL: https://issues.apache.org/jira/browse/HBASE-12572 > Project: HBase > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Jimmy Xiang > Attachments: master.jstack, meta-flushing.png > > > Not sure if this is still an issue with the latest branch 1 code. I ran into > this with branch 1 commit: 0.99.2-SNAPSHOT, > revision=290749fc56d07461441bd532f62d70f562eee588. > Jstack shows lots of scanners blocked at close. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12572) Meta flush hangs
[ https://issues.apache.org/jira/browse/HBASE-12572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-12572: Attachment: master.jstack > Meta flush hangs > > > Key: HBASE-12572 > URL: https://issues.apache.org/jira/browse/HBASE-12572 > Project: HBase > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Jimmy Xiang > Attachments: master.jstack > > > Not sure if this is still an issue with the latest branch 1 code. I ran into > this with branch 1 commit: 0.99.2-SNAPSHOT, > revision=290749fc56d07461441bd532f62d70f562eee588. > Jstack shows lots of scanners blocked at close. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-12572) Meta flush hangs
Jimmy Xiang created HBASE-12572: --- Summary: Meta flush hangs Key: HBASE-12572 URL: https://issues.apache.org/jira/browse/HBASE-12572 Project: HBase Issue Type: Bug Affects Versions: 1.0.0 Reporter: Jimmy Xiang Not sure if this is still an issue with the latest branch 1 code. I ran into this with branch 1 commit: 0.99.2-SNAPSHOT, revision=290749fc56d07461441bd532f62d70f562eee588. Jstack shows lots of scanners blocked at close. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-12555) Region mover should not try to move regions to master
Jimmy Xiang created HBASE-12555: --- Summary: Region mover should not try to move regions to master Key: HBASE-12555 URL: https://issues.apache.org/jira/browse/HBASE-12555 Project: HBase Issue Type: Bug Reporter: Jimmy Xiang If meta and master is co-located, master is a region server. Region mover script may try to move regions to the master which will fail since load balancer doesn't allow that. The script should be fixed not to move regions to master. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12464) meta table region assignment stuck in the FAILED_OPEN state due to region server not fully ready to serve
[ https://issues.apache.org/jira/browse/HBASE-12464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14219591#comment-14219591 ] Jimmy Xiang commented on HBASE-12464: - The patch for 2.0 looks good to me. Thanks. > meta table region assignment stuck in the FAILED_OPEN state due to region > server not fully ready to serve > - > > Key: HBASE-12464 > URL: https://issues.apache.org/jira/browse/HBASE-12464 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 1.0.0, 2.0.0, 0.99.1 >Reporter: Stephen Yuan Jiang >Assignee: Stephen Yuan Jiang > Fix For: 2.0.0 > > Attachments: HBASE-12464.v1-1.0.patch, HBASE-12464.v1-2.0.patch, > HBASE-12464.v2-2.0.patch > > Original Estimate: 24h > Time Spent: 7.4h > Remaining Estimate: 1h > > meta table region assignment could reach to the 'FAILED_OPEN' state, which > makes the region not available unless the target region server shutdown or > manual resolution. This is undesirable state for meta tavle region. > Here is the sequence how this could happen (the code is in > AssignmentManager#assign()): > Step 1: Master detects a region server (RS1) that hosts one meta table region > is down, it changes the meta region state from 'online' to 'offline' > Step 2: In a loop (with configuable maximumAttempts count, default is 10, and > minimal is 1), AssignmentManager tries to find a RS to host the meta table > region. If there is no RS available, it would loop forver by resetting the > loop count (BUG#1 from this logic - a small bug) > {code} >if (region.isMetaRegion()) { > try { > Thread.sleep(this.sleepTimeBeforeRetryingMetaAssignment); > if (i == maximumAttempts) i = 1; // ==> BUG: if > maximumAttempts is 1, then the loop will end. > continue; > } catch (InterruptedException e) { > ... >} > {code} > Step 3: Once a new RS is found (RS2), inside the same loop as Step 2, > AssignmentManager tries to assign the meta region to RS2 (OFFLINE, RS1 => > PENDING_OPEN, RS2). If for some reason that opening the region in RS2 failed > (eg. the target RS2 is not ready to serve - ServerNotRunningYetException), > AssignmentManager would change the state from (PENDING_OPEN, RS2) to > (FAILED_OPEN, RS2). then it would retry (and even change the RS server to go > to). The retry is up to maximumAttempts. Once the maximumAttempts is > reached, the meta region will be in the 'FAILED_OPEN' state, unless either > (1). RS2 shutdown to trigger region assignment again or (2). it is > reassigned by an operator via HBase Shell. > Based on the document ( http://hbase.apache.org/book/regions.arch.html ), > this is by design - "17. For regions in FAILED_OPEN or FAILED_CLOSE states , > the master tries to close them again when they are reassigned by an operator > via HBase Shell.". > However, this is bad design, espcially for meta table region (it is arguable > that the design is good for regular table - for this ticket, I am more focus > on fixing the meta region availablity issue). > I propose 2 possible fixes: > Fix#1 (band-aid change): in Step 3, just like Step 2, if the region is a meta > table region, reset the loop count so that it would not leave the loop with > meta table region in FAILED_OPEN state. > Fix#2 (more involved): if a region is in FAILED_OPEN state, we should provide > a way to automatically trigger AssignmentManager::assign() after a short > period of time (leaving any region in FAILED_OPEN state or other states like > 'FAILED_CLOSE' is undesirable, should have some way to retrying and auto-heal > the region). > I think at least for 1.0.0, Fix#1 is good enough. We can open a task-type of > JIRA for Fix#2 in future release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12479) Backport HBASE-11689 (Track meta in transition) to 0.98 and branch-1
[ https://issues.apache.org/jira/browse/HBASE-12479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217219#comment-14217219 ] Jimmy Xiang commented on HBASE-12479: - My bad. I missed that part. > Backport HBASE-11689 (Track meta in transition) to 0.98 and branch-1 > > > Key: HBASE-12479 > URL: https://issues.apache.org/jira/browse/HBASE-12479 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Reporter: Virag Kothari >Assignee: Virag Kothari > Fix For: 0.98.9, 0.99.2 > > Attachments: HBASE-12479-0.98.patch > > > Required for zk-less assignment -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12480) Regions in FAILED_OPEN/FAILED_CLOSE should be processed on master failover
[ https://issues.apache.org/jira/browse/HBASE-12480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217212#comment-14217212 ] Jimmy Xiang commented on HBASE-12480: - You meant testOpenFailed? Yes, it may be null. I see. It's better to handle it. Thanks. > Regions in FAILED_OPEN/FAILED_CLOSE should be processed on master failover > --- > > Key: HBASE-12480 > URL: https://issues.apache.org/jira/browse/HBASE-12480 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Reporter: Virag Kothari >Assignee: Virag Kothari > Fix For: 2.0.0, 0.98.9, 0.99.2 > > Attachments: HBASE-12480.patch > > > For zk assignment, we used to process this regions. For zk less assignment, > we should do the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12479) Backport HBASE-11689 (Track meta in transition) to 0.98 and branch-1
[ https://issues.apache.org/jira/browse/HBASE-12479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217187#comment-14217187 ] Jimmy Xiang commented on HBASE-12479: - It is used for ZK-less region assignment and non-colated meta and master. When we commited HBASE-11689, we didn't do a patch for 1.0/0.98 because it's hard to make the master right for these branches for both ZK-based and ZK-less region assignments. The attached patch didn't touch HMaster at all. It seems the patch is not complete. > Backport HBASE-11689 (Track meta in transition) to 0.98 and branch-1 > > > Key: HBASE-12479 > URL: https://issues.apache.org/jira/browse/HBASE-12479 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Reporter: Virag Kothari >Assignee: Virag Kothari > Fix For: 0.98.9, 0.99.2 > > Attachments: HBASE-12479-0.98.patch > > > Required for zk-less assignment -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12464) meta table region assignment stuck in the FAILED_OPEN state due to region server not fully ready to serve
[ https://issues.apache.org/jira/browse/HBASE-12464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14216603#comment-14216603 ] Jimmy Xiang commented on HBASE-12464: - It's not good for meta to stuck in FAILED_OPEN. Agree we should handle it differently. The patch looks good. Just couples things: 1. Can we add a log (info/debug level may be fine) when we reset the retry count to 0? 2. We also need to prevent meta region goes to FAILED_OPEN at method AssignmentManger#onRegionFailedOpen. How about FAILED_CLOSE? It should be fine since the meta region is still available? bq. Fix#2 (more involved): if a region is in FAILED_OPEN state, we should provide a way to automatically trigger AssignmentManager::assign() after a short period of time (leaving any region in FAILED_OPEN state or other states like 'FAILED_CLOSE' is undesirable, should have some way to retrying and auto-heal the region). Is this essentially the same as setting maximumAttempts to a huge number? In many cases, a region may not be able to heal automatically without a pill. Personally, I think a better monitoring system could be better in this case. > meta table region assignment stuck in the FAILED_OPEN state due to region > server not fully ready to serve > - > > Key: HBASE-12464 > URL: https://issues.apache.org/jira/browse/HBASE-12464 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 1.0.0, 2.0.0, 0.99.1 >Reporter: Stephen Yuan Jiang >Assignee: Stephen Yuan Jiang > Fix For: 1.0.0, 2.0.0, 0.99.2 > > Attachments: HBASE-12464.v1-2.0.patch > > Original Estimate: 3h > Remaining Estimate: 3h > > meta table region assignment could reach to the 'FAILED_OPEN' state, which > makes the region not available unless the target region server shutdown or > manual resolution. This is undesirable state for meta tavle region. > Here is the sequence how this could happen (the code is in > AssignmentManager::assign()): > Step 1: Master detects a region server (RS1) that hosts one meta table region > is down, it changes the meta region state from 'online' to 'offline' > Step 2: In a loop (with configuable maximumAttempts count, default is 10, and > minimal is 1), AssignmentManager tries to find a RS to host the meta table > region. If there is no RS available, it would loop forver by resetting the > loop count (!!BUG#1 from this logic - a small bug!!) >if (region.isMetaRegion()) { > -try { > - Thread.sleep(this.sleepTimeBeforeRetryingMetaAssignment); > - if (i == maximumAttempts) i = 1; // ==> BUG: if > maximumAttempts is 1, then the loop will end. > - continue; > -} catch (InterruptedException e) { > - ... > -} > Step 3: Once a new RS is found (RS2), inside the same loop as Step 2, > AssignmentManager tries to assign the meta region to RS2 (OFFLINE, RS1 => > PENDING_OPEN, RS2). If for some reason that opening the region in RS2 failed > (eg. the target RS2 is not ready to serve - ServerNotRunningYetException), > AssignmentManager would change the state from (PENDING_OPEN, RS2) to > (FAILED_OPEN, RS2). then it would retry (and even change the RS server to go > to). The retry is up to maximumAttempts. Once the maximumAttempts is > reached, the meta region will be in the 'FAILED_OPEN' state, unless either > (1). RS2 shutdown to trigger region assignment again or (2). it is > reassigned by an operator via HBase Shell. > Based on the document ( http://hbase.apache.org/book/regions.arch.html ), > this is by design - "17. For regions in FAILED_OPEN or FAILED_CLOSE states , > the master tries to close them again when they are reassigned by an operator > via HBase Shell.". > However, this is bad design, espcially for meta table region (it is arguable > that the design is good for regular table - for this ticket, I am more focus > on fixing the meta region availablity issue). > I propose 2 possible fixes: > Fix#1 (band-aid change): in Step 3, just like Step 2, if the region is a meta > table region, reset the loop count so that it would not leave the loop with > meta table region in FAILED_OPEN state. > Fix#2 (more involved): if a region is in FAILED_OPEN state, we should provide > a way to automatically trigger AssignmentManager::assign() after a short > period of time (leaving any region in FAILED_OPEN state or other states like > 'FAILED_CLOSE' is undesirable, should have some way to retrying and auto-heal > the region). > I think at least for 1.0.0, Fix#1 is good enough. We can open a task-type of > JIRA for Fix#2 in future release. -- This message was sent by Atlassian JIRA (v6.3
[jira] [Commented] (HBASE-12480) Regions in FAILED_OPEN/FAILED_CLOSE should be processed on master failover
[ https://issues.apache.org/jira/browse/HBASE-12480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14215430#comment-14215430 ] Jimmy Xiang commented on HBASE-12480: - bq. Passing null to ConcurrentMap.keySet().contains() will throw NPE Right. OK. bq. Hmm, will make the change. My initial thinking was that we need to make blocking calls but that doesn't seem to matter. May not use invokeUnAssign directly. It's better to do it asynchronously. > Regions in FAILED_OPEN/FAILED_CLOSE should be processed on master failover > --- > > Key: HBASE-12480 > URL: https://issues.apache.org/jira/browse/HBASE-12480 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Reporter: Virag Kothari >Assignee: Virag Kothari > Fix For: 2.0.0, 0.98.9, 0.99.2 > > Attachments: HBASE-12480.patch > > > For zk assignment, we used to process this regions. For zk less assignment, > we should do the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12480) Regions in FAILED_OPEN/FAILED_CLOSE should be processed on master failover
[ https://issues.apache.org/jira/browse/HBASE-12480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213334#comment-14213334 ] Jimmy Xiang commented on HBASE-12480: - I see. {noformat} +&& serverName != null && onlineServers.contains(serverName)) { {noformat} No need this change. If serverName is null, onlineServers should not contain it. bq. isServerOnline(ServerName) will return false when serverName is null (It will be null in case 2 above) In master branch, the server should be never null if it is in these states. {noformat} + case FAILED_CLOSE: + case FAILED_OPEN: +unassign(regionInfo, regionState.getServerName(), null); ... {noformat} Should use invokeUnAssign. > Regions in FAILED_OPEN/FAILED_CLOSE should be processed on master failover > --- > > Key: HBASE-12480 > URL: https://issues.apache.org/jira/browse/HBASE-12480 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Reporter: Virag Kothari >Assignee: Virag Kothari > Fix For: 2.0.0, 0.98.9, 0.99.2 > > Attachments: HBASE-12480.patch > > > For zk assignment, we used to process this regions. For zk less assignment, > we should do the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12480) Regions in FAILED_OPEN/FAILED_CLOSE should be processed on master failover
[ https://issues.apache.org/jira/browse/HBASE-12480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213294#comment-14213294 ] Jimmy Xiang commented on HBASE-12480: - Before regions get into such states, we have tried many times already. If the region server dies, SSH will retry in case things are changed. If the region server stays up, there may be no need to retry at all. If admin fixes the problem causing failed open/close, they can re-assign the region from shell. What do you think? BTW, no need to change serverManager.isServerOnline(regionState.getServerName()) I think, it should do exactly what you want. > Regions in FAILED_OPEN/FAILED_CLOSE should be processed on master failover > --- > > Key: HBASE-12480 > URL: https://issues.apache.org/jira/browse/HBASE-12480 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Reporter: Virag Kothari >Assignee: Virag Kothari > Fix For: 2.0.0, 0.98.9, 0.99.2 > > Attachments: HBASE-12480.patch > > > For zk assignment, we used to process this regions. For zk less assignment, > we should do the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HBASE-12480) Regions in FAILED_OPEN/FAILED_CLOSE should be processed on master failover
[ https://issues.apache.org/jira/browse/HBASE-12480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213291#comment-14213291 ] Jimmy Xiang edited comment on HBASE-12480 at 11/15/14 2:30 AM: --- Are you sure this is an issue for 2.0.0? I remember I tried to file a similar jira before and didn't becase it is not an issue after I looked into it (for master branch, not other branches). was (Author: jxiang): Are you sure this is an issue? I remember I tried to file a similar jira before and didn't becase it is not an issue after I looked into it (for master branch, not other branches). > Regions in FAILED_OPEN/FAILED_CLOSE should be processed on master failover > --- > > Key: HBASE-12480 > URL: https://issues.apache.org/jira/browse/HBASE-12480 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Reporter: Virag Kothari >Assignee: Virag Kothari > Fix For: 2.0.0, 0.98.9, 0.99.2 > > Attachments: HBASE-12480.patch > > > For zk assignment, we used to process this regions. For zk less assignment, > we should do the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12480) Regions in FAILED_OPEN/FAILED_CLOSE should be processed on master failover
[ https://issues.apache.org/jira/browse/HBASE-12480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213291#comment-14213291 ] Jimmy Xiang commented on HBASE-12480: - Are you sure this is an issue? I remember I tried to file a similar jira before and didn't becase it is not an issue after I looked into it (for master branch, not other branches). > Regions in FAILED_OPEN/FAILED_CLOSE should be processed on master failover > --- > > Key: HBASE-12480 > URL: https://issues.apache.org/jira/browse/HBASE-12480 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Reporter: Virag Kothari >Assignee: Virag Kothari > Fix For: 2.0.0, 0.98.9, 0.99.2 > > Attachments: HBASE-12480.patch > > > For zk assignment, we used to process this regions. For zk less assignment, > we should do the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12453) Make region available once it's open
[ https://issues.apache.org/jira/browse/HBASE-12453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206644#comment-14206644 ] Jimmy Xiang commented on HBASE-12453: - We used to update znode, then update meta. In this case, it is something like "update znode" vs "notify master". This should not be an issue. The issue is that if master is down, we can't notify master now. That's what I was thinking about. > Make region available once it's open > > > Key: HBASE-12453 > URL: https://issues.apache.org/jira/browse/HBASE-12453 > Project: HBase > Issue Type: Bug >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang > > Currently (in trunk, with zk-less assignment), a region is available to > serving requests only after RS notifies the master the region is open, and > the meta is updated with the new location. We may be able to do better than > this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12453) Make region available once it's open
[ https://issues.apache.org/jira/browse/HBASE-12453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206621#comment-14206621 ] Jimmy Xiang commented on HBASE-12453: - I looked into it and found it will introduce quite some racing issues. > Make region available once it's open > > > Key: HBASE-12453 > URL: https://issues.apache.org/jira/browse/HBASE-12453 > Project: HBase > Issue Type: Bug >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang > > Currently (in trunk, with zk-less assignment), a region is available to > serving requests only after RS notifies the master the region is open, and > the meta is updated with the new location. We may be able to do better than > this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HBASE-12453) Make region available once it's open
[ https://issues.apache.org/jira/browse/HBASE-12453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang resolved HBASE-12453. - Resolution: Invalid > Make region available once it's open > > > Key: HBASE-12453 > URL: https://issues.apache.org/jira/browse/HBASE-12453 > Project: HBase > Issue Type: Bug >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang > > Currently (in trunk, with zk-less assignment), a region is available to > serving requests only after RS notifies the master the region is open, and > the meta is updated with the new location. We may be able to do better than > this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-12453) Make region available once it's open
Jimmy Xiang created HBASE-12453: --- Summary: Make region available once it's open Key: HBASE-12453 URL: https://issues.apache.org/jira/browse/HBASE-12453 Project: HBase Issue Type: Bug Reporter: Jimmy Xiang Assignee: Jimmy Xiang Currently (in trunk, with zk-less assignment), a region is available to serving requests only after RS notifies the master the region is open, and the meta is updated with the new location. We may be able to do better than this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12398) Region isn't assigned in an extreme race condition
[ https://issues.apache.org/jira/browse/HBASE-12398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14193657#comment-14193657 ] Jimmy Xiang commented on HBASE-12398: - The master branch should not have such a problem because only master updates the region states (step b won't happen). So I think we don't need a patch for master. > Region isn't assigned in an extreme race condition > -- > > Key: HBASE-12398 > URL: https://issues.apache.org/jira/browse/HBASE-12398 > Project: HBase > Issue Type: Bug > Components: Region Assignment >Affects Versions: 0.98.7 >Reporter: Jeffrey Zhong >Assignee: Jeffrey Zhong > Attachments: HBASE-12398.patch > > > In a test, [~enis] has seen a condition which made one of the regions > unassigned. > The client failed since the region is not online anywhere: > {code} > 2014-10-29 01:51:40,731 WARN [HBaseReaderThread_13] > util.MultiThreadedReader: > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after > attempts=35, exceptions: > Wed Oct 29 01:39:51 UTC 2014, > org.apache.hadoop.hbase.client.RpcRetryingCaller@cc21330, > org.apache.hadoop.hbase.NotServingRegionException: > org.apache.hadoop.hbase.NotServingRegionException: Region > IntegrationTestRegionReplicaReplication,0666,1414545619766_0001.689b77e1bad7e951b0d9ef4663b217e9. > is not online on hor8n08.gq1.ygridcore.net,60020,1414546670414 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2774) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4257) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2906) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29990) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2078) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > at java.lang.Thread.run(Thread.java:722) > {code} > The root cause of the issue is due to some extreme race condition: > a) a region is about to open and receives a closeRpc request triggered by a > second re-assignment > b) the second re-assignment updates region state to offline while immediately > is overwritten to OPEN from previous region open ZK opened notification > c) when the region reopened on the same RS by the second assignment, AM force > the region to close as the its region state isn't in PendingOpenOrOpening > state. > d) the region ends up offline & can't server any request > Region Server Side: > 1) A region almost opens region 689b77e1bad7e951b0d9ef4663b217e9 while the > RS(hor8n10) receives a closeRegion request. > {noformat} > 2014-10-29 01:39:43,153 INFO > [PriorityRpcServer.handler=2,queue=0,port=60020] regionserver.HRegionServer: > Received CLOSE for the region:689b77e1bad7e951b0d9ef4663b217e9 , which we are > already trying to OPEN. Cancelling OPENING. > {noformat} > 2) Since region 689b77e1bad7e951b0d9ef4663b217e9 was already opened right > before some final steps, so the RS logs the following message and close > 689b77e1bad7e951b0d9ef4663b217e9 immediately after the RS update ZK node > state to 'OPENED'. > {noformat} > 2014-10-29 01:39:43,198 ERROR [RS_OPEN_REGION-hor8n10:60020-0] > handler.OpenRegionHandler: Race condition: we've finished to open a region, > while a close was requested on > region=IntegrationTestRegionReplicaReplication,0666,1414545619766_0001.689b77e1bad7e951b0d9ef4663b217e9.. > It can be a critical error, as a region that should be closed is now opened. > Closing it now > {noformat} > In Master Server Side: > {noformat} > 2014-10-29 01:39:43,177 DEBUG [AM.ZK.Worker-pool2-t55] > master.AssignmentManager: Handling RS_ZK_REGION_OPENED, > server=hor8n10.gq1.ygridcore.net,60020,1414546531945, > region=689b77e1bad7e951b0d9ef4663b217e9, > current_state={689b77e1bad7e951b0d9ef4663b217e9 state=OPENING, > ts=1414546783152, server=hor8n10.gq1.ygridcore.net,60020,1414546531945} > > 2014-10-29 01:39:43,255 DEBUG [AM.-pool1-t16] master.AssignmentManager: > Offline > IntegrationTestRegionReplicaReplication,0666,1414545619766_0001.689b77e1bad7e951b0d9ef4663b217e9., > it's not any more on hor8n10.gq1.ygridcore.net,60020,1414546531945 > > 2014-10-29 01:39:43,942 DEBUG [AM.ZK.Worker-pool2-t58] > master.AssignmentManager: Handling RS_ZK_REGION_OPENED, > server=hor8n10.gq1.ygridcore.net,60020,1414546531945, > region=689b77e1bad7e951b0d9ef4663b217e9, > current_state={689b77e1bad7e951b0d9ef4663b
[jira] [Updated] (HBASE-12380) TestRegionServerNoMaster#testMultipleOpen is flaky after HBASE-11760
[ https://issues.apache.org/jira/browse/HBASE-12380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-12380: Resolution: Fixed Fix Version/s: 2.0.0 Assignee: Esteban Gutierrez Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Integrated into branch master. Thanks Esteban for the patch. > TestRegionServerNoMaster#testMultipleOpen is flaky after HBASE-11760 > > > Key: HBASE-12380 > URL: https://issues.apache.org/jira/browse/HBASE-12380 > Project: HBase > Issue Type: Bug > Components: test >Affects Versions: 2.0.0 >Reporter: Esteban Gutierrez >Assignee: Esteban Gutierrez > Fix For: 2.0.0 > > Attachments: HBASE-12380.v0.patch > > > Noticed this while trying to fix faulty test while working on a fix for > HBASE-12219: > {code} > Tests in error: > TestRegionServerNoMaster.testMultipleOpen:237 » Service > java.io.IOException: R... > TestRegionServerNoMaster.testCloseByRegionServer:211->closeRegionNoZK:201 » > Service > {code} > Initially I thought the problem was on my patch for HBASE-12219 but I noticed > that the issue was occurring on the 7th attempt to open the region. However I > was able to reproduce the same problem in the master branch after increasing > the number of requests in testMultipleOpen(): > {code} > 2014-10-29 15:03:45,043 INFO [Thread-216] regionserver.RSRpcServices(1334): > Receiving OPEN for the > region:TestRegionServerNoMaster,,1414620223682.025198143197ea68803e49819eae27ca., > which we are already trying to OPEN - ignoring this new request for this > region. > Submitting openRegion attempt: 16 < > 2014-10-29 15:03:45,044 INFO [Thread-216] regionserver.RSRpcServices(1311): > Open TestRegionServerNoMaster,,1414620223682.025198143197ea68803e49819eae27ca. > 2014-10-29 15:03:45,044 INFO > [PostOpenDeployTasks:025198143197ea68803e49819eae27ca] > hbase.MetaTableAccessor(1307): Updated row > TestRegionServerNoMaster,,1414620223682.025198143197ea68803e49819eae27ca. > with server=192.168.1.105,63082,1414620220789 > Submitting openRegion attempt: 17 < > 2014-10-29 15:03:45,046 ERROR [RS_OPEN_REGION-192.168.1.105:63082-2] > handler.OpenRegionHandler(88): Region 025198143197ea68803e49819eae27ca was > already online when we started processing the opening. Marking this new > attempt as failed > 2014-10-29 15:03:45,047 FATAL [Thread-216] regionserver.HRegionServer(1931): > ABORTING region server 192.168.1.105,63082,1414620220789: Received OPEN for > the > region:TestRegionServerNoMaster,,1414620223682.025198143197ea68803e49819eae27ca., > which is already online > 2014-10-29 15:03:45,047 FATAL [Thread-216] regionserver.HRegionServer(1937): > RegionServer abort: loaded coprocessors are: > [org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint] > 2014-10-29 15:03:45,054 WARN [Thread-216] regionserver.HRegionServer(1955): > Unable to report fatal error to master > com.google.protobuf.ServiceException: java.io.IOException: Call to > /192.168.1.105:63079 failed on local exception: java.io.IOException: > Connection to /192.168.1.105:63079 is closing. Call id=4, waitTime=2 > at > org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1707) > at > org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1757) > at > org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.reportRSFatalError(RegionServerStatusProtos.java:8301) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:1952) > at > org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.abortRegionServer(MiniHBaseCluster.java:174) > at > org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.access$100(MiniHBaseCluster.java:108) > at > org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer$2.run(MiniHBaseCluster.java:167) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:356) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1528) > at > org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs(User.java:277) > at > org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.abort(MiniHBaseCluster.java:165) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:1964) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:1308) > at > org.apache.hadoop.hbase.regionserver.TestRegionServerNoMaster.testMultiple
[jira] [Commented] (HBASE-12380) TestRegionServerNoMaster#testMultipleOpen is flaky after HBASE-11760
[ https://issues.apache.org/jira/browse/HBASE-12380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14190479#comment-14190479 ] Jimmy Xiang commented on HBASE-12380: - +1 > TestRegionServerNoMaster#testMultipleOpen is flaky after HBASE-11760 > > > Key: HBASE-12380 > URL: https://issues.apache.org/jira/browse/HBASE-12380 > Project: HBase > Issue Type: Bug > Components: test >Affects Versions: 2.0.0 >Reporter: Esteban Gutierrez > Attachments: HBASE-12380.v0.patch > > > Noticed this while trying to fix faulty test while working on a fix for > HBASE-12219: > {code} > Tests in error: > TestRegionServerNoMaster.testMultipleOpen:237 » Service > java.io.IOException: R... > TestRegionServerNoMaster.testCloseByRegionServer:211->closeRegionNoZK:201 » > Service > {code} > Initially I thought the problem was on my patch for HBASE-12219 but I noticed > that the issue was occurring on the 7th attempt to open the region. However I > was able to reproduce the same problem in the master branch after increasing > the number of requests in testMultipleOpen(): > {code} > 2014-10-29 15:03:45,043 INFO [Thread-216] regionserver.RSRpcServices(1334): > Receiving OPEN for the > region:TestRegionServerNoMaster,,1414620223682.025198143197ea68803e49819eae27ca., > which we are already trying to OPEN - ignoring this new request for this > region. > Submitting openRegion attempt: 16 < > 2014-10-29 15:03:45,044 INFO [Thread-216] regionserver.RSRpcServices(1311): > Open TestRegionServerNoMaster,,1414620223682.025198143197ea68803e49819eae27ca. > 2014-10-29 15:03:45,044 INFO > [PostOpenDeployTasks:025198143197ea68803e49819eae27ca] > hbase.MetaTableAccessor(1307): Updated row > TestRegionServerNoMaster,,1414620223682.025198143197ea68803e49819eae27ca. > with server=192.168.1.105,63082,1414620220789 > Submitting openRegion attempt: 17 < > 2014-10-29 15:03:45,046 ERROR [RS_OPEN_REGION-192.168.1.105:63082-2] > handler.OpenRegionHandler(88): Region 025198143197ea68803e49819eae27ca was > already online when we started processing the opening. Marking this new > attempt as failed > 2014-10-29 15:03:45,047 FATAL [Thread-216] regionserver.HRegionServer(1931): > ABORTING region server 192.168.1.105,63082,1414620220789: Received OPEN for > the > region:TestRegionServerNoMaster,,1414620223682.025198143197ea68803e49819eae27ca., > which is already online > 2014-10-29 15:03:45,047 FATAL [Thread-216] regionserver.HRegionServer(1937): > RegionServer abort: loaded coprocessors are: > [org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint] > 2014-10-29 15:03:45,054 WARN [Thread-216] regionserver.HRegionServer(1955): > Unable to report fatal error to master > com.google.protobuf.ServiceException: java.io.IOException: Call to > /192.168.1.105:63079 failed on local exception: java.io.IOException: > Connection to /192.168.1.105:63079 is closing. Call id=4, waitTime=2 > at > org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1707) > at > org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1757) > at > org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.reportRSFatalError(RegionServerStatusProtos.java:8301) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:1952) > at > org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.abortRegionServer(MiniHBaseCluster.java:174) > at > org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.access$100(MiniHBaseCluster.java:108) > at > org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer$2.run(MiniHBaseCluster.java:167) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:356) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1528) > at > org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs(User.java:277) > at > org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.abort(MiniHBaseCluster.java:165) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:1964) > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:1308) > at > org.apache.hadoop.hbase.regionserver.TestRegionServerNoMaster.testMultipleOpen(TestRegionServerNoMaster.java:237) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.Deleg
[jira] [Commented] (HBASE-12380) Too many attempts to open a region can crash the RegionServer
[ https://issues.apache.org/jira/browse/HBASE-12380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14190360#comment-14190360 ] Jimmy Xiang commented on HBASE-12380: - I have discussed it with Esteban. We agree that it is better not to abort. We can log a warning/error message instead and let it go. The reason for aborting is that this scenario should never happen natually. Master has a state machine and won't send the open call again if it is already opened. My concern with not aborting is that we may hide some serious bug in master if that indeed happens. This test is an old test. My suggestion is to remove this test. > Too many attempts to open a region can crash the RegionServer > - > > Key: HBASE-12380 > URL: https://issues.apache.org/jira/browse/HBASE-12380 > Project: HBase > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Esteban Gutierrez >Priority: Critical > > Noticed this while trying to fix faulty test while working on a fix for > HBASE-12219: > {code} > Tests in error: > TestRegionServerNoMaster.testMultipleOpen:237 » Service > java.io.IOException: R... > TestRegionServerNoMaster.testCloseByRegionServer:211->closeRegionNoZK:201 » > Service > {code} > Initially I thought the problem was on my patch for HBASE-12219 but I noticed > that the issue was occurring on the 7th attempt to open the region. However I > was able to reproduce the same problem in the master branch after increasing > the number of requests in testMultipleOpen(): > {code} > 2014-10-29 15:03:45,043 INFO [Thread-216] regionserver.RSRpcServices(1334): > Receiving OPEN for the > region:TestRegionServerNoMaster,,1414620223682.025198143197ea68803e49819eae27ca., > which we are already trying to OPEN - ignoring this new request for this > region. > Submitting openRegion attempt: 16 < > 2014-10-29 15:03:45,044 INFO [Thread-216] regionserver.RSRpcServices(1311): > Open TestRegionServerNoMaster,,1414620223682.025198143197ea68803e49819eae27ca. > 2014-10-29 15:03:45,044 INFO > [PostOpenDeployTasks:025198143197ea68803e49819eae27ca] > hbase.MetaTableAccessor(1307): Updated row > TestRegionServerNoMaster,,1414620223682.025198143197ea68803e49819eae27ca. > with server=192.168.1.105,63082,1414620220789 > Submitting openRegion attempt: 17 < > 2014-10-29 15:03:45,046 ERROR [RS_OPEN_REGION-192.168.1.105:63082-2] > handler.OpenRegionHandler(88): Region 025198143197ea68803e49819eae27ca was > already online when we started processing the opening. Marking this new > attempt as failed > 2014-10-29 15:03:45,047 FATAL [Thread-216] regionserver.HRegionServer(1931): > ABORTING region server 192.168.1.105,63082,1414620220789: Received OPEN for > the > region:TestRegionServerNoMaster,,1414620223682.025198143197ea68803e49819eae27ca., > which is already online > 2014-10-29 15:03:45,047 FATAL [Thread-216] regionserver.HRegionServer(1937): > RegionServer abort: loaded coprocessors are: > [org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint] > 2014-10-29 15:03:45,054 WARN [Thread-216] regionserver.HRegionServer(1955): > Unable to report fatal error to master > com.google.protobuf.ServiceException: java.io.IOException: Call to > /192.168.1.105:63079 failed on local exception: java.io.IOException: > Connection to /192.168.1.105:63079 is closing. Call id=4, waitTime=2 > at > org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1707) > at > org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1757) > at > org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.reportRSFatalError(RegionServerStatusProtos.java:8301) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:1952) > at > org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.abortRegionServer(MiniHBaseCluster.java:174) > at > org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.access$100(MiniHBaseCluster.java:108) > at > org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer$2.run(MiniHBaseCluster.java:167) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:356) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1528) > at > org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs(User.java:277) > at > org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.abort(MiniHBaseCluster.java:165) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:1964) > at > org.apache
[jira] [Commented] (HBASE-12319) Inconsistencies during region recovery due to close/open of a region during recovery
[ https://issues.apache.org/jira/browse/HBASE-12319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181901#comment-14181901 ] Jimmy Xiang commented on HBASE-12319: - +1. Looks good to me. > Inconsistencies during region recovery due to close/open of a region during > recovery > > > Key: HBASE-12319 > URL: https://issues.apache.org/jira/browse/HBASE-12319 > Project: HBase > Issue Type: Bug >Affects Versions: 0.98.7, 0.99.1 >Reporter: Devaraj Das >Assignee: Jeffrey Zhong > Attachments: HBASE-12319.patch > > > In one of my test runs, I saw the following: > {noformat} > 2014-10-14 13:45:30,782 DEBUG > [StoreOpener-51af4bd23dc32a940ad2dd5435f00e1d-1] regionserver.HStore: loaded > hdfs://hor9n01.gq1.ygridcore.net:8020/apps/hbase/data/data/default/IntegrationTestIngest/51af4bd23dc32a940ad2dd5435f00e1d/test_cf/d6df5cfe15ca41d68c619489fbde4d04, > isReference=false, isBulkLoadResult=false, seqid=141197, majorCompaction=true > 2014-10-14 13:45:30,788 DEBUG [RS_OPEN_REGION-hor9n01:60020-1] > regionserver.HRegion: Found 3 recovered edits file(s) under > hdfs://hor9n01.gq1.ygridcore.net:8020/apps/hbase/data/data/default/IntegrationTestIngest/51af4bd23dc32a940ad2dd5435f00e1d > . > . > 2014-10-14 13:45:31,916 WARN [RS_OPEN_REGION-hor9n01:60020-1] > regionserver.HRegion: Null or non-existent edits file: > hdfs://hor9n01.gq1.ygridcore.net:8020/apps/hbase/data/data/default/IntegrationTestIngest/51af4bd23dc32a940ad2dd5435f00e1d/recovered.edits/0198080 > {noformat} > The above logs is from a regionserver, say RS2. From the initial analysis it > seemed like the master asked a certain regionserver to open the region (let's > say RS1) and for some reason asked it to close soon after. The open was still > proceeding on RS1 but the master reassigned the region to RS2. This also > started the recovery but it ended up seeing an inconsistent view of the > recovered-edits files (it reports missing files as per the logs above) since > the first regionserver (RS1) deleted some files after it completed the > recovery. When RS2 really opens the region, it might not see the recent data > that was written by flushes on hor9n10 during the recovery process. Reads of > that data would have inconsistencies. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12228) Backport HBASE-11373 (hbase-protocol compile failed for name conflict of RegionTransition) to 0.98
[ https://issues.apache.org/jira/browse/HBASE-12228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168264#comment-14168264 ] Jimmy Xiang commented on HBASE-12228: - +1. Thanks! > Backport HBASE-11373 (hbase-protocol compile failed for name conflict of > RegionTransition) to 0.98 > -- > > Key: HBASE-12228 > URL: https://issues.apache.org/jira/browse/HBASE-12228 > Project: HBase > Issue Type: Bug >Reporter: Andrew Purtell >Assignee: Andrew Purtell > Fix For: 0.98.8 > > Attachments: HBASE-12228-0.98.patch > > > {quote} > RegionServerStatus.proto:81:9: "RegionTransition" is already defined in file > "ZooKeeper.proto". > RegionServerStatus.proto:114:12: "RegionTransition" seems to be defined in > "ZooKeeper.proto", which is not imported by "RegionServerStatus.proto". To > use it here, please add the necessary import. > {quote} > This was introduced into 0.98 in e6ffa86e > {noformat} > commit e6ffa86e33ee173afcff15ca4b614e6ec56357ed > Author: Andrew Purtell > Date: Tue Aug 26 08:01:09 2014 -0700 > HBASE-11546 Backport ZK-less region assignment to 0.98 (Virag Kothari) > [1/8] > > HBASE-11059 ZK-less region assignment (Jimmy Xiang > {noformat} > There's a later fix for this that needs to be applied: > {noformat} > commit 175f133dbc127d7eb2ba5693cc6b2e4fe3c51655 > Author: Jimmy Xiang > Date: Wed Jun 18 08:38:05 2014 -0700 > HBASE-11373 hbase-protocol compile failed for name conflict of > RegionTransition > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12230) User impersonation does not work in 'simple' mode.
[ https://issues.apache.org/jira/browse/HBASE-12230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14167699#comment-14167699 ] Jimmy Xiang commented on HBASE-12230: - You want to use doAs without authentication? > User impersonation does not work in 'simple' mode. > -- > > Key: HBASE-12230 > URL: https://issues.apache.org/jira/browse/HBASE-12230 > Project: HBase > Issue Type: Bug > Components: REST, security >Affects Versions: 0.98.6.1 >Reporter: Aditya Kishore >Assignee: Aditya Kishore > Attachments: > HBASE-12230-User-impersonation-does-not-work-in-simp.patch > > > The [code responsible for initializing proxy > configuration|https://github.com/apache/hbase/blob/7cfdb38c9274e306ac37374c147a978c2cef31d6/hbase-server/src/main/java/org/apache/hadoop/hbase/security/HBasePolicyProvider.java#L54] > does not execute unless {{"hadoop.security.authorization"}} is set to true. > This is departure from other Hadoop components. Impersonation should not be > tied to authorization. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12216) Lower closed region logging level
[ https://issues.apache.org/jira/browse/HBASE-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-12216: Resolution: Fixed Fix Version/s: (was: 0.99.1) Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) The test is ok locally. Integrated into branch master. Thanks. > Lower closed region logging level > - > > Key: HBASE-12216 > URL: https://issues.apache.org/jira/browse/HBASE-12216 > Project: HBase > Issue Type: Bug >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang >Priority: Minor > Fix For: 2.0.0 > > Attachments: hbase-12216.patch > > > There are quite some ERROR messages in the log, which sounds some problems > but actually they are not. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12216) Lower closed region logging level
[ https://issues.apache.org/jira/browse/HBASE-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14165415#comment-14165415 ] Jimmy Xiang commented on HBASE-12216: - Now, double closes are due to retry when master restarts, just in case the previous close request (before the master crashes) wasn't received by the region server yet. It is no longer signs of things going wrong. > Lower closed region logging level > - > > Key: HBASE-12216 > URL: https://issues.apache.org/jira/browse/HBASE-12216 > Project: HBase > Issue Type: Bug >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang >Priority: Minor > Fix For: 2.0.0, 0.99.1 > > Attachments: hbase-12216.patch > > > There are quite some ERROR messages in the log, which sounds some problems > but actually they are not. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12216) Lower closed region logging level
[ https://issues.apache.org/jira/browse/HBASE-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-12216: Fix Version/s: 0.99.1 Status: Patch Available (was: Open) > Lower closed region logging level > - > > Key: HBASE-12216 > URL: https://issues.apache.org/jira/browse/HBASE-12216 > Project: HBase > Issue Type: Bug >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang >Priority: Minor > Fix For: 2.0.0, 0.99.1 > > Attachments: hbase-12216.patch > > > There are quite some ERROR messages in the log, which sounds some problems > but actually they are not. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12216) Lower closed region logging level
[ https://issues.apache.org/jira/browse/HBASE-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-12216: Attachment: hbase-12216.patch > Lower closed region logging level > - > > Key: HBASE-12216 > URL: https://issues.apache.org/jira/browse/HBASE-12216 > Project: HBase > Issue Type: Bug >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang >Priority: Minor > Fix For: 2.0.0 > > Attachments: hbase-12216.patch > > > There are quite some ERROR messages in the log, which sounds some problems > but actually they are not. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-12216) Lower closed region logging level
Jimmy Xiang created HBASE-12216: --- Summary: Lower closed region logging level Key: HBASE-12216 URL: https://issues.apache.org/jira/browse/HBASE-12216 Project: HBase Issue Type: Bug Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Fix For: 2.0.0 There are quite some ERROR messages in the log, which sounds some problems but actually they are not. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12209) NPE in HRegionServer#getLastSequenceId
[ https://issues.apache.org/jira/browse/HBASE-12209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-12209: Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Integrated into branch 1 and master. Thanks. > NPE in HRegionServer#getLastSequenceId > -- > > Key: HBASE-12209 > URL: https://issues.apache.org/jira/browse/HBASE-12209 > Project: HBase > Issue Type: Bug >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang >Priority: Minor > Fix For: 2.0.0, 0.99.1 > > Attachments: hbase-12209.patch > > > The region server got the logging splitting task, but the master is gone. > {noformat} > 2014-10-08 08:31:22,089 ERROR [RS_LOG_REPLAY_OPS-a2428:20020-1] > executor.EventHandler: Caught throwable while processing event RS_LOG_REPLAY > java.lang.NullPointerException > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getLastSequenceId(HRegionServer.java:2113) > at > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:317) > at > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:218) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:103) > at > org.apache.hadoop.hbase.regionserver.handler.HLogSplitterHandler.process(HLogSplitterHandler.java:72) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:724) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12209) NPE in HRegionServer#getLastSequenceId
[ https://issues.apache.org/jira/browse/HBASE-12209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-12209: Attachment: hbase-12209.patch > NPE in HRegionServer#getLastSequenceId > -- > > Key: HBASE-12209 > URL: https://issues.apache.org/jira/browse/HBASE-12209 > Project: HBase > Issue Type: Bug >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang >Priority: Minor > Fix For: 2.0.0, 0.99.1 > > Attachments: hbase-12209.patch > > > The region server got the logging splitting task, but the master is gone. > {noformat} > 2014-10-08 08:31:22,089 ERROR [RS_LOG_REPLAY_OPS-a2428:20020-1] > executor.EventHandler: Caught throwable while processing event RS_LOG_REPLAY > java.lang.NullPointerException > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getLastSequenceId(HRegionServer.java:2113) > at > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:317) > at > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:218) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:103) > at > org.apache.hadoop.hbase.regionserver.handler.HLogSplitterHandler.process(HLogSplitterHandler.java:72) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:724) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12209) NPE in HRegionServer#getLastSequenceId
[ https://issues.apache.org/jira/browse/HBASE-12209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-12209: Status: Patch Available (was: Open) > NPE in HRegionServer#getLastSequenceId > -- > > Key: HBASE-12209 > URL: https://issues.apache.org/jira/browse/HBASE-12209 > Project: HBase > Issue Type: Bug >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang >Priority: Minor > Fix For: 2.0.0, 0.99.1 > > Attachments: hbase-12209.patch > > > The region server got the logging splitting task, but the master is gone. > {noformat} > 2014-10-08 08:31:22,089 ERROR [RS_LOG_REPLAY_OPS-a2428:20020-1] > executor.EventHandler: Caught throwable while processing event RS_LOG_REPLAY > java.lang.NullPointerException > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getLastSequenceId(HRegionServer.java:2113) > at > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:317) > at > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:218) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:103) > at > org.apache.hadoop.hbase.regionserver.handler.HLogSplitterHandler.process(HLogSplitterHandler.java:72) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:724) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-12209) NPE in HRegionServer#getLastSequenceId
Jimmy Xiang created HBASE-12209: --- Summary: NPE in HRegionServer#getLastSequenceId Key: HBASE-12209 URL: https://issues.apache.org/jira/browse/HBASE-12209 Project: HBase Issue Type: Bug Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Fix For: 2.0.0, 0.99.1 The region server got the logging splitting task, but the master is gone. {noformat} 2014-10-08 08:31:22,089 ERROR [RS_LOG_REPLAY_OPS-a2428:20020-1] executor.EventHandler: Caught throwable while processing event RS_LOG_REPLAY java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.HRegionServer.getLastSequenceId(HRegionServer.java:2113) at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:317) at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:218) at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:103) at org.apache.hadoop.hbase.regionserver.handler.HLogSplitterHandler.process(HLogSplitterHandler.java:72) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12206) NPE in RSRpcServices
[ https://issues.apache.org/jira/browse/HBASE-12206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-12206: Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Integrated into branch 1 and master. Thanks. > NPE in RSRpcServices > > > Key: HBASE-12206 > URL: https://issues.apache.org/jira/browse/HBASE-12206 > Project: HBase > Issue Type: Bug >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang >Priority: Minor > Fix For: 2.0.0, 0.99.1 > > Attachments: hbase-12206.patch > > > Looks "leases" is null, which is possible since the region server is not open > yet. Will add a check. > {noformat} > 2014-10-08 08:38:17,985 ERROR > [B.defaultRpcServer.handler=0,queue=0,port=20020] ipc.RpcServer: Unexpected > throwable object > java.lang.NullPointerException > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:1957) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30422) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > at java.lang.Thread.run(Thread.java:724) > 2014-10-08 08:38:17,988 DEBUG > [B.defaultRpcServer.handler=0,queue=0,port=20020] ipc.RpcServer: > B.defaultRpcServer.handler=0,queue=0,port=20020: callId: 645 service: > ClientService methodName: Scan size: 22 connection: 10.20.212.36:53810 > java.io.IOException > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2054) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > at java.lang.Thread.run(Thread.java:724) > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:1957) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30422) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > ... 4 more > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12206) NPE in RSRpcServices
[ https://issues.apache.org/jira/browse/HBASE-12206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14163805#comment-14163805 ] Jimmy Xiang commented on HBASE-12206: - Sure. Will make it DEBUG. Thanks. > NPE in RSRpcServices > > > Key: HBASE-12206 > URL: https://issues.apache.org/jira/browse/HBASE-12206 > Project: HBase > Issue Type: Bug >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang >Priority: Minor > Fix For: 2.0.0, 0.99.1 > > Attachments: hbase-12206.patch > > > Looks "leases" is null, which is possible since the region server is not open > yet. Will add a check. > {noformat} > 2014-10-08 08:38:17,985 ERROR > [B.defaultRpcServer.handler=0,queue=0,port=20020] ipc.RpcServer: Unexpected > throwable object > java.lang.NullPointerException > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:1957) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30422) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > at java.lang.Thread.run(Thread.java:724) > 2014-10-08 08:38:17,988 DEBUG > [B.defaultRpcServer.handler=0,queue=0,port=20020] ipc.RpcServer: > B.defaultRpcServer.handler=0,queue=0,port=20020: callId: 645 service: > ClientService methodName: Scan size: 22 connection: 10.20.212.36:53810 > java.io.IOException > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2054) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > at java.lang.Thread.run(Thread.java:724) > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:1957) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30422) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > ... 4 more > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12206) NPE in RSRpcServices
[ https://issues.apache.org/jira/browse/HBASE-12206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-12206: Fix Version/s: 0.99.1 2.0.0 Status: Patch Available (was: Open) > NPE in RSRpcServices > > > Key: HBASE-12206 > URL: https://issues.apache.org/jira/browse/HBASE-12206 > Project: HBase > Issue Type: Bug >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang >Priority: Minor > Fix For: 2.0.0, 0.99.1 > > Attachments: hbase-12206.patch > > > Looks "leases" is null, which is possible since the region server is not open > yet. Will add a check. > {noformat} > 2014-10-08 08:38:17,985 ERROR > [B.defaultRpcServer.handler=0,queue=0,port=20020] ipc.RpcServer: Unexpected > throwable object > java.lang.NullPointerException > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:1957) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30422) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > at java.lang.Thread.run(Thread.java:724) > 2014-10-08 08:38:17,988 DEBUG > [B.defaultRpcServer.handler=0,queue=0,port=20020] ipc.RpcServer: > B.defaultRpcServer.handler=0,queue=0,port=20020: callId: 645 service: > ClientService methodName: Scan size: 22 connection: 10.20.212.36:53810 > java.io.IOException > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2054) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > at java.lang.Thread.run(Thread.java:724) > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:1957) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30422) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > ... 4 more > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12206) NPE in RSRpcServices
[ https://issues.apache.org/jira/browse/HBASE-12206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-12206: Attachment: hbase-12206.patch > NPE in RSRpcServices > > > Key: HBASE-12206 > URL: https://issues.apache.org/jira/browse/HBASE-12206 > Project: HBase > Issue Type: Bug >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang >Priority: Minor > Fix For: 2.0.0, 0.99.1 > > Attachments: hbase-12206.patch > > > Looks "leases" is null, which is possible since the region server is not open > yet. Will add a check. > {noformat} > 2014-10-08 08:38:17,985 ERROR > [B.defaultRpcServer.handler=0,queue=0,port=20020] ipc.RpcServer: Unexpected > throwable object > java.lang.NullPointerException > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:1957) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30422) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > at java.lang.Thread.run(Thread.java:724) > 2014-10-08 08:38:17,988 DEBUG > [B.defaultRpcServer.handler=0,queue=0,port=20020] ipc.RpcServer: > B.defaultRpcServer.handler=0,queue=0,port=20020: callId: 645 service: > ClientService methodName: Scan size: 22 connection: 10.20.212.36:53810 > java.io.IOException > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2054) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > at java.lang.Thread.run(Thread.java:724) > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:1957) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30422) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > ... 4 more > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-12206) NPE in RSRpcServices
Jimmy Xiang created HBASE-12206: --- Summary: NPE in RSRpcServices Key: HBASE-12206 URL: https://issues.apache.org/jira/browse/HBASE-12206 Project: HBase Issue Type: Bug Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Looks "leases" is null, which is possible since the region server is not open yet. Will add a check. {noformat} 2014-10-08 08:38:17,985 ERROR [B.defaultRpcServer.handler=0,queue=0,port=20020] ipc.RpcServer: Unexpected throwable object java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:1957) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30422) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) at java.lang.Thread.run(Thread.java:724) 2014-10-08 08:38:17,988 DEBUG [B.defaultRpcServer.handler=0,queue=0,port=20020] ipc.RpcServer: B.defaultRpcServer.handler=0,queue=0,port=20020: callId: 645 service: ClientService methodName: Scan size: 22 connection: 10.20.212.36:53810 java.io.IOException at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2054) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) at java.lang.Thread.run(Thread.java:724) Caused by: java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:1957) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30422) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) ... 4 more {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12196) SSH should retry in case failed to assign regions
[ https://issues.apache.org/jira/browse/HBASE-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-12196: Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Integrated into branch 1 and master. Thanks. > SSH should retry in case failed to assign regions > - > > Key: HBASE-12196 > URL: https://issues.apache.org/jira/browse/HBASE-12196 > Project: HBase > Issue Type: Bug >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.99.1 > > Attachments: hbase-12196.patch, hbase-12196_v2.patch > > > If there is only master alive, all regionservers are down, SSH can't find a > plan to assign user regions. In this case, SSH should retry. > {noformat} > 2014-10-07 14:05:18,310 ERROR [MASTER_SERVER_OPERATIONS-a2424:20020-2] > executor.EventHandler: Caught throwable while processing event > M_SERVER_SHUTDOWN > java.io.IOException: Unable to determine a plan to assign region(s) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1411) > at > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:272) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:724) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12196) SSH should retry in case failed to assign regions
[ https://issues.apache.org/jira/browse/HBASE-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-12196: Attachment: hbase-12196_v2.patch Attached v2 that added one test to cover this. > SSH should retry in case failed to assign regions > - > > Key: HBASE-12196 > URL: https://issues.apache.org/jira/browse/HBASE-12196 > Project: HBase > Issue Type: Bug >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.99.1 > > Attachments: hbase-12196.patch, hbase-12196_v2.patch > > > If there is only master alive, all regionservers are down, SSH can't find a > plan to assign user regions. In this case, SSH should retry. > {noformat} > 2014-10-07 14:05:18,310 ERROR [MASTER_SERVER_OPERATIONS-a2424:20020-2] > executor.EventHandler: Caught throwable while processing event > M_SERVER_SHUTDOWN > java.io.IOException: Unable to determine a plan to assign region(s) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1411) > at > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:272) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:724) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12196) SSH should retry in case failed to assign regions
[ https://issues.apache.org/jira/browse/HBASE-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-12196: Status: Patch Available (was: Open) > SSH should retry in case failed to assign regions > - > > Key: HBASE-12196 > URL: https://issues.apache.org/jira/browse/HBASE-12196 > Project: HBase > Issue Type: Bug >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.99.1 > > Attachments: hbase-12196.patch, hbase-12196_v2.patch > > > If there is only master alive, all regionservers are down, SSH can't find a > plan to assign user regions. In this case, SSH should retry. > {noformat} > 2014-10-07 14:05:18,310 ERROR [MASTER_SERVER_OPERATIONS-a2424:20020-2] > executor.EventHandler: Caught throwable while processing event > M_SERVER_SHUTDOWN > java.io.IOException: Unable to determine a plan to assign region(s) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1411) > at > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:272) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:724) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12196) SSH should retry in case failed to assign regions
[ https://issues.apache.org/jira/browse/HBASE-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162704#comment-14162704 ] Jimmy Xiang commented on HBASE-12196: - Yes, we can have a test for this one. Let me add one. Thanks. > SSH should retry in case failed to assign regions > - > > Key: HBASE-12196 > URL: https://issues.apache.org/jira/browse/HBASE-12196 > Project: HBase > Issue Type: Bug >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.99.1 > > Attachments: hbase-12196.patch > > > If there is only master alive, all regionservers are down, SSH can't find a > plan to assign user regions. In this case, SSH should retry. > {noformat} > 2014-10-07 14:05:18,310 ERROR [MASTER_SERVER_OPERATIONS-a2424:20020-2] > executor.EventHandler: Caught throwable while processing event > M_SERVER_SHUTDOWN > java.io.IOException: Unable to determine a plan to assign region(s) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1411) > at > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:272) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:724) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12196) SSH should retry in case failed to assign regions
[ https://issues.apache.org/jira/browse/HBASE-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-12196: Attachment: hbase-12196.patch > SSH should retry in case failed to assign regions > - > > Key: HBASE-12196 > URL: https://issues.apache.org/jira/browse/HBASE-12196 > Project: HBase > Issue Type: Bug >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.99.1 > > Attachments: hbase-12196.patch > > > If there is only master alive, all regionservers are down, SSH can't find a > plan to assign user regions. In this case, SSH should retry. > {noformat} > 2014-10-07 14:05:18,310 ERROR [MASTER_SERVER_OPERATIONS-a2424:20020-2] > executor.EventHandler: Caught throwable while processing event > M_SERVER_SHUTDOWN > java.io.IOException: Unable to determine a plan to assign region(s) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1411) > at > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:272) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:724) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-12196) SSH should retry in case failed to assign regions
Jimmy Xiang created HBASE-12196: --- Summary: SSH should retry in case failed to assign regions Key: HBASE-12196 URL: https://issues.apache.org/jira/browse/HBASE-12196 Project: HBase Issue Type: Bug Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 2.0.0, 0.99.1 If there is only master alive, all regionservers are down, SSH can't find a plan to assign user regions. In this case, SSH should retry. {noformat} 2014-10-07 14:05:18,310 ERROR [MASTER_SERVER_OPERATIONS-a2424:20020-2] executor.EventHandler: Caught throwable while processing event M_SERVER_SHUTDOWN java.io.IOException: Unable to determine a plan to assign region(s) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1411) at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:272) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HBASE-11838) Enable PREFIX_TREE in integration tests
[ https://issues.apache.org/jira/browse/HBASE-11838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang resolved HBASE-11838. - Resolution: Fixed Fix Version/s: 0.99.1 2.0.0 Hadoop Flags: Reviewed With HBASE-11728 and HBASE-12078, ITBLL with PREFIX_TREE encoding works fine for me now. Integrated the patch to branch 1 and master. ITBLL tests all supported data encodings from now on. Thanks. > Enable PREFIX_TREE in integration tests > --- > > Key: HBASE-11838 > URL: https://issues.apache.org/jira/browse/HBASE-11838 > Project: HBase > Issue Type: Test > Components: test >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang >Priority: Minor > Fix For: 2.0.0, 0.99.1 > > Attachments: hbase-11838.patch > > > HBASE-11728 fixed a PREFIX_TREE encoding bug. Let's try to enable the > encoding in integration tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12184) ServerShutdownHandler throws NPE
[ https://issues.apache.org/jira/browse/HBASE-12184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-12184: Resolution: Fixed Fix Version/s: 0.98.7 Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Integrated into branch 0.98, 1, and master. Thanks. > ServerShutdownHandler throws NPE > > > Key: HBASE-12184 > URL: https://issues.apache.org/jira/browse/HBASE-12184 > Project: HBase > Issue Type: Bug >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.98.7, 0.99.1 > > Attachments: hbase-12184.patch > > > {noformat} > 2014-10-06 16:59:22,219 ERROR [MASTER_SERVER_OPERATIONS-a2424:20020-2] > executor.EventHandler: Caught throwable while processing event > M_SERVER_SHUTDOWN > java.lang.NullPointerException > at > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:190) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:724) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12078) Missing Data when scanning using PREFIX_TREE DATA-BLOCK-ENCODING
[ https://issues.apache.org/jira/browse/HBASE-12078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14161268#comment-14161268 ] Jimmy Xiang commented on HBASE-12078: - Playing with ITBLL with PREFIX_TREE encoding enabled (HBASE-11838). It seems there is no bug with this encoding anymore. Good job! > Missing Data when scanning using PREFIX_TREE DATA-BLOCK-ENCODING > > > Key: HBASE-12078 > URL: https://issues.apache.org/jira/browse/HBASE-12078 > Project: HBase > Issue Type: Bug >Affects Versions: 0.98.6.1 > Environment: CentOS 6.3 > hadoop 2.5.0(hdfs) > hadoop 2.2.0(hbase) > hbase 0.98.6.1 > sun-jdk 1.7.0_67-b01 >Reporter: zhangduo >Assignee: zhangduo >Priority: Critical > Fix For: 2.0.0, 0.98.7, 0.99.1 > > Attachments: HBASE-12078-0.98.patch, HBASE-12078.patch, > HBASE-12078_1.patch, prefix_tree_error.patch > > > our row key is combined with two ints, and we found that sometimes when we > using only the first int part to scan, the result returned may missing some > rows. But when we dump the whole hfile, the row is still there. > We have written a testcase to reproduce the bug. It works like this: > put 1-12345 > put 12345-0x0100 > put 12345-0x0101 > put 12345-0x0200 > put 12345-0x0202 > put 12345-0x0300 > put 12345-0x0303 > put 12345-0x0400 > put 12345-0x0404 > flush memstore > then scan using 12345,the returned row key will be > 12345-0x2000(12345-0x1000 expected) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12184) ServerShutdownHandler throws NPE
[ https://issues.apache.org/jira/browse/HBASE-12184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-12184: Status: Patch Available (was: Open) > ServerShutdownHandler throws NPE > > > Key: HBASE-12184 > URL: https://issues.apache.org/jira/browse/HBASE-12184 > Project: HBase > Issue Type: Bug >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.99.1 > > Attachments: hbase-12184.patch > > > {noformat} > 2014-10-06 16:59:22,219 ERROR [MASTER_SERVER_OPERATIONS-a2424:20020-2] > executor.EventHandler: Caught throwable while processing event > M_SERVER_SHUTDOWN > java.lang.NullPointerException > at > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:190) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:724) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12184) ServerShutdownHandler throws NPE
[ https://issues.apache.org/jira/browse/HBASE-12184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-12184: Attachment: hbase-12184.patch > ServerShutdownHandler throws NPE > > > Key: HBASE-12184 > URL: https://issues.apache.org/jira/browse/HBASE-12184 > Project: HBase > Issue Type: Bug >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.99.1 > > Attachments: hbase-12184.patch > > > {noformat} > 2014-10-06 16:59:22,219 ERROR [MASTER_SERVER_OPERATIONS-a2424:20020-2] > executor.EventHandler: Caught throwable while processing event > M_SERVER_SHUTDOWN > java.lang.NullPointerException > at > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:190) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:724) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-12184) ServerShutdownHandler throws NPE
Jimmy Xiang created HBASE-12184: --- Summary: ServerShutdownHandler throws NPE Key: HBASE-12184 URL: https://issues.apache.org/jira/browse/HBASE-12184 Project: HBase Issue Type: Bug Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 2.0.0, 0.99.1 {noformat} 2014-10-06 16:59:22,219 ERROR [MASTER_SERVER_OPERATIONS-a2424:20020-2] executor.EventHandler: Caught throwable while processing event M_SERVER_SHUTDOWN java.lang.NullPointerException at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:190) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HBASE-12175) Can't create table
[ https://issues.apache.org/jira/browse/HBASE-12175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang resolved HBASE-12175. - Resolution: Invalid Could be my env issue. > Can't create table > -- > > Key: HBASE-12175 > URL: https://issues.apache.org/jira/browse/HBASE-12175 > Project: HBase > Issue Type: Bug >Reporter: Jimmy Xiang > > Trying to create a table from hbase shell and couldn't get region assigned: > {noformat} > ^Gdefault^R^Dtest > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2213) > at > org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879) > Caused by: java.lang.IllegalArgumentException: Illegal character <10> at 0. > Namespaces can only contain 'alphanumeric characters': i.e. [a-zA-Z_0-9]: > ^Gdefault^R^Dtest > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:215) > at > org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:204) > at org.apache.hadoop.hbase.TableName.(TableName.java:302) > at > org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:339) > at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:460) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-12175) Can't create table
Jimmy Xiang created HBASE-12175: --- Summary: Can't create table Key: HBASE-12175 URL: https://issues.apache.org/jira/browse/HBASE-12175 Project: HBase Issue Type: Bug Reporter: Jimmy Xiang Trying to create a table from hbase shell and couldn't get region assigned: {noformat} ^Gdefault^R^Dtest at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2213) at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1879) Caused by: java.lang.IllegalArgumentException: Illegal character <10> at 0. Namespaces can only contain 'alphanumeric characters': i.e. [a-zA-Z_0-9]: ^Gdefault^R^Dtest at org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:215) at org.apache.hadoop.hbase.TableName.isLegalNamespaceName(TableName.java:204) at org.apache.hadoop.hbase.TableName.(TableName.java:302) at org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:339) at org.apache.hadoop.hbase.TableName.valueOf(TableName.java:460) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12166) TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork
[ https://issues.apache.org/jira/browse/HBASE-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-12166: Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Integrated into branch 1 and master. Thanks. > TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork > --- > > Key: HBASE-12166 > URL: https://issues.apache.org/jira/browse/HBASE-12166 > Project: HBase > Issue Type: Bug > Components: test, wal >Reporter: stack >Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.99.1 > > Attachments: 12166.txt, hbase-12166.patch, hbase-12166_v2.patch, > log.txt > > > See > https://builds.apache.org/job/PreCommit-HBASE-Build/11204//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testMasterStartsUpWithLogReplayWork/ > The namespace region gets stuck. It is never 'recovered' even though we have > finished log splitting. Here is the main exception: > {code} > 4941 2014-10-03 02:00:36,862 DEBUG > [B.defaultRpcServer.handler=1,queue=0,port=37113] ipc.CallRunner(111): > B.defaultRpcServer.handler=1,queue=0,port=37113: callId: 211 service: > ClientService methodName: Get > size: 99 connection: 67.195.81.144:44526 > 4942 org.apache.hadoop.hbase.exceptions.RegionInRecoveryException: > hbase:namespace,,1412301462277.eba5d23de65f2718715eeb22edf7edc2. is recovering > 4943 at > org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:6058) > 4944 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2086) > 4945 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2072) > 4946 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:5014) > 4947 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4988) > 4948 at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1690) > 4949 at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30418) > 4950 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > 4951 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > 4952 at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > 4953 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > 4954 at java.lang.Thread.run(Thread.java:744) > {code} > See how we've finished log splitting long time previous: > {code} > 2014-10-03 01:57:48,129 INFO [M_LOG_REPLAY_OPS-asf900:37113-1] > master.SplitLogManager(294): finished splitting (more than or equal to) > 197337 bytes in 1 log files in > [hdfs://localhost:49601/user/jenkins/hbase/WALs/asf900.gq1.ygridcore.net,40732,1412301461887-splitting] > in 379ms > {code} > If I grep for the deleting of znodes on recovery, which is when we set the > recovering flag to false, I see a bunch of regions but not my namespace one: > 2014-10-03 01:57:47,330 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/1588230740 > znode deleted. Region: 1588230740 completes recovery. > 2014-10-03 01:57:48,119 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/adfdcf958dd958f0e2ce59072ce2209d znode deleted. > Region: adfdcf958dd958f0e2ce59072ce2209d completes recovery. > 2014-10-03 01:57:48,121 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/41d438848305831b61d708a406d5ecde znode deleted. > Region: 41d438848305831b61d708a406d5ecde completes recovery. > 2014-10-03 01:57:48,122 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/6a7cada80de2ae5d774fe8cd33bd4cda znode deleted. > Region: 6a7cada80de2ae5d774fe8cd33bd4cda completes recovery. > 2014-10-03 01:57:48,124 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/65451bd5b38bd16a31e25b62b3305533 znode deleted. > Region: 65451bd5b38bd16a31e25b62b3305533 completes recovery. > 2014-10-03 01:57:48,125 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/07afdc3748894cf2b56e0075272a95a0 znode deleted. > Region: 07afdc3748894cf2b56e0075272a95a0 completes recovery. > 2014-10-03 01:57:48,126 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/a4337ad2874ee7e599ca2344fce21583 znode deleted. > Region: a4337ad2874ee7e599ca2344fce21583 completes recovery. > 2014-10-03 01:57:48,128 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/9d91d6eafe260ce33e8d7d23ccd13192 znode deleted.
[jira] [Updated] (HBASE-12166) TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork
[ https://issues.apache.org/jira/browse/HBASE-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-12166: Attachment: hbase-12166_v2.patch Attched v2 that fixed the issue Stack found. > TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork > --- > > Key: HBASE-12166 > URL: https://issues.apache.org/jira/browse/HBASE-12166 > Project: HBase > Issue Type: Bug > Components: test, wal >Reporter: stack >Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.99.1 > > Attachments: 12166.txt, hbase-12166.patch, hbase-12166_v2.patch, > log.txt > > > See > https://builds.apache.org/job/PreCommit-HBASE-Build/11204//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testMasterStartsUpWithLogReplayWork/ > The namespace region gets stuck. It is never 'recovered' even though we have > finished log splitting. Here is the main exception: > {code} > 4941 2014-10-03 02:00:36,862 DEBUG > [B.defaultRpcServer.handler=1,queue=0,port=37113] ipc.CallRunner(111): > B.defaultRpcServer.handler=1,queue=0,port=37113: callId: 211 service: > ClientService methodName: Get > size: 99 connection: 67.195.81.144:44526 > 4942 org.apache.hadoop.hbase.exceptions.RegionInRecoveryException: > hbase:namespace,,1412301462277.eba5d23de65f2718715eeb22edf7edc2. is recovering > 4943 at > org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:6058) > 4944 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2086) > 4945 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2072) > 4946 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:5014) > 4947 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4988) > 4948 at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1690) > 4949 at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30418) > 4950 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > 4951 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > 4952 at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > 4953 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > 4954 at java.lang.Thread.run(Thread.java:744) > {code} > See how we've finished log splitting long time previous: > {code} > 2014-10-03 01:57:48,129 INFO [M_LOG_REPLAY_OPS-asf900:37113-1] > master.SplitLogManager(294): finished splitting (more than or equal to) > 197337 bytes in 1 log files in > [hdfs://localhost:49601/user/jenkins/hbase/WALs/asf900.gq1.ygridcore.net,40732,1412301461887-splitting] > in 379ms > {code} > If I grep for the deleting of znodes on recovery, which is when we set the > recovering flag to false, I see a bunch of regions but not my namespace one: > 2014-10-03 01:57:47,330 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/1588230740 > znode deleted. Region: 1588230740 completes recovery. > 2014-10-03 01:57:48,119 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/adfdcf958dd958f0e2ce59072ce2209d znode deleted. > Region: adfdcf958dd958f0e2ce59072ce2209d completes recovery. > 2014-10-03 01:57:48,121 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/41d438848305831b61d708a406d5ecde znode deleted. > Region: 41d438848305831b61d708a406d5ecde completes recovery. > 2014-10-03 01:57:48,122 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/6a7cada80de2ae5d774fe8cd33bd4cda znode deleted. > Region: 6a7cada80de2ae5d774fe8cd33bd4cda completes recovery. > 2014-10-03 01:57:48,124 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/65451bd5b38bd16a31e25b62b3305533 znode deleted. > Region: 65451bd5b38bd16a31e25b62b3305533 completes recovery. > 2014-10-03 01:57:48,125 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/07afdc3748894cf2b56e0075272a95a0 znode deleted. > Region: 07afdc3748894cf2b56e0075272a95a0 completes recovery. > 2014-10-03 01:57:48,126 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/a4337ad2874ee7e599ca2344fce21583 znode deleted. > Region: a4337ad2874ee7e599ca2344fce21583 completes recovery. > 2014-10-03 01:57:48,128 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/9d91d6eafe260ce33e8d7d23ccd13192 znode deleted. > Region: 9d91d6eafe260ce33e8d7d23ccd13192 completes recovery. >
[jira] [Commented] (HBASE-12166) TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork
[ https://issues.apache.org/jira/browse/HBASE-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158679#comment-14158679 ] Jimmy Xiang commented on HBASE-12166: - [~stack], good catch! Unbeliveable! > TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork > --- > > Key: HBASE-12166 > URL: https://issues.apache.org/jira/browse/HBASE-12166 > Project: HBase > Issue Type: Bug > Components: test, wal >Reporter: stack >Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.99.1 > > Attachments: 12166.txt, hbase-12166.patch, log.txt > > > See > https://builds.apache.org/job/PreCommit-HBASE-Build/11204//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testMasterStartsUpWithLogReplayWork/ > The namespace region gets stuck. It is never 'recovered' even though we have > finished log splitting. Here is the main exception: > {code} > 4941 2014-10-03 02:00:36,862 DEBUG > [B.defaultRpcServer.handler=1,queue=0,port=37113] ipc.CallRunner(111): > B.defaultRpcServer.handler=1,queue=0,port=37113: callId: 211 service: > ClientService methodName: Get > size: 99 connection: 67.195.81.144:44526 > 4942 org.apache.hadoop.hbase.exceptions.RegionInRecoveryException: > hbase:namespace,,1412301462277.eba5d23de65f2718715eeb22edf7edc2. is recovering > 4943 at > org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:6058) > 4944 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2086) > 4945 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2072) > 4946 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:5014) > 4947 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4988) > 4948 at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1690) > 4949 at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30418) > 4950 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > 4951 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > 4952 at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > 4953 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > 4954 at java.lang.Thread.run(Thread.java:744) > {code} > See how we've finished log splitting long time previous: > {code} > 2014-10-03 01:57:48,129 INFO [M_LOG_REPLAY_OPS-asf900:37113-1] > master.SplitLogManager(294): finished splitting (more than or equal to) > 197337 bytes in 1 log files in > [hdfs://localhost:49601/user/jenkins/hbase/WALs/asf900.gq1.ygridcore.net,40732,1412301461887-splitting] > in 379ms > {code} > If I grep for the deleting of znodes on recovery, which is when we set the > recovering flag to false, I see a bunch of regions but not my namespace one: > 2014-10-03 01:57:47,330 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/1588230740 > znode deleted. Region: 1588230740 completes recovery. > 2014-10-03 01:57:48,119 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/adfdcf958dd958f0e2ce59072ce2209d znode deleted. > Region: adfdcf958dd958f0e2ce59072ce2209d completes recovery. > 2014-10-03 01:57:48,121 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/41d438848305831b61d708a406d5ecde znode deleted. > Region: 41d438848305831b61d708a406d5ecde completes recovery. > 2014-10-03 01:57:48,122 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/6a7cada80de2ae5d774fe8cd33bd4cda znode deleted. > Region: 6a7cada80de2ae5d774fe8cd33bd4cda completes recovery. > 2014-10-03 01:57:48,124 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/65451bd5b38bd16a31e25b62b3305533 znode deleted. > Region: 65451bd5b38bd16a31e25b62b3305533 completes recovery. > 2014-10-03 01:57:48,125 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/07afdc3748894cf2b56e0075272a95a0 znode deleted. > Region: 07afdc3748894cf2b56e0075272a95a0 completes recovery. > 2014-10-03 01:57:48,126 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/a4337ad2874ee7e599ca2344fce21583 znode deleted. > Region: a4337ad2874ee7e599ca2344fce21583 completes recovery. > 2014-10-03 01:57:48,128 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/9d91d6eafe260ce33e8d7d23ccd13192 znode deleted. > Region: 9d91d6eafe260ce33e8d7d23ccd13192 completes recovery. > This would see
[jira] [Comment Edited] (HBASE-12166) TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork
[ https://issues.apache.org/jira/browse/HBASE-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158515#comment-14158515 ] Jimmy Xiang edited comment on HBASE-12166 at 10/3/14 11:08 PM: --- I think I found out the cause. In ZKSplitLogManagerCoordination#removeRecoveringRegions: {noformat} listSize = failedServers.size(); for (int j = 0; j < listSize; j++) { {noformat} The listSize is redefined. was (Author: jxiang): I think I found out the cause. In ZKSplitLogManagerCoordination#removeRecoveringRegions: {noformat} listSize = failedServers.size(); for (int j = 0; j < listSize; j++) { {noformat} The listSize is redefined. That's not a bug, it is a hidden bomb :) > TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork > --- > > Key: HBASE-12166 > URL: https://issues.apache.org/jira/browse/HBASE-12166 > Project: HBase > Issue Type: Bug > Components: test, wal >Reporter: stack >Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.99.1 > > Attachments: 12166.txt, hbase-12166.patch, log.txt > > > See > https://builds.apache.org/job/PreCommit-HBASE-Build/11204//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testMasterStartsUpWithLogReplayWork/ > The namespace region gets stuck. It is never 'recovered' even though we have > finished log splitting. Here is the main exception: > {code} > 4941 2014-10-03 02:00:36,862 DEBUG > [B.defaultRpcServer.handler=1,queue=0,port=37113] ipc.CallRunner(111): > B.defaultRpcServer.handler=1,queue=0,port=37113: callId: 211 service: > ClientService methodName: Get > size: 99 connection: 67.195.81.144:44526 > 4942 org.apache.hadoop.hbase.exceptions.RegionInRecoveryException: > hbase:namespace,,1412301462277.eba5d23de65f2718715eeb22edf7edc2. is recovering > 4943 at > org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:6058) > 4944 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2086) > 4945 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2072) > 4946 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:5014) > 4947 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4988) > 4948 at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1690) > 4949 at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30418) > 4950 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > 4951 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > 4952 at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > 4953 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > 4954 at java.lang.Thread.run(Thread.java:744) > {code} > See how we've finished log splitting long time previous: > {code} > 2014-10-03 01:57:48,129 INFO [M_LOG_REPLAY_OPS-asf900:37113-1] > master.SplitLogManager(294): finished splitting (more than or equal to) > 197337 bytes in 1 log files in > [hdfs://localhost:49601/user/jenkins/hbase/WALs/asf900.gq1.ygridcore.net,40732,1412301461887-splitting] > in 379ms > {code} > If I grep for the deleting of znodes on recovery, which is when we set the > recovering flag to false, I see a bunch of regions but not my namespace one: > 2014-10-03 01:57:47,330 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/1588230740 > znode deleted. Region: 1588230740 completes recovery. > 2014-10-03 01:57:48,119 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/adfdcf958dd958f0e2ce59072ce2209d znode deleted. > Region: adfdcf958dd958f0e2ce59072ce2209d completes recovery. > 2014-10-03 01:57:48,121 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/41d438848305831b61d708a406d5ecde znode deleted. > Region: 41d438848305831b61d708a406d5ecde completes recovery. > 2014-10-03 01:57:48,122 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/6a7cada80de2ae5d774fe8cd33bd4cda znode deleted. > Region: 6a7cada80de2ae5d774fe8cd33bd4cda completes recovery. > 2014-10-03 01:57:48,124 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/65451bd5b38bd16a31e25b62b3305533 znode deleted. > Region: 65451bd5b38bd16a31e25b62b3305533 completes recovery. > 2014-10-03 01:57:48,125 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/07afdc3748894cf2b56e0075272a95a0 znode deleted. > R
[jira] [Commented] (HBASE-12166) TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork
[ https://issues.apache.org/jira/browse/HBASE-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158649#comment-14158649 ] Jimmy Xiang commented on HBASE-12166: - TestRegionReplicaReplicationEndpoint is ok locally. I can increase the timeout a little at checkin (from 1000 to 6000?). > TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork > --- > > Key: HBASE-12166 > URL: https://issues.apache.org/jira/browse/HBASE-12166 > Project: HBase > Issue Type: Bug > Components: test, wal >Reporter: stack >Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.99.1 > > Attachments: 12166.txt, hbase-12166.patch, log.txt > > > See > https://builds.apache.org/job/PreCommit-HBASE-Build/11204//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testMasterStartsUpWithLogReplayWork/ > The namespace region gets stuck. It is never 'recovered' even though we have > finished log splitting. Here is the main exception: > {code} > 4941 2014-10-03 02:00:36,862 DEBUG > [B.defaultRpcServer.handler=1,queue=0,port=37113] ipc.CallRunner(111): > B.defaultRpcServer.handler=1,queue=0,port=37113: callId: 211 service: > ClientService methodName: Get > size: 99 connection: 67.195.81.144:44526 > 4942 org.apache.hadoop.hbase.exceptions.RegionInRecoveryException: > hbase:namespace,,1412301462277.eba5d23de65f2718715eeb22edf7edc2. is recovering > 4943 at > org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:6058) > 4944 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2086) > 4945 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2072) > 4946 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:5014) > 4947 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4988) > 4948 at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1690) > 4949 at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30418) > 4950 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > 4951 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > 4952 at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > 4953 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > 4954 at java.lang.Thread.run(Thread.java:744) > {code} > See how we've finished log splitting long time previous: > {code} > 2014-10-03 01:57:48,129 INFO [M_LOG_REPLAY_OPS-asf900:37113-1] > master.SplitLogManager(294): finished splitting (more than or equal to) > 197337 bytes in 1 log files in > [hdfs://localhost:49601/user/jenkins/hbase/WALs/asf900.gq1.ygridcore.net,40732,1412301461887-splitting] > in 379ms > {code} > If I grep for the deleting of znodes on recovery, which is when we set the > recovering flag to false, I see a bunch of regions but not my namespace one: > 2014-10-03 01:57:47,330 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/1588230740 > znode deleted. Region: 1588230740 completes recovery. > 2014-10-03 01:57:48,119 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/adfdcf958dd958f0e2ce59072ce2209d znode deleted. > Region: adfdcf958dd958f0e2ce59072ce2209d completes recovery. > 2014-10-03 01:57:48,121 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/41d438848305831b61d708a406d5ecde znode deleted. > Region: 41d438848305831b61d708a406d5ecde completes recovery. > 2014-10-03 01:57:48,122 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/6a7cada80de2ae5d774fe8cd33bd4cda znode deleted. > Region: 6a7cada80de2ae5d774fe8cd33bd4cda completes recovery. > 2014-10-03 01:57:48,124 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/65451bd5b38bd16a31e25b62b3305533 znode deleted. > Region: 65451bd5b38bd16a31e25b62b3305533 completes recovery. > 2014-10-03 01:57:48,125 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/07afdc3748894cf2b56e0075272a95a0 znode deleted. > Region: 07afdc3748894cf2b56e0075272a95a0 completes recovery. > 2014-10-03 01:57:48,126 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/a4337ad2874ee7e599ca2344fce21583 znode deleted. > Region: a4337ad2874ee7e599ca2344fce21583 completes recovery. > 2014-10-03 01:57:48,128 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/9d91d6eafe260ce33e8d7d23ccd13192 znode del
[jira] [Commented] (HBASE-12166) TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork
[ https://issues.apache.org/jira/browse/HBASE-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158645#comment-14158645 ] Jimmy Xiang commented on HBASE-12166: - [~stack], [~jeffreyz], could you take a look the patch? Thanks. > TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork > --- > > Key: HBASE-12166 > URL: https://issues.apache.org/jira/browse/HBASE-12166 > Project: HBase > Issue Type: Bug > Components: test, wal >Reporter: stack >Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.99.1 > > Attachments: 12166.txt, hbase-12166.patch, log.txt > > > See > https://builds.apache.org/job/PreCommit-HBASE-Build/11204//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testMasterStartsUpWithLogReplayWork/ > The namespace region gets stuck. It is never 'recovered' even though we have > finished log splitting. Here is the main exception: > {code} > 4941 2014-10-03 02:00:36,862 DEBUG > [B.defaultRpcServer.handler=1,queue=0,port=37113] ipc.CallRunner(111): > B.defaultRpcServer.handler=1,queue=0,port=37113: callId: 211 service: > ClientService methodName: Get > size: 99 connection: 67.195.81.144:44526 > 4942 org.apache.hadoop.hbase.exceptions.RegionInRecoveryException: > hbase:namespace,,1412301462277.eba5d23de65f2718715eeb22edf7edc2. is recovering > 4943 at > org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:6058) > 4944 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2086) > 4945 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2072) > 4946 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:5014) > 4947 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4988) > 4948 at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1690) > 4949 at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30418) > 4950 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > 4951 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > 4952 at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > 4953 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > 4954 at java.lang.Thread.run(Thread.java:744) > {code} > See how we've finished log splitting long time previous: > {code} > 2014-10-03 01:57:48,129 INFO [M_LOG_REPLAY_OPS-asf900:37113-1] > master.SplitLogManager(294): finished splitting (more than or equal to) > 197337 bytes in 1 log files in > [hdfs://localhost:49601/user/jenkins/hbase/WALs/asf900.gq1.ygridcore.net,40732,1412301461887-splitting] > in 379ms > {code} > If I grep for the deleting of znodes on recovery, which is when we set the > recovering flag to false, I see a bunch of regions but not my namespace one: > 2014-10-03 01:57:47,330 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/1588230740 > znode deleted. Region: 1588230740 completes recovery. > 2014-10-03 01:57:48,119 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/adfdcf958dd958f0e2ce59072ce2209d znode deleted. > Region: adfdcf958dd958f0e2ce59072ce2209d completes recovery. > 2014-10-03 01:57:48,121 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/41d438848305831b61d708a406d5ecde znode deleted. > Region: 41d438848305831b61d708a406d5ecde completes recovery. > 2014-10-03 01:57:48,122 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/6a7cada80de2ae5d774fe8cd33bd4cda znode deleted. > Region: 6a7cada80de2ae5d774fe8cd33bd4cda completes recovery. > 2014-10-03 01:57:48,124 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/65451bd5b38bd16a31e25b62b3305533 znode deleted. > Region: 65451bd5b38bd16a31e25b62b3305533 completes recovery. > 2014-10-03 01:57:48,125 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/07afdc3748894cf2b56e0075272a95a0 znode deleted. > Region: 07afdc3748894cf2b56e0075272a95a0 completes recovery. > 2014-10-03 01:57:48,126 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/a4337ad2874ee7e599ca2344fce21583 znode deleted. > Region: a4337ad2874ee7e599ca2344fce21583 completes recovery. > 2014-10-03 01:57:48,128 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/9d91d6eafe260ce33e8d7d23ccd13192 znode deleted. > Region: 9d91d6eafe260ce33e8d7d23ccd13192 complete
[jira] [Commented] (HBASE-12166) TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork
[ https://issues.apache.org/jira/browse/HBASE-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158641#comment-14158641 ] Jimmy Xiang commented on HBASE-12166: - TestMasterObserver should be fixed by the addendumo of HBASE-12167. > TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork > --- > > Key: HBASE-12166 > URL: https://issues.apache.org/jira/browse/HBASE-12166 > Project: HBase > Issue Type: Bug > Components: test, wal >Reporter: stack >Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.99.1 > > Attachments: 12166.txt, hbase-12166.patch, log.txt > > > See > https://builds.apache.org/job/PreCommit-HBASE-Build/11204//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testMasterStartsUpWithLogReplayWork/ > The namespace region gets stuck. It is never 'recovered' even though we have > finished log splitting. Here is the main exception: > {code} > 4941 2014-10-03 02:00:36,862 DEBUG > [B.defaultRpcServer.handler=1,queue=0,port=37113] ipc.CallRunner(111): > B.defaultRpcServer.handler=1,queue=0,port=37113: callId: 211 service: > ClientService methodName: Get > size: 99 connection: 67.195.81.144:44526 > 4942 org.apache.hadoop.hbase.exceptions.RegionInRecoveryException: > hbase:namespace,,1412301462277.eba5d23de65f2718715eeb22edf7edc2. is recovering > 4943 at > org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:6058) > 4944 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2086) > 4945 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2072) > 4946 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:5014) > 4947 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4988) > 4948 at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1690) > 4949 at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30418) > 4950 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > 4951 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > 4952 at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > 4953 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > 4954 at java.lang.Thread.run(Thread.java:744) > {code} > See how we've finished log splitting long time previous: > {code} > 2014-10-03 01:57:48,129 INFO [M_LOG_REPLAY_OPS-asf900:37113-1] > master.SplitLogManager(294): finished splitting (more than or equal to) > 197337 bytes in 1 log files in > [hdfs://localhost:49601/user/jenkins/hbase/WALs/asf900.gq1.ygridcore.net,40732,1412301461887-splitting] > in 379ms > {code} > If I grep for the deleting of znodes on recovery, which is when we set the > recovering flag to false, I see a bunch of regions but not my namespace one: > 2014-10-03 01:57:47,330 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/1588230740 > znode deleted. Region: 1588230740 completes recovery. > 2014-10-03 01:57:48,119 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/adfdcf958dd958f0e2ce59072ce2209d znode deleted. > Region: adfdcf958dd958f0e2ce59072ce2209d completes recovery. > 2014-10-03 01:57:48,121 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/41d438848305831b61d708a406d5ecde znode deleted. > Region: 41d438848305831b61d708a406d5ecde completes recovery. > 2014-10-03 01:57:48,122 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/6a7cada80de2ae5d774fe8cd33bd4cda znode deleted. > Region: 6a7cada80de2ae5d774fe8cd33bd4cda completes recovery. > 2014-10-03 01:57:48,124 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/65451bd5b38bd16a31e25b62b3305533 znode deleted. > Region: 65451bd5b38bd16a31e25b62b3305533 completes recovery. > 2014-10-03 01:57:48,125 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/07afdc3748894cf2b56e0075272a95a0 znode deleted. > Region: 07afdc3748894cf2b56e0075272a95a0 completes recovery. > 2014-10-03 01:57:48,126 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/a4337ad2874ee7e599ca2344fce21583 znode deleted. > Region: a4337ad2874ee7e599ca2344fce21583 completes recovery. > 2014-10-03 01:57:48,128 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/9d91d6eafe260ce33e8d7d23ccd13192 znode deleted. > Region: 9d91d6eafe260ce33e8d7d23ccd13192 comp
[jira] [Commented] (HBASE-12167) NPE in AssignmentManager
[ https://issues.apache.org/jira/browse/HBASE-12167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158598#comment-14158598 ] Jimmy Xiang commented on HBASE-12167: - Checked in an addendum to fix TestMasterObserver. > NPE in AssignmentManager > > > Key: HBASE-12167 > URL: https://issues.apache.org/jira/browse/HBASE-12167 > Project: HBase > Issue Type: Bug >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.99.1 > > Attachments: hbase-12167.patch > > > If we can't find a region plan, we should check. > {noformat} > 2014-10-02 18:36:27,719 ERROR [MASTER_SERVER_OPERATIONS-a2424:20020-0] > executor.EventHandler: Caught throwable while processing event > M_SERVER_SHUTDOWN > java.lang.NullPointerException > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1417) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1409) > at > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:271) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:724) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12167) NPE in AssignmentManager
[ https://issues.apache.org/jira/browse/HBASE-12167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-12167: Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Integrated into branch 1 and master. Thanks. > NPE in AssignmentManager > > > Key: HBASE-12167 > URL: https://issues.apache.org/jira/browse/HBASE-12167 > Project: HBase > Issue Type: Bug >Reporter: Jimmy Xiang >Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.99.1 > > Attachments: hbase-12167.patch > > > If we can't find a region plan, we should check. > {noformat} > 2014-10-02 18:36:27,719 ERROR [MASTER_SERVER_OPERATIONS-a2424:20020-0] > executor.EventHandler: Caught throwable while processing event > M_SERVER_SHUTDOWN > java.lang.NullPointerException > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1417) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1409) > at > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:271) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:724) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12166) TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork
[ https://issues.apache.org/jira/browse/HBASE-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-12166: Component/s: wal > TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork > --- > > Key: HBASE-12166 > URL: https://issues.apache.org/jira/browse/HBASE-12166 > Project: HBase > Issue Type: Bug > Components: test, wal >Reporter: stack >Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.99.1 > > Attachments: 12166.txt, hbase-12166.patch, log.txt > > > See > https://builds.apache.org/job/PreCommit-HBASE-Build/11204//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testMasterStartsUpWithLogReplayWork/ > The namespace region gets stuck. It is never 'recovered' even though we have > finished log splitting. Here is the main exception: > {code} > 4941 2014-10-03 02:00:36,862 DEBUG > [B.defaultRpcServer.handler=1,queue=0,port=37113] ipc.CallRunner(111): > B.defaultRpcServer.handler=1,queue=0,port=37113: callId: 211 service: > ClientService methodName: Get > size: 99 connection: 67.195.81.144:44526 > 4942 org.apache.hadoop.hbase.exceptions.RegionInRecoveryException: > hbase:namespace,,1412301462277.eba5d23de65f2718715eeb22edf7edc2. is recovering > 4943 at > org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:6058) > 4944 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2086) > 4945 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2072) > 4946 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:5014) > 4947 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4988) > 4948 at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1690) > 4949 at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30418) > 4950 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > 4951 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > 4952 at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > 4953 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > 4954 at java.lang.Thread.run(Thread.java:744) > {code} > See how we've finished log splitting long time previous: > {code} > 2014-10-03 01:57:48,129 INFO [M_LOG_REPLAY_OPS-asf900:37113-1] > master.SplitLogManager(294): finished splitting (more than or equal to) > 197337 bytes in 1 log files in > [hdfs://localhost:49601/user/jenkins/hbase/WALs/asf900.gq1.ygridcore.net,40732,1412301461887-splitting] > in 379ms > {code} > If I grep for the deleting of znodes on recovery, which is when we set the > recovering flag to false, I see a bunch of regions but not my namespace one: > 2014-10-03 01:57:47,330 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/1588230740 > znode deleted. Region: 1588230740 completes recovery. > 2014-10-03 01:57:48,119 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/adfdcf958dd958f0e2ce59072ce2209d znode deleted. > Region: adfdcf958dd958f0e2ce59072ce2209d completes recovery. > 2014-10-03 01:57:48,121 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/41d438848305831b61d708a406d5ecde znode deleted. > Region: 41d438848305831b61d708a406d5ecde completes recovery. > 2014-10-03 01:57:48,122 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/6a7cada80de2ae5d774fe8cd33bd4cda znode deleted. > Region: 6a7cada80de2ae5d774fe8cd33bd4cda completes recovery. > 2014-10-03 01:57:48,124 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/65451bd5b38bd16a31e25b62b3305533 znode deleted. > Region: 65451bd5b38bd16a31e25b62b3305533 completes recovery. > 2014-10-03 01:57:48,125 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/07afdc3748894cf2b56e0075272a95a0 znode deleted. > Region: 07afdc3748894cf2b56e0075272a95a0 completes recovery. > 2014-10-03 01:57:48,126 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/a4337ad2874ee7e599ca2344fce21583 znode deleted. > Region: a4337ad2874ee7e599ca2344fce21583 completes recovery. > 2014-10-03 01:57:48,128 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/9d91d6eafe260ce33e8d7d23ccd13192 znode deleted. > Region: 9d91d6eafe260ce33e8d7d23ccd13192 completes recovery. > This would seem to indicate that we successfully wrote zk that we are > recovering: >
[jira] [Updated] (HBASE-12166) TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork
[ https://issues.apache.org/jira/browse/HBASE-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-12166: Status: Patch Available (was: Open) Attached a simple patch. The test is ok locally now. Let's see what the jenkins says. Hope this is the last DLR bug. > TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork > --- > > Key: HBASE-12166 > URL: https://issues.apache.org/jira/browse/HBASE-12166 > Project: HBase > Issue Type: Bug > Components: test >Reporter: stack >Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.99.1 > > Attachments: 12166.txt, hbase-12166.patch, log.txt > > > See > https://builds.apache.org/job/PreCommit-HBASE-Build/11204//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testMasterStartsUpWithLogReplayWork/ > The namespace region gets stuck. It is never 'recovered' even though we have > finished log splitting. Here is the main exception: > {code} > 4941 2014-10-03 02:00:36,862 DEBUG > [B.defaultRpcServer.handler=1,queue=0,port=37113] ipc.CallRunner(111): > B.defaultRpcServer.handler=1,queue=0,port=37113: callId: 211 service: > ClientService methodName: Get > size: 99 connection: 67.195.81.144:44526 > 4942 org.apache.hadoop.hbase.exceptions.RegionInRecoveryException: > hbase:namespace,,1412301462277.eba5d23de65f2718715eeb22edf7edc2. is recovering > 4943 at > org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:6058) > 4944 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2086) > 4945 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2072) > 4946 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:5014) > 4947 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4988) > 4948 at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1690) > 4949 at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30418) > 4950 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > 4951 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > 4952 at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > 4953 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > 4954 at java.lang.Thread.run(Thread.java:744) > {code} > See how we've finished log splitting long time previous: > {code} > 2014-10-03 01:57:48,129 INFO [M_LOG_REPLAY_OPS-asf900:37113-1] > master.SplitLogManager(294): finished splitting (more than or equal to) > 197337 bytes in 1 log files in > [hdfs://localhost:49601/user/jenkins/hbase/WALs/asf900.gq1.ygridcore.net,40732,1412301461887-splitting] > in 379ms > {code} > If I grep for the deleting of znodes on recovery, which is when we set the > recovering flag to false, I see a bunch of regions but not my namespace one: > 2014-10-03 01:57:47,330 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/1588230740 > znode deleted. Region: 1588230740 completes recovery. > 2014-10-03 01:57:48,119 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/adfdcf958dd958f0e2ce59072ce2209d znode deleted. > Region: adfdcf958dd958f0e2ce59072ce2209d completes recovery. > 2014-10-03 01:57:48,121 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/41d438848305831b61d708a406d5ecde znode deleted. > Region: 41d438848305831b61d708a406d5ecde completes recovery. > 2014-10-03 01:57:48,122 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/6a7cada80de2ae5d774fe8cd33bd4cda znode deleted. > Region: 6a7cada80de2ae5d774fe8cd33bd4cda completes recovery. > 2014-10-03 01:57:48,124 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/65451bd5b38bd16a31e25b62b3305533 znode deleted. > Region: 65451bd5b38bd16a31e25b62b3305533 completes recovery. > 2014-10-03 01:57:48,125 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/07afdc3748894cf2b56e0075272a95a0 znode deleted. > Region: 07afdc3748894cf2b56e0075272a95a0 completes recovery. > 2014-10-03 01:57:48,126 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/a4337ad2874ee7e599ca2344fce21583 znode deleted. > Region: a4337ad2874ee7e599ca2344fce21583 completes recovery. > 2014-10-03 01:57:48,128 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/9d91d6eafe260ce33e8d7d23ccd13192 znode deleted. > Region: 9d91d6ea
[jira] [Updated] (HBASE-12166) TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork
[ https://issues.apache.org/jira/browse/HBASE-12166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-12166: Attachment: hbase-12166.patch > TestDistributedLogSplitting.testMasterStartsUpWithLogReplayWork > --- > > Key: HBASE-12166 > URL: https://issues.apache.org/jira/browse/HBASE-12166 > Project: HBase > Issue Type: Bug > Components: test >Reporter: stack >Assignee: Jimmy Xiang > Fix For: 2.0.0, 0.99.1 > > Attachments: 12166.txt, hbase-12166.patch, log.txt > > > See > https://builds.apache.org/job/PreCommit-HBASE-Build/11204//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testMasterStartsUpWithLogReplayWork/ > The namespace region gets stuck. It is never 'recovered' even though we have > finished log splitting. Here is the main exception: > {code} > 4941 2014-10-03 02:00:36,862 DEBUG > [B.defaultRpcServer.handler=1,queue=0,port=37113] ipc.CallRunner(111): > B.defaultRpcServer.handler=1,queue=0,port=37113: callId: 211 service: > ClientService methodName: Get > size: 99 connection: 67.195.81.144:44526 > 4942 org.apache.hadoop.hbase.exceptions.RegionInRecoveryException: > hbase:namespace,,1412301462277.eba5d23de65f2718715eeb22edf7edc2. is recovering > 4943 at > org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:6058) > 4944 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2086) > 4945 at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2072) > 4946 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:5014) > 4947 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4988) > 4948 at > org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:1690) > 4949 at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30418) > 4950 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020) > 4951 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > 4952 at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > 4953 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > 4954 at java.lang.Thread.run(Thread.java:744) > {code} > See how we've finished log splitting long time previous: > {code} > 2014-10-03 01:57:48,129 INFO [M_LOG_REPLAY_OPS-asf900:37113-1] > master.SplitLogManager(294): finished splitting (more than or equal to) > 197337 bytes in 1 log files in > [hdfs://localhost:49601/user/jenkins/hbase/WALs/asf900.gq1.ygridcore.net,40732,1412301461887-splitting] > in 379ms > {code} > If I grep for the deleting of znodes on recovery, which is when we set the > recovering flag to false, I see a bunch of regions but not my namespace one: > 2014-10-03 01:57:47,330 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): /hbase/recovering-regions/1588230740 > znode deleted. Region: 1588230740 completes recovery. > 2014-10-03 01:57:48,119 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/adfdcf958dd958f0e2ce59072ce2209d znode deleted. > Region: adfdcf958dd958f0e2ce59072ce2209d completes recovery. > 2014-10-03 01:57:48,121 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/41d438848305831b61d708a406d5ecde znode deleted. > Region: 41d438848305831b61d708a406d5ecde completes recovery. > 2014-10-03 01:57:48,122 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/6a7cada80de2ae5d774fe8cd33bd4cda znode deleted. > Region: 6a7cada80de2ae5d774fe8cd33bd4cda completes recovery. > 2014-10-03 01:57:48,124 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/65451bd5b38bd16a31e25b62b3305533 znode deleted. > Region: 65451bd5b38bd16a31e25b62b3305533 completes recovery. > 2014-10-03 01:57:48,125 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/07afdc3748894cf2b56e0075272a95a0 znode deleted. > Region: 07afdc3748894cf2b56e0075272a95a0 completes recovery. > 2014-10-03 01:57:48,126 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/a4337ad2874ee7e599ca2344fce21583 znode deleted. > Region: a4337ad2874ee7e599ca2344fce21583 completes recovery. > 2014-10-03 01:57:48,128 INFO [Thread-9216-EventThread] > zookeeper.RecoveringRegionWatcher(66): > /hbase/recovering-regions/9d91d6eafe260ce33e8d7d23ccd13192 znode deleted. > Region: 9d91d6eafe260ce33e8d7d23ccd13192 completes recovery. > This would seem to indicate that we successfully wrote zk that we are > recove