[jira] [Updated] (HBASE-9480) Regions are unexpectedly made offline in certain failure conditions
[ https://issues.apache.org/jira/browse/HBASE-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-9480: --- Resolution: Fixed Fix Version/s: 0.98.0 Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Integrated into 0.96 and trunk. Thanks. Regions are unexpectedly made offline in certain failure conditions --- Key: HBASE-9480 URL: https://issues.apache.org/jira/browse/HBASE-9480 Project: HBase Issue Type: Bug Reporter: Devaraj Das Assignee: Jimmy Xiang Fix For: 0.98.0, 0.96.0 Attachments: 9480-1.txt, trunk-9480.patch, trunk-9480_v1.1.patch, trunk-9480_v1.2.patch, trunk-9480_v2.patch Came across this issue (HBASE-9338 test): 1. Client issues a request to move a region from ServerA to ServerB 2. ServerA is compacting that region and doesn't close region immediately. In fact, it takes a while to complete the request. 3. The master in the meantime, sends another close request. 4. ServerA sends it a NotServingRegionException 5. Master handles the exception, deletes the znode, and invokes regionOffline for the said region. 6. ServerA fails to operate on ZK in the CloseRegionHandler since the node is deleted. The region is permanently offline. There are potentially other situations where when a RegionServer is offline and the client asks for a region move off from that server, the master makes the region offline. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-9480) Regions are unexpectedly made offline in certain failure conditions
[ https://issues.apache.org/jira/browse/HBASE-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-9480: - Priority: Major (was: Blocker) After talking w/ [~jxiang], this is not a blocker; state comes about because of outside intervention, not by normal operation (Yes, we need to make it so that outside intervention cannot put hbase into odd state but also need to allow administrators to do fixup -- TODO: Make it so only privileged user can do status threatening ops). Regions are unexpectedly made offline in certain failure conditions --- Key: HBASE-9480 URL: https://issues.apache.org/jira/browse/HBASE-9480 Project: HBase Issue Type: Bug Reporter: Devaraj Das Assignee: Jimmy Xiang Fix For: 0.96.0 Attachments: 9480-1.txt, trunk-9480.patch, trunk-9480_v1.1.patch, trunk-9480_v1.2.patch, trunk-9480_v2.patch Came across this issue (HBASE-9338 test): 1. Client issues a request to move a region from ServerA to ServerB 2. ServerA is compacting that region and doesn't close region immediately. In fact, it takes a while to complete the request. 3. The master in the meantime, sends another close request. 4. ServerA sends it a NotServingRegionException 5. Master handles the exception, deletes the znode, and invokes regionOffline for the said region. 6. ServerA fails to operate on ZK in the CloseRegionHandler since the node is deleted. The region is permanently offline. There are potentially other situations where when a RegionServer is offline and the client asks for a region move off from that server, the master makes the region offline. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-9480) Regions are unexpectedly made offline in certain failure conditions
[ https://issues.apache.org/jira/browse/HBASE-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-9480: --- Status: Open (was: Patch Available) Regions are unexpectedly made offline in certain failure conditions --- Key: HBASE-9480 URL: https://issues.apache.org/jira/browse/HBASE-9480 Project: HBase Issue Type: Bug Reporter: Devaraj Das Assignee: Jimmy Xiang Priority: Blocker Fix For: 0.96.0 Attachments: 9480-1.txt, trunk-9480.patch, trunk-9480_v1.1.patch, trunk-9480_v1.2.patch Came across this issue (HBASE-9338 test): 1. Client issues a request to move a region from ServerA to ServerB 2. ServerA is compacting that region and doesn't close region immediately. In fact, it takes a while to complete the request. 3. The master in the meantime, sends another close request. 4. ServerA sends it a NotServingRegionException 5. Master handles the exception, deletes the znode, and invokes regionOffline for the said region. 6. ServerA fails to operate on ZK in the CloseRegionHandler since the node is deleted. The region is permanently offline. There are potentially other situations where when a RegionServer is offline and the client asks for a region move off from that server, the master makes the region offline. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-9480) Regions are unexpectedly made offline in certain failure conditions
[ https://issues.apache.org/jira/browse/HBASE-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-9480: --- Attachment: trunk-9480_v2.patch Attached v2, now the test should not be flaky any more. Regions are unexpectedly made offline in certain failure conditions --- Key: HBASE-9480 URL: https://issues.apache.org/jira/browse/HBASE-9480 Project: HBase Issue Type: Bug Reporter: Devaraj Das Assignee: Jimmy Xiang Priority: Blocker Fix For: 0.96.0 Attachments: 9480-1.txt, trunk-9480.patch, trunk-9480_v1.1.patch, trunk-9480_v1.2.patch, trunk-9480_v2.patch Came across this issue (HBASE-9338 test): 1. Client issues a request to move a region from ServerA to ServerB 2. ServerA is compacting that region and doesn't close region immediately. In fact, it takes a while to complete the request. 3. The master in the meantime, sends another close request. 4. ServerA sends it a NotServingRegionException 5. Master handles the exception, deletes the znode, and invokes regionOffline for the said region. 6. ServerA fails to operate on ZK in the CloseRegionHandler since the node is deleted. The region is permanently offline. There are potentially other situations where when a RegionServer is offline and the client asks for a region move off from that server, the master makes the region offline. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-9480) Regions are unexpectedly made offline in certain failure conditions
[ https://issues.apache.org/jira/browse/HBASE-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-9480: --- Status: Patch Available (was: Open) Regions are unexpectedly made offline in certain failure conditions --- Key: HBASE-9480 URL: https://issues.apache.org/jira/browse/HBASE-9480 Project: HBase Issue Type: Bug Reporter: Devaraj Das Assignee: Jimmy Xiang Priority: Blocker Fix For: 0.96.0 Attachments: 9480-1.txt, trunk-9480.patch, trunk-9480_v1.1.patch, trunk-9480_v1.2.patch, trunk-9480_v2.patch Came across this issue (HBASE-9338 test): 1. Client issues a request to move a region from ServerA to ServerB 2. ServerA is compacting that region and doesn't close region immediately. In fact, it takes a while to complete the request. 3. The master in the meantime, sends another close request. 4. ServerA sends it a NotServingRegionException 5. Master handles the exception, deletes the znode, and invokes regionOffline for the said region. 6. ServerA fails to operate on ZK in the CloseRegionHandler since the node is deleted. The region is permanently offline. There are potentially other situations where when a RegionServer is offline and the client asks for a region move off from that server, the master makes the region offline. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-9480) Regions are unexpectedly made offline in certain failure conditions
[ https://issues.apache.org/jira/browse/HBASE-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-9480: --- Attachment: trunk-9480.patch Attached a patch to make sure the znode is not deleted in this case. RS throws a region already in transition exception. Master moves the region to failed_to_close state once tries run out. The region can still be closed and reassigned if it is eventually closed. Trying to add a test case now. Regions are unexpectedly made offline in certain failure conditions --- Key: HBASE-9480 URL: https://issues.apache.org/jira/browse/HBASE-9480 Project: HBase Issue Type: Bug Reporter: Devaraj Das Priority: Blocker Fix For: 0.96.0 Attachments: 9480-1.txt, trunk-9480.patch Came across this issue (HBASE-9338 test): 1. Client issues a request to move a region from ServerA to ServerB 2. ServerA is compacting that region and doesn't close region immediately. In fact, it takes a while to complete the request. 3. The master in the meantime, sends another close request. 4. ServerA sends it a NotServingRegionException 5. Master handles the exception, deletes the znode, and invokes regionOffline for the said region. 6. ServerA fails to operate on ZK in the CloseRegionHandler since the node is deleted. The region is permanently offline. There are potentially other situations where when a RegionServer is offline and the client asks for a region move off from that server, the master makes the region offline. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-9480) Regions are unexpectedly made offline in certain failure conditions
[ https://issues.apache.org/jira/browse/HBASE-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-9480: - Priority: Blocker (was: Critical) Regions are unexpectedly made offline in certain failure conditions --- Key: HBASE-9480 URL: https://issues.apache.org/jira/browse/HBASE-9480 Project: HBase Issue Type: Bug Reporter: Devaraj Das Priority: Blocker Fix For: 0.96.0 Attachments: 9480-1.txt Came across this issue (HBASE-9338 test): 1. Client issues a request to move a region from ServerA to ServerB 2. ServerA is compacting that region and doesn't close region immediately. In fact, it takes a while to complete the request. 3. The master in the meantime, sends another close request. 4. ServerA sends it a NotServingRegionException 5. Master handles the exception, deletes the znode, and invokes regionOffline for the said region. 6. ServerA fails to operate on ZK in the CloseRegionHandler since the node is deleted. The region is permanently offline. There are potentially other situations where when a RegionServer is offline and the client asks for a region move off from that server, the master makes the region offline. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-9480) Regions are unexpectedly made offline in certain failure conditions
[ https://issues.apache.org/jira/browse/HBASE-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-9480: --- Attachment: trunk-9480_v1.1.patch Regions are unexpectedly made offline in certain failure conditions --- Key: HBASE-9480 URL: https://issues.apache.org/jira/browse/HBASE-9480 Project: HBase Issue Type: Bug Reporter: Devaraj Das Assignee: Jimmy Xiang Priority: Blocker Fix For: 0.96.0 Attachments: 9480-1.txt, trunk-9480.patch, trunk-9480_v1.1.patch Came across this issue (HBASE-9338 test): 1. Client issues a request to move a region from ServerA to ServerB 2. ServerA is compacting that region and doesn't close region immediately. In fact, it takes a while to complete the request. 3. The master in the meantime, sends another close request. 4. ServerA sends it a NotServingRegionException 5. Master handles the exception, deletes the znode, and invokes regionOffline for the said region. 6. ServerA fails to operate on ZK in the CloseRegionHandler since the node is deleted. The region is permanently offline. There are potentially other situations where when a RegionServer is offline and the client asks for a region move off from that server, the master makes the region offline. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-9480) Regions are unexpectedly made offline in certain failure conditions
[ https://issues.apache.org/jira/browse/HBASE-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jimmy Xiang updated HBASE-9480: --- Attachment: trunk-9480_v1.2.patch [~jeffreyz], how about version 1.2? If you still see some problem, can you show me how to reproduce it so that I can add a testcase to cover it? Regions are unexpectedly made offline in certain failure conditions --- Key: HBASE-9480 URL: https://issues.apache.org/jira/browse/HBASE-9480 Project: HBase Issue Type: Bug Reporter: Devaraj Das Assignee: Jimmy Xiang Priority: Blocker Fix For: 0.96.0 Attachments: 9480-1.txt, trunk-9480.patch, trunk-9480_v1.1.patch, trunk-9480_v1.2.patch Came across this issue (HBASE-9338 test): 1. Client issues a request to move a region from ServerA to ServerB 2. ServerA is compacting that region and doesn't close region immediately. In fact, it takes a while to complete the request. 3. The master in the meantime, sends another close request. 4. ServerA sends it a NotServingRegionException 5. Master handles the exception, deletes the znode, and invokes regionOffline for the said region. 6. ServerA fails to operate on ZK in the CloseRegionHandler since the node is deleted. The region is permanently offline. There are potentially other situations where when a RegionServer is offline and the client asks for a region move off from that server, the master makes the region offline. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-9480) Regions are unexpectedly made offline in certain failure conditions
[ https://issues.apache.org/jira/browse/HBASE-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj Das updated HBASE-9480: --- Attachment: 9480-1.txt This is the patch that I have been working with. Regions are unexpectedly made offline in certain failure conditions --- Key: HBASE-9480 URL: https://issues.apache.org/jira/browse/HBASE-9480 Project: HBase Issue Type: Bug Reporter: Devaraj Das Priority: Critical Attachments: 9480-1.txt Came across this issue (HBASE-9338 test): 1. Client issues a request to move a region from ServerA to ServerB 2. ServerA is compacting that region and doesn't close region immediately. In fact, it takes a while to complete the request. 3. The master in the meantime, sends another close request. 4. ServerA sends it a NotServingRegionException 5. Master handles the exception, deletes the znode, and invokes regionOffline for the said region. 6. ServerA fails to operate on ZK in the CloseRegionHandler since the node is deleted. The region is permanently offline. There are potentially other situations where when a RegionServer is offline and the client asks for a region move off from that server, the master makes the region offline. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-9480) Regions are unexpectedly made offline in certain failure conditions
[ https://issues.apache.org/jira/browse/HBASE-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj Das updated HBASE-9480: --- Fix Version/s: 0.96.0 Regions are unexpectedly made offline in certain failure conditions --- Key: HBASE-9480 URL: https://issues.apache.org/jira/browse/HBASE-9480 Project: HBase Issue Type: Bug Reporter: Devaraj Das Priority: Critical Fix For: 0.96.0 Attachments: 9480-1.txt Came across this issue (HBASE-9338 test): 1. Client issues a request to move a region from ServerA to ServerB 2. ServerA is compacting that region and doesn't close region immediately. In fact, it takes a while to complete the request. 3. The master in the meantime, sends another close request. 4. ServerA sends it a NotServingRegionException 5. Master handles the exception, deletes the znode, and invokes regionOffline for the said region. 6. ServerA fails to operate on ZK in the CloseRegionHandler since the node is deleted. The region is permanently offline. There are potentially other situations where when a RegionServer is offline and the client asks for a region move off from that server, the master makes the region offline. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-9480) Regions are unexpectedly made offline in certain failure conditions
[ https://issues.apache.org/jira/browse/HBASE-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-9480: -- Status: Patch Available (was: Open) Regions are unexpectedly made offline in certain failure conditions --- Key: HBASE-9480 URL: https://issues.apache.org/jira/browse/HBASE-9480 Project: HBase Issue Type: Bug Reporter: Devaraj Das Priority: Critical Fix For: 0.96.0 Attachments: 9480-1.txt Came across this issue (HBASE-9338 test): 1. Client issues a request to move a region from ServerA to ServerB 2. ServerA is compacting that region and doesn't close region immediately. In fact, it takes a while to complete the request. 3. The master in the meantime, sends another close request. 4. ServerA sends it a NotServingRegionException 5. Master handles the exception, deletes the znode, and invokes regionOffline for the said region. 6. ServerA fails to operate on ZK in the CloseRegionHandler since the node is deleted. The region is permanently offline. There are potentially other situations where when a RegionServer is offline and the client asks for a region move off from that server, the master makes the region offline. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira