[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE
[ https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267790#comment-14267790 ] Cosmin Lehene commented on HBASE-4497: -- [~saint@gmail.com] is this still valid? If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE --- Key: HBASE-4497 URL: https://issues.apache.org/jira/browse/HBASE-4497 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Assignee: stack Priority: Critical As per the discussion in the mail chain HBCK reporting of possible mismatch in RS assignment this JIRA is created. Consider two RS- RS1 and RS2. A region tries to open in RS1. But it takes a while. The RS1 has still not updated meta and transitioned the node from OPENING to OPENED So timeout assigns the region to RS2. RS2 successfully updates the META and opens the region. Now RS1 tries to act on the region by first updating the META and then transiting the node to OPENING to OPENED. RS1 transiting the node to OPENING to OPENED will fail. But the META entry will have RS1 as the latest. Now HBCK reports this as an inconsistency and if we try to scan the Region we get NotServingRegionException. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE
[ https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13118539#comment-13118539 ] stack commented on HBASE-4497: -- So if doesn't need to be monotonically increasing -- it'd be nice but not necessary and monotonically increasing id for a cluster is a bit of a pain to do -- how about we do this (the below this comes of Todd input over in 4507 and from the back and forth above): From here on out, every edit of meta will also update a new column, info:editid. This info:editid will hold a UUID generated by the client making the edit. On open of a region, the open runs as it currently does with following additions: + Just after the regionserver has moved the znode to OPENING the first time confirming it 'owns' the region, the RS reads the current info:editid value. + After opening the region, when we go to update the regions location in meta, the RS will do a checkAndPut where the check checks the info:editid value. Hows that? If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE --- Key: HBASE-4497 URL: https://issues.apache.org/jira/browse/HBASE-4497 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Priority: Critical As per the discussion in the mail chain HBCK reporting of possible mismatch in RS assignment this JIRA is created. Consider two RS- RS1 and RS2. A region tries to open in RS1. But it takes a while. The RS1 has still not updated meta and transitioned the node from OPENING to OPENED So timeout assigns the region to RS2. RS2 successfully updates the META and opens the region. Now RS1 tries to act on the region by first updating the META and then transiting the node to OPENING to OPENED. RS1 transiting the node to OPENING to OPENED will fail. But the META entry will have RS1 as the latest. Now HBCK reports this as an inconsistency and if we try to scan the Region we get NotServingRegionException. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE
[ https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13118546#comment-13118546 ] Ted Yu commented on HBASE-4497: --- +1 on the plan above. This info:editid would be helpful in debugging as well. If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE --- Key: HBASE-4497 URL: https://issues.apache.org/jira/browse/HBASE-4497 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Assignee: stack Priority: Critical As per the discussion in the mail chain HBCK reporting of possible mismatch in RS assignment this JIRA is created. Consider two RS- RS1 and RS2. A region tries to open in RS1. But it takes a while. The RS1 has still not updated meta and transitioned the node from OPENING to OPENED So timeout assigns the region to RS2. RS2 successfully updates the META and opens the region. Now RS1 tries to act on the region by first updating the META and then transiting the node to OPENING to OPENED. RS1 transiting the node to OPENING to OPENED will fail. But the META entry will have RS1 as the latest. Now HBCK reports this as an inconsistency and if we try to scan the Region we get NotServingRegionException. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE
[ https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13118617#comment-13118617 ] dhruba borthakur commented on HBASE-4497: - Stack: the proposal looks solid. If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE --- Key: HBASE-4497 URL: https://issues.apache.org/jira/browse/HBASE-4497 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Assignee: stack Priority: Critical As per the discussion in the mail chain HBCK reporting of possible mismatch in RS assignment this JIRA is created. Consider two RS- RS1 and RS2. A region tries to open in RS1. But it takes a while. The RS1 has still not updated meta and transitioned the node from OPENING to OPENED So timeout assigns the region to RS2. RS2 successfully updates the META and opens the region. Now RS1 tries to act on the region by first updating the META and then transiting the node to OPENING to OPENED. RS1 transiting the node to OPENING to OPENED will fail. But the META entry will have RS1 as the latest. Now HBCK reports this as an inconsistency and if we try to scan the Region we get NotServingRegionException. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE
[ https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13117617#comment-13117617 ] jirapos...@reviews.apache.org commented on HBASE-4497: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/2118/ --- Review request for hbase. Summary --- Adds a checkAndPut that takes a timestamp This addresses bug hbase-4497. https://issues.apache.org/jira/browse/hbase-4497 Diffs - src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java 3679c02 src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 7cbdb98 src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 0c06f4f src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java 99b34cc Diff: https://reviews.apache.org/r/2118/diff Testing --- Thanks, Michael If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE --- Key: HBASE-4497 URL: https://issues.apache.org/jira/browse/HBASE-4497 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Priority: Critical As per the discussion in the mail chain HBCK reporting of possible mismatch in RS assignment this JIRA is created. Consider two RS- RS1 and RS2. A region tries to open in RS1. But it takes a while. The RS1 has still not updated meta and transitioned the node from OPENING to OPENED So timeout assigns the region to RS2. RS2 successfully updates the META and opens the region. Now RS1 tries to act on the region by first updating the META and then transiting the node to OPENING to OPENED. RS1 transiting the node to OPENING to OPENED will fail. But the META entry will have RS1 as the latest. Now HBCK reports this as an inconsistency and if we try to scan the Region we get NotServingRegionException. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE
[ https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13117881#comment-13117881 ] Ming Ma commented on HBASE-4497: 1. Agree checkAndPut solution is good enough. I am just trying to find holes here.:) 2. Does RS need to have access to global counter? If it is only for region assignment scenario, agree there is no such need. I initially thought of it as a region operation id where RS will also get a new ID when state changes, for example from OPENING to OPENED. We will use such counter to track every region state change in the system. 3. Persistent .vs. ephemeral. I thought there will be a way to provide reliable ZK based AtomicLong that can survive HBase, ZK reliable restart. That will give us a good pictures of the event sequence in the system. Performance isn't that important given region state happens less frequently. 4. unique .vs. monotonically increase. For this issue, unique number seems to be fine. I thought it might be used in other context to track event sequence. So monotonically increase is better given the comparison of two values can indicate the order in time dimension. It doesn't have to be sequential. If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE --- Key: HBASE-4497 URL: https://issues.apache.org/jira/browse/HBASE-4497 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Priority: Critical As per the discussion in the mail chain HBCK reporting of possible mismatch in RS assignment this JIRA is created. Consider two RS- RS1 and RS2. A region tries to open in RS1. But it takes a while. The RS1 has still not updated meta and transitioned the node from OPENING to OPENED So timeout assigns the region to RS2. RS2 successfully updates the META and opens the region. Now RS1 tries to act on the region by first updating the META and then transiting the node to OPENING to OPENED. RS1 transiting the node to OPENING to OPENED will fail. But the META entry will have RS1 as the latest. Now HBCK reports this as an inconsistency and if we try to scan the Region we get NotServingRegionException. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE
[ https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116232#comment-13116232 ] Ming Ma commented on HBASE-4497: ok, Ram. Add some more clarification. 1. It looks ZKAssign.transitionNode has provided atomicity via expected version feature in ZK. So we are good here. 2. Global AtomicInteger isn't necessary in this context, we can just use the expected version from ZK for a given ZNode, given expected version just need to be unique on a given ZNode, not global. 3. With regard to HBase .META. update, we can put expected version as ID into the .META. table and enforce new update's ID has to be greater than the previous version for a given region via some new HBase API checkGreaterAndPut. This ID value is local to the region node, that should be ok; for a given region node, this value will increment all the time. Currently this expected version is passed via RPC RegionOpeningState openRegion(HRegionInfo region, int versionOfOfflineNode). Will that address the issue, Jonathan? Jonathan Dhruba's suggestion is interesting. Could scale be an issue when HBase scales to the next level in terms of number of machines, number of regions and number of region movements? .META. table will be distributed to different RSs, putting it on the Master could be a bottleneck. However, we might first run into other more important issues in such large scale. If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE --- Key: HBASE-4497 URL: https://issues.apache.org/jira/browse/HBASE-4497 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Priority: Critical As per the discussion in the mail chain HBCK reporting of possible mismatch in RS assignment this JIRA is created. Consider two RS- RS1 and RS2. A region tries to open in RS1. But it takes a while. The RS1 has still not updated meta and transitioned the node from OPENING to OPENED So timeout assigns the region to RS2. RS2 successfully updates the META and opens the region. Now RS1 tries to act on the region by first updating the META and then transiting the node to OPENING to OPENED. RS1 transiting the node to OPENING to OPENED will fail. But the META entry will have RS1 as the latest. Now HBCK reports this as an inconsistency and if we try to scan the Region we get NotServingRegionException. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE
[ https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116255#comment-13116255 ] Ming Ma commented on HBASE-4497: Did some testing on ZK, my assumption that the ZK's data version is incremental post node deletion is incorrect. So perhaps we still some global AtomicLong based on ZK. If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE --- Key: HBASE-4497 URL: https://issues.apache.org/jira/browse/HBASE-4497 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Priority: Critical As per the discussion in the mail chain HBCK reporting of possible mismatch in RS assignment this JIRA is created. Consider two RS- RS1 and RS2. A region tries to open in RS1. But it takes a while. The RS1 has still not updated meta and transitioned the node from OPENING to OPENED So timeout assigns the region to RS2. RS2 successfully updates the META and opens the region. Now RS1 tries to act on the region by first updating the META and then transiting the node to OPENING to OPENED. RS1 transiting the node to OPENING to OPENED will fail. But the META entry will have RS1 as the latest. Now HBCK reports this as an inconsistency and if we try to scan the Region we get NotServingRegionException. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE
[ https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116267#comment-13116267 ] ramkrishna.s.vasudevan commented on HBASE-4497: --- @Ming The expected version of an znode will not increase once the node gets deleted. Like if the region gets balanced then a new znode wil be created then we will have the expected version as 0 again. If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE --- Key: HBASE-4497 URL: https://issues.apache.org/jira/browse/HBASE-4497 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Priority: Critical As per the discussion in the mail chain HBCK reporting of possible mismatch in RS assignment this JIRA is created. Consider two RS- RS1 and RS2. A region tries to open in RS1. But it takes a while. The RS1 has still not updated meta and transitioned the node from OPENING to OPENED So timeout assigns the region to RS2. RS2 successfully updates the META and opens the region. Now RS1 tries to act on the region by first updating the META and then transiting the node to OPENING to OPENED. RS1 transiting the node to OPENING to OPENED will fail. But the META entry will have RS1 as the latest. Now HBCK reports this as an inconsistency and if we try to scan the Region we get NotServingRegionException. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE
[ https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116272#comment-13116272 ] ramkrishna.s.vasudevan commented on HBASE-4497: --- @Ming I did not see your latest comment. As i had not refreshed.:) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE --- Key: HBASE-4497 URL: https://issues.apache.org/jira/browse/HBASE-4497 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Priority: Critical As per the discussion in the mail chain HBCK reporting of possible mismatch in RS assignment this JIRA is created. Consider two RS- RS1 and RS2. A region tries to open in RS1. But it takes a while. The RS1 has still not updated meta and transitioned the node from OPENING to OPENED So timeout assigns the region to RS2. RS2 successfully updates the META and opens the region. Now RS1 tries to act on the region by first updating the META and then transiting the node to OPENING to OPENED. RS1 transiting the node to OPENING to OPENED will fail. But the META entry will have RS1 as the latest. Now HBCK reports this as an inconsistency and if we try to scan the Region we get NotServingRegionException. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE
[ https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116498#comment-13116498 ] Ted Yu commented on HBASE-4497: --- Ming's idea @ 28/Sep/11 04:56, especially point 3 is interesting. I like that for long term solution. We need to be careful writing migration code to accommodate the new operation Id. If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE --- Key: HBASE-4497 URL: https://issues.apache.org/jira/browse/HBASE-4497 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Priority: Critical As per the discussion in the mail chain HBCK reporting of possible mismatch in RS assignment this JIRA is created. Consider two RS- RS1 and RS2. A region tries to open in RS1. But it takes a while. The RS1 has still not updated meta and transitioned the node from OPENING to OPENED So timeout assigns the region to RS2. RS2 successfully updates the META and opens the region. Now RS1 tries to act on the region by first updating the META and then transiting the node to OPENING to OPENED. RS1 transiting the node to OPENING to OPENED will fail. But the META entry will have RS1 as the latest. Now HBCK reports this as an inconsistency and if we try to scan the Region we get NotServingRegionException. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE
[ https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116524#comment-13116524 ] ramkrishna.s.vasudevan commented on HBASE-4497: --- As Ming suggested we can generate an incremental integer at the master side which will be generated per region and pass that value over RPC which we can be checked before updating the META. This value can be maintained in the master side in a map with region as the key. If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE --- Key: HBASE-4497 URL: https://issues.apache.org/jira/browse/HBASE-4497 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Priority: Critical As per the discussion in the mail chain HBCK reporting of possible mismatch in RS assignment this JIRA is created. Consider two RS- RS1 and RS2. A region tries to open in RS1. But it takes a while. The RS1 has still not updated meta and transitioned the node from OPENING to OPENED So timeout assigns the region to RS2. RS2 successfully updates the META and opens the region. Now RS1 tries to act on the region by first updating the META and then transiting the node to OPENING to OPENED. RS1 transiting the node to OPENING to OPENED will fail. But the META entry will have RS1 as the latest. Now HBCK reports this as an inconsistency and if we try to scan the Region we get NotServingRegionException. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE
[ https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116525#comment-13116525 ] Jonathan Gray commented on HBASE-4497: -- I don't think we can use the same ID as the ZK node. But we could just some incrementing number. An alternative would be to instead allow the roll-back of the META edit using a checkAndDelete which might be simpler but less optimal. If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE --- Key: HBASE-4497 URL: https://issues.apache.org/jira/browse/HBASE-4497 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Priority: Critical As per the discussion in the mail chain HBCK reporting of possible mismatch in RS assignment this JIRA is created. Consider two RS- RS1 and RS2. A region tries to open in RS1. But it takes a while. The RS1 has still not updated meta and transitioned the node from OPENING to OPENED So timeout assigns the region to RS2. RS2 successfully updates the META and opens the region. Now RS1 tries to act on the region by first updating the META and then transiting the node to OPENING to OPENED. RS1 transiting the node to OPENING to OPENED will fail. But the META entry will have RS1 as the latest. Now HBCK reports this as an inconsistency and if we try to scan the Region we get NotServingRegionException. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE
[ https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116695#comment-13116695 ] Jonathan Gray commented on HBASE-4497: -- startcode and timestamp is what i initially thought of. seems like there could be some weird situations. for example, what is to say that the server already in META didn't somehow become the new assignment destination? or what if... M tells RS1 to OPEN R1 and to expect RS3:StartCode3 in META. RS1 locks up right before doing the META edit, M tells RS2 to OPEN R1 and to also expect RS3:StartCode3 in META. I guess this is the atomicity we need, so that should be okay. one neat idea would be to introduce this region assignment incrementing ID into META. it would provide a nice way to debug the movement of a region across the cluster over time and could also provide the necessary info to use CheckAndPut. If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE --- Key: HBASE-4497 URL: https://issues.apache.org/jira/browse/HBASE-4497 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Priority: Critical As per the discussion in the mail chain HBCK reporting of possible mismatch in RS assignment this JIRA is created. Consider two RS- RS1 and RS2. A region tries to open in RS1. But it takes a while. The RS1 has still not updated meta and transitioned the node from OPENING to OPENED So timeout assigns the region to RS2. RS2 successfully updates the META and opens the region. Now RS1 tries to act on the region by first updating the META and then transiting the node to OPENING to OPENED. RS1 transiting the node to OPENING to OPENED will fail. But the META entry will have RS1 as the latest. Now HBCK reports this as an inconsistency and if we try to scan the Region we get NotServingRegionException. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE
[ https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116721#comment-13116721 ] stack commented on HBASE-4497: -- bq. startcode and timestamp is what i initially thought of. seems like there could be some weird situations. for example, what is to say that the server already in META didn't somehow become the new assignment destination? The timestamp will be different in this case? (It'll have been updated by the new open). bq. or what if... M tells RS1 to OPEN R1 and to expect RS3:StartCode3 I'm not suggesting the master tell the RS anything new. I'm suggesting that on receiving the open, the RS itself read .META. at start of the open transaction before it does anything else and use this read as input for the later checkAndSet write. bq. one neat idea would be to introduce this region assignment incrementing ID into META. it would provide a nice way to debug the movement of a region across the cluster over time and could also provide the necessary info to use CheckAndPut. This could work. Downsides are M has to write meta first before doing assign which will be a bit of new burden on meta (double'd write load?) and this new write is now inline with an assign; we'd have to do some hackery in here around bulk assign. If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE --- Key: HBASE-4497 URL: https://issues.apache.org/jira/browse/HBASE-4497 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Priority: Critical As per the discussion in the mail chain HBCK reporting of possible mismatch in RS assignment this JIRA is created. Consider two RS- RS1 and RS2. A region tries to open in RS1. But it takes a while. The RS1 has still not updated meta and transitioned the node from OPENING to OPENED So timeout assigns the region to RS2. RS2 successfully updates the META and opens the region. Now RS1 tries to act on the region by first updating the META and then transiting the node to OPENING to OPENED. RS1 transiting the node to OPENING to OPENED will fail. But the META entry will have RS1 as the latest. Now HBCK reports this as an inconsistency and if we try to scan the Region we get NotServingRegionException. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE
[ https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116727#comment-13116727 ] stack commented on HBASE-4497: -- I just checked the checkAndPut. It doesn't expose timestamp. So. Fix checkAndPut so it exposes timestamp or write timestamp or uuid to meta into a new column info:editid whenever we do the metadata open update (I'd prefer adding a checkAndPut override -- seems like a hole in checkAndPut that we don't allow version checking). If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE --- Key: HBASE-4497 URL: https://issues.apache.org/jira/browse/HBASE-4497 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Priority: Critical As per the discussion in the mail chain HBCK reporting of possible mismatch in RS assignment this JIRA is created. Consider two RS- RS1 and RS2. A region tries to open in RS1. But it takes a while. The RS1 has still not updated meta and transitioned the node from OPENING to OPENED So timeout assigns the region to RS2. RS2 successfully updates the META and opens the region. Now RS1 tries to act on the region by first updating the META and then transiting the node to OPENING to OPENED. RS1 transiting the node to OPENING to OPENED will fail. But the META entry will have RS1 as the latest. Now HBCK reports this as an inconsistency and if we try to scan the Region we get NotServingRegionException. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE
[ https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116729#comment-13116729 ] Jonathan Gray commented on HBASE-4497: -- Sounds like it could work. I'm +1 on exposing version to checkAndPut and using it for META edits. Good point, we can just do the read on the RS first. If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE --- Key: HBASE-4497 URL: https://issues.apache.org/jira/browse/HBASE-4497 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Priority: Critical As per the discussion in the mail chain HBCK reporting of possible mismatch in RS assignment this JIRA is created. Consider two RS- RS1 and RS2. A region tries to open in RS1. But it takes a while. The RS1 has still not updated meta and transitioned the node from OPENING to OPENED So timeout assigns the region to RS2. RS2 successfully updates the META and opens the region. Now RS1 tries to act on the region by first updating the META and then transiting the node to OPENING to OPENED. RS1 transiting the node to OPENING to OPENED will fail. But the META entry will have RS1 as the latest. Now HBCK reports this as an inconsistency and if we try to scan the Region we get NotServingRegionException. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE
[ https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116740#comment-13116740 ] Ted Yu commented on HBASE-4497: --- HBASE-4507 has been opened. If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE --- Key: HBASE-4497 URL: https://issues.apache.org/jira/browse/HBASE-4497 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Priority: Critical As per the discussion in the mail chain HBCK reporting of possible mismatch in RS assignment this JIRA is created. Consider two RS- RS1 and RS2. A region tries to open in RS1. But it takes a while. The RS1 has still not updated meta and transitioned the node from OPENING to OPENED So timeout assigns the region to RS2. RS2 successfully updates the META and opens the region. Now RS1 tries to act on the region by first updating the META and then transiting the node to OPENING to OPENED. RS1 transiting the node to OPENING to OPENED will fail. But the META entry will have RS1 as the latest. Now HBCK reports this as an inconsistency and if we try to scan the Region we get NotServingRegionException. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE
[ https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116981#comment-13116981 ] Ming Ma commented on HBASE-4497: Using startcode and timestamp is a good idea. However, I want to confirm if there is a case where it won't work. Given there is no such thing as global clock, the timestamp value generated by the RS that hosts .META. region at that moment might not be unique if .META. region is moved to another RS. So there is a possibility of startcode and timestamp is what i initially thought of. seems like there could be some weird situations. for example, what is to say that the server already in META didn't somehow become the new assignment destination?. Here is how: 1. For a given region, .META. table has RS1 as RS serverName, T1 as timestamp value. { RS1, T1 } 2. .META. is moved to another RS whose clock is behind after the original RS that wrote {RS1, T1}. 3. RS2 starts openRegion first, it has an older ZK node version to check. RS1 start openRegion later. It has an up-to-date ZK node version. 4. Both RS2 and RS1 are about the do checkAndPut on .META. table. Both will use {RS1, T1} as condition for checkAndPut. 5. RS1 updates it first, it succeeds. There is a chance that after the update, the value is still {RS1, T1}, given T1 is generated by a RS whose clock is behind. 6. RS2 updates it next, it also succeeds, given {RS1, T1} hasn't change even RS1 makes an update earlier. 7. RS1 has the up-to-date ZK node version, thus it will continue and succeeds with the rest of open operatioin. The region is considered OPENED from AM's point of view. 8. RS2 has older ZK node version, thus will fail later when it tries to update ZK node. Region won't be opened on RS2. 9. In .META. table, the region is on RS2. Adding support for version check in checkAndPut should address such scenario. Regarding the region assignment ID approach: 1. I didn't imply it will only be incremented by the Master. I suggested a ZK-based AtomicLong that Master and all RSs can get hold off. So this could be considered a global clock. 2. Such ID could also help to track all the region transition events, HBASE-4354. If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE --- Key: HBASE-4497 URL: https://issues.apache.org/jira/browse/HBASE-4497 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Priority: Critical As per the discussion in the mail chain HBCK reporting of possible mismatch in RS assignment this JIRA is created. Consider two RS- RS1 and RS2. A region tries to open in RS1. But it takes a while. The RS1 has still not updated meta and transitioned the node from OPENING to OPENED So timeout assigns the region to RS2. RS2 successfully updates the META and opens the region. Now RS1 tries to act on the region by first updating the META and then transiting the node to OPENING to OPENED. RS1 transiting the node to OPENING to OPENED will fail. But the META entry will have RS1 as the latest. Now HBCK reports this as an inconsistency and if we try to scan the Region we get NotServingRegionException. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE
[ https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13117007#comment-13117007 ] stack commented on HBASE-4497: -- Good stuff Ming. Looking at your pathological case, I think it is possible. I could add to the checkAndPut that takes a version a check that we never write back the same version; if the version we are checking will go in with a timestamp that is exactly what we are checking, add a millisecond (especially if the value we write back is the same again). I think we should do this though the probability of the scenario your postulate is extremely low. Why would RSs need access to a global counter? Master assigns. It'd need to keep its running counter in zk in case it crashed but I'd think only the assigner would need to use it (Here are some notes on counter in zk from zk mailing list: http://www.mail-archive.com/zookeeper-user@hadoop.apache.org/msg01968.html) Would this counter be other than ephemeral data? Design dictum up to this has been that zk is for ephemeral data only. Would keeping a counter change that? Does the 'region assignment id' need to monotonically increase? Can it just be unique (uuid?)? Good stuff Ming. If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE --- Key: HBASE-4497 URL: https://issues.apache.org/jira/browse/HBASE-4497 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Priority: Critical As per the discussion in the mail chain HBCK reporting of possible mismatch in RS assignment this JIRA is created. Consider two RS- RS1 and RS2. A region tries to open in RS1. But it takes a while. The RS1 has still not updated meta and transitioned the node from OPENING to OPENED So timeout assigns the region to RS2. RS2 successfully updates the META and opens the region. Now RS1 tries to act on the region by first updating the META and then transiting the node to OPENING to OPENED. RS1 transiting the node to OPENING to OPENED will fail. But the META entry will have RS1 as the latest. Now HBCK reports this as an inconsistency and if we try to scan the Region we get NotServingRegionException. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE
[ https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13115788#comment-13115788 ] stack commented on HBASE-4497: -- Do we need to add an extra tickle of OPENING znode after open of region and before we go to do meta update? If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE --- Key: HBASE-4497 URL: https://issues.apache.org/jira/browse/HBASE-4497 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Priority: Critical As per the discussion in the mail chain HBCK reporting of possible mismatch in RS assignment this JIRA is created. Consider two RS- RS1 and RS2. A region tries to open in RS1. But it takes a while. The RS1 has still not updated meta and transitioned the node from OPENING to OPENED So timeout assigns the region to RS2. RS2 successfully updates the META and opens the region. Now RS1 tries to act on the region by first updating the META and then transiting the node to OPENING to OPENED. RS1 transiting the node to OPENING to OPENED will fail. But the META entry will have RS1 as the latest. Now HBCK reports this as an inconsistency and if we try to scan the Region we get NotServingRegionException. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE
[ https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13115795#comment-13115795 ] Jean-Daniel Cryans commented on HBASE-4497: --- Stack, it seems that it's already the case: {code} if (tickleOpening(post_region_open)) { if (updateMeta(region)) failed = false; } {code} In any case, there's still a hole as those two operations aren't done in an atomic fashion. If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE --- Key: HBASE-4497 URL: https://issues.apache.org/jira/browse/HBASE-4497 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Priority: Critical As per the discussion in the mail chain HBCK reporting of possible mismatch in RS assignment this JIRA is created. Consider two RS- RS1 and RS2. A region tries to open in RS1. But it takes a while. The RS1 has still not updated meta and transitioned the node from OPENING to OPENED So timeout assigns the region to RS2. RS2 successfully updates the META and opens the region. Now RS1 tries to act on the region by first updating the META and then transiting the node to OPENING to OPENED. RS1 transiting the node to OPENING to OPENED will fail. But the META entry will have RS1 as the latest. Now HBCK reports this as an inconsistency and if we try to scan the Region we get NotServingRegionException. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE
[ https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13115828#comment-13115828 ] stack commented on HBASE-4497: -- Thanks J-D. Thats what I was too lazy to looksee for myself. Looks like we are doing enough tickling. Weird that timeout monitor can cut in, region can be assigned elsewhere AND successfully update meta before this comes back. Here is from Rams email up on list earlier with log snippets: {code} RS1 === 2011-09-23 22:34:34,000 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: addToOnlineRegions is doneREGION = {NAME = 't5,,1316797380065.2d06b3ca4d398ec96920ae86441a68c9.', TableName = 't5', STARTKEY = '', ENDKEY = '', ENCODED = 2d06b3ca4d398ec96920ae86441a68c9,} 2011-09-23 22:34:34,009 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Updated row t5,,1316797380065.2d06b3ca4d398ec96920ae86441a68c9. in region .META.,,1 with serverName=linux76,60020,1316796517682 2011-09-23 22:34:34,009 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Done with post open deploy taks for region=t5,,1316797380065.2d06b3ca4d398ec96920ae86441a68c9., daughter=false 2011-09-23 22:34:34,009 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x1328ceaa1ff0037 Attempting to transition node 2d06b3ca4d398ec96920ae86441a68c9 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED 2011-09-23 22:34:34,038 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Completed the OPEN of region t5,,1316797380065.2d06b3ca4d398ec96920ae86441a68c9. but when transitioning from OPENING to OPENED got a version mismatch, someone else clashed so now unassigning -- closing region 2011-09-23 22:34:34,038 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Closing t5,,1316797380065.2d06b3ca4d398ec96920ae86441a68c9.: disabling compactions flushes 2011-09-23 22:34:34,038 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for region t5,,1316797380065.2d06b3ca4d398ec96920ae86441a68c9. 2011-09-23 22:34:34,038 DEBUG org.apache.hadoop.hbase.regionserver.Store: closed f5 2011-09-23 22:34:34,038 INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed t5,,1316797380065.2d06b3ca4d398ec96920ae86441a68c9. RS2 === 2011-09-23 22:33:56,546 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x1328ceaa1ff0039 Successfully transitioned node 2d06b3ca4d398ec96920ae86441a68c9 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING 2011-09-23 22:33:56,845 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Post open deploy tasks for region=t5,,1316797380065.2d06b3ca4d398ec96920ae86441a68c9., daughter=false 2011-09-23 22:33:56,845 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: addToOnlineRegions is doneREGION = {NAME = 't5,,1316797380065.2d06b3ca4d398ec96920ae86441a68c9.', TableName = 't5', STARTKEY = '', ENDKEY = '', ENCODED = 2d06b3ca4d398ec96920ae86441a68c9,} 2011-09-23 22:33:56,856 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Updated row t5,,1316797380065.2d06b3ca4d398ec96920ae86441a68c9. in region .META.,,1 with serverName=linux146,60020,1316796499216 2011-09-23 22:33:56,856 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Done with post open deploy taks for region=t5,,1316797380065.2d06b3ca4d398ec96920ae86441a68c9., daughter=false 2011-09-23 22:33:58,887 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x1328ceaa1ff0039 Attempting to transition node 2d06b3ca4d398ec96920ae86441a68c9 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED 2011-09-23 22:33:58,893 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x1328ceaa1ff0039 Successfully transitioned node 2d06b3ca4d398ec96920ae86441a68c9 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED 2011-09-23 22:33:58,893 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Opened t5,,1316797380065.2d06b3ca4d398ec96920ae86441a68c9. {code} If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE --- Key: HBASE-4497 URL: https://issues.apache.org/jira/browse/HBASE-4497 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Priority: Critical As per the discussion in the mail chain HBCK reporting of possible mismatch in RS assignment this JIRA is created. Consider two RS- RS1 and RS2. A region tries to open in RS1. But it takes a while. The RS1 has still not updated meta and transitioned the node from OPENING to OPENED So timeout assigns the region to RS2. RS2 successfully updates the META and opens the region. Now RS1 tries to act on the region by first updating the META and then transiting the node to OPENING to OPENED. RS1 transiting the node to OPENING to OPENED
[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE
[ https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116032#comment-13116032 ] Jonathan Gray commented on HBASE-4497: -- I was just discussing this scenario with Dhruba a few days back. There's definitely a race condition here and I don't see a trivial fix. We use HLog IO-fencing to ensure that edits don't slip into an HLog after a server is considered dead by the Master. But the Master has no way to prevent this META update from slipping in. We need to make some modification to how the master can safely timeout an OPENING. One possibility is for the master to require either an acknowledgment from the RS before moving the region elsewhere or for the RS to die. It seems unlikely that we will actually see the RS to Master acknowledgment since OPENING taking too long is usually a sign of brokenness or the RS being backed up, I think. But in any case I'd imagine some kind of OPEN_CANCEL_REQUESTED state that the Master transitions the node to and only when the RS transitions to OPEN_CANCELED or OFFLINE or something, then it's safe to reassign elsewhere. I think this design still has a hole in it though because there are scenarios where the RS doesn't actually die but for some reason doesn't OPEN or ack the cancel. Another option would be to do the RS performed META edits using a CheckAndPut rather than straight Put. Or we could move META editing back to the Master where it's easy to do things atomically :) The CheckAndPut idea is kind of neat but we'd probably have to send more data on the OPEN_RPC. For example, the existing server start code or server name + start code or something guaranteed unique (guaranteed that a conflicting RS opening stuff wouldn't be able to use the same thing). Then the atomicity is on the META region. If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE --- Key: HBASE-4497 URL: https://issues.apache.org/jira/browse/HBASE-4497 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Priority: Critical As per the discussion in the mail chain HBCK reporting of possible mismatch in RS assignment this JIRA is created. Consider two RS- RS1 and RS2. A region tries to open in RS1. But it takes a while. The RS1 has still not updated meta and transitioned the node from OPENING to OPENED So timeout assigns the region to RS2. RS2 successfully updates the META and opens the region. Now RS1 tries to act on the region by first updating the META and then transiting the node to OPENING to OPENED. RS1 transiting the node to OPENING to OPENED will fail. But the META entry will have RS1 as the latest. Now HBCK reports this as an inconsistency and if we try to scan the Region we get NotServingRegionException. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE
[ https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116134#comment-13116134 ] ramkrishna.s.vasudevan commented on HBASE-4497: --- I got this problem for 3 regions before HBASE-4452 went in. Now HBASE-4452 will definitely reduce the probability of this happening. If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE --- Key: HBASE-4497 URL: https://issues.apache.org/jira/browse/HBASE-4497 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Priority: Critical As per the discussion in the mail chain HBCK reporting of possible mismatch in RS assignment this JIRA is created. Consider two RS- RS1 and RS2. A region tries to open in RS1. But it takes a while. The RS1 has still not updated meta and transitioned the node from OPENING to OPENED So timeout assigns the region to RS2. RS2 successfully updates the META and opens the region. Now RS1 tries to act on the region by first updating the META and then transiting the node to OPENING to OPENED. RS1 transiting the node to OPENING to OPENED will fail. But the META entry will have RS1 as the latest. Now HBCK reports this as an inconsistency and if we try to scan the Region we get NotServingRegionException. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE
[ https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116139#comment-13116139 ] dhruba borthakur commented on HBASE-4497: - Can somebody pl elaborate any disadvantages if we make the master be the only entity that can update META about where the region is being served from? If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE --- Key: HBASE-4497 URL: https://issues.apache.org/jira/browse/HBASE-4497 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Priority: Critical As per the discussion in the mail chain HBCK reporting of possible mismatch in RS assignment this JIRA is created. Consider two RS- RS1 and RS2. A region tries to open in RS1. But it takes a while. The RS1 has still not updated meta and transitioned the node from OPENING to OPENED So timeout assigns the region to RS2. RS2 successfully updates the META and opens the region. Now RS1 tries to act on the region by first updating the META and then transiting the node to OPENING to OPENED. RS1 transiting the node to OPENING to OPENED will fail. But the META entry will have RS1 as the latest. Now HBCK reports this as an inconsistency and if we try to scan the Region we get NotServingRegionException. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE
[ https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116151#comment-13116151 ] Ming Ma commented on HBASE-4497: checkAndPut might work. We will use checkAndPut on both ZK as well as HBase. There are other bugs due to the lack of strong synchronization on the ZK nodes among AssignmentManager and RSs. Here is another scenario for race between AM timeoutMonitor and the first RS's openRegion operation. RS1 successfully transition to OPENED state around the same time as timeoutMonitor kicks in, timeoutMonitor gets data from ZK right before RS1 set it to OPENED, thus timeoutMonitor has RS_ZK_REGION_OPENING and tries to reassign the region. In that case, we will end up with the same region on two RSs. Will the followings work? 1. ZKAssign.transitionNode has some sort of checkAndPut semantics when it tries to enforce the original state is the correct one. However, it isn't atomic. It first tries to getData from ZK and then compare. Instead, we can use ZK's checkAndPut API to enforce the atomicity. 2. Introduce a ZK-base global AtomicInteger for region operation; e.g., each openRegion operation will use a new incremental region_operation_ID. Each openRegion operation will validate its own ID with ZK state via checkAndPut. Thus one of the two openRegion operations on RSs won't work. 3. With regard to HBase .META. update, we can put region_operation_ID into the table and enforce new update's region operation ID has to be greater than the previous version for a given region. In that way the older RS won't be able to update the table properly. We will need to introduce a new API for HBase, similar to checkAndPut, more like checkGreaterandPut. If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE --- Key: HBASE-4497 URL: https://issues.apache.org/jira/browse/HBASE-4497 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Priority: Critical As per the discussion in the mail chain HBCK reporting of possible mismatch in RS assignment this JIRA is created. Consider two RS- RS1 and RS2. A region tries to open in RS1. But it takes a while. The RS1 has still not updated meta and transitioned the node from OPENING to OPENED So timeout assigns the region to RS2. RS2 successfully updates the META and opens the region. Now RS1 tries to act on the region by first updating the META and then transiting the node to OPENING to OPENED. RS1 transiting the node to OPENING to OPENED will fail. But the META entry will have RS1 as the latest. Now HBCK reports this as an inconsistency and if we try to scan the Region we get NotServingRegionException. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE
[ https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116160#comment-13116160 ] ramkrishna.s.vasudevan commented on HBASE-4497: --- I am not aware of ZK much. Your 3rd point looks good to me Ming. I think HBASE-4015 may handle the race you have told above. {code} if (hijack null != curDataInZNode) { EventType eventType = curDataInZNode.getEventType(); if (eventType.equals(EventType.RS_ZK_REGION_CLOSING) || eventType.equals(EventType.RS_ZK_REGION_CLOSED) || eventType.equals(EventType.RS_ZK_REGION_OPENED)) { return -1; } {code} Also if the timeout succeeds the transiting from OPENING to OPENED will fail in RS. May be there may be an entry in META with the old RS. If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE --- Key: HBASE-4497 URL: https://issues.apache.org/jira/browse/HBASE-4497 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Priority: Critical As per the discussion in the mail chain HBCK reporting of possible mismatch in RS assignment this JIRA is created. Consider two RS- RS1 and RS2. A region tries to open in RS1. But it takes a while. The RS1 has still not updated meta and transitioned the node from OPENING to OPENED So timeout assigns the region to RS2. RS2 successfully updates the META and opens the region. Now RS1 tries to act on the region by first updating the META and then transiting the node to OPENING to OPENED. RS1 transiting the node to OPENING to OPENED will fail. But the META entry will have RS1 as the latest. Now HBCK reports this as an inconsistency and if we try to scan the Region we get NotServingRegionException. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira