[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE

2015-01-07 Thread Cosmin Lehene (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267790#comment-14267790
 ] 

Cosmin Lehene commented on HBASE-4497:
--

[~saint@gmail.com] is this still valid? 

 If region opening fails after updating META HBCK reports it as inconsistent 
 and scanning the region throws NSRE
 ---

 Key: HBASE-4497
 URL: https://issues.apache.org/jira/browse/HBASE-4497
 Project: HBase
  Issue Type: Bug
Reporter: ramkrishna.s.vasudevan
Assignee: stack
Priority: Critical

 As per the discussion in the mail chain HBCK reporting of possible mismatch 
 in RS assignment this JIRA is created.
 Consider two RS- RS1 and RS2.
 A region tries to open in RS1. But it takes a while.  The RS1 has still not 
 updated meta and transitioned the node from OPENING to OPENED
 So timeout assigns the region to RS2.  RS2 successfully updates the META and 
 opens the region.
 Now RS1 tries to act on the region by first updating the META and then 
 transiting the node to OPENING to OPENED.
 RS1 transiting the node to OPENING to OPENED will fail.  But the META entry 
 will have RS1 as the latest.
 Now HBCK reports this as an inconsistency and if we try to scan the Region we 
 get NotServingRegionException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE

2011-09-30 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13118539#comment-13118539
 ] 

stack commented on HBASE-4497:
--

So if doesn't need to be monotonically increasing -- it'd be nice but not 
necessary and monotonically increasing id for a cluster is a bit of a pain to 
do -- how about we do this (the below this comes of Todd input over in 4507 and 
from the back and forth above):

From here on out, every edit of meta will also update a new column, 
info:editid.  This info:editid will hold a UUID generated by the client making 
the edit.

On open of a region, the open runs as it currently does with following 
additions:

+ Just after the regionserver has moved the znode to OPENING the first time 
confirming it 'owns' the region, the RS reads the current info:editid value.
+ After opening the region, when we go to update the regions location in meta, 
the RS will do a checkAndPut where the check checks the info:editid value.

Hows that?

 If region opening fails after updating META HBCK reports it as inconsistent 
 and scanning the region throws NSRE
 ---

 Key: HBASE-4497
 URL: https://issues.apache.org/jira/browse/HBASE-4497
 Project: HBase
  Issue Type: Bug
Reporter: ramkrishna.s.vasudevan
Priority: Critical

 As per the discussion in the mail chain HBCK reporting of possible mismatch 
 in RS assignment this JIRA is created.
 Consider two RS- RS1 and RS2.
 A region tries to open in RS1. But it takes a while.  The RS1 has still not 
 updated meta and transitioned the node from OPENING to OPENED
 So timeout assigns the region to RS2.  RS2 successfully updates the META and 
 opens the region.
 Now RS1 tries to act on the region by first updating the META and then 
 transiting the node to OPENING to OPENED.
 RS1 transiting the node to OPENING to OPENED will fail.  But the META entry 
 will have RS1 as the latest.
 Now HBCK reports this as an inconsistency and if we try to scan the Region we 
 get NotServingRegionException.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE

2011-09-30 Thread Ted Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13118546#comment-13118546
 ] 

Ted Yu commented on HBASE-4497:
---

+1 on the plan above.
This info:editid would be helpful in debugging as well.

 If region opening fails after updating META HBCK reports it as inconsistent 
 and scanning the region throws NSRE
 ---

 Key: HBASE-4497
 URL: https://issues.apache.org/jira/browse/HBASE-4497
 Project: HBase
  Issue Type: Bug
Reporter: ramkrishna.s.vasudevan
Assignee: stack
Priority: Critical

 As per the discussion in the mail chain HBCK reporting of possible mismatch 
 in RS assignment this JIRA is created.
 Consider two RS- RS1 and RS2.
 A region tries to open in RS1. But it takes a while.  The RS1 has still not 
 updated meta and transitioned the node from OPENING to OPENED
 So timeout assigns the region to RS2.  RS2 successfully updates the META and 
 opens the region.
 Now RS1 tries to act on the region by first updating the META and then 
 transiting the node to OPENING to OPENED.
 RS1 transiting the node to OPENING to OPENED will fail.  But the META entry 
 will have RS1 as the latest.
 Now HBCK reports this as an inconsistency and if we try to scan the Region we 
 get NotServingRegionException.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE

2011-09-30 Thread dhruba borthakur (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13118617#comment-13118617
 ] 

dhruba borthakur commented on HBASE-4497:
-

Stack: the proposal looks solid.

 If region opening fails after updating META HBCK reports it as inconsistent 
 and scanning the region throws NSRE
 ---

 Key: HBASE-4497
 URL: https://issues.apache.org/jira/browse/HBASE-4497
 Project: HBase
  Issue Type: Bug
Reporter: ramkrishna.s.vasudevan
Assignee: stack
Priority: Critical

 As per the discussion in the mail chain HBCK reporting of possible mismatch 
 in RS assignment this JIRA is created.
 Consider two RS- RS1 and RS2.
 A region tries to open in RS1. But it takes a while.  The RS1 has still not 
 updated meta and transitioned the node from OPENING to OPENED
 So timeout assigns the region to RS2.  RS2 successfully updates the META and 
 opens the region.
 Now RS1 tries to act on the region by first updating the META and then 
 transiting the node to OPENING to OPENED.
 RS1 transiting the node to OPENING to OPENED will fail.  But the META entry 
 will have RS1 as the latest.
 Now HBCK reports this as an inconsistency and if we try to scan the Region we 
 get NotServingRegionException.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE

2011-09-29 Thread jirapos...@reviews.apache.org (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13117617#comment-13117617
 ] 

jirapos...@reviews.apache.org commented on HBASE-4497:
--


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2118/
---

Review request for hbase.


Summary
---

Adds a checkAndPut that takes a timestamp


This addresses bug hbase-4497.
https://issues.apache.org/jira/browse/hbase-4497


Diffs
-

  src/main/java/org/apache/hadoop/hbase/ipc/HRegionInterface.java 3679c02 
  src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java 7cbdb98 
  src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 0c06f4f 
  src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java 99b34cc 

Diff: https://reviews.apache.org/r/2118/diff


Testing
---


Thanks,

Michael



 If region opening fails after updating META HBCK reports it as inconsistent 
 and scanning the region throws NSRE
 ---

 Key: HBASE-4497
 URL: https://issues.apache.org/jira/browse/HBASE-4497
 Project: HBase
  Issue Type: Bug
Reporter: ramkrishna.s.vasudevan
Priority: Critical

 As per the discussion in the mail chain HBCK reporting of possible mismatch 
 in RS assignment this JIRA is created.
 Consider two RS- RS1 and RS2.
 A region tries to open in RS1. But it takes a while.  The RS1 has still not 
 updated meta and transitioned the node from OPENING to OPENED
 So timeout assigns the region to RS2.  RS2 successfully updates the META and 
 opens the region.
 Now RS1 tries to act on the region by first updating the META and then 
 transiting the node to OPENING to OPENED.
 RS1 transiting the node to OPENING to OPENED will fail.  But the META entry 
 will have RS1 as the latest.
 Now HBCK reports this as an inconsistency and if we try to scan the Region we 
 get NotServingRegionException.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE

2011-09-29 Thread Ming Ma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13117881#comment-13117881
 ] 

Ming Ma commented on HBASE-4497:


1. Agree checkAndPut solution is good enough. I am just trying to find holes 
here.:)
2. Does RS need to have access to global counter? If it is only for region 
assignment scenario, agree there is no such need. I initially thought of it as 
a region operation id where RS will also get a new ID when state changes, for 
example from OPENING to OPENED. We will use such counter to track every region 
state change in the system.
3. Persistent .vs. ephemeral. I thought there will be a way to provide reliable 
ZK based AtomicLong that can survive HBase, ZK reliable restart. That will give 
us a good pictures of the event sequence in the system. Performance isn't that 
important given region state happens less frequently.
4. unique .vs. monotonically increase. For this issue, unique number seems to 
be fine. I thought it might be used in other context to track event sequence. 
So monotonically increase is better given the comparison of two values can 
indicate the order in time dimension. It doesn't have to be sequential.

 If region opening fails after updating META HBCK reports it as inconsistent 
 and scanning the region throws NSRE
 ---

 Key: HBASE-4497
 URL: https://issues.apache.org/jira/browse/HBASE-4497
 Project: HBase
  Issue Type: Bug
Reporter: ramkrishna.s.vasudevan
Priority: Critical

 As per the discussion in the mail chain HBCK reporting of possible mismatch 
 in RS assignment this JIRA is created.
 Consider two RS- RS1 and RS2.
 A region tries to open in RS1. But it takes a while.  The RS1 has still not 
 updated meta and transitioned the node from OPENING to OPENED
 So timeout assigns the region to RS2.  RS2 successfully updates the META and 
 opens the region.
 Now RS1 tries to act on the region by first updating the META and then 
 transiting the node to OPENING to OPENED.
 RS1 transiting the node to OPENING to OPENED will fail.  But the META entry 
 will have RS1 as the latest.
 Now HBCK reports this as an inconsistency and if we try to scan the Region we 
 get NotServingRegionException.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE

2011-09-28 Thread Ming Ma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116232#comment-13116232
 ] 

Ming Ma commented on HBASE-4497:


ok, Ram.

Add some more clarification.

1. It looks ZKAssign.transitionNode has provided atomicity via expected 
version feature in ZK. So we are good here.
2. Global AtomicInteger isn't necessary in this context, we can just use the 
expected version from ZK for a given ZNode, given expected version just 
need to be unique on a given ZNode, not global.
3. With regard to HBase .META. update, we can put expected version as ID into 
the .META. table and enforce new update's ID has to be greater than the 
previous version for a given region via some new HBase API checkGreaterAndPut. 
This ID value is local to the region node, that should be ok; for a given 
region node, this value will increment all the time. Currently this expected 
version is passed via RPC RegionOpeningState openRegion(HRegionInfo region, 
int versionOfOfflineNode). Will that address the issue, Jonathan?



Jonathan Dhruba's suggestion is interesting. Could scale be an issue when HBase 
scales to the next level in terms of number of machines, number of regions and 
number of region movements? .META. table will be distributed to different RSs, 
putting it on the Master could be a bottleneck. However, we might first run 
into other more important issues in such large scale.

 If region opening fails after updating META HBCK reports it as inconsistent 
 and scanning the region throws NSRE
 ---

 Key: HBASE-4497
 URL: https://issues.apache.org/jira/browse/HBASE-4497
 Project: HBase
  Issue Type: Bug
Reporter: ramkrishna.s.vasudevan
Priority: Critical

 As per the discussion in the mail chain HBCK reporting of possible mismatch 
 in RS assignment this JIRA is created.
 Consider two RS- RS1 and RS2.
 A region tries to open in RS1. But it takes a while.  The RS1 has still not 
 updated meta and transitioned the node from OPENING to OPENED
 So timeout assigns the region to RS2.  RS2 successfully updates the META and 
 opens the region.
 Now RS1 tries to act on the region by first updating the META and then 
 transiting the node to OPENING to OPENED.
 RS1 transiting the node to OPENING to OPENED will fail.  But the META entry 
 will have RS1 as the latest.
 Now HBCK reports this as an inconsistency and if we try to scan the Region we 
 get NotServingRegionException.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE

2011-09-28 Thread Ming Ma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116255#comment-13116255
 ] 

Ming Ma commented on HBASE-4497:


Did some testing on ZK, my assumption that the ZK's data version is incremental 
post node deletion is incorrect. So perhaps we still some global AtomicLong 
based on ZK.

 If region opening fails after updating META HBCK reports it as inconsistent 
 and scanning the region throws NSRE
 ---

 Key: HBASE-4497
 URL: https://issues.apache.org/jira/browse/HBASE-4497
 Project: HBase
  Issue Type: Bug
Reporter: ramkrishna.s.vasudevan
Priority: Critical

 As per the discussion in the mail chain HBCK reporting of possible mismatch 
 in RS assignment this JIRA is created.
 Consider two RS- RS1 and RS2.
 A region tries to open in RS1. But it takes a while.  The RS1 has still not 
 updated meta and transitioned the node from OPENING to OPENED
 So timeout assigns the region to RS2.  RS2 successfully updates the META and 
 opens the region.
 Now RS1 tries to act on the region by first updating the META and then 
 transiting the node to OPENING to OPENED.
 RS1 transiting the node to OPENING to OPENED will fail.  But the META entry 
 will have RS1 as the latest.
 Now HBCK reports this as an inconsistency and if we try to scan the Region we 
 get NotServingRegionException.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE

2011-09-28 Thread ramkrishna.s.vasudevan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116267#comment-13116267
 ] 

ramkrishna.s.vasudevan commented on HBASE-4497:
---

@Ming

The expected version of an znode will not increase once the node gets deleted.
Like if the region gets balanced then a new znode wil be created then we will 
have the expected version as 0 again.

 If region opening fails after updating META HBCK reports it as inconsistent 
 and scanning the region throws NSRE
 ---

 Key: HBASE-4497
 URL: https://issues.apache.org/jira/browse/HBASE-4497
 Project: HBase
  Issue Type: Bug
Reporter: ramkrishna.s.vasudevan
Priority: Critical

 As per the discussion in the mail chain HBCK reporting of possible mismatch 
 in RS assignment this JIRA is created.
 Consider two RS- RS1 and RS2.
 A region tries to open in RS1. But it takes a while.  The RS1 has still not 
 updated meta and transitioned the node from OPENING to OPENED
 So timeout assigns the region to RS2.  RS2 successfully updates the META and 
 opens the region.
 Now RS1 tries to act on the region by first updating the META and then 
 transiting the node to OPENING to OPENED.
 RS1 transiting the node to OPENING to OPENED will fail.  But the META entry 
 will have RS1 as the latest.
 Now HBCK reports this as an inconsistency and if we try to scan the Region we 
 get NotServingRegionException.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE

2011-09-28 Thread ramkrishna.s.vasudevan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116272#comment-13116272
 ] 

ramkrishna.s.vasudevan commented on HBASE-4497:
---

@Ming I did not see your latest comment. As i had not refreshed.:)



 If region opening fails after updating META HBCK reports it as inconsistent 
 and scanning the region throws NSRE
 ---

 Key: HBASE-4497
 URL: https://issues.apache.org/jira/browse/HBASE-4497
 Project: HBase
  Issue Type: Bug
Reporter: ramkrishna.s.vasudevan
Priority: Critical

 As per the discussion in the mail chain HBCK reporting of possible mismatch 
 in RS assignment this JIRA is created.
 Consider two RS- RS1 and RS2.
 A region tries to open in RS1. But it takes a while.  The RS1 has still not 
 updated meta and transitioned the node from OPENING to OPENED
 So timeout assigns the region to RS2.  RS2 successfully updates the META and 
 opens the region.
 Now RS1 tries to act on the region by first updating the META and then 
 transiting the node to OPENING to OPENED.
 RS1 transiting the node to OPENING to OPENED will fail.  But the META entry 
 will have RS1 as the latest.
 Now HBCK reports this as an inconsistency and if we try to scan the Region we 
 get NotServingRegionException.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE

2011-09-28 Thread Ted Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116498#comment-13116498
 ] 

Ted Yu commented on HBASE-4497:
---

Ming's idea @ 28/Sep/11 04:56, especially point 3 is interesting.
I like that for long term solution.
We need to be careful writing migration code to accommodate the new operation 
Id.

 If region opening fails after updating META HBCK reports it as inconsistent 
 and scanning the region throws NSRE
 ---

 Key: HBASE-4497
 URL: https://issues.apache.org/jira/browse/HBASE-4497
 Project: HBase
  Issue Type: Bug
Reporter: ramkrishna.s.vasudevan
Priority: Critical

 As per the discussion in the mail chain HBCK reporting of possible mismatch 
 in RS assignment this JIRA is created.
 Consider two RS- RS1 and RS2.
 A region tries to open in RS1. But it takes a while.  The RS1 has still not 
 updated meta and transitioned the node from OPENING to OPENED
 So timeout assigns the region to RS2.  RS2 successfully updates the META and 
 opens the region.
 Now RS1 tries to act on the region by first updating the META and then 
 transiting the node to OPENING to OPENED.
 RS1 transiting the node to OPENING to OPENED will fail.  But the META entry 
 will have RS1 as the latest.
 Now HBCK reports this as an inconsistency and if we try to scan the Region we 
 get NotServingRegionException.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE

2011-09-28 Thread ramkrishna.s.vasudevan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116524#comment-13116524
 ] 

ramkrishna.s.vasudevan commented on HBASE-4497:
---

As Ming suggested 
we can generate an incremental integer at the master side which will be 
generated per region and pass that value over RPC which we can be checked 
before updating the META.

This value can be maintained in the master side in a map with region as the key.



 If region opening fails after updating META HBCK reports it as inconsistent 
 and scanning the region throws NSRE
 ---

 Key: HBASE-4497
 URL: https://issues.apache.org/jira/browse/HBASE-4497
 Project: HBase
  Issue Type: Bug
Reporter: ramkrishna.s.vasudevan
Priority: Critical

 As per the discussion in the mail chain HBCK reporting of possible mismatch 
 in RS assignment this JIRA is created.
 Consider two RS- RS1 and RS2.
 A region tries to open in RS1. But it takes a while.  The RS1 has still not 
 updated meta and transitioned the node from OPENING to OPENED
 So timeout assigns the region to RS2.  RS2 successfully updates the META and 
 opens the region.
 Now RS1 tries to act on the region by first updating the META and then 
 transiting the node to OPENING to OPENED.
 RS1 transiting the node to OPENING to OPENED will fail.  But the META entry 
 will have RS1 as the latest.
 Now HBCK reports this as an inconsistency and if we try to scan the Region we 
 get NotServingRegionException.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE

2011-09-28 Thread Jonathan Gray (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116525#comment-13116525
 ] 

Jonathan Gray commented on HBASE-4497:
--

I don't think we can use the same ID as the ZK node.  But we could just some 
incrementing number.

An alternative would be to instead allow the roll-back of the META edit using a 
checkAndDelete which might be simpler but less optimal.

 If region opening fails after updating META HBCK reports it as inconsistent 
 and scanning the region throws NSRE
 ---

 Key: HBASE-4497
 URL: https://issues.apache.org/jira/browse/HBASE-4497
 Project: HBase
  Issue Type: Bug
Reporter: ramkrishna.s.vasudevan
Priority: Critical

 As per the discussion in the mail chain HBCK reporting of possible mismatch 
 in RS assignment this JIRA is created.
 Consider two RS- RS1 and RS2.
 A region tries to open in RS1. But it takes a while.  The RS1 has still not 
 updated meta and transitioned the node from OPENING to OPENED
 So timeout assigns the region to RS2.  RS2 successfully updates the META and 
 opens the region.
 Now RS1 tries to act on the region by first updating the META and then 
 transiting the node to OPENING to OPENED.
 RS1 transiting the node to OPENING to OPENED will fail.  But the META entry 
 will have RS1 as the latest.
 Now HBCK reports this as an inconsistency and if we try to scan the Region we 
 get NotServingRegionException.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE

2011-09-28 Thread Jonathan Gray (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116695#comment-13116695
 ] 

Jonathan Gray commented on HBASE-4497:
--

startcode and timestamp is what i initially thought of.  seems like there could 
be some weird situations.  for example, what is to say that the server already 
in META didn't somehow become the new assignment destination?

or what if... M tells RS1 to OPEN R1 and to expect RS3:StartCode3 in META.  RS1 
locks up right before doing the META edit, M tells RS2 to OPEN R1 and to also 
expect RS3:StartCode3 in META.  I guess this is the atomicity we need, so that 
should be okay.

one neat idea would be to introduce this region assignment incrementing ID into 
META.  it would provide a nice way to debug the movement of a region across the 
cluster over time and could also provide the necessary info to use CheckAndPut.

 If region opening fails after updating META HBCK reports it as inconsistent 
 and scanning the region throws NSRE
 ---

 Key: HBASE-4497
 URL: https://issues.apache.org/jira/browse/HBASE-4497
 Project: HBase
  Issue Type: Bug
Reporter: ramkrishna.s.vasudevan
Priority: Critical

 As per the discussion in the mail chain HBCK reporting of possible mismatch 
 in RS assignment this JIRA is created.
 Consider two RS- RS1 and RS2.
 A region tries to open in RS1. But it takes a while.  The RS1 has still not 
 updated meta and transitioned the node from OPENING to OPENED
 So timeout assigns the region to RS2.  RS2 successfully updates the META and 
 opens the region.
 Now RS1 tries to act on the region by first updating the META and then 
 transiting the node to OPENING to OPENED.
 RS1 transiting the node to OPENING to OPENED will fail.  But the META entry 
 will have RS1 as the latest.
 Now HBCK reports this as an inconsistency and if we try to scan the Region we 
 get NotServingRegionException.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE

2011-09-28 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116721#comment-13116721
 ] 

stack commented on HBASE-4497:
--

bq. startcode and timestamp is what i initially thought of. seems like there 
could be some weird situations. for example, what is to say that the server 
already in META didn't somehow become the new assignment destination?

The timestamp will be different in this case? (It'll have been updated by the 
new open).

bq. or what if... M tells RS1 to OPEN R1 and to expect RS3:StartCode3

I'm not suggesting the master tell the RS anything new.  I'm suggesting that on 
receiving the open, the RS itself read .META. at start of the open transaction 
before it does anything else and use this read as input for the later 
checkAndSet write.

bq. one neat idea would be to introduce this region assignment incrementing ID 
into META. it would provide a nice way to debug the movement of a region across 
the cluster over time and could also provide the necessary info to use 
CheckAndPut.

This could work.  Downsides are M has to write meta first before doing assign 
which will be a bit of new burden on meta (double'd write load?) and this new 
write is now inline with an assign; we'd have to do some hackery in here around 
bulk assign.

 If region opening fails after updating META HBCK reports it as inconsistent 
 and scanning the region throws NSRE
 ---

 Key: HBASE-4497
 URL: https://issues.apache.org/jira/browse/HBASE-4497
 Project: HBase
  Issue Type: Bug
Reporter: ramkrishna.s.vasudevan
Priority: Critical

 As per the discussion in the mail chain HBCK reporting of possible mismatch 
 in RS assignment this JIRA is created.
 Consider two RS- RS1 and RS2.
 A region tries to open in RS1. But it takes a while.  The RS1 has still not 
 updated meta and transitioned the node from OPENING to OPENED
 So timeout assigns the region to RS2.  RS2 successfully updates the META and 
 opens the region.
 Now RS1 tries to act on the region by first updating the META and then 
 transiting the node to OPENING to OPENED.
 RS1 transiting the node to OPENING to OPENED will fail.  But the META entry 
 will have RS1 as the latest.
 Now HBCK reports this as an inconsistency and if we try to scan the Region we 
 get NotServingRegionException.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE

2011-09-28 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116727#comment-13116727
 ] 

stack commented on HBASE-4497:
--

I just checked the checkAndPut.  It doesn't expose timestamp.  So.  Fix 
checkAndPut so it exposes timestamp or write timestamp or uuid to meta into a 
new column info:editid whenever we do the metadata open update (I'd prefer 
adding a checkAndPut override -- seems like a hole in checkAndPut that we don't 
allow version checking).

 If region opening fails after updating META HBCK reports it as inconsistent 
 and scanning the region throws NSRE
 ---

 Key: HBASE-4497
 URL: https://issues.apache.org/jira/browse/HBASE-4497
 Project: HBase
  Issue Type: Bug
Reporter: ramkrishna.s.vasudevan
Priority: Critical

 As per the discussion in the mail chain HBCK reporting of possible mismatch 
 in RS assignment this JIRA is created.
 Consider two RS- RS1 and RS2.
 A region tries to open in RS1. But it takes a while.  The RS1 has still not 
 updated meta and transitioned the node from OPENING to OPENED
 So timeout assigns the region to RS2.  RS2 successfully updates the META and 
 opens the region.
 Now RS1 tries to act on the region by first updating the META and then 
 transiting the node to OPENING to OPENED.
 RS1 transiting the node to OPENING to OPENED will fail.  But the META entry 
 will have RS1 as the latest.
 Now HBCK reports this as an inconsistency and if we try to scan the Region we 
 get NotServingRegionException.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE

2011-09-28 Thread Jonathan Gray (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116729#comment-13116729
 ] 

Jonathan Gray commented on HBASE-4497:
--

Sounds like it could work.  I'm +1 on exposing version to checkAndPut and using 
it for META edits.  Good point, we can just do the read on the RS first.

 If region opening fails after updating META HBCK reports it as inconsistent 
 and scanning the region throws NSRE
 ---

 Key: HBASE-4497
 URL: https://issues.apache.org/jira/browse/HBASE-4497
 Project: HBase
  Issue Type: Bug
Reporter: ramkrishna.s.vasudevan
Priority: Critical

 As per the discussion in the mail chain HBCK reporting of possible mismatch 
 in RS assignment this JIRA is created.
 Consider two RS- RS1 and RS2.
 A region tries to open in RS1. But it takes a while.  The RS1 has still not 
 updated meta and transitioned the node from OPENING to OPENED
 So timeout assigns the region to RS2.  RS2 successfully updates the META and 
 opens the region.
 Now RS1 tries to act on the region by first updating the META and then 
 transiting the node to OPENING to OPENED.
 RS1 transiting the node to OPENING to OPENED will fail.  But the META entry 
 will have RS1 as the latest.
 Now HBCK reports this as an inconsistency and if we try to scan the Region we 
 get NotServingRegionException.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE

2011-09-28 Thread Ted Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116740#comment-13116740
 ] 

Ted Yu commented on HBASE-4497:
---

HBASE-4507 has been opened.

 If region opening fails after updating META HBCK reports it as inconsistent 
 and scanning the region throws NSRE
 ---

 Key: HBASE-4497
 URL: https://issues.apache.org/jira/browse/HBASE-4497
 Project: HBase
  Issue Type: Bug
Reporter: ramkrishna.s.vasudevan
Priority: Critical

 As per the discussion in the mail chain HBCK reporting of possible mismatch 
 in RS assignment this JIRA is created.
 Consider two RS- RS1 and RS2.
 A region tries to open in RS1. But it takes a while.  The RS1 has still not 
 updated meta and transitioned the node from OPENING to OPENED
 So timeout assigns the region to RS2.  RS2 successfully updates the META and 
 opens the region.
 Now RS1 tries to act on the region by first updating the META and then 
 transiting the node to OPENING to OPENED.
 RS1 transiting the node to OPENING to OPENED will fail.  But the META entry 
 will have RS1 as the latest.
 Now HBCK reports this as an inconsistency and if we try to scan the Region we 
 get NotServingRegionException.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE

2011-09-28 Thread Ming Ma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116981#comment-13116981
 ] 

Ming Ma commented on HBASE-4497:


Using startcode and timestamp is a good idea. However, I want to confirm if 
there is a case where it won't work. Given there is no such thing as global 
clock, the timestamp value generated by the RS that hosts .META. region at that 
moment might not be unique if .META. region is moved to another RS. So there is 
a possibility of startcode and timestamp is what i initially thought of. seems 
like there could be some weird situations. for example, what is to say that the 
server already in META didn't somehow become the new assignment destination?. 
Here is how:

1. For a given region, .META. table has RS1 as RS serverName, T1 as timestamp 
value. { RS1, T1 }
2. .META. is moved to another RS whose clock is behind after the original RS 
that wrote {RS1, T1}.
3. RS2 starts openRegion first, it has an older ZK node version to check. RS1 
start openRegion later. It has an up-to-date ZK node version.
4. Both RS2 and RS1 are about the do checkAndPut on .META. table.  Both will 
use {RS1, T1} as condition for checkAndPut.
5. RS1 updates it first, it succeeds. There is a chance that after the update, 
the value is still {RS1, T1}, given T1 is generated by a RS whose clock is 
behind.
6. RS2 updates it next, it also succeeds, given {RS1, T1} hasn't change even 
RS1 makes an update earlier.
7. RS1 has the up-to-date ZK node version, thus it will continue and succeeds 
with the rest of open operatioin. The region is considered OPENED from AM's 
point of view.
8. RS2 has older ZK node version, thus will fail later when it tries to update 
ZK node. Region won't be opened on RS2.
9. In .META. table, the region is on RS2.


Adding support for version check in checkAndPut should address such scenario.


Regarding the region assignment ID approach:

1. I didn't imply it will only be incremented by the Master. I suggested a 
ZK-based AtomicLong that Master and all RSs can get hold off. So this could be 
considered a global clock.
2. Such ID could also help to track all the region transition events, 
HBASE-4354.



 If region opening fails after updating META HBCK reports it as inconsistent 
 and scanning the region throws NSRE
 ---

 Key: HBASE-4497
 URL: https://issues.apache.org/jira/browse/HBASE-4497
 Project: HBase
  Issue Type: Bug
Reporter: ramkrishna.s.vasudevan
Priority: Critical

 As per the discussion in the mail chain HBCK reporting of possible mismatch 
 in RS assignment this JIRA is created.
 Consider two RS- RS1 and RS2.
 A region tries to open in RS1. But it takes a while.  The RS1 has still not 
 updated meta and transitioned the node from OPENING to OPENED
 So timeout assigns the region to RS2.  RS2 successfully updates the META and 
 opens the region.
 Now RS1 tries to act on the region by first updating the META and then 
 transiting the node to OPENING to OPENED.
 RS1 transiting the node to OPENING to OPENED will fail.  But the META entry 
 will have RS1 as the latest.
 Now HBCK reports this as an inconsistency and if we try to scan the Region we 
 get NotServingRegionException.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE

2011-09-28 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13117007#comment-13117007
 ] 

stack commented on HBASE-4497:
--

Good stuff Ming.

Looking at your pathological case, I think it is possible.  I could add to the 
checkAndPut that takes a version a check that we never write back the same 
version; if the version we are checking will go in with a timestamp that is 
exactly what we are checking, add a millisecond (especially if the value we 
write back is the same again).

I think we should do this though the probability of the scenario your postulate 
is extremely low.

Why would RSs need access to a global counter?  Master assigns.  It'd need to 
keep its running counter in zk in case it crashed but I'd think only the 
assigner would need to use it (Here are some notes on counter in zk from zk 
mailing list: 
http://www.mail-archive.com/zookeeper-user@hadoop.apache.org/msg01968.html)

Would this counter be other than ephemeral data?  Design dictum up to this has 
been that zk is for ephemeral data only.  Would keeping a counter change that?

Does the 'region assignment id' need to monotonically increase?  Can it just be 
unique (uuid?)?

Good stuff Ming.





 If region opening fails after updating META HBCK reports it as inconsistent 
 and scanning the region throws NSRE
 ---

 Key: HBASE-4497
 URL: https://issues.apache.org/jira/browse/HBASE-4497
 Project: HBase
  Issue Type: Bug
Reporter: ramkrishna.s.vasudevan
Priority: Critical

 As per the discussion in the mail chain HBCK reporting of possible mismatch 
 in RS assignment this JIRA is created.
 Consider two RS- RS1 and RS2.
 A region tries to open in RS1. But it takes a while.  The RS1 has still not 
 updated meta and transitioned the node from OPENING to OPENED
 So timeout assigns the region to RS2.  RS2 successfully updates the META and 
 opens the region.
 Now RS1 tries to act on the region by first updating the META and then 
 transiting the node to OPENING to OPENED.
 RS1 transiting the node to OPENING to OPENED will fail.  But the META entry 
 will have RS1 as the latest.
 Now HBCK reports this as an inconsistency and if we try to scan the Region we 
 get NotServingRegionException.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE

2011-09-27 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13115788#comment-13115788
 ] 

stack commented on HBASE-4497:
--

Do we need to add an extra tickle of OPENING znode after open of region and 
before we go to do meta update?

 If region opening fails after updating META HBCK reports it as inconsistent 
 and scanning the region throws NSRE
 ---

 Key: HBASE-4497
 URL: https://issues.apache.org/jira/browse/HBASE-4497
 Project: HBase
  Issue Type: Bug
Reporter: ramkrishna.s.vasudevan
Priority: Critical

 As per the discussion in the mail chain HBCK reporting of possible mismatch 
 in RS assignment this JIRA is created.
 Consider two RS- RS1 and RS2.
 A region tries to open in RS1. But it takes a while.  The RS1 has still not 
 updated meta and transitioned the node from OPENING to OPENED
 So timeout assigns the region to RS2.  RS2 successfully updates the META and 
 opens the region.
 Now RS1 tries to act on the region by first updating the META and then 
 transiting the node to OPENING to OPENED.
 RS1 transiting the node to OPENING to OPENED will fail.  But the META entry 
 will have RS1 as the latest.
 Now HBCK reports this as an inconsistency and if we try to scan the Region we 
 get NotServingRegionException.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE

2011-09-27 Thread Jean-Daniel Cryans (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13115795#comment-13115795
 ] 

Jean-Daniel Cryans commented on HBASE-4497:
---

Stack, it seems that it's already the case:

{code}
if (tickleOpening(post_region_open)) {
  if (updateMeta(region)) failed = false;
}
{code}

In any case, there's still a hole as those two operations aren't done in an 
atomic fashion.

 If region opening fails after updating META HBCK reports it as inconsistent 
 and scanning the region throws NSRE
 ---

 Key: HBASE-4497
 URL: https://issues.apache.org/jira/browse/HBASE-4497
 Project: HBase
  Issue Type: Bug
Reporter: ramkrishna.s.vasudevan
Priority: Critical

 As per the discussion in the mail chain HBCK reporting of possible mismatch 
 in RS assignment this JIRA is created.
 Consider two RS- RS1 and RS2.
 A region tries to open in RS1. But it takes a while.  The RS1 has still not 
 updated meta and transitioned the node from OPENING to OPENED
 So timeout assigns the region to RS2.  RS2 successfully updates the META and 
 opens the region.
 Now RS1 tries to act on the region by first updating the META and then 
 transiting the node to OPENING to OPENED.
 RS1 transiting the node to OPENING to OPENED will fail.  But the META entry 
 will have RS1 as the latest.
 Now HBCK reports this as an inconsistency and if we try to scan the Region we 
 get NotServingRegionException.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE

2011-09-27 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13115828#comment-13115828
 ] 

stack commented on HBASE-4497:
--

Thanks J-D.  Thats what I was too lazy to looksee for myself.  Looks like we 
are doing enough tickling.  Weird that timeout monitor can cut in, region can 
be assigned elsewhere AND successfully update meta before this comes back.  
Here is from Rams email up on list earlier with log snippets:

{code}
RS1
===
2011-09-23 22:34:34,000 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: addToOnlineRegions is
doneREGION = {NAME =
't5,,1316797380065.2d06b3ca4d398ec96920ae86441a68c9.', TableName = 't5',
STARTKEY = '', ENDKEY = '', ENCODED = 2d06b3ca4d398ec96920ae86441a68c9,}
2011-09-23 22:34:34,009 INFO org.apache.hadoop.hbase.catalog.MetaEditor:
Updated row t5,,1316797380065.2d06b3ca4d398ec96920ae86441a68c9. in region
.META.,,1 with serverName=linux76,60020,1316796517682
2011-09-23 22:34:34,009 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Done with post open
deploy taks for region=t5,,1316797380065.2d06b3ca4d398ec96920ae86441a68c9.,
daughter=false
2011-09-23 22:34:34,009 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:60020-0x1328ceaa1ff0037 Attempting to transition node
2d06b3ca4d398ec96920ae86441a68c9 from RS_ZK_REGION_OPENING to
RS_ZK_REGION_OPENED
2011-09-23 22:34:34,038 WARN
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Completed
the OPEN of region t5,,1316797380065.2d06b3ca4d398ec96920ae86441a68c9. but
when transitioning from  OPENING to OPENED got a version mismatch, someone
else clashed so now unassigning -- closing region
2011-09-23 22:34:34,038 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Closing t5,,1316797380065.2d06b3ca4d398ec96920ae86441a68c9.: disabling
compactions  flushes
2011-09-23 22:34:34,038 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Updates disabled for region
t5,,1316797380065.2d06b3ca4d398ec96920ae86441a68c9.
2011-09-23 22:34:34,038 DEBUG org.apache.hadoop.hbase.regionserver.Store:
closed f5
2011-09-23 22:34:34,038 INFO org.apache.hadoop.hbase.regionserver.HRegion:
Closed t5,,1316797380065.2d06b3ca4d398ec96920ae86441a68c9.

RS2
===
2011-09-23 22:33:56,546 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:60020-0x1328ceaa1ff0039 Successfully transitioned node
2d06b3ca4d398ec96920ae86441a68c9 from RS_ZK_REGION_OPENING to
RS_ZK_REGION_OPENING
2011-09-23 22:33:56,845 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Post open deploy tasks
for region=t5,,1316797380065.2d06b3ca4d398ec96920ae86441a68c9.,
daughter=false
2011-09-23 22:33:56,845 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: addToOnlineRegions is
doneREGION = {NAME =
't5,,1316797380065.2d06b3ca4d398ec96920ae86441a68c9.', TableName = 't5',
STARTKEY = '', ENDKEY = '', ENCODED = 2d06b3ca4d398ec96920ae86441a68c9,}
2011-09-23 22:33:56,856 INFO org.apache.hadoop.hbase.catalog.MetaEditor:
Updated row t5,,1316797380065.2d06b3ca4d398ec96920ae86441a68c9. in region
.META.,,1 with serverName=linux146,60020,1316796499216
2011-09-23 22:33:56,856 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Done with post open
deploy taks for region=t5,,1316797380065.2d06b3ca4d398ec96920ae86441a68c9.,
daughter=false
2011-09-23 22:33:58,887 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:60020-0x1328ceaa1ff0039 Attempting to transition node
2d06b3ca4d398ec96920ae86441a68c9 from RS_ZK_REGION_OPENING to
RS_ZK_REGION_OPENED
2011-09-23 22:33:58,893 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:60020-0x1328ceaa1ff0039 Successfully transitioned node
2d06b3ca4d398ec96920ae86441a68c9 from RS_ZK_REGION_OPENING to
RS_ZK_REGION_OPENED
2011-09-23 22:33:58,893 DEBUG
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Opened
t5,,1316797380065.2d06b3ca4d398ec96920ae86441a68c9.
{code}

 If region opening fails after updating META HBCK reports it as inconsistent 
 and scanning the region throws NSRE
 ---

 Key: HBASE-4497
 URL: https://issues.apache.org/jira/browse/HBASE-4497
 Project: HBase
  Issue Type: Bug
Reporter: ramkrishna.s.vasudevan
Priority: Critical

 As per the discussion in the mail chain HBCK reporting of possible mismatch 
 in RS assignment this JIRA is created.
 Consider two RS- RS1 and RS2.
 A region tries to open in RS1. But it takes a while.  The RS1 has still not 
 updated meta and transitioned the node from OPENING to OPENED
 So timeout assigns the region to RS2.  RS2 successfully updates the META and 
 opens the region.
 Now RS1 tries to act on the region by first updating the META and then 
 transiting the node to OPENING to OPENED.
 RS1 transiting the node to OPENING to OPENED 

[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE

2011-09-27 Thread Jonathan Gray (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116032#comment-13116032
 ] 

Jonathan Gray commented on HBASE-4497:
--

I was just discussing this scenario with Dhruba a few days back.  There's 
definitely a race condition here and I don't see a trivial fix.

We use HLog IO-fencing to ensure that edits don't slip into an HLog after a 
server is considered dead by the Master.  But the Master has no way to prevent 
this META update from slipping in.

We need to make some modification to how the master can safely timeout an 
OPENING.  One possibility is for the master to require either an acknowledgment 
from the RS before moving the region elsewhere or for the RS to die.  It seems 
unlikely that we will actually see the RS to Master acknowledgment since 
OPENING taking too long is usually a sign of brokenness or the RS being backed 
up, I think.  But in any case I'd imagine some kind of OPEN_CANCEL_REQUESTED 
state that the Master transitions the node to and only when the RS transitions 
to OPEN_CANCELED or OFFLINE or something, then it's safe to reassign elsewhere.

I think this design still has a hole in it though because there are scenarios 
where the RS doesn't actually die but for some reason doesn't OPEN or ack the 
cancel. 

Another option would be to do the RS performed META edits using a CheckAndPut 
rather than straight Put.  Or we could move META editing back to the Master 
where it's easy to do things atomically :)

The CheckAndPut idea is kind of neat but we'd probably have to send more data 
on the OPEN_RPC.  For example, the existing server start code or server name + 
start code or something guaranteed unique (guaranteed that a conflicting RS 
opening stuff wouldn't be able to use the same thing).  Then the atomicity is 
on the META region.

 If region opening fails after updating META HBCK reports it as inconsistent 
 and scanning the region throws NSRE
 ---

 Key: HBASE-4497
 URL: https://issues.apache.org/jira/browse/HBASE-4497
 Project: HBase
  Issue Type: Bug
Reporter: ramkrishna.s.vasudevan
Priority: Critical

 As per the discussion in the mail chain HBCK reporting of possible mismatch 
 in RS assignment this JIRA is created.
 Consider two RS- RS1 and RS2.
 A region tries to open in RS1. But it takes a while.  The RS1 has still not 
 updated meta and transitioned the node from OPENING to OPENED
 So timeout assigns the region to RS2.  RS2 successfully updates the META and 
 opens the region.
 Now RS1 tries to act on the region by first updating the META and then 
 transiting the node to OPENING to OPENED.
 RS1 transiting the node to OPENING to OPENED will fail.  But the META entry 
 will have RS1 as the latest.
 Now HBCK reports this as an inconsistency and if we try to scan the Region we 
 get NotServingRegionException.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE

2011-09-27 Thread ramkrishna.s.vasudevan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116134#comment-13116134
 ] 

ramkrishna.s.vasudevan commented on HBASE-4497:
---

I got this problem for 3 regions before HBASE-4452 went in. Now HBASE-4452 will 
definitely reduce the probability of this happening.



 If region opening fails after updating META HBCK reports it as inconsistent 
 and scanning the region throws NSRE
 ---

 Key: HBASE-4497
 URL: https://issues.apache.org/jira/browse/HBASE-4497
 Project: HBase
  Issue Type: Bug
Reporter: ramkrishna.s.vasudevan
Priority: Critical

 As per the discussion in the mail chain HBCK reporting of possible mismatch 
 in RS assignment this JIRA is created.
 Consider two RS- RS1 and RS2.
 A region tries to open in RS1. But it takes a while.  The RS1 has still not 
 updated meta and transitioned the node from OPENING to OPENED
 So timeout assigns the region to RS2.  RS2 successfully updates the META and 
 opens the region.
 Now RS1 tries to act on the region by first updating the META and then 
 transiting the node to OPENING to OPENED.
 RS1 transiting the node to OPENING to OPENED will fail.  But the META entry 
 will have RS1 as the latest.
 Now HBCK reports this as an inconsistency and if we try to scan the Region we 
 get NotServingRegionException.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE

2011-09-27 Thread dhruba borthakur (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116139#comment-13116139
 ] 

dhruba borthakur commented on HBASE-4497:
-

Can somebody pl elaborate any disadvantages if we make the master be the only 
entity that can update META about where the region is being served from?

 If region opening fails after updating META HBCK reports it as inconsistent 
 and scanning the region throws NSRE
 ---

 Key: HBASE-4497
 URL: https://issues.apache.org/jira/browse/HBASE-4497
 Project: HBase
  Issue Type: Bug
Reporter: ramkrishna.s.vasudevan
Priority: Critical

 As per the discussion in the mail chain HBCK reporting of possible mismatch 
 in RS assignment this JIRA is created.
 Consider two RS- RS1 and RS2.
 A region tries to open in RS1. But it takes a while.  The RS1 has still not 
 updated meta and transitioned the node from OPENING to OPENED
 So timeout assigns the region to RS2.  RS2 successfully updates the META and 
 opens the region.
 Now RS1 tries to act on the region by first updating the META and then 
 transiting the node to OPENING to OPENED.
 RS1 transiting the node to OPENING to OPENED will fail.  But the META entry 
 will have RS1 as the latest.
 Now HBCK reports this as an inconsistency and if we try to scan the Region we 
 get NotServingRegionException.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE

2011-09-27 Thread Ming Ma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116151#comment-13116151
 ] 

Ming Ma commented on HBASE-4497:


checkAndPut might work. We will use checkAndPut on both ZK as well as HBase. 
There are other bugs due to the lack of strong synchronization on the ZK nodes 
among AssignmentManager and RSs. Here is another scenario for race between AM 
timeoutMonitor and the first RS's openRegion operation.

RS1 successfully transition to OPENED state around the same time as 
timeoutMonitor kicks in, timeoutMonitor gets data from ZK right before RS1 set 
it to OPENED, thus timeoutMonitor has RS_ZK_REGION_OPENING and tries to 
reassign the region. In that case, we will end up with the same region on two 
RSs.


Will the followings work?

1. ZKAssign.transitionNode has some sort of checkAndPut semantics when it tries 
to enforce the original state is the correct one. However, it isn't atomic. It 
first tries to getData from ZK and then compare. Instead, we can use ZK's 
checkAndPut API to enforce the atomicity.
2. Introduce a ZK-base global AtomicInteger for region operation; e.g., each 
openRegion operation will use a new incremental region_operation_ID. Each 
openRegion operation will validate its own ID with ZK state via checkAndPut. 
Thus one of the two openRegion operations on RSs won't work.
3. With regard to HBase .META. update, we can put region_operation_ID into the 
table and enforce new update's region operation ID has to be greater than the 
previous version for a given region. In that way the older RS won't be able to 
update the table properly. We will need to introduce a new API for HBase, 
similar to checkAndPut, more like checkGreaterandPut.


 If region opening fails after updating META HBCK reports it as inconsistent 
 and scanning the region throws NSRE
 ---

 Key: HBASE-4497
 URL: https://issues.apache.org/jira/browse/HBASE-4497
 Project: HBase
  Issue Type: Bug
Reporter: ramkrishna.s.vasudevan
Priority: Critical

 As per the discussion in the mail chain HBCK reporting of possible mismatch 
 in RS assignment this JIRA is created.
 Consider two RS- RS1 and RS2.
 A region tries to open in RS1. But it takes a while.  The RS1 has still not 
 updated meta and transitioned the node from OPENING to OPENED
 So timeout assigns the region to RS2.  RS2 successfully updates the META and 
 opens the region.
 Now RS1 tries to act on the region by first updating the META and then 
 transiting the node to OPENING to OPENED.
 RS1 transiting the node to OPENING to OPENED will fail.  But the META entry 
 will have RS1 as the latest.
 Now HBCK reports this as an inconsistency and if we try to scan the Region we 
 get NotServingRegionException.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4497) If region opening fails after updating META HBCK reports it as inconsistent and scanning the region throws NSRE

2011-09-27 Thread ramkrishna.s.vasudevan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13116160#comment-13116160
 ] 

ramkrishna.s.vasudevan commented on HBASE-4497:
---

I am not aware of ZK much.
Your 3rd point looks good to me Ming.
I think HBASE-4015 may handle the race you have told above.
{code}
  if (hijack  null != curDataInZNode) {
EventType eventType = curDataInZNode.getEventType();
if (eventType.equals(EventType.RS_ZK_REGION_CLOSING)
|| eventType.equals(EventType.RS_ZK_REGION_CLOSED)
|| eventType.equals(EventType.RS_ZK_REGION_OPENED)) {
  return -1;
}
{code}
Also if the timeout succeeds the transiting from OPENING to OPENED will fail in 
RS.
May be there may be an entry in META with the old RS.


 If region opening fails after updating META HBCK reports it as inconsistent 
 and scanning the region throws NSRE
 ---

 Key: HBASE-4497
 URL: https://issues.apache.org/jira/browse/HBASE-4497
 Project: HBase
  Issue Type: Bug
Reporter: ramkrishna.s.vasudevan
Priority: Critical

 As per the discussion in the mail chain HBCK reporting of possible mismatch 
 in RS assignment this JIRA is created.
 Consider two RS- RS1 and RS2.
 A region tries to open in RS1. But it takes a while.  The RS1 has still not 
 updated meta and transitioned the node from OPENING to OPENED
 So timeout assigns the region to RS2.  RS2 successfully updates the META and 
 opens the region.
 Now RS1 tries to act on the region by first updating the META and then 
 transiting the node to OPENING to OPENED.
 RS1 transiting the node to OPENING to OPENED will fail.  But the META entry 
 will have RS1 as the latest.
 Now HBCK reports this as an inconsistency and if we try to scan the Region we 
 get NotServingRegionException.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira