[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-04-21 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13258788#comment-13258788
 ] 

nkeywal commented on HBASE-5844:


I didn't know this parameter. It's interesting, because with ZK the default 
timeout is 30 seconds, but with HBase it's now 180s (from hbase-default.xml). 
It was increased to 60s a first time in HBASE-1772. It seems it was increased 
because of the GC.

But it means that deleting immediately the ZK represents a huge mttr 
improvement for the regions server crash case.




 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Attachments: 5844.v1.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-04-21 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13258803#comment-13258803
 ] 

nkeywal commented on HBASE-5844:


For the tracker, it's my private workaround for HBASE-5666, it should not have 
been included in this patch. Sorry about this.

I think it's better to delete the file explicitly, just after the znode 
deletion. HRegionServer#deleteMyEphemeralNode is called only once, and I added 
deleteMyEphemeralNodeOnDisk just after this call. If we rely on #deleteOnExit, 
I fear we could have the file deleted with a still alive znode. I'm not sure 
and I have not tried it, but I think it's too easy to enter into the 
jvm-specific-behavior space here.

I will fix the java code and try the whole fix on a real cluster for the v2.

Thanks you for the review.





 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Attachments: 5844.v1.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-04-23 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5844:
---

Attachment: 5844.v2.patch

 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Attachments: 5844.v1.patch, 5844.v2.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-04-23 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13259692#comment-13259692
 ] 

nkeywal commented on HBASE-5844:


v2 should be ok. It does not include anymore the fix for HBASE-5666, so it 
cannot be tested locally but I tried it before removing the workaround. 

 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Attachments: 5844.v1.patch, 5844.v2.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-5859) Optimize the rolling restart script

2012-04-23 Thread nkeywal (JIRA)
nkeywal created HBASE-5859:
--

 Summary: Optimize the rolling restart script
 Key: HBASE-5859
 URL: https://issues.apache.org/jira/browse/HBASE-5859
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Priority: Minor


There is a graceful_stop script. This algorithm:

{noformat}
for i = 0 to servers.size {
 regionsInServer = servers[i].regions
 move servers[i].regions to random
 stop servers[i]
 start servers[i]
 move regionsInServer to servers[i] //filled back with the same regions
}
{noformat}

It would be possible to optimize it while keeping data locality with

{noformat}
for i = 0 to servers.size {
 start servers[i*2+1] on the computer of servers[i] // Two RS on the same box
 move servers[i].regions to servers[i*2+1]  // The one on the same box
 stop servers[i]
}
{noformat}

There would be an impact with a fixed port configuration. To fix this, we could:
- use a range of port instead of a single port. This could be an issue for the 
web port.
- start on a port then reuse the fixed ones when they become available. This is 
not very elegant if a client code is already using the previous code. Moreover 
the region server code is written in the meta table.
- do a mix of the two solutions: a range for the server itself, while waiting 
for the web port to be available.


To be discussed...


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-04-23 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13259904#comment-13259904
 ] 

nkeywal commented on HBASE-5844:


You're right. I propose to commit this patch, I will then generalize the 
solution to master in another jira.

 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Attachments: 5844.v1.patch, 5844.v2.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-5877) When a query fails because the region has moved, let the regionserver returns the new address to the client

2012-04-25 Thread nkeywal (JIRA)
nkeywal created HBASE-5877:
--

 Summary: When a query fails because the region has moved, let the 
regionserver returns the new address to the client
 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor


This is mainly useful when we do a rolling restart. This will decrease the load 
on the master and the network load.

Note that a region is not immediately opened after a close. So:
- it seems preferable to wait before retrying on the other server. An 
optimisation would be to have an heuristic depending on when the region was 
closed.
- during a rolling restart, the server moves the regions then stops. So we may 
have failures when the server is stopped, and this patch won't help.


The implementation in the first patch does:
- on the region move, there is an added parameter on the regionserver#close to 
say where we are sending the region
- the regionserver keeps a list of what was moved. Each entry is kept 100 
seconds.
- the regionserver sends a specific exception when it receives a query on a 
moved region. This exception contains the new address.
- the client analyses the exeptions and update its cache accordingly...






--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver returns the new address to the client

2012-04-25 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5877:
---

Attachment: 5877.v1.patch

 When a query fails because the region has moved, let the regionserver returns 
 the new address to the client
 ---

 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5877.v1.patch


 This is mainly useful when we do a rolling restart. This will decrease the 
 load on the master and the network load.
 Note that a region is not immediately opened after a close. So:
 - it seems preferable to wait before retrying on the other server. An 
 optimisation would be to have an heuristic depending on when the region was 
 closed.
 - during a rolling restart, the server moves the regions then stops. So we 
 may have failures when the server is stopped, and this patch won't help.
 The implementation in the first patch does:
 - on the region move, there is an added parameter on the regionserver#close 
 to say where we are sending the region
 - the regionserver keeps a list of what was moved. Each entry is kept 100 
 seconds.
 - the regionserver sends a specific exception when it receives a query on a 
 moved region. This exception contains the new address.
 - the client analyses the exeptions and update its cache accordingly...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5877) When a query fails because the region has moved, let the regionserver returns the new address to the client

2012-04-25 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13261972#comment-13261972
 ] 

nkeywal commented on HBASE-5877:


v1. On an old trunk, so it's just to give an overview. Includes some bits of 
HBASE-5844 as well.

There are 3 workarounds in the implementation:
1) As a ServerName is not serializable we use the String dedicated to this kind 
of issue. Acceptable I think.
2) hadoop.ipc serialization of exception is based on the #getMessage. So we 
have to parse it internally. It's not visisble to the exception user. Still 
acceptable (? :-)
3) The code to manage the error in the client package is quite complex. We have 
the exception at the very beginning, and then it's checked again, but we don't 
have the real exception anymore. I used a new updateList to make it works, 
I'm looking for another solution here...

 When a query fails because the region has moved, let the regionserver returns 
 the new address to the client
 ---

 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5877.v1.patch


 This is mainly useful when we do a rolling restart. This will decrease the 
 load on the master and the network load.
 Note that a region is not immediately opened after a close. So:
 - it seems preferable to wait before retrying on the other server. An 
 optimisation would be to have an heuristic depending on when the region was 
 closed.
 - during a rolling restart, the server moves the regions then stops. So we 
 may have failures when the server is stopped, and this patch won't help.
 The implementation in the first patch does:
 - on the region move, there is an added parameter on the regionserver#close 
 to say where we are sending the region
 - the regionserver keeps a list of what was moved. Each entry is kept 100 
 seconds.
 - the regionserver sends a specific exception when it receives a query on a 
 moved region. This exception contains the new address.
 - the client analyses the exeptions and update its cache accordingly...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client

2012-04-26 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262462#comment-13262462
 ] 

nkeywal commented on HBASE-5877:


bq. Can we mark the failure and make this RegionMovedException behave the same 
as NotServingRegionException ?
Done.

bq. For updateCachedLocations(), please put explanation for parameter on the 
same line as the parameter:
Done.

bq. 'Failed all' - 'Failed call'
It's an existing comment that we can find again later in the code. It really 
means failed all: all the queries on this server failed. I don't mind 
changing it to something better, but I think we should keep the all.

bq. 'which the server' - 'which the region'
Done.

bq. Please increase the VERSION of HRegionInterface
Done.

bq. How is the server removed from cache since I see 'continue' above ?
That's what makes this code complex and difficult to change: the error is 
actually managed later, when we don't have the real exception anymore.

bq. For ServerManager.sendRegionClose(), please add javadoc for destServerName 
param.
Done.

bq. Is it possible that destServerName is null ?
Safety checks added.

bq. Please change the above to debug log.  Why is the above fatal 
(regionResult != null) ? Step 4 appears in a comment below the above code. 
Should the above say step 3 ?
Bad logs fixed.

 When a query fails because the region has moved, let the regionserver return 
 the new address to the client
 --

 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5877.v1.patch


 This is mainly useful when we do a rolling restart. This will decrease the 
 load on the master and the network load.
 Note that a region is not immediately opened after a close. So:
 - it seems preferable to wait before retrying on the other server. An 
 optimisation would be to have an heuristic depending on when the region was 
 closed.
 - during a rolling restart, the server moves the regions then stops. So we 
 may have failures when the server is stopped, and this patch won't help.
 The implementation in the first patch does:
 - on the region move, there is an added parameter on the regionserver#close 
 to say where we are sending the region
 - the regionserver keeps a list of what was moved. Each entry is kept 100 
 seconds.
 - the regionserver sends a specific exception when it receives a query on a 
 moved region. This exception contains the new address.
 - the client analyses the exeptions and update its cache accordingly...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client

2012-04-26 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262717#comment-13262717
 ] 

nkeywal commented on HBASE-5877:


Note that I'm currently rewriting the patch, as it conflicts with the protobuf 
stuff that was committed recently... But the logic hasn't changed.

@ted What we're saving in the current implementation is a call to the master. 
It can be interesting in itself if the region moves is used by a lot of 
clients. We could do better by letting the client know that the region is now 
fully available somewhere else and that there is no need to wait before 
retrying. But right now the region server only knows that the region is closed 
and moved to another server. It doesn't know if the region is opened yet. We 
could have this by adding the info in zk, but it would increase the zk load...

 When a query fails because the region has moved, let the regionserver return 
 the new address to the client
 --

 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5877.v1.patch


 This is mainly useful when we do a rolling restart. This will decrease the 
 load on the master and the network load.
 Note that a region is not immediately opened after a close. So:
 - it seems preferable to wait before retrying on the other server. An 
 optimisation would be to have an heuristic depending on when the region was 
 closed.
 - during a rolling restart, the server moves the regions then stops. So we 
 may have failures when the server is stopped, and this patch won't help.
 The implementation in the first patch does:
 - on the region move, there is an added parameter on the regionserver#close 
 to say where we are sending the region
 - the regionserver keeps a list of what was moved. Each entry is kept 100 
 seconds.
 - the regionserver sends a specific exception when it receives a query on a 
 moved region. This exception contains the new address.
 - the client analyses the exeptions and update its cache accordingly...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Reopened] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-04-26 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal reopened HBASE-5844:



There is a regression when the cluster is fully distributed: the start command 
hangs. I'm on it. In the meantime, would it be possible to undo the commit?

Sorry about this.

 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Fix For: 0.96.0

 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client

2012-04-26 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13263140#comment-13263140
 ] 

nkeywal commented on HBASE-5877:


bq. This patch will benefit any move, not just rolling restart, right?
Yes, but as there is a wait time between two tries, I think the benefit will be 
minimal vs. the wait time for a single client. I could add an heuristic like if 
region was closed more than 2 seconds ago, consider that it's now available on 
the new server and don't sleep before the next retry. That could lead of having 
more network messages if the rule is wrong (and the rule will be wrong when the 
system is overloaded), and it will add some complexity to the client code. 
Having the real status of the region would solve this. 

Anyway, with the dev already done to cut the link between master  clients, it 
can help to save a reconnect to master. And during a rolling restart with 
regions moving everywhere, I think it will make a real difference.


bq. I don't see changes to make use of this new functionality? I'd expect the 
balancer in master to make use of it?
Yes, it's the changes in AssignmentManager: the changes are in the patch, but 
are quite small at the end: basically:
{noformat}
-unassign(plan.getRegionInfo());
+unassign(plan.getRegionInfo(), false, plan.getDestination());
{noformat}

I still need to manage the case when the destination is not specified at the 
beginning.


 When a query fails because the region has moved, let the regionserver return 
 the new address to the client
 --

 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5877.v1.patch


 This is mainly useful when we do a rolling restart. This will decrease the 
 load on the master and the network load.
 Note that a region is not immediately opened after a close. So:
 - it seems preferable to wait before retrying on the other server. An 
 optimisation would be to have an heuristic depending on when the region was 
 closed.
 - during a rolling restart, the server moves the regions then stops. So we 
 may have failures when the server is stopped, and this patch won't help.
 The implementation in the first patch does:
 - on the region move, there is an added parameter on the regionserver#close 
 to say where we are sending the region
 - the regionserver keeps a list of what was moved. Each entry is kept 100 
 seconds.
 - the regionserver sends a specific exception when it receives a query on a 
 moved region. This exception contains the new address.
 - the client analyses the exeptions and update its cache accordingly...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-04-27 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13263778#comment-13263778
 ] 

nkeywal commented on HBASE-5844:


I found the issue, and (hopefully) a fix. I will have a new patch middle of 
next week, I will include the master znode in this one...

 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Fix For: 0.96.0

 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-5902) Some scripts are not executable

2012-04-30 Thread nkeywal (JIRA)
nkeywal created HBASE-5902:
--

 Summary: Some scripts are not executable
 Key: HBASE-5902
 URL: https://issues.apache.org/jira/browse/HBASE-5902
 Project: HBase
  Issue Type: Bug
  Components: scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Trivial


-rw-rw-r--  graceful_stop.sh
-rw-rw-r--  hbase-config.sh
-rw-rw-r--  local-master-backup.sh
-rw-rw-r--  local-regionservers.sh


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5902) Some scripts are not executable

2012-04-30 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5902:
---

Attachment: 5902.v1.patch

 Some scripts are not executable
 ---

 Key: HBASE-5902
 URL: https://issues.apache.org/jira/browse/HBASE-5902
 Project: HBase
  Issue Type: Bug
  Components: scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Trivial
 Attachments: 5902.v1.patch


 -rw-rw-r--  graceful_stop.sh
 -rw-rw-r--  hbase-config.sh
 -rw-rw-r--  local-master-backup.sh
 -rw-rw-r--  local-regionservers.sh

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-5903) Detect the test classes without categories

2012-04-30 Thread nkeywal (JIRA)
nkeywal created HBASE-5903:
--

 Summary: Detect the test classes without categories
 Key: HBASE-5903
 URL: https://issues.apache.org/jira/browse/HBASE-5903
 Project: HBase
  Issue Type: Improvement
  Components: build, test
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor


The tests are executed by category. When a test does not have a category, it's 
not run on prebuild nor central build.

This new test checks the test classess and list the ones without category. It 
fails if it finds one. As it's a small test it will be executed on the 
developper machine and will fail immediately on the central builds.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5903) Detect the test classes without categories

2012-04-30 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5903:
---

Attachment: 5903.v3.patch

 Detect the test classes without categories
 --

 Key: HBASE-5903
 URL: https://issues.apache.org/jira/browse/HBASE-5903
 Project: HBase
  Issue Type: Improvement
  Components: build, test
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5903.v3.patch


 The tests are executed by category. When a test does not have a category, 
 it's not run on prebuild nor central build.
 This new test checks the test classess and list the ones without category. It 
 fails if it finds one. As it's a small test it will be executed on the 
 developper machine and will fail immediately on the central builds.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5903) Detect the test classes without categories

2012-04-30 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5903:
---

Fix Version/s: 0.96.0
   Status: Patch Available  (was: Open)

 Detect the test classes without categories
 --

 Key: HBASE-5903
 URL: https://issues.apache.org/jira/browse/HBASE-5903
 Project: HBase
  Issue Type: Improvement
  Components: build, test
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5903.v3.patch


 The tests are executed by category. When a test does not have a category, 
 it's not run on prebuild nor central build.
 This new test checks the test classess and list the ones without category. It 
 fails if it finds one. As it's a small test it will be executed on the 
 developper machine and will fail immediately on the central builds.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5903) Detect the test classes without categories

2012-04-30 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13264978#comment-13264978
 ] 

nkeywal commented on HBASE-5903:


Considering the actual patch, we can just consider TestAssignmentManager as a 
little bit flaky ;-)

 Detect the test classes without categories
 --

 Key: HBASE-5903
 URL: https://issues.apache.org/jira/browse/HBASE-5903
 Project: HBase
  Issue Type: Improvement
  Components: build, test
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5903.v3.patch


 The tests are executed by category. When a test does not have a category, 
 it's not run on prebuild nor central build.
 This new test checks the test classess and list the ones without category. It 
 fails if it finds one. As it's a small test it will be executed on the 
 developper machine and will fail immediately on the central builds.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-5905) Protobuf interface for Admin: split between the internal and the external/customer interface

2012-04-30 Thread nkeywal (JIRA)
nkeywal created HBASE-5905:
--

 Summary: Protobuf interface for Admin: split between the internal 
and the external/customer interface
 Key: HBASE-5905
 URL: https://issues.apache.org/jira/browse/HBASE-5905
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal


After a short discussion with Stack, I create a jira.
--
I'am a little bit confused by the protobuf interface for closeRegion.

We have two types of closeRegion today:
1) the external ones; available in client.HBaseAdmin. They take the server and 
the region identifier as a parameter and nothing else.
2) The internal ones, called for example by the master. They have more 
parameters (like versionOfClosingNode or transitionInZK).

When I look at protobuf.ProtobufUtil, I see:

  public static void closeRegion(final AdminProtocol admin,
  final byte[] regionName, final boolean transitionInZK) throws IOException 
{
CloseRegionRequest closeRegionRequest =
  RequestConverter.buildCloseRegionRequest(regionName, transitionInZK);
try {
  admin.closeRegion(null, closeRegionRequest);
} catch (ServiceException se) {
  throw getRemoteException(se);
}
  }


In other words, it seems that we merged the two interfaces into a single one. 
Is that the intend?
I checked, the internal fields in closeRegionRequest are all optional (that's 
good). Still, it means that the end user could use them or at least would need 
to distinguish between the optional for functional reasons and the optional 
- do not use.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5905) Protobuf interface for Admin: split between the internal and the external/customer interface

2012-05-01 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13265710#comment-13265710
 ] 

nkeywal commented on HBASE-5905:


I don't think so. But we could add a specific comment?

 Protobuf interface for Admin: split between the internal and the 
 external/customer interface
 

 Key: HBASE-5905
 URL: https://issues.apache.org/jira/browse/HBASE-5905
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal

 After a short discussion with Stack, I create a jira.
 --
 I'am a little bit confused by the protobuf interface for closeRegion.
 We have two types of closeRegion today:
 1) the external ones; available in client.HBaseAdmin. They take the server 
 and the region identifier as a parameter and nothing else.
 2) The internal ones, called for example by the master. They have more 
 parameters (like versionOfClosingNode or transitionInZK).
 When I look at protobuf.ProtobufUtil, I see:
   public static void closeRegion(final AdminProtocol admin,
   final byte[] regionName, final boolean transitionInZK) throws 
 IOException {
 CloseRegionRequest closeRegionRequest =
   RequestConverter.buildCloseRegionRequest(regionName, transitionInZK);
 try {
   admin.closeRegion(null, closeRegionRequest);
 } catch (ServiceException se) {
   throw getRemoteException(se);
 }
   }
 In other words, it seems that we merged the two interfaces into a single one. 
 Is that the intend?
 I checked, the internal fields in closeRegionRequest are all optional (that's 
 good). Still, it means that the end user could use them or at least would 
 need to distinguish between the optional for functional reasons and the 
 optional - do not use.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5902) Some scripts are not executable

2012-05-02 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13266464#comment-13266464
 ] 

nkeywal commented on HBASE-5902:


I haven't submitted the patch as it does not change the java code, but it can 
be committed.

 Some scripts are not executable
 ---

 Key: HBASE-5902
 URL: https://issues.apache.org/jira/browse/HBASE-5902
 Project: HBase
  Issue Type: Bug
  Components: scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Trivial
 Attachments: 5902.v1.patch


 -rw-rw-r--  graceful_stop.sh
 -rw-rw-r--  hbase-config.sh
 -rw-rw-r--  local-master-backup.sh
 -rw-rw-r--  local-regionservers.sh

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client

2012-05-02 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5877:
---

Attachment: 5877.v6.patch

 When a query fails because the region has moved, let the regionserver return 
 the new address to the client
 --

 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5877.v1.patch, 5877.v6.patch


 This is mainly useful when we do a rolling restart. This will decrease the 
 load on the master and the network load.
 Note that a region is not immediately opened after a close. So:
 - it seems preferable to wait before retrying on the other server. An 
 optimisation would be to have an heuristic depending on when the region was 
 closed.
 - during a rolling restart, the server moves the regions then stops. So we 
 may have failures when the server is stopped, and this patch won't help.
 The implementation in the first patch does:
 - on the region move, there is an added parameter on the regionserver#close 
 to say where we are sending the region
 - the regionserver keeps a list of what was moved. Each entry is kept 100 
 seconds.
 - the regionserver sends a specific exception when it receives a query on a 
 moved region. This exception contains the new address.
 - the client analyses the exeptions and update its cache accordingly...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client

2012-05-02 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5877:
---

Status: Patch Available  (was: Open)

 When a query fails because the region has moved, let the regionserver return 
 the new address to the client
 --

 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5877.v1.patch, 5877.v6.patch


 This is mainly useful when we do a rolling restart. This will decrease the 
 load on the master and the network load.
 Note that a region is not immediately opened after a close. So:
 - it seems preferable to wait before retrying on the other server. An 
 optimisation would be to have an heuristic depending on when the region was 
 closed.
 - during a rolling restart, the server moves the regions then stops. So we 
 may have failures when the server is stopped, and this patch won't help.
 The implementation in the first patch does:
 - on the region move, there is an added parameter on the regionserver#close 
 to say where we are sending the region
 - the regionserver keeps a list of what was moved. Each entry is kept 100 
 seconds.
 - the regionserver sends a specific exception when it receives a query on a 
 moved region. This exception contains the new address.
 - the client analyses the exeptions and update its cache accordingly...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client

2012-05-02 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13266656#comment-13266656
 ] 

nkeywal commented on HBASE-5877:


Generated with protobuf 2.4.1

Here are the things I'm not a big fan, but for which I don't have a better 
solution:
- the move management in the client code: I think it's possible to change the 
way we manage error (don't wait for all results before retrying), but that 
would be for another JIRA
- the destination is the closeRegion interface is a kind of interface 
hijacking. Other options would be:
- sharing the region state in zookeeper
- letting the regionserver calls the master to get the new server. On paper 
this would be more efficient than a client - master call. In both cases we 
could consider that the client should not connect to the master except for 
cluster administration (create table, split regin; ...). That would increase 
global reliability. That's for another discussion as well I think.

Here is what I plan to do in the final version
- move the handler functional code into a function in HRegionServer: this would 
allow to have the function addToMovedRegion as private instead of public.
- Change all the CloseRegionHandler to take a RegionServer instead of a server? 
I'm not really keen on adding a class RegionServerServices, but may be I should?
- Manage the case when the destination is not specified at the beginning of the 
move (may be in a different Jira if it's not simple)...

All the previous comment should have been taken into account.

 When a query fails because the region has moved, let the regionserver return 
 the new address to the client
 --

 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5877.v1.patch, 5877.v6.patch


 This is mainly useful when we do a rolling restart. This will decrease the 
 load on the master and the network load.
 Note that a region is not immediately opened after a close. So:
 - it seems preferable to wait before retrying on the other server. An 
 optimisation would be to have an heuristic depending on when the region was 
 closed.
 - during a rolling restart, the server moves the regions then stops. So we 
 may have failures when the server is stopped, and this patch won't help.
 The implementation in the first patch does:
 - on the region move, there is an added parameter on the regionserver#close 
 to say where we are sending the region
 - the regionserver keeps a list of what was moved. Each entry is kept 100 
 seconds.
 - the regionserver sends a specific exception when it receives a query on a 
 moved region. This exception contains the new address.
 - the client analyses the exeptions and update its cache accordingly...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client

2012-05-02 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13266827#comment-13266827
 ] 

nkeywal commented on HBASE-5877:


I have the same locally, so it's likely my patch...

 When a query fails because the region has moved, let the regionserver return 
 the new address to the client
 --

 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5877.v1.patch, 5877.v6.patch


 This is mainly useful when we do a rolling restart. This will decrease the 
 load on the master and the network load.
 Note that a region is not immediately opened after a close. So:
 - it seems preferable to wait before retrying on the other server. An 
 optimisation would be to have an heuristic depending on when the region was 
 closed.
 - during a rolling restart, the server moves the regions then stops. So we 
 may have failures when the server is stopped, and this patch won't help.
 The implementation in the first patch does:
 - on the region move, there is an added parameter on the regionserver#close 
 to say where we are sending the region
 - the regionserver keeps a list of what was moved. Each entry is kept 100 
 seconds.
 - the regionserver sends a specific exception when it receives a query on a 
 moved region. This exception contains the new address.
 - the client analyses the exeptions and update its cache accordingly...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5905) Protobuf interface for Admin: split between the internal and the external/customer interface

2012-05-02 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13266839#comment-13266839
 ] 

nkeywal commented on HBASE-5905:


This would make sense if we think that customers should/will use the protobuf 
interface.

 Protobuf interface for Admin: split between the internal and the 
 external/customer interface
 

 Key: HBASE-5905
 URL: https://issues.apache.org/jira/browse/HBASE-5905
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal

 After a short discussion with Stack, I create a jira.
 --
 I'am a little bit confused by the protobuf interface for closeRegion.
 We have two types of closeRegion today:
 1) the external ones; available in client.HBaseAdmin. They take the server 
 and the region identifier as a parameter and nothing else.
 2) The internal ones, called for example by the master. They have more 
 parameters (like versionOfClosingNode or transitionInZK).
 When I look at protobuf.ProtobufUtil, I see:
   public static void closeRegion(final AdminProtocol admin,
   final byte[] regionName, final boolean transitionInZK) throws 
 IOException {
 CloseRegionRequest closeRegionRequest =
   RequestConverter.buildCloseRegionRequest(regionName, transitionInZK);
 try {
   admin.closeRegion(null, closeRegionRequest);
 } catch (ServiceException se) {
   throw getRemoteException(se);
 }
   }
 In other words, it seems that we merged the two interfaces into a single one. 
 Is that the intend?
 I checked, the internal fields in closeRegionRequest are all optional (that's 
 good). Still, it means that the end user could use them or at least would 
 need to distinguish between the optional for functional reasons and the 
 optional - do not use.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-05-03 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5844:
---

Attachment: 5844.v3.patch

 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Fix For: 0.96.0

 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 
 5844.v3.patch, 5844.v4.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-05-03 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5844:
---

Status: Open  (was: Patch Available)

 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Fix For: 0.96.0

 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 
 5844.v3.patch, 5844.v4.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-05-03 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5844:
---

Attachment: 5844.v4.patch

 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Fix For: 0.96.0

 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 
 5844.v3.patch, 5844.v4.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-05-03 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5844:
---

Status: Patch Available  (was: Reopened)

 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Fix For: 0.96.0

 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 
 5844.v3.patch, 5844.v4.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-05-03 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5844:
---

Status: Patch Available  (was: Open)

 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Fix For: 0.96.0

 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 
 5844.v3.patch, 5844.v4.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-05-03 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13267332#comment-13267332
 ] 

nkeywal commented on HBASE-5844:


v4 should be ok.
I will do another jira for the master.

 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Fix For: 0.96.0

 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 
 5844.v3.patch, 5844.v4.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-5924) In the client code, don't wait for all the requests to be executed before resubmitting a request in error.

2012-05-03 Thread nkeywal (JIRA)
nkeywal created HBASE-5924:
--

 Summary: In the client code, don't wait for all the requests to be 
executed before resubmitting a request in error.
 Key: HBASE-5924
 URL: https://issues.apache.org/jira/browse/HBASE-5924
 Project: HBase
  Issue Type: Improvement
  Components: client
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor


The client (in the function HConnectionManager#processBatchCallback) works in 
two steps:
 - make the requests
 - collect the failures and successes and prepare for retry

It means that when there is an immediate error (region moved, split, dead 
server, ...) we still wait for all the initial requests to be executed before 
submitting again the failed request. If we have a scenario with all the 
requests taking 5 seconds we have a final execution time of: 5 (initial 
requests) + 1 (wait time) + 5 (final request) = 11s.

We could improve this by analyzing immediately the results. This would lead us, 
for the scenario mentioned above, to 6 seconds. 

So we could have a performance improvement of nearly 50% in many cases, and 
much more than 50% if the request execution time is different.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5902) Some scripts are not executable

2012-05-03 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5902:
---

Status: Patch Available  (was: Open)

 Some scripts are not executable
 ---

 Key: HBASE-5902
 URL: https://issues.apache.org/jira/browse/HBASE-5902
 Project: HBase
  Issue Type: Bug
  Components: scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Trivial
 Attachments: 5902.v1.patch


 -rw-rw-r--  graceful_stop.sh
 -rw-rw-r--  hbase-config.sh
 -rw-rw-r--  local-master-backup.sh
 -rw-rw-r--  local-regionservers.sh

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-5926) Delete the master znode after a znode crash

2012-05-03 Thread nkeywal (JIRA)
nkeywal created HBASE-5926:
--

 Summary: Delete the master znode after a znode crash
 Key: HBASE-5926
 URL: https://issues.apache.org/jira/browse/HBASE-5926
 Project: HBase
  Issue Type: Improvement
  Components: master, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor


This is the continuation of the work done in HBASE-5844.
But we can't apply exactly the same strategy: for the region server, there is a 
znode per region server, while for the master  backup master there is a single 
znode for both.

So if we apply the same strategy as for a regionserver, we may have this 
scenario:
1) Master starts
2) Backup master starts
3) Master dies
4) ZK detects it
5) Backup master receives the update from ZK
6) Backup master creates the new master node and become the main master
7) Previous master script continues
8) Previous master script delete the master node in ZK
9) = issue: we deleted the node just created by the new master

This should not happen often (usually the znode will be delete soon enough), 
but it can happen.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5926) Delete the master znode after a znode crash

2012-05-03 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5926:
---

Description: 
This is the continuation of the work done in HBASE-5844.
But we can't apply exactly the same strategy: for the region server, there is a 
znode per region server, while for the master  backup master there is a single 
znode for both.

So if we apply the same strategy as for a regionserver, we may have this 
scenario:
1) Master starts
2) Backup master starts
3) Master dies
4) ZK detects it
5) Backup master receives the update from ZK
6) Backup master creates the new master node and become the main master
7) Previous master script continues
8) Previous master script deletes the master node in ZK
9) = issue: we deleted the node just created by the new master

This should not happen often (usually the znode will be deleted soon enough), 
but it can happen.

  was:
This is the continuation of the work done in HBASE-5844.
But we can't apply exactly the same strategy: for the region server, there is a 
znode per region server, while for the master  backup master there is a single 
znode for both.

So if we apply the same strategy as for a regionserver, we may have this 
scenario:
1) Master starts
2) Backup master starts
3) Master dies
4) ZK detects it
5) Backup master receives the update from ZK
6) Backup master creates the new master node and become the main master
7) Previous master script continues
8) Previous master script delete the master node in ZK
9) = issue: we deleted the node just created by the new master

This should not happen often (usually the znode will be delete soon enough), 
but it can happen.


 Delete the master znode after a znode crash
 ---

 Key: HBASE-5926
 URL: https://issues.apache.org/jira/browse/HBASE-5926
 Project: HBase
  Issue Type: Improvement
  Components: master, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor

 This is the continuation of the work done in HBASE-5844.
 But we can't apply exactly the same strategy: for the region server, there is 
 a znode per region server, while for the master  backup master there is a 
 single znode for both.
 So if we apply the same strategy as for a regionserver, we may have this 
 scenario:
 1) Master starts
 2) Backup master starts
 3) Master dies
 4) ZK detects it
 5) Backup master receives the update from ZK
 6) Backup master creates the new master node and become the main master
 7) Previous master script continues
 8) Previous master script deletes the master node in ZK
 9) = issue: we deleted the node just created by the new master
 This should not happen often (usually the znode will be deleted soon enough), 
 but it can happen.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5926) Delete the master znode after a master crash

2012-05-03 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5926:
---

Summary: Delete the master znode after a master crash  (was: Delete the 
master znode after a znode crash)

 Delete the master znode after a master crash
 

 Key: HBASE-5926
 URL: https://issues.apache.org/jira/browse/HBASE-5926
 Project: HBase
  Issue Type: Improvement
  Components: master, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor

 This is the continuation of the work done in HBASE-5844.
 But we can't apply exactly the same strategy: for the region server, there is 
 a znode per region server, while for the master  backup master there is a 
 single znode for both.
 So if we apply the same strategy as for a regionserver, we may have this 
 scenario:
 1) Master starts
 2) Backup master starts
 3) Master dies
 4) ZK detects it
 5) Backup master receives the update from ZK
 6) Backup master creates the new master node and become the main master
 7) Previous master script continues
 8) Previous master script deletes the master node in ZK
 9) = issue: we deleted the node just created by the new master
 This should not happen often (usually the znode will be deleted soon enough), 
 but it can happen.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client

2012-05-03 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13267751#comment-13267751
 ] 

nkeywal commented on HBASE-5877:


v12, should be final.

1) ServerName is used everywhere in the interface, thanks to protobuf
2) hadoop.ipc serialization of exception is based on the #getMessage. So we 
have to parse it internally. It's not visisble to the exception user.
3) The code to manage the error in the client package is quite complex. We have 
the exception at the very beginning, and then it's checked again, but we don't 
have the real exception anymore. I used a new historyList to make it works. 
There is another JIRA for other improvement, in which I could get rid of this 
(HBASE-5924)
4) Generated with protobuf 2.4.1
5) The destination is the closeRegion interface is a kind of interface 
hijacking. Other options would be:
- sharing the region state in zookeeper
- letting the regionserver calls the master to get the new server. On paper 
this would be more efficient than a client - master call. In both cases we 
could consider that the client should not connect to the master except for 
cluster administration (create table, split regin; ...). That would increase 
global reliability. That's for another discussion as well I think.
6) RegionServerServices has been modified to set a destination when removing a 
region from the online regions.
7) In another JIRA I will manage the case when the destination is not specified 
when calling the move function.

 When a query fails because the region has moved, let the regionserver return 
 the new address to the client
 --

 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5877.v1.patch, 5877.v12.patch, 5877.v6.patch


 This is mainly useful when we do a rolling restart. This will decrease the 
 load on the master and the network load.
 Note that a region is not immediately opened after a close. So:
 - it seems preferable to wait before retrying on the other server. An 
 optimisation would be to have an heuristic depending on when the region was 
 closed.
 - during a rolling restart, the server moves the regions then stops. So we 
 may have failures when the server is stopped, and this patch won't help.
 The implementation in the first patch does:
 - on the region move, there is an added parameter on the regionserver#close 
 to say where we are sending the region
 - the regionserver keeps a list of what was moved. Each entry is kept 100 
 seconds.
 - the regionserver sends a specific exception when it receives a query on a 
 moved region. This exception contains the new address.
 - the client analyses the exeptions and update its cache accordingly...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client

2012-05-03 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5877:
---

Attachment: 5877.v12.patch

 When a query fails because the region has moved, let the regionserver return 
 the new address to the client
 --

 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5877.v1.patch, 5877.v12.patch, 5877.v6.patch


 This is mainly useful when we do a rolling restart. This will decrease the 
 load on the master and the network load.
 Note that a region is not immediately opened after a close. So:
 - it seems preferable to wait before retrying on the other server. An 
 optimisation would be to have an heuristic depending on when the region was 
 closed.
 - during a rolling restart, the server moves the regions then stops. So we 
 may have failures when the server is stopped, and this patch won't help.
 The implementation in the first patch does:
 - on the region move, there is an added parameter on the regionserver#close 
 to say where we are sending the region
 - the regionserver keeps a list of what was moved. Each entry is kept 100 
 seconds.
 - the regionserver sends a specific exception when it receives a query on a 
 moved region. This exception contains the new address.
 - the client analyses the exeptions and update its cache accordingly...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client

2012-05-03 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5877:
---

Status: Open  (was: Patch Available)

 When a query fails because the region has moved, let the regionserver return 
 the new address to the client
 --

 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5877.v1.patch, 5877.v12.patch, 5877.v6.patch


 This is mainly useful when we do a rolling restart. This will decrease the 
 load on the master and the network load.
 Note that a region is not immediately opened after a close. So:
 - it seems preferable to wait before retrying on the other server. An 
 optimisation would be to have an heuristic depending on when the region was 
 closed.
 - during a rolling restart, the server moves the regions then stops. So we 
 may have failures when the server is stopped, and this patch won't help.
 The implementation in the first patch does:
 - on the region move, there is an added parameter on the regionserver#close 
 to say where we are sending the region
 - the regionserver keeps a list of what was moved. Each entry is kept 100 
 seconds.
 - the regionserver sends a specific exception when it receives a query on a 
 moved region. This exception contains the new address.
 - the client analyses the exeptions and update its cache accordingly...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client

2012-05-03 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5877:
---

Status: Patch Available  (was: Open)

 When a query fails because the region has moved, let the regionserver return 
 the new address to the client
 --

 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5877.v1.patch, 5877.v12.patch, 5877.v6.patch


 This is mainly useful when we do a rolling restart. This will decrease the 
 load on the master and the network load.
 Note that a region is not immediately opened after a close. So:
 - it seems preferable to wait before retrying on the other server. An 
 optimisation would be to have an heuristic depending on when the region was 
 closed.
 - during a rolling restart, the server moves the regions then stops. So we 
 may have failures when the server is stopped, and this patch won't help.
 The implementation in the first patch does:
 - on the region move, there is an added parameter on the regionserver#close 
 to say where we are sending the region
 - the regionserver keeps a list of what was moved. Each entry is kept 100 
 seconds.
 - the regionserver sends a specific exception when it receives a query on a 
 moved region. This exception contains the new address.
 - the client analyses the exeptions and update its cache accordingly...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5902) Some scripts are not executable

2012-05-04 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13268186#comment-13268186
 ] 

nkeywal commented on HBASE-5902:


 @stack: yes, thank you!

 Some scripts are not executable
 ---

 Key: HBASE-5902
 URL: https://issues.apache.org/jira/browse/HBASE-5902
 Project: HBase
  Issue Type: Bug
  Components: scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Trivial
 Attachments: 5902.v1.patch, 5902v2.txt


 -rw-rw-r--  graceful_stop.sh
 -rw-rw-r--  hbase-config.sh
 -rw-rw-r--  local-master-backup.sh
 -rw-rw-r--  local-regionservers.sh

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client

2012-05-04 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13268199#comment-13268199
 ] 

nkeywal commented on HBASE-5877:


@stack

bq. You don't want to have RegionMovedException carry a ServerName#toString 
instead of host and port?
I think it's safer this way, as I have to parse the string afterward. 
Otherwise, if someone modifies ServerName#toString he will break the parsing in 
RegionMovedException, a class he may never have heard of (yes, it will break 
the unit test :-))

bq. Is this a bug fix?
Unfortunately, it's a feature. The error management is duplicated, and I have 
to manage both cases, because we don't have the exception when we come back to 
the result later in the code.

bq. Put the history of moved regions out into its own class?
You're right, it would be better. Wil do.

bq. Don't presize this I'd say: private static final long TIMEOUT_REGION_MOVED 
= (2L * 60L * 1000L);
You would prefer a configurable value? 


bq. Stuff is lazily cleared from movedRegions? Should we have a cleaner come 
visit occasionally?
Aggreed, it would be better with a cleaner. Will do as well.

bq. Why you say the above? When we protobuf it, it'll just be an option so it 
shouldn't be too bad?
Yeah, if it was too bad I would not have proposed it :-). It's an imperfection 
to accept I think. We would not have it if we share the regions state within 
the cluster with ZK.

@ted
bq. Under what condition would newHrl be null above ?
Oops. Refactoring error. Removed.

bq. Please remove the space between newHrl and ')' below:
Done.

bq. Would the above code result in NPE since I see the following in javadoc:
It should not happen because we test hrl value before. But I added a check on 
the arguments to make it safer.

bq. Since updateCachedLocations() is used to handle exception, the presizing 
above may not be needed.

bq. Since updateCachedLocations() is used to handle exception, the presizing 
above may not be needed.
Yeah, I sized it thinking: if we're doing a rolling restart we may have 100 
regions with a wrong location if we're really unlucky. As it small, any 
solution would work here, but I prefer to have the size explicitly set, as it 
says I though about it, that's a reasonable size. I added a comment however. 

bq. The indentation of CloseRegionHandler() above is off.
Fixed.


bq. 'will contains' - 'will contain'. 'keep a too old' - 'keep too old'.
Fixed.



 When a query fails because the region has moved, let the regionserver return 
 the new address to the client
 --

 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5877.v1.patch, 5877.v12.patch, 5877.v6.patch


 This is mainly useful when we do a rolling restart. This will decrease the 
 load on the master and the network load.
 Note that a region is not immediately opened after a close. So:
 - it seems preferable to wait before retrying on the other server. An 
 optimisation would be to have an heuristic depending on when the region was 
 closed.
 - during a rolling restart, the server moves the regions then stops. So we 
 may have failures when the server is stopped, and this patch won't help.
 The implementation in the first patch does:
 - on the region move, there is an added parameter on the regionserver#close 
 to say where we are sending the region
 - the regionserver keeps a list of what was moved. Each entry is kept 100 
 seconds.
 - the regionserver sends a specific exception when it receives a query on a 
 moved region. This exception contains the new address.
 - the client analyses the exeptions and update its cache accordingly...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-05-04 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13268224#comment-13268224
 ] 

nkeywal commented on HBASE-5844:


Ready to be committed imho.

 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Fix For: 0.96.0

 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 
 5844.v3.patch, 5844.v4.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-5939) Add an autorestart option in the start scripts

2012-05-04 Thread nkeywal (JIRA)
nkeywal created HBASE-5939:
--

 Summary: Add an autorestart option in the start scripts
 Key: HBASE-5939
 URL: https://issues.apache.org/jira/browse/HBASE-5939
 Project: HBase
  Issue Type: Improvement
  Components: master, regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor


When a binary dies on a server, we don't try to restart it while it would be 
possible in most cases.

We can have something as:
loop
 start
 wait
 if cleanStop then exit
 if already stopped less than 5 minutes ago sleep 1 minute
endloop

This is simple for master  backup master, a little bit more complex for the 
region server as it can be stopped by a script or by the shutdown procedure.

On a long long term it could allow a restart with exactly the same assignments.





--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client

2012-05-04 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5877:
---

Status: Open  (was: Patch Available)

 When a query fails because the region has moved, let the regionserver return 
 the new address to the client
 --

 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5877.v1.patch, 5877.v12.patch, 5877.v6.patch


 This is mainly useful when we do a rolling restart. This will decrease the 
 load on the master and the network load.
 Note that a region is not immediately opened after a close. So:
 - it seems preferable to wait before retrying on the other server. An 
 optimisation would be to have an heuristic depending on when the region was 
 closed.
 - during a rolling restart, the server moves the regions then stops. So we 
 may have failures when the server is stopped, and this patch won't help.
 The implementation in the first patch does:
 - on the region move, there is an added parameter on the regionserver#close 
 to say where we are sending the region
 - the regionserver keeps a list of what was moved. Each entry is kept 100 
 seconds.
 - the regionserver sends a specific exception when it receives a query on a 
 moved region. This exception contains the new address.
 - the client analyses the exeptions and update its cache accordingly...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client

2012-05-04 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5877:
---

Attachment: 5877.v15.patch

 When a query fails because the region has moved, let the regionserver return 
 the new address to the client
 --

 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5877.v1.patch, 5877.v12.patch, 5877.v15.patch, 
 5877.v6.patch


 This is mainly useful when we do a rolling restart. This will decrease the 
 load on the master and the network load.
 Note that a region is not immediately opened after a close. So:
 - it seems preferable to wait before retrying on the other server. An 
 optimisation would be to have an heuristic depending on when the region was 
 closed.
 - during a rolling restart, the server moves the regions then stops. So we 
 may have failures when the server is stopped, and this patch won't help.
 The implementation in the first patch does:
 - on the region move, there is an added parameter on the regionserver#close 
 to say where we are sending the region
 - the regionserver keeps a list of what was moved. Each entry is kept 100 
 seconds.
 - the regionserver sends a specific exception when it receives a query on a 
 moved region. This exception contains the new address.
 - the client analyses the exeptions and update its cache accordingly...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client

2012-05-04 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5877:
---

Status: Patch Available  (was: Open)

 When a query fails because the region has moved, let the regionserver return 
 the new address to the client
 --

 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5877.v1.patch, 5877.v12.patch, 5877.v15.patch, 
 5877.v6.patch


 This is mainly useful when we do a rolling restart. This will decrease the 
 load on the master and the network load.
 Note that a region is not immediately opened after a close. So:
 - it seems preferable to wait before retrying on the other server. An 
 optimisation would be to have an heuristic depending on when the region was 
 closed.
 - during a rolling restart, the server moves the regions then stops. So we 
 may have failures when the server is stopped, and this patch won't help.
 The implementation in the first patch does:
 - on the region move, there is an added parameter on the regionserver#close 
 to say where we are sending the region
 - the regionserver keeps a list of what was moved. Each entry is kept 100 
 seconds.
 - the regionserver sends a specific exception when it receives a query on a 
 moved region. This exception contains the new address.
 - the client analyses the exeptions and update its cache accordingly...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client

2012-05-04 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13268701#comment-13268701
 ] 

nkeywal commented on HBASE-5877:


thank you Ted. It's strange, it though it worked locally. May be this test
was hung and was not reported as an error. Thank you anyway.




 When a query fails because the region has moved, let the regionserver return 
 the new address to the client
 --

 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5877-v16.txt, 5877.v1.patch, 5877.v12.patch, 
 5877.v15.patch, 5877.v6.patch


 This is mainly useful when we do a rolling restart. This will decrease the 
 load on the master and the network load.
 Note that a region is not immediately opened after a close. So:
 - it seems preferable to wait before retrying on the other server. An 
 optimisation would be to have an heuristic depending on when the region was 
 closed.
 - during a rolling restart, the server moves the regions then stops. So we 
 may have failures when the server is stopped, and this patch won't help.
 The implementation in the first patch does:
 - on the region move, there is an added parameter on the regionserver#close 
 to say where we are sending the region
 - the regionserver keeps a list of what was moved. Each entry is kept 100 
 seconds.
 - the regionserver sends a specific exception when it receives a query on a 
 moved region. This exception contains the new address.
 - the client analyses the exeptions and update its cache accordingly...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client

2012-05-07 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13270212#comment-13270212
 ] 

nkeywal commented on HBASE-5877:


I didn't find the test in error. Could be committed imho.

 When a query fails because the region has moved, let the regionserver return 
 the new address to the client
 --

 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5877-v16.txt, 5877-v17.txt, 5877.v1.patch, 
 5877.v12.patch, 5877.v15.patch, 5877.v6.patch


 This is mainly useful when we do a rolling restart. This will decrease the 
 load on the master and the network load.
 Note that a region is not immediately opened after a close. So:
 - it seems preferable to wait before retrying on the other server. An 
 optimisation would be to have an heuristic depending on when the region was 
 closed.
 - during a rolling restart, the server moves the regions then stops. So we 
 may have failures when the server is stopped, and this patch won't help.
 The implementation in the first patch does:
 - on the region move, there is an added parameter on the regionserver#close 
 to say where we are sending the region
 - the regionserver keeps a list of what was moved. Each entry is kept 100 
 seconds.
 - the regionserver sends a specific exception when it receives a query on a 
 moved region. This exception contains the new address.
 - the client analyses the exeptions and update its cache accordingly...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client

2012-05-08 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13270263#comment-13270263
 ] 

nkeywal commented on HBASE-5877:


Ok... I will do the merge and provide another patch today...

 When a query fails because the region has moved, let the regionserver return 
 the new address to the client
 --

 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5877-v16.txt, 5877-v17.txt, 5877-v17.txt, 5877.v1.patch, 
 5877.v12.patch, 5877.v15.patch, 5877.v6.patch


 This is mainly useful when we do a rolling restart. This will decrease the 
 load on the master and the network load.
 Note that a region is not immediately opened after a close. So:
 - it seems preferable to wait before retrying on the other server. An 
 optimisation would be to have an heuristic depending on when the region was 
 closed.
 - during a rolling restart, the server moves the regions then stops. So we 
 may have failures when the server is stopped, and this patch won't help.
 The implementation in the first patch does:
 - on the region move, there is an added parameter on the regionserver#close 
 to say where we are sending the region
 - the regionserver keeps a list of what was moved. Each entry is kept 100 
 seconds.
 - the regionserver sends a specific exception when it receives a query on a 
 moved region. This exception contains the new address.
 - the client analyses the exeptions and update its cache accordingly...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client

2012-05-08 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5877:
---

Status: Open  (was: Patch Available)

 When a query fails because the region has moved, let the regionserver return 
 the new address to the client
 --

 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5877-v16.txt, 5877-v17.txt, 5877-v17.txt, 5877.v1.patch, 
 5877.v12.patch, 5877.v15.patch, 5877.v6.patch


 This is mainly useful when we do a rolling restart. This will decrease the 
 load on the master and the network load.
 Note that a region is not immediately opened after a close. So:
 - it seems preferable to wait before retrying on the other server. An 
 optimisation would be to have an heuristic depending on when the region was 
 closed.
 - during a rolling restart, the server moves the regions then stops. So we 
 may have failures when the server is stopped, and this patch won't help.
 The implementation in the first patch does:
 - on the region move, there is an added parameter on the regionserver#close 
 to say where we are sending the region
 - the regionserver keeps a list of what was moved. Each entry is kept 100 
 seconds.
 - the regionserver sends a specific exception when it receives a query on a 
 moved region. This exception contains the new address.
 - the client analyses the exeptions and update its cache accordingly...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client

2012-05-08 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5877:
---

Attachment: 5877.v18.patch

Hopefully I didn't break anything while doing the merge. Local tests are ok. I 
included Ted's fix.

 When a query fails because the region has moved, let the regionserver return 
 the new address to the client
 --

 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5877-v16.txt, 5877-v17.txt, 5877-v17.txt, 5877.v1.patch, 
 5877.v12.patch, 5877.v15.patch, 5877.v18.patch, 5877.v6.patch


 This is mainly useful when we do a rolling restart. This will decrease the 
 load on the master and the network load.
 Note that a region is not immediately opened after a close. So:
 - it seems preferable to wait before retrying on the other server. An 
 optimisation would be to have an heuristic depending on when the region was 
 closed.
 - during a rolling restart, the server moves the regions then stops. So we 
 may have failures when the server is stopped, and this patch won't help.
 The implementation in the first patch does:
 - on the region move, there is an added parameter on the regionserver#close 
 to say where we are sending the region
 - the regionserver keeps a list of what was moved. Each entry is kept 100 
 seconds.
 - the regionserver sends a specific exception when it receives a query on a 
 moved region. This exception contains the new address.
 - the client analyses the exeptions and update its cache accordingly...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client

2012-05-08 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5877:
---

Status: Patch Available  (was: Open)

 When a query fails because the region has moved, let the regionserver return 
 the new address to the client
 --

 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5877-v16.txt, 5877-v17.txt, 5877-v17.txt, 5877.v1.patch, 
 5877.v12.patch, 5877.v15.patch, 5877.v18.patch, 5877.v6.patch


 This is mainly useful when we do a rolling restart. This will decrease the 
 load on the master and the network load.
 Note that a region is not immediately opened after a close. So:
 - it seems preferable to wait before retrying on the other server. An 
 optimisation would be to have an heuristic depending on when the region was 
 closed.
 - during a rolling restart, the server moves the regions then stops. So we 
 may have failures when the server is stopped, and this patch won't help.
 The implementation in the first patch does:
 - on the region move, there is an added parameter on the regionserver#close 
 to say where we are sending the region
 - the regionserver keeps a list of what was moved. Each entry is kept 100 
 seconds.
 - the regionserver sends a specific exception when it receives a query on a 
 moved region. This exception contains the new address.
 - the client analyses the exeptions and update its cache accordingly...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client

2012-05-09 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5877:
---

Status: Open  (was: Patch Available)

 When a query fails because the region has moved, let the regionserver return 
 the new address to the client
 --

 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5877-v16.txt, 5877-v17.txt, 5877-v17.txt, 5877.v1.patch, 
 5877.v12.patch, 5877.v15.patch, 5877.v18.patch, 5877.v6.patch


 This is mainly useful when we do a rolling restart. This will decrease the 
 load on the master and the network load.
 Note that a region is not immediately opened after a close. So:
 - it seems preferable to wait before retrying on the other server. An 
 optimisation would be to have an heuristic depending on when the region was 
 closed.
 - during a rolling restart, the server moves the regions then stops. So we 
 may have failures when the server is stopped, and this patch won't help.
 The implementation in the first patch does:
 - on the region move, there is an added parameter on the regionserver#close 
 to say where we are sending the region
 - the regionserver keeps a list of what was moved. Each entry is kept 100 
 seconds.
 - the regionserver sends a specific exception when it receives a query on a 
 moved region. This exception contains the new address.
 - the client analyses the exeptions and update its cache accordingly...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client

2012-05-09 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5877:
---

Attachment: 5877.v18.patch

 When a query fails because the region has moved, let the regionserver return 
 the new address to the client
 --

 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5877-v16.txt, 5877-v17.txt, 5877-v17.txt, 5877.v1.patch, 
 5877.v12.patch, 5877.v15.patch, 5877.v18.patch, 5877.v18.patch, 5877.v6.patch


 This is mainly useful when we do a rolling restart. This will decrease the 
 load on the master and the network load.
 Note that a region is not immediately opened after a close. So:
 - it seems preferable to wait before retrying on the other server. An 
 optimisation would be to have an heuristic depending on when the region was 
 closed.
 - during a rolling restart, the server moves the regions then stops. So we 
 may have failures when the server is stopped, and this patch won't help.
 The implementation in the first patch does:
 - on the region move, there is an added parameter on the regionserver#close 
 to say where we are sending the region
 - the regionserver keeps a list of what was moved. Each entry is kept 100 
 seconds.
 - the regionserver sends a specific exception when it receives a query on a 
 moved region. This exception contains the new address.
 - the client analyses the exeptions and update its cache accordingly...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client

2012-05-09 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5877:
---

Status: Patch Available  (was: Open)

 When a query fails because the region has moved, let the regionserver return 
 the new address to the client
 --

 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5877-v16.txt, 5877-v17.txt, 5877-v17.txt, 5877.v1.patch, 
 5877.v12.patch, 5877.v15.patch, 5877.v18.patch, 5877.v18.patch, 5877.v6.patch


 This is mainly useful when we do a rolling restart. This will decrease the 
 load on the master and the network load.
 Note that a region is not immediately opened after a close. So:
 - it seems preferable to wait before retrying on the other server. An 
 optimisation would be to have an heuristic depending on when the region was 
 closed.
 - during a rolling restart, the server moves the regions then stops. So we 
 may have failures when the server is stopped, and this patch won't help.
 The implementation in the first patch does:
 - on the region move, there is an added parameter on the regionserver#close 
 to say where we are sending the region
 - the regionserver keeps a list of what was moved. Each entry is kept 100 
 seconds.
 - the regionserver sends a specific exception when it receives a query on a 
 moved region. This exception contains the new address.
 - the client analyses the exeptions and update its cache accordingly...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client

2012-05-09 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13271590#comment-13271590
 ] 

nkeywal commented on HBASE-5877:


Seems to be ready for a commit :-]

 When a query fails because the region has moved, let the regionserver return 
 the new address to the client
 --

 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5877-v16.txt, 5877-v17.txt, 5877-v17.txt, 5877.v1.patch, 
 5877.v12.patch, 5877.v15.patch, 5877.v18.patch, 5877.v18.patch, 5877.v6.patch


 This is mainly useful when we do a rolling restart. This will decrease the 
 load on the master and the network load.
 Note that a region is not immediately opened after a close. So:
 - it seems preferable to wait before retrying on the other server. An 
 optimisation would be to have an heuristic depending on when the region was 
 closed.
 - during a rolling restart, the server moves the regions then stops. So we 
 may have failures when the server is stopped, and this patch won't help.
 The implementation in the first patch does:
 - on the region move, there is an added parameter on the regionserver#close 
 to say where we are sending the region
 - the regionserver keeps a list of what was moved. Each entry is kept 100 
 seconds.
 - the regionserver sends a specific exception when it receives a query on a 
 moved region. This exception contains the new address.
 - the client analyses the exeptions and update its cache accordingly...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client

2012-05-09 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13271651#comment-13271651
 ] 

nkeywal commented on HBASE-5877:


bq. Now how does the updateCachelocation help here. If for some reason the 
opening of the region is not yet done and if the client gets 
RegionMovedException the client will try to contact the RS thinking the region 
got moved to it.

Yes, exactly. That's why I kept the sleep in the client code even for this 
RegionMoved. We could optimize this by adding a timestamps, with an heuristic 
like: we give two seconds for the region to move after it's closed on the 
origin server. Sharing the region state in ZK would be a simpler option, as we 
would know if the region has moved or not.

 When a query fails because the region has moved, let the regionserver return 
 the new address to the client
 --

 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5877-v16.txt, 5877-v17.txt, 5877-v17.txt, 5877.v1.patch, 
 5877.v12.patch, 5877.v15.patch, 5877.v18.patch, 5877.v18.patch, 5877.v6.patch


 This is mainly useful when we do a rolling restart. This will decrease the 
 load on the master and the network load.
 Note that a region is not immediately opened after a close. So:
 - it seems preferable to wait before retrying on the other server. An 
 optimisation would be to have an heuristic depending on when the region was 
 closed.
 - during a rolling restart, the server moves the regions then stops. So we 
 may have failures when the server is stopped, and this patch won't help.
 The implementation in the first patch does:
 - on the region move, there is an added parameter on the regionserver#close 
 to say where we are sending the region
 - the regionserver keeps a list of what was moved. Each entry is kept 100 
 seconds.
 - the regionserver sends a specific exception when it receives a query on a 
 moved region. This exception contains the new address.
 - the client analyses the exeptions and update its cache accordingly...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5939) Add an autorestart option in the start scripts

2012-05-11 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5939:
---

Fix Version/s: 0.96.0
   Status: Patch Available  (was: Open)

 Add an autorestart option in the start scripts
 --

 Key: HBASE-5939
 URL: https://issues.apache.org/jira/browse/HBASE-5939
 Project: HBase
  Issue Type: Improvement
  Components: master, regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5939.v4.patch


 When a binary dies on a server, we don't try to restart it while it would be 
 possible in most cases.
 We can have something as:
 loop
  start
  wait
  if cleanStop then exit
  if already stopped less than 5 minutes ago sleep 1 minute
 endloop
 This is simple for master  backup master, a little bit more complex for the 
 region server as it can be stopped by a script or by the shutdown procedure.
 On a long long term it could allow a restart with exactly the same 
 assignments.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5939) Add an autorestart option in the start scripts

2012-05-11 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5939:
---

Attachment: 5939.v4.patch

 Add an autorestart option in the start scripts
 --

 Key: HBASE-5939
 URL: https://issues.apache.org/jira/browse/HBASE-5939
 Project: HBase
  Issue Type: Improvement
  Components: master, regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5939.v4.patch


 When a binary dies on a server, we don't try to restart it while it would be 
 possible in most cases.
 We can have something as:
 loop
  start
  wait
  if cleanStop then exit
  if already stopped less than 5 minutes ago sleep 1 minute
 endloop
 This is simple for master  backup master, a little bit more complex for the 
 region server as it can be stopped by a script or by the shutdown procedure.
 On a long long term it could allow a restart with exactly the same 
 assignments.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client

2012-05-11 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5877:
---

Status: Open  (was: Patch Available)

 When a query fails because the region has moved, let the regionserver return 
 the new address to the client
 --

 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5877-v16.txt, 5877-v17.txt, 5877-v17.txt, 5877.v1.patch, 
 5877.v12.patch, 5877.v15.patch, 5877.v18.patch, 5877.v18.patch, 
 5877.v18.patch, 5877.v6.patch


 This is mainly useful when we do a rolling restart. This will decrease the 
 load on the master and the network load.
 Note that a region is not immediately opened after a close. So:
 - it seems preferable to wait before retrying on the other server. An 
 optimisation would be to have an heuristic depending on when the region was 
 closed.
 - during a rolling restart, the server moves the regions then stops. So we 
 may have failures when the server is stopped, and this patch won't help.
 The implementation in the first patch does:
 - on the region move, there is an added parameter on the regionserver#close 
 to say where we are sending the region
 - the regionserver keeps a list of what was moved. Each entry is kept 100 
 seconds.
 - the regionserver sends a specific exception when it receives a query on a 
 moved region. This exception contains the new address.
 - the client analyses the exeptions and update its cache accordingly...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client

2012-05-11 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5877:
---

Status: Patch Available  (was: Open)

 When a query fails because the region has moved, let the regionserver return 
 the new address to the client
 --

 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5877-v16.txt, 5877-v17.txt, 5877-v17.txt, 5877.v1.patch, 
 5877.v12.patch, 5877.v15.patch, 5877.v18.patch, 5877.v18.patch, 
 5877.v18.patch, 5877.v6.patch


 This is mainly useful when we do a rolling restart. This will decrease the 
 load on the master and the network load.
 Note that a region is not immediately opened after a close. So:
 - it seems preferable to wait before retrying on the other server. An 
 optimisation would be to have an heuristic depending on when the region was 
 closed.
 - during a rolling restart, the server moves the regions then stops. So we 
 may have failures when the server is stopped, and this patch won't help.
 The implementation in the first patch does:
 - on the region move, there is an added parameter on the regionserver#close 
 to say where we are sending the region
 - the regionserver keeps a list of what was moved. Each entry is kept 100 
 seconds.
 - the regionserver sends a specific exception when it receives a query on a 
 moved region. This exception contains the new address.
 - the client analyses the exeptions and update its cache accordingly...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client

2012-05-11 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5877:
---

Attachment: 5877.v18.patch

 When a query fails because the region has moved, let the regionserver return 
 the new address to the client
 --

 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5877-v16.txt, 5877-v17.txt, 5877-v17.txt, 5877.v1.patch, 
 5877.v12.patch, 5877.v15.patch, 5877.v18.patch, 5877.v18.patch, 
 5877.v18.patch, 5877.v6.patch


 This is mainly useful when we do a rolling restart. This will decrease the 
 load on the master and the network load.
 Note that a region is not immediately opened after a close. So:
 - it seems preferable to wait before retrying on the other server. An 
 optimisation would be to have an heuristic depending on when the region was 
 closed.
 - during a rolling restart, the server moves the regions then stops. So we 
 may have failures when the server is stopped, and this patch won't help.
 The implementation in the first patch does:
 - on the region move, there is an added parameter on the regionserver#close 
 to say where we are sending the region
 - the regionserver keeps a list of what was moved. Each entry is kept 100 
 seconds.
 - the regionserver sends a specific exception when it receives a query on a 
 moved region. This exception contains the new address.
 - the client analyses the exeptions and update its cache accordingly...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client

2012-05-11 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13273151#comment-13273151
 ] 

nkeywal commented on HBASE-5877:


Strange, the status is failure, but actually the code is available in the 
trunk (which is good, it saves another merge)

 When a query fails because the region has moved, let the regionserver return 
 the new address to the client
 --

 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5877-v16.txt, 5877-v17.txt, 5877-v17.txt, 5877.v1.patch, 
 5877.v12.patch, 5877.v15.patch, 5877.v18.patch, 5877.v18.patch, 
 5877.v18.patch, 5877.v6.patch


 This is mainly useful when we do a rolling restart. This will decrease the 
 load on the master and the network load.
 Note that a region is not immediately opened after a close. So:
 - it seems preferable to wait before retrying on the other server. An 
 optimisation would be to have an heuristic depending on when the region was 
 closed.
 - during a rolling restart, the server moves the regions then stops. So we 
 may have failures when the server is stopped, and this patch won't help.
 The implementation in the first patch does:
 - on the region move, there is an added parameter on the regionserver#close 
 to say where we are sending the region
 - the regionserver keeps a list of what was moved. Each entry is kept 100 
 seconds.
 - the regionserver sends a specific exception when it receives a query on a 
 moved region. This exception contains the new address.
 - the client analyses the exeptions and update its cache accordingly...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4955) Use the official versions of surefire junit

2012-05-11 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13273153#comment-13273153
 ] 

nkeywal commented on HBASE-4955:


Update: Still waiting for JUnit 4.11  Surefire 2.13, still no release date.

 Use the official versions of surefire  junit
 -

 Key: HBASE-4955
 URL: https://issues.apache.org/jira/browse/HBASE-4955
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor

 We currently use private versions for Surefire  JUnit since HBASE-4763.
 This JIRA traks what we need to move to official versions.
 Surefire 2.11 is just out, but, after some tests, it does not contain all 
 what we need.
 JUnit. Could be for JUnit 4.11. Issue to monitor:
 https://github.com/KentBeck/junit/issues/359: fixed in our version, no 
 feedback for an integration on trunk
 Surefire: Could be for Surefire 2.12. Issues to monitor are:
 329 (category support): fixed, we use the official implementation from the 
 trunk
 786 (@Category with forkMode=always): fixed, we use the official 
 implementation from the trunk
 791 (incorrect elapsed time on test failure): fixed, we use the official 
 implementation from the trunk
 793 (incorrect time in the XML report): Not fixed (reopen) on trunk, fixed on 
 our version.
 760 (does not take into account the test method): fixed in trunk, not fixed 
 in our version
 798 (print immediately the test class name): not fixed in trunk, not fixed in 
 our version
 799 (Allow test parallelization when forkMode=always): not fixed in trunk, 
 not fixed in our version
 800 (redirectTestOutputToFile not taken into account): not yet fix on trunk, 
 fixed on our version
 800  793 are the more important to monitor, it's the only ones that are 
 fixed in our version but not on trunk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client

2012-05-11 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5877:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

 When a query fails because the region has moved, let the regionserver return 
 the new address to the client
 --

 Key: HBASE-5877
 URL: https://issues.apache.org/jira/browse/HBASE-5877
 Project: HBase
  Issue Type: Improvement
  Components: client, master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5877-v16.txt, 5877-v17.txt, 5877-v17.txt, 5877.v1.patch, 
 5877.v12.patch, 5877.v15.patch, 5877.v18.patch, 5877.v18.patch, 
 5877.v18.patch, 5877.v6.patch


 This is mainly useful when we do a rolling restart. This will decrease the 
 load on the master and the network load.
 Note that a region is not immediately opened after a close. So:
 - it seems preferable to wait before retrying on the other server. An 
 optimisation would be to have an heuristic depending on when the region was 
 closed.
 - during a rolling restart, the server moves the regions then stops. So we 
 may have failures when the server is stopped, and this patch won't help.
 The implementation in the first patch does:
 - on the region move, there is an added parameter on the regionserver#close 
 to say where we are sending the region
 - the regionserver keeps a list of what was moved. Each entry is kept 100 
 seconds.
 - the regionserver sends a specific exception when it receives a query on a 
 moved region. This exception contains the new address.
 - the client analyses the exeptions and update its cache accordingly...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5939) Add an autorestart option in the start scripts

2012-05-11 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13273158#comment-13273158
 ] 

nkeywal commented on HBASE-5939:


Changes are not related to the failed test. Patch could be committed imho.

 Add an autorestart option in the start scripts
 --

 Key: HBASE-5939
 URL: https://issues.apache.org/jira/browse/HBASE-5939
 Project: HBase
  Issue Type: Improvement
  Components: master, regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5939.v4.patch


 When a binary dies on a server, we don't try to restart it while it would be 
 possible in most cases.
 We can have something as:
 loop
  start
  wait
  if cleanStop then exit
  if already stopped less than 5 minutes ago sleep 1 minute
 endloop
 This is simple for master  backup master, a little bit more complex for the 
 region server as it can be stopped by a script or by the shutdown procedure.
 On a long long term it could allow a restart with exactly the same 
 assignments.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-5992) Generalization of region move implementation + manage draining servers in bulk assign

2012-05-11 Thread nkeywal (JIRA)
nkeywal created HBASE-5992:
--

 Summary: Generalization of region move implementation + manage 
draining servers in bulk assign
 Key: HBASE-5992
 URL: https://issues.apache.org/jira/browse/HBASE-5992
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor


The region move implementation now has now a similar behavior whatever the 
destination server is specified or not. This allows:
 - to benefit from the improvement in HBASE-5877
 - as a side effect to have the coprocessors calls when the destination server 
is not specified
 
This includes various fixes around draining servers. Draining servers were not 
excluded during a bulk assign. This is now fixed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5992) Generalization of region move implementation + manage draining servers in bulk assign

2012-05-11 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5992:
---

Attachment: 5992.v2.patch

 Generalization of region move implementation + manage draining servers in 
 bulk assign
 -

 Key: HBASE-5992
 URL: https://issues.apache.org/jira/browse/HBASE-5992
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5992.v2.patch


 The region move implementation now has now a similar behavior whatever the 
 destination server is specified or not. This allows:
  - to benefit from the improvement in HBASE-5877
  - as a side effect to have the coprocessors calls when the destination 
 server is not specified
  
 This includes various fixes around draining servers. Draining servers were 
 not excluded during a bulk assign. This is now fixed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5992) Generalization of region move implementation + manage draining servers in bulk assign

2012-05-11 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5992:
---

Status: Open  (was: Patch Available)

 Generalization of region move implementation + manage draining servers in 
 bulk assign
 -

 Key: HBASE-5992
 URL: https://issues.apache.org/jira/browse/HBASE-5992
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5992.v2.patch


 The region move implementation now has now a similar behavior whatever the 
 destination server is specified or not. This allows:
  - to benefit from the improvement in HBASE-5877
  - as a side effect to have the coprocessors calls when the destination 
 server is not specified
  
 This includes various fixes around draining servers. Draining servers were 
 not excluded during a bulk assign. This is now fixed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5992) Generalization of region move implementation + manage draining servers in bulk assign

2012-05-11 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13273703#comment-13273703
 ] 

nkeywal commented on HBASE-5992:


bq. What does the big refactoring of TestDrainingServer do?
Hopefully it's now non flaky. Before both the test and the core code were a 
little bit flaky! :-).

bq. In hbase code base, we have spaces around operators... i.e. this needs 
spaces 
Done.

bq. does removeDeadNotExpiredServers also need to be?
It does compile if it's package protected. I didn't dare doing it in the first 
patch to maintain the existing interface, but since you're asking :-). Done.

TestDrainingServer failed for an unrelated reason in its setup. But well. Fixed 
(hopefully, I actually don't really reproduce the issue locally) as well in the 
next patch.


 Generalization of region move implementation + manage draining servers in 
 bulk assign
 -

 Key: HBASE-5992
 URL: https://issues.apache.org/jira/browse/HBASE-5992
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5992.v2.patch


 The region move implementation now has now a similar behavior whatever the 
 destination server is specified or not. This allows:
  - to benefit from the improvement in HBASE-5877
  - as a side effect to have the coprocessors calls when the destination 
 server is not specified
  
 This includes various fixes around draining servers. Draining servers were 
 not excluded during a bulk assign. This is now fixed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5939) Add an autorestart option in the start scripts

2012-05-11 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13273716#comment-13273716
 ] 

nkeywal commented on HBASE-5939:


It would make sense to make it the default. Just that he developers or admins 
used to a simple kill will be surprised to see the process coming back. As 
you like.

For the release notes, I'm ok. I was planning to update the reference guide 
(including a part with the forgotten-but-useful local-region.sh script), but I 
will write a release note for this one as well.

 Add an autorestart option in the start scripts
 --

 Key: HBASE-5939
 URL: https://issues.apache.org/jira/browse/HBASE-5939
 Project: HBase
  Issue Type: Improvement
  Components: master, regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5939.v4.patch


 When a binary dies on a server, we don't try to restart it while it would be 
 possible in most cases.
 We can have something as:
 loop
  start
  wait
  if cleanStop then exit
  if already stopped less than 5 minutes ago sleep 1 minute
 endloop
 This is simple for master  backup master, a little bit more complex for the 
 region server as it can be stopped by a script or by the shutdown procedure.
 On a long long term it could allow a restart with exactly the same 
 assignments.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5939) Add an autorestart option in the start scripts

2012-05-11 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5939:
---

 Description: 
When a binary dies on a server, we don't try to restart it while it would be 
possible in most cases.

We can have something as:
loop
 start
 wait
 if cleanStop then exit
 if already stopped less than 5 minutes ago sleep 5 minute
endloop

This is simple for master  backup master, a little bit more complex for the 
region server as it can be stopped by a script or by the shutdown procedure.

On a long long term it could allow a restart with exactly the same assignments.





  was:
When a binary dies on a server, we don't try to restart it while it would be 
possible in most cases.

We can have something as:
loop
 start
 wait
 if cleanStop then exit
 if already stopped less than 5 minutes ago sleep 1 minute
endloop

This is simple for master  backup master, a little bit more complex for the 
region server as it can be stopped by a script or by the shutdown procedure.

On a long long term it could allow a restart with exactly the same assignments.





Release Note: When launched with autorestart, HBase processes will 
automatically restart if they are not properly terminated, either by a stop 
command or by a cluster stop. To ensure that it does not overload the system 
when the server itself is corrupted and the process cannot be restarted, the 
server sleeps for 5 minutes before restarting if it was already started 5 
minutes ago previously. To use it, launch the process with bin/start-hbase 
autorestart. This option is not fully compatible with the existing restart 
command: if you ask for a restart on a server launched with autorestart, the 
server will restart but the next server instance won't be automatically 
restarted.

 Add an autorestart option in the start scripts
 --

 Key: HBASE-5939
 URL: https://issues.apache.org/jira/browse/HBASE-5939
 Project: HBase
  Issue Type: Improvement
  Components: master, regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5939.v4.patch


 When a binary dies on a server, we don't try to restart it while it would be 
 possible in most cases.
 We can have something as:
 loop
  start
  wait
  if cleanStop then exit
  if already stopped less than 5 minutes ago sleep 5 minute
 endloop
 This is simple for master  backup master, a little bit more complex for the 
 region server as it can be stopped by a script or by the shutdown procedure.
 On a long long term it could allow a restart with exactly the same 
 assignments.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5992) Generalization of region move implementation + manage draining servers in bulk assign

2012-05-11 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13273724#comment-13273724
 ] 

nkeywal commented on HBASE-5992:


v5 with Stack's and Ted's comments taken into account. Thanks.

 Generalization of region move implementation + manage draining servers in 
 bulk assign
 -

 Key: HBASE-5992
 URL: https://issues.apache.org/jira/browse/HBASE-5992
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5992.v2.patch, 5992.v5.patch


 The region move implementation now has now a similar behavior whatever the 
 destination server is specified or not. This allows:
  - to benefit from the improvement in HBASE-5877
  - as a side effect to have the coprocessors calls when the destination 
 server is not specified
  
 This includes various fixes around draining servers. Draining servers were 
 not excluded during a bulk assign. This is now fixed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5992) Generalization of region move implementation + manage draining servers in bulk assign

2012-05-11 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5992:
---

Attachment: 5992.v5.patch

 Generalization of region move implementation + manage draining servers in 
 bulk assign
 -

 Key: HBASE-5992
 URL: https://issues.apache.org/jira/browse/HBASE-5992
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5992.v2.patch, 5992.v5.patch


 The region move implementation now has now a similar behavior whatever the 
 destination server is specified or not. This allows:
  - to benefit from the improvement in HBASE-5877
  - as a side effect to have the coprocessors calls when the destination 
 server is not specified
  
 This includes various fixes around draining servers. Draining servers were 
 not excluded during a bulk assign. This is now fixed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5992) Generalization of region move implementation + manage draining servers in bulk assign

2012-05-11 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5992:
---

Status: Patch Available  (was: Open)

 Generalization of region move implementation + manage draining servers in 
 bulk assign
 -

 Key: HBASE-5992
 URL: https://issues.apache.org/jira/browse/HBASE-5992
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5992.v2.patch, 5992.v5.patch


 The region move implementation now has now a similar behavior whatever the 
 destination server is specified or not. This allows:
  - to benefit from the improvement in HBASE-5877
  - as a side effect to have the coprocessors calls when the destination 
 server is not specified
  
 This includes various fixes around draining servers. Draining servers were 
 not excluded during a bulk assign. This is now fixed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5992) Generalization of region move implementation + manage draining servers in bulk assign

2012-05-12 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13273997#comment-13273997
 ] 

nkeywal commented on HBASE-5992:


It's org.apache.hadoop.hbase.TestDrainingServer.
I fixed the issue, I'm retesting before uploading the patch.

 Generalization of region move implementation + manage draining servers in 
 bulk assign
 -

 Key: HBASE-5992
 URL: https://issues.apache.org/jira/browse/HBASE-5992
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5992.v2.patch, 5992.v5.patch, 5992.v5.patch


 The region move implementation now has now a similar behavior whatever the 
 destination server is specified or not. This allows:
  - to benefit from the improvement in HBASE-5877
  - as a side effect to have the coprocessors calls when the destination 
 server is not specified
  
 This includes various fixes around draining servers. Draining servers were 
 not excluded during a bulk assign. This is now fixed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5992) Generalization of region move implementation + manage draining servers in bulk assign

2012-05-12 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5992:
---

Status: Open  (was: Patch Available)

 Generalization of region move implementation + manage draining servers in 
 bulk assign
 -

 Key: HBASE-5992
 URL: https://issues.apache.org/jira/browse/HBASE-5992
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5992.v2.patch, 5992.v5.patch, 5992.v5.patch


 The region move implementation now has now a similar behavior whatever the 
 destination server is specified or not. This allows:
  - to benefit from the improvement in HBASE-5877
  - as a side effect to have the coprocessors calls when the destination 
 server is not specified
  
 This includes various fixes around draining servers. Draining servers were 
 not excluded during a bulk assign. This is now fixed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira





[jira] [Updated] (HBASE-5992) Generalization of region move implementation + manage draining servers in bulk assign

2012-05-12 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5992:
---

Attachment: 5992.v11.patch

 Generalization of region move implementation + manage draining servers in 
 bulk assign
 -

 Key: HBASE-5992
 URL: https://issues.apache.org/jira/browse/HBASE-5992
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5992.v11.patch, 5992.v2.patch, 5992.v5.patch, 
 5992.v5.patch


 The region move implementation now has now a similar behavior whatever the 
 destination server is specified or not. This allows:
  - to benefit from the improvement in HBASE-5877
  - as a side effect to have the coprocessors calls when the destination 
 server is not specified
  
 This includes various fixes around draining servers. Draining servers were 
 not excluded during a bulk assign. This is now fixed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5992) Generalization of region move implementation + manage draining servers in bulk assign

2012-05-12 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5992:
---

Status: Patch Available  (was: Open)

 Generalization of region move implementation + manage draining servers in 
 bulk assign
 -

 Key: HBASE-5992
 URL: https://issues.apache.org/jira/browse/HBASE-5992
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5992.v11.patch, 5992.v2.patch, 5992.v5.patch, 
 5992.v5.patch


 The region move implementation now has now a similar behavior whatever the 
 destination server is specified or not. This allows:
  - to benefit from the improvement in HBASE-5877
  - as a side effect to have the coprocessors calls when the destination 
 server is not specified
  
 This includes various fixes around draining servers. Draining servers were 
 not excluded during a bulk assign. This is now fixed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (HBASE-5998) Bulk assignment: regionserver optimization by using a temporary cache for table descriptors when receveing an open regions request

2012-05-14 Thread nkeywal (JIRA)
nkeywal created HBASE-5998:
--

 Summary: Bulk assignment: regionserver optimization by using a 
temporary cache for table descriptors when receveing an open regions request
 Key: HBASE-5998
 URL: https://issues.apache.org/jira/browse/HBASE-5998
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor


During the assignment, on the regionserver, before creating the handlers we 
load the table description. Even if there is a cache, we check the timestamps 
for each region, while it's not necessary. The test below is just with one 
node, with more nodes the benefit will improve. By limiting the time spent in 
HRegion#openRegion we increase the parallelization during cluster startup, as 
the master is using a pool of threads to call the RS.

-- Without the fix
2012-05-14 11:40:52,501 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Bulk assigning 1193 region(s) to localhost,11003,1336988444043
2012-05-14 11:41:09,947 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Bulk assigning done for localhost,11003,1336988444043

-- With the fix
2012-05-14 11:34:40,444 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Bulk assigning 1193 region(s) to localhost,11003,1336988444043
2012-05-14 11:34:40,929 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Bulk assigning done for localhost,11003,1336988065948


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5998) Bulk assignment: regionserver optimization by using a temporary cache for table descriptors when receveing an open regions request

2012-05-14 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5998:
---

Attachment: 5998.v2.patch

 Bulk assignment: regionserver optimization by using a temporary cache for 
 table descriptors when receveing an open regions request
 --

 Key: HBASE-5998
 URL: https://issues.apache.org/jira/browse/HBASE-5998
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5998.v2.patch


 During the assignment, on the regionserver, before creating the handlers we 
 load the table description. Even if there is a cache, we check the timestamps 
 for each region, while it's not necessary. The test below is just with one 
 node, with more nodes the benefit will improve. By limiting the time spent in 
 HRegion#openRegion we increase the parallelization during cluster startup, as 
 the master is using a pool of threads to call the RS.
 -- Without the fix
 2012-05-14 11:40:52,501 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 1193 
 region(s) to localhost,11003,1336988444043
 2012-05-14 11:41:09,947 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done for 
 localhost,11003,1336988444043
 -- With the fix
 2012-05-14 11:34:40,444 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 1193 
 region(s) to localhost,11003,1336988444043
 2012-05-14 11:34:40,929 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done for 
 localhost,11003,1336988065948

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5998) Bulk assignment: regionserver optimization by using a temporary cache for table descriptors when receveing an open regions request

2012-05-14 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5998:
---

Fix Version/s: 0.96.0
   Status: Patch Available  (was: Open)

 Bulk assignment: regionserver optimization by using a temporary cache for 
 table descriptors when receveing an open regions request
 --

 Key: HBASE-5998
 URL: https://issues.apache.org/jira/browse/HBASE-5998
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5998.v2.patch


 During the assignment, on the regionserver, before creating the handlers we 
 load the table description. Even if there is a cache, we check the timestamps 
 for each region, while it's not necessary. The test below is just with one 
 node, with more nodes the benefit will improve. By limiting the time spent in 
 HRegion#openRegion we increase the parallelization during cluster startup, as 
 the master is using a pool of threads to call the RS.
 -- Without the fix
 2012-05-14 11:40:52,501 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 1193 
 region(s) to localhost,11003,1336988444043
 2012-05-14 11:41:09,947 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done for 
 localhost,11003,1336988444043
 -- With the fix
 2012-05-14 11:34:40,444 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 1193 
 region(s) to localhost,11003,1336988444043
 2012-05-14 11:34:40,929 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done for 
 localhost,11003,1336988065948

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5998) Bulk assignment: regionserver optimization by using a temporary cache for table descriptors when receveing an open regions request

2012-05-14 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5998:
---

Status: Patch Available  (was: Open)

 Bulk assignment: regionserver optimization by using a temporary cache for 
 table descriptors when receveing an open regions request
 --

 Key: HBASE-5998
 URL: https://issues.apache.org/jira/browse/HBASE-5998
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5998.v2.patch, 5998.v3.patch


 During the assignment, on the regionserver, before creating the handlers we 
 load the table description. Even if there is a cache, we check the timestamps 
 for each region, while it's not necessary. The test below is just with one 
 node, with more nodes the benefit will improve. By limiting the time spent in 
 HRegion#openRegion we increase the parallelization during cluster startup, as 
 the master is using a pool of threads to call the RS.
 -- Without the fix
 2012-05-14 11:40:52,501 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 1193 
 region(s) to localhost,11003,1336988444043
 2012-05-14 11:41:09,947 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done for 
 localhost,11003,1336988444043
 -- With the fix
 2012-05-14 11:34:40,444 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 1193 
 region(s) to localhost,11003,1336988444043
 2012-05-14 11:34:40,929 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done for 
 localhost,11003,1336988065948

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5998) Bulk assignment: regionserver optimization by using a temporary cache for table descriptors when receveing an open regions request

2012-05-14 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5998:
---

Attachment: 5998.v3.patch

 Bulk assignment: regionserver optimization by using a temporary cache for 
 table descriptors when receveing an open regions request
 --

 Key: HBASE-5998
 URL: https://issues.apache.org/jira/browse/HBASE-5998
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5998.v2.patch, 5998.v3.patch


 During the assignment, on the regionserver, before creating the handlers we 
 load the table description. Even if there is a cache, we check the timestamps 
 for each region, while it's not necessary. The test below is just with one 
 node, with more nodes the benefit will improve. By limiting the time spent in 
 HRegion#openRegion we increase the parallelization during cluster startup, as 
 the master is using a pool of threads to call the RS.
 -- Without the fix
 2012-05-14 11:40:52,501 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 1193 
 region(s) to localhost,11003,1336988444043
 2012-05-14 11:41:09,947 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done for 
 localhost,11003,1336988444043
 -- With the fix
 2012-05-14 11:34:40,444 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 1193 
 region(s) to localhost,11003,1336988444043
 2012-05-14 11:34:40,929 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done for 
 localhost,11003,1336988065948

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5998) Bulk assignment: regionserver optimization by using a temporary cache for table descriptors when receveing an open regions request

2012-05-14 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13274572#comment-13274572
 ] 

nkeywal commented on HBASE-5998:


The failure in TestDrainingServer is not directly related to my change. I will 
have a look and fix the flakiness in another jira.

 Bulk assignment: regionserver optimization by using a temporary cache for 
 table descriptors when receveing an open regions request
 --

 Key: HBASE-5998
 URL: https://issues.apache.org/jira/browse/HBASE-5998
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5998.v2.patch, 5998.v3.patch


 During the assignment, on the regionserver, before creating the handlers we 
 load the table description. Even if there is a cache, we check the timestamps 
 for each region, while it's not necessary. The test below is just with one 
 node, with more nodes the benefit will improve. By limiting the time spent in 
 HRegion#openRegion we increase the parallelization during cluster startup, as 
 the master is using a pool of threads to call the RS.
 -- Without the fix
 2012-05-14 11:40:52,501 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 1193 
 region(s) to localhost,11003,1336988444043
 2012-05-14 11:41:09,947 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done for 
 localhost,11003,1336988444043
 -- With the fix
 2012-05-14 11:34:40,444 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 1193 
 region(s) to localhost,11003,1336988444043
 2012-05-14 11:34:40,929 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done for 
 localhost,11003,1336988065948

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5926) Delete the master znode after a master crash

2012-05-16 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5926:
---

Attachment: 5926.v6.patch

 Delete the master znode after a master crash
 

 Key: HBASE-5926
 URL: https://issues.apache.org/jira/browse/HBASE-5926
 Project: HBase
  Issue Type: Improvement
  Components: master, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5926.v6.patch


 This is the continuation of the work done in HBASE-5844.
 But we can't apply exactly the same strategy: for the region server, there is 
 a znode per region server, while for the master  backup master there is a 
 single znode for both.
 So if we apply the same strategy as for a regionserver, we may have this 
 scenario:
 1) Master starts
 2) Backup master starts
 3) Master dies
 4) ZK detects it
 5) Backup master receives the update from ZK
 6) Backup master creates the new master node and become the main master
 7) Previous master script continues
 8) Previous master script deletes the master node in ZK
 9) = issue: we deleted the node just created by the new master
 This should not happen often (usually the znode will be deleted soon enough), 
 but it can happen.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5926) Delete the master znode after a master crash

2012-05-16 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5926:
---

Fix Version/s: 0.96.0
   Status: Patch Available  (was: Open)

 Delete the master znode after a master crash
 

 Key: HBASE-5926
 URL: https://issues.apache.org/jira/browse/HBASE-5926
 Project: HBase
  Issue Type: Improvement
  Components: master, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5926.v6.patch


 This is the continuation of the work done in HBASE-5844.
 But we can't apply exactly the same strategy: for the region server, there is 
 a znode per region server, while for the master  backup master there is a 
 single znode for both.
 So if we apply the same strategy as for a regionserver, we may have this 
 scenario:
 1) Master starts
 2) Backup master starts
 3) Master dies
 4) ZK detects it
 5) Backup master receives the update from ZK
 6) Backup master creates the new master node and become the main master
 7) Previous master script continues
 8) Previous master script deletes the master node in ZK
 9) = issue: we deleted the node just created by the new master
 This should not happen often (usually the znode will be deleted soon enough), 
 but it can happen.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5926) Delete the master znode after a master crash

2012-05-16 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13277172#comment-13277172
 ] 

nkeywal commented on HBASE-5926:


the race condition is decreased to a production-acceptable minimum imho. We do 
a compare  delete in the java code, so the race condition is now: between the 
comparison and the delete, we fail if, and only if: the session expires and the 
master node is deleted and the master backup recreates the node. That's 
unlikely. 

 Delete the master znode after a master crash
 

 Key: HBASE-5926
 URL: https://issues.apache.org/jira/browse/HBASE-5926
 Project: HBase
  Issue Type: Improvement
  Components: master, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5926.v6.patch


 This is the continuation of the work done in HBASE-5844.
 But we can't apply exactly the same strategy: for the region server, there is 
 a znode per region server, while for the master  backup master there is a 
 single znode for both.
 So if we apply the same strategy as for a regionserver, we may have this 
 scenario:
 1) Master starts
 2) Backup master starts
 3) Master dies
 4) ZK detects it
 5) Backup master receives the update from ZK
 6) Backup master creates the new master node and become the main master
 7) Previous master script continues
 8) Previous master script deletes the master node in ZK
 9) = issue: we deleted the node just created by the new master
 This should not happen often (usually the znode will be deleted soon enough), 
 but it can happen.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5926) Delete the master znode after a master crash

2012-05-17 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5926:
---

Status: Open  (was: Patch Available)

 Delete the master znode after a master crash
 

 Key: HBASE-5926
 URL: https://issues.apache.org/jira/browse/HBASE-5926
 Project: HBase
  Issue Type: Improvement
  Components: master, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5926.v6.patch, 5926.v8.patch


 This is the continuation of the work done in HBASE-5844.
 But we can't apply exactly the same strategy: for the region server, there is 
 a znode per region server, while for the master  backup master there is a 
 single znode for both.
 So if we apply the same strategy as for a regionserver, we may have this 
 scenario:
 1) Master starts
 2) Backup master starts
 3) Master dies
 4) ZK detects it
 5) Backup master receives the update from ZK
 6) Backup master creates the new master node and become the main master
 7) Previous master script continues
 8) Previous master script deletes the master node in ZK
 9) = issue: we deleted the node just created by the new master
 This should not happen often (usually the znode will be deleted soon enough), 
 but it can happen.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5926) Delete the master znode after a master crash

2012-05-17 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13277684#comment-13277684
 ] 

nkeywal commented on HBASE-5926:


v8. with Ted's comments taken into account.

 Delete the master znode after a master crash
 

 Key: HBASE-5926
 URL: https://issues.apache.org/jira/browse/HBASE-5926
 Project: HBase
  Issue Type: Improvement
  Components: master, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5926.v6.patch, 5926.v8.patch


 This is the continuation of the work done in HBASE-5844.
 But we can't apply exactly the same strategy: for the region server, there is 
 a znode per region server, while for the master  backup master there is a 
 single znode for both.
 So if we apply the same strategy as for a regionserver, we may have this 
 scenario:
 1) Master starts
 2) Backup master starts
 3) Master dies
 4) ZK detects it
 5) Backup master receives the update from ZK
 6) Backup master creates the new master node and become the main master
 7) Previous master script continues
 8) Previous master script deletes the master node in ZK
 9) = issue: we deleted the node just created by the new master
 This should not happen often (usually the znode will be deleted soon enough), 
 but it can happen.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5926) Delete the master znode after a master crash

2012-05-17 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5926:
---

Attachment: 5926.v8.patch

 Delete the master znode after a master crash
 

 Key: HBASE-5926
 URL: https://issues.apache.org/jira/browse/HBASE-5926
 Project: HBase
  Issue Type: Improvement
  Components: master, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5926.v6.patch, 5926.v8.patch


 This is the continuation of the work done in HBASE-5844.
 But we can't apply exactly the same strategy: for the region server, there is 
 a znode per region server, while for the master  backup master there is a 
 single znode for both.
 So if we apply the same strategy as for a regionserver, we may have this 
 scenario:
 1) Master starts
 2) Backup master starts
 3) Master dies
 4) ZK detects it
 5) Backup master receives the update from ZK
 6) Backup master creates the new master node and become the main master
 7) Previous master script continues
 8) Previous master script deletes the master node in ZK
 9) = issue: we deleted the node just created by the new master
 This should not happen often (usually the znode will be deleted soon enough), 
 but it can happen.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5926) Delete the master znode after a master crash

2012-05-17 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5926:
---

Status: Patch Available  (was: Open)

 Delete the master znode after a master crash
 

 Key: HBASE-5926
 URL: https://issues.apache.org/jira/browse/HBASE-5926
 Project: HBase
  Issue Type: Improvement
  Components: master, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5926.v6.patch, 5926.v8.patch


 This is the continuation of the work done in HBASE-5844.
 But we can't apply exactly the same strategy: for the region server, there is 
 a znode per region server, while for the master  backup master there is a 
 single znode for both.
 So if we apply the same strategy as for a regionserver, we may have this 
 scenario:
 1) Master starts
 2) Backup master starts
 3) Master dies
 4) ZK detects it
 5) Backup master receives the update from ZK
 6) Backup master creates the new master node and become the main master
 7) Previous master script continues
 8) Previous master script deletes the master node in ZK
 9) = issue: we deleted the node just created by the new master
 This should not happen often (usually the znode will be deleted soon enough), 
 but it can happen.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5926) Delete the master znode after a master crash

2012-05-17 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13277702#comment-13277702
 ] 

nkeywal commented on HBASE-5926:


These tests run ok locally.

 Delete the master znode after a master crash
 

 Key: HBASE-5926
 URL: https://issues.apache.org/jira/browse/HBASE-5926
 Project: HBase
  Issue Type: Improvement
  Components: master, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5926.v6.patch, 5926.v8.patch


 This is the continuation of the work done in HBASE-5844.
 But we can't apply exactly the same strategy: for the region server, there is 
 a znode per region server, while for the master  backup master there is a 
 single znode for both.
 So if we apply the same strategy as for a regionserver, we may have this 
 scenario:
 1) Master starts
 2) Backup master starts
 3) Master dies
 4) ZK detects it
 5) Backup master receives the update from ZK
 6) Backup master creates the new master node and become the main master
 7) Previous master script continues
 8) Previous master script deletes the master node in ZK
 9) = issue: we deleted the node just created by the new master
 This should not happen often (usually the znode will be deleted soon enough), 
 but it can happen.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5926) Delete the master znode after a master crash

2012-05-17 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13277996#comment-13277996
 ] 

nkeywal commented on HBASE-5926:


bq. javadoc
done.

bq. readMyEphemeralNodeOnDisk() throws IOException but 
writeMyEphemeralNodeOnDisk() doesn't. What was the reason ?
When we write we ignore the results (i.e. we don't stop the master or the 
region server if we can't store the znode, we just continue). When we read, 
we're interested in the exception: the pattern in HMasterCommandLine is to 
return -1 on error.

bq. Please add ' contents' at the end of the above.
ok.

bq. Should the setting be restored before exiting the above method ?
I now clone the conf.


 Delete the master znode after a master crash
 

 Key: HBASE-5926
 URL: https://issues.apache.org/jira/browse/HBASE-5926
 Project: HBase
  Issue Type: Improvement
  Components: master, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5926.v6.patch, 5926.v8.patch


 This is the continuation of the work done in HBASE-5844.
 But we can't apply exactly the same strategy: for the region server, there is 
 a znode per region server, while for the master  backup master there is a 
 single znode for both.
 So if we apply the same strategy as for a regionserver, we may have this 
 scenario:
 1) Master starts
 2) Backup master starts
 3) Master dies
 4) ZK detects it
 5) Backup master receives the update from ZK
 6) Backup master creates the new master node and become the main master
 7) Previous master script continues
 8) Previous master script deletes the master node in ZK
 9) = issue: we deleted the node just created by the new master
 This should not happen often (usually the znode will be deleted soon enough), 
 but it can happen.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5926) Delete the master znode after a master crash

2012-05-17 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5926:
---

Status: Open  (was: Patch Available)

 Delete the master znode after a master crash
 

 Key: HBASE-5926
 URL: https://issues.apache.org/jira/browse/HBASE-5926
 Project: HBase
  Issue Type: Improvement
  Components: master, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5926.v6.patch, 5926.v8.patch, 5926.v9.patch


 This is the continuation of the work done in HBASE-5844.
 But we can't apply exactly the same strategy: for the region server, there is 
 a znode per region server, while for the master  backup master there is a 
 single znode for both.
 So if we apply the same strategy as for a regionserver, we may have this 
 scenario:
 1) Master starts
 2) Backup master starts
 3) Master dies
 4) ZK detects it
 5) Backup master receives the update from ZK
 6) Backup master creates the new master node and become the main master
 7) Previous master script continues
 8) Previous master script deletes the master node in ZK
 9) = issue: we deleted the node just created by the new master
 This should not happen often (usually the znode will be deleted soon enough), 
 but it can happen.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5926) Delete the master znode after a master crash

2012-05-17 Thread nkeywal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-5926:
---

Attachment: 5926.v9.patch

 Delete the master znode after a master crash
 

 Key: HBASE-5926
 URL: https://issues.apache.org/jira/browse/HBASE-5926
 Project: HBase
  Issue Type: Improvement
  Components: master, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5926.v6.patch, 5926.v8.patch, 5926.v9.patch


 This is the continuation of the work done in HBASE-5844.
 But we can't apply exactly the same strategy: for the region server, there is 
 a znode per region server, while for the master  backup master there is a 
 single znode for both.
 So if we apply the same strategy as for a regionserver, we may have this 
 scenario:
 1) Master starts
 2) Backup master starts
 3) Master dies
 4) ZK detects it
 5) Backup master receives the update from ZK
 6) Backup master creates the new master node and become the main master
 7) Previous master script continues
 8) Previous master script deletes the master node in ZK
 9) = issue: we deleted the node just created by the new master
 This should not happen often (usually the znode will be deleted soon enough), 
 but it can happen.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




  1   2   3   4   5   6   7   8   9   10   >