[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash
[ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13258788#comment-13258788 ] nkeywal commented on HBASE-5844: I didn't know this parameter. It's interesting, because with ZK the default timeout is 30 seconds, but with HBase it's now 180s (from hbase-default.xml). It was increased to 60s a first time in HBASE-1772. It seems it was increased because of the GC. But it means that deleting immediately the ZK represents a huge mttr improvement for the regions server crash case. Delete the region servers znode after a regions server crash Key: HBASE-5844 URL: https://issues.apache.org/jira/browse/HBASE-5844 Project: HBase Issue Type: Improvement Components: regionserver, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Attachments: 5844.v1.patch today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s. By deleting the znode in start script, we remove this delay and the recovery starts immediately. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash
[ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13258803#comment-13258803 ] nkeywal commented on HBASE-5844: For the tracker, it's my private workaround for HBASE-5666, it should not have been included in this patch. Sorry about this. I think it's better to delete the file explicitly, just after the znode deletion. HRegionServer#deleteMyEphemeralNode is called only once, and I added deleteMyEphemeralNodeOnDisk just after this call. If we rely on #deleteOnExit, I fear we could have the file deleted with a still alive znode. I'm not sure and I have not tried it, but I think it's too easy to enter into the jvm-specific-behavior space here. I will fix the java code and try the whole fix on a real cluster for the v2. Thanks you for the review. Delete the region servers znode after a regions server crash Key: HBASE-5844 URL: https://issues.apache.org/jira/browse/HBASE-5844 Project: HBase Issue Type: Improvement Components: regionserver, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Attachments: 5844.v1.patch today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s. By deleting the znode in start script, we remove this delay and the recovery starts immediately. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5844) Delete the region servers znode after a regions server crash
[ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5844: --- Attachment: 5844.v2.patch Delete the region servers znode after a regions server crash Key: HBASE-5844 URL: https://issues.apache.org/jira/browse/HBASE-5844 Project: HBase Issue Type: Improvement Components: regionserver, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Attachments: 5844.v1.patch, 5844.v2.patch today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s. By deleting the znode in start script, we remove this delay and the recovery starts immediately. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash
[ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13259692#comment-13259692 ] nkeywal commented on HBASE-5844: v2 should be ok. It does not include anymore the fix for HBASE-5666, so it cannot be tested locally but I tried it before removing the workaround. Delete the region servers znode after a regions server crash Key: HBASE-5844 URL: https://issues.apache.org/jira/browse/HBASE-5844 Project: HBase Issue Type: Improvement Components: regionserver, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Attachments: 5844.v1.patch, 5844.v2.patch today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s. By deleting the znode in start script, we remove this delay and the recovery starts immediately. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5859) Optimize the rolling restart script
nkeywal created HBASE-5859: -- Summary: Optimize the rolling restart script Key: HBASE-5859 URL: https://issues.apache.org/jira/browse/HBASE-5859 Project: HBase Issue Type: Improvement Components: regionserver, scripts Affects Versions: 0.96.0 Reporter: nkeywal Priority: Minor There is a graceful_stop script. This algorithm: {noformat} for i = 0 to servers.size { regionsInServer = servers[i].regions move servers[i].regions to random stop servers[i] start servers[i] move regionsInServer to servers[i] //filled back with the same regions } {noformat} It would be possible to optimize it while keeping data locality with {noformat} for i = 0 to servers.size { start servers[i*2+1] on the computer of servers[i] // Two RS on the same box move servers[i].regions to servers[i*2+1] // The one on the same box stop servers[i] } {noformat} There would be an impact with a fixed port configuration. To fix this, we could: - use a range of port instead of a single port. This could be an issue for the web port. - start on a port then reuse the fixed ones when they become available. This is not very elegant if a client code is already using the previous code. Moreover the region server code is written in the meta table. - do a mix of the two solutions: a range for the server itself, while waiting for the web port to be available. To be discussed... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash
[ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13259904#comment-13259904 ] nkeywal commented on HBASE-5844: You're right. I propose to commit this patch, I will then generalize the solution to master in another jira. Delete the region servers znode after a regions server crash Key: HBASE-5844 URL: https://issues.apache.org/jira/browse/HBASE-5844 Project: HBase Issue Type: Improvement Components: regionserver, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Attachments: 5844.v1.patch, 5844.v2.patch today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s. By deleting the znode in start script, we remove this delay and the recovery starts immediately. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5877) When a query fails because the region has moved, let the regionserver returns the new address to the client
nkeywal created HBASE-5877: -- Summary: When a query fails because the region has moved, let the regionserver returns the new address to the client Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver returns the new address to the client
[ https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5877: --- Attachment: 5877.v1.patch When a query fails because the region has moved, let the regionserver returns the new address to the client --- Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5877.v1.patch This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5877) When a query fails because the region has moved, let the regionserver returns the new address to the client
[ https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13261972#comment-13261972 ] nkeywal commented on HBASE-5877: v1. On an old trunk, so it's just to give an overview. Includes some bits of HBASE-5844 as well. There are 3 workarounds in the implementation: 1) As a ServerName is not serializable we use the String dedicated to this kind of issue. Acceptable I think. 2) hadoop.ipc serialization of exception is based on the #getMessage. So we have to parse it internally. It's not visisble to the exception user. Still acceptable (? :-) 3) The code to manage the error in the client package is quite complex. We have the exception at the very beginning, and then it's checked again, but we don't have the real exception anymore. I used a new updateList to make it works, I'm looking for another solution here... When a query fails because the region has moved, let the regionserver returns the new address to the client --- Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5877.v1.patch This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client
[ https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262462#comment-13262462 ] nkeywal commented on HBASE-5877: bq. Can we mark the failure and make this RegionMovedException behave the same as NotServingRegionException ? Done. bq. For updateCachedLocations(), please put explanation for parameter on the same line as the parameter: Done. bq. 'Failed all' - 'Failed call' It's an existing comment that we can find again later in the code. It really means failed all: all the queries on this server failed. I don't mind changing it to something better, but I think we should keep the all. bq. 'which the server' - 'which the region' Done. bq. Please increase the VERSION of HRegionInterface Done. bq. How is the server removed from cache since I see 'continue' above ? That's what makes this code complex and difficult to change: the error is actually managed later, when we don't have the real exception anymore. bq. For ServerManager.sendRegionClose(), please add javadoc for destServerName param. Done. bq. Is it possible that destServerName is null ? Safety checks added. bq. Please change the above to debug log. Why is the above fatal (regionResult != null) ? Step 4 appears in a comment below the above code. Should the above say step 3 ? Bad logs fixed. When a query fails because the region has moved, let the regionserver return the new address to the client -- Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5877.v1.patch This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client
[ https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262717#comment-13262717 ] nkeywal commented on HBASE-5877: Note that I'm currently rewriting the patch, as it conflicts with the protobuf stuff that was committed recently... But the logic hasn't changed. @ted What we're saving in the current implementation is a call to the master. It can be interesting in itself if the region moves is used by a lot of clients. We could do better by letting the client know that the region is now fully available somewhere else and that there is no need to wait before retrying. But right now the region server only knows that the region is closed and moved to another server. It doesn't know if the region is opened yet. We could have this by adding the info in zk, but it would increase the zk load... When a query fails because the region has moved, let the regionserver return the new address to the client -- Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5877.v1.patch This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Reopened] (HBASE-5844) Delete the region servers znode after a regions server crash
[ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal reopened HBASE-5844: There is a regression when the cluster is fully distributed: the start command hangs. I'm on it. In the meantime, would it be possible to undo the commit? Sorry about this. Delete the region servers znode after a regions server crash Key: HBASE-5844 URL: https://issues.apache.org/jira/browse/HBASE-5844 Project: HBase Issue Type: Improvement Components: regionserver, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Fix For: 0.96.0 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s. By deleting the znode in start script, we remove this delay and the recovery starts immediately. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client
[ https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13263140#comment-13263140 ] nkeywal commented on HBASE-5877: bq. This patch will benefit any move, not just rolling restart, right? Yes, but as there is a wait time between two tries, I think the benefit will be minimal vs. the wait time for a single client. I could add an heuristic like if region was closed more than 2 seconds ago, consider that it's now available on the new server and don't sleep before the next retry. That could lead of having more network messages if the rule is wrong (and the rule will be wrong when the system is overloaded), and it will add some complexity to the client code. Having the real status of the region would solve this. Anyway, with the dev already done to cut the link between master clients, it can help to save a reconnect to master. And during a rolling restart with regions moving everywhere, I think it will make a real difference. bq. I don't see changes to make use of this new functionality? I'd expect the balancer in master to make use of it? Yes, it's the changes in AssignmentManager: the changes are in the patch, but are quite small at the end: basically: {noformat} -unassign(plan.getRegionInfo()); +unassign(plan.getRegionInfo(), false, plan.getDestination()); {noformat} I still need to manage the case when the destination is not specified at the beginning. When a query fails because the region has moved, let the regionserver return the new address to the client -- Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5877.v1.patch This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash
[ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13263778#comment-13263778 ] nkeywal commented on HBASE-5844: I found the issue, and (hopefully) a fix. I will have a new patch middle of next week, I will include the master znode in this one... Delete the region servers znode after a regions server crash Key: HBASE-5844 URL: https://issues.apache.org/jira/browse/HBASE-5844 Project: HBase Issue Type: Improvement Components: regionserver, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Fix For: 0.96.0 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s. By deleting the znode in start script, we remove this delay and the recovery starts immediately. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5902) Some scripts are not executable
nkeywal created HBASE-5902: -- Summary: Some scripts are not executable Key: HBASE-5902 URL: https://issues.apache.org/jira/browse/HBASE-5902 Project: HBase Issue Type: Bug Components: scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Trivial -rw-rw-r-- graceful_stop.sh -rw-rw-r-- hbase-config.sh -rw-rw-r-- local-master-backup.sh -rw-rw-r-- local-regionservers.sh -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5902) Some scripts are not executable
[ https://issues.apache.org/jira/browse/HBASE-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5902: --- Attachment: 5902.v1.patch Some scripts are not executable --- Key: HBASE-5902 URL: https://issues.apache.org/jira/browse/HBASE-5902 Project: HBase Issue Type: Bug Components: scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Trivial Attachments: 5902.v1.patch -rw-rw-r-- graceful_stop.sh -rw-rw-r-- hbase-config.sh -rw-rw-r-- local-master-backup.sh -rw-rw-r-- local-regionservers.sh -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5903) Detect the test classes without categories
nkeywal created HBASE-5903: -- Summary: Detect the test classes without categories Key: HBASE-5903 URL: https://issues.apache.org/jira/browse/HBASE-5903 Project: HBase Issue Type: Improvement Components: build, test Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor The tests are executed by category. When a test does not have a category, it's not run on prebuild nor central build. This new test checks the test classess and list the ones without category. It fails if it finds one. As it's a small test it will be executed on the developper machine and will fail immediately on the central builds. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5903) Detect the test classes without categories
[ https://issues.apache.org/jira/browse/HBASE-5903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5903: --- Attachment: 5903.v3.patch Detect the test classes without categories -- Key: HBASE-5903 URL: https://issues.apache.org/jira/browse/HBASE-5903 Project: HBase Issue Type: Improvement Components: build, test Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5903.v3.patch The tests are executed by category. When a test does not have a category, it's not run on prebuild nor central build. This new test checks the test classess and list the ones without category. It fails if it finds one. As it's a small test it will be executed on the developper machine and will fail immediately on the central builds. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5903) Detect the test classes without categories
[ https://issues.apache.org/jira/browse/HBASE-5903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5903: --- Fix Version/s: 0.96.0 Status: Patch Available (was: Open) Detect the test classes without categories -- Key: HBASE-5903 URL: https://issues.apache.org/jira/browse/HBASE-5903 Project: HBase Issue Type: Improvement Components: build, test Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5903.v3.patch The tests are executed by category. When a test does not have a category, it's not run on prebuild nor central build. This new test checks the test classess and list the ones without category. It fails if it finds one. As it's a small test it will be executed on the developper machine and will fail immediately on the central builds. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5903) Detect the test classes without categories
[ https://issues.apache.org/jira/browse/HBASE-5903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13264978#comment-13264978 ] nkeywal commented on HBASE-5903: Considering the actual patch, we can just consider TestAssignmentManager as a little bit flaky ;-) Detect the test classes without categories -- Key: HBASE-5903 URL: https://issues.apache.org/jira/browse/HBASE-5903 Project: HBase Issue Type: Improvement Components: build, test Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5903.v3.patch The tests are executed by category. When a test does not have a category, it's not run on prebuild nor central build. This new test checks the test classess and list the ones without category. It fails if it finds one. As it's a small test it will be executed on the developper machine and will fail immediately on the central builds. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5905) Protobuf interface for Admin: split between the internal and the external/customer interface
nkeywal created HBASE-5905: -- Summary: Protobuf interface for Admin: split between the internal and the external/customer interface Key: HBASE-5905 URL: https://issues.apache.org/jira/browse/HBASE-5905 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal After a short discussion with Stack, I create a jira. -- I'am a little bit confused by the protobuf interface for closeRegion. We have two types of closeRegion today: 1) the external ones; available in client.HBaseAdmin. They take the server and the region identifier as a parameter and nothing else. 2) The internal ones, called for example by the master. They have more parameters (like versionOfClosingNode or transitionInZK). When I look at protobuf.ProtobufUtil, I see: public static void closeRegion(final AdminProtocol admin, final byte[] regionName, final boolean transitionInZK) throws IOException { CloseRegionRequest closeRegionRequest = RequestConverter.buildCloseRegionRequest(regionName, transitionInZK); try { admin.closeRegion(null, closeRegionRequest); } catch (ServiceException se) { throw getRemoteException(se); } } In other words, it seems that we merged the two interfaces into a single one. Is that the intend? I checked, the internal fields in closeRegionRequest are all optional (that's good). Still, it means that the end user could use them or at least would need to distinguish between the optional for functional reasons and the optional - do not use. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5905) Protobuf interface for Admin: split between the internal and the external/customer interface
[ https://issues.apache.org/jira/browse/HBASE-5905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13265710#comment-13265710 ] nkeywal commented on HBASE-5905: I don't think so. But we could add a specific comment? Protobuf interface for Admin: split between the internal and the external/customer interface Key: HBASE-5905 URL: https://issues.apache.org/jira/browse/HBASE-5905 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal After a short discussion with Stack, I create a jira. -- I'am a little bit confused by the protobuf interface for closeRegion. We have two types of closeRegion today: 1) the external ones; available in client.HBaseAdmin. They take the server and the region identifier as a parameter and nothing else. 2) The internal ones, called for example by the master. They have more parameters (like versionOfClosingNode or transitionInZK). When I look at protobuf.ProtobufUtil, I see: public static void closeRegion(final AdminProtocol admin, final byte[] regionName, final boolean transitionInZK) throws IOException { CloseRegionRequest closeRegionRequest = RequestConverter.buildCloseRegionRequest(regionName, transitionInZK); try { admin.closeRegion(null, closeRegionRequest); } catch (ServiceException se) { throw getRemoteException(se); } } In other words, it seems that we merged the two interfaces into a single one. Is that the intend? I checked, the internal fields in closeRegionRequest are all optional (that's good). Still, it means that the end user could use them or at least would need to distinguish between the optional for functional reasons and the optional - do not use. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5902) Some scripts are not executable
[ https://issues.apache.org/jira/browse/HBASE-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13266464#comment-13266464 ] nkeywal commented on HBASE-5902: I haven't submitted the patch as it does not change the java code, but it can be committed. Some scripts are not executable --- Key: HBASE-5902 URL: https://issues.apache.org/jira/browse/HBASE-5902 Project: HBase Issue Type: Bug Components: scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Trivial Attachments: 5902.v1.patch -rw-rw-r-- graceful_stop.sh -rw-rw-r-- hbase-config.sh -rw-rw-r-- local-master-backup.sh -rw-rw-r-- local-regionservers.sh -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client
[ https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5877: --- Attachment: 5877.v6.patch When a query fails because the region has moved, let the regionserver return the new address to the client -- Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5877.v1.patch, 5877.v6.patch This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client
[ https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5877: --- Status: Patch Available (was: Open) When a query fails because the region has moved, let the regionserver return the new address to the client -- Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5877.v1.patch, 5877.v6.patch This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client
[ https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13266656#comment-13266656 ] nkeywal commented on HBASE-5877: Generated with protobuf 2.4.1 Here are the things I'm not a big fan, but for which I don't have a better solution: - the move management in the client code: I think it's possible to change the way we manage error (don't wait for all results before retrying), but that would be for another JIRA - the destination is the closeRegion interface is a kind of interface hijacking. Other options would be: - sharing the region state in zookeeper - letting the regionserver calls the master to get the new server. On paper this would be more efficient than a client - master call. In both cases we could consider that the client should not connect to the master except for cluster administration (create table, split regin; ...). That would increase global reliability. That's for another discussion as well I think. Here is what I plan to do in the final version - move the handler functional code into a function in HRegionServer: this would allow to have the function addToMovedRegion as private instead of public. - Change all the CloseRegionHandler to take a RegionServer instead of a server? I'm not really keen on adding a class RegionServerServices, but may be I should? - Manage the case when the destination is not specified at the beginning of the move (may be in a different Jira if it's not simple)... All the previous comment should have been taken into account. When a query fails because the region has moved, let the regionserver return the new address to the client -- Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5877.v1.patch, 5877.v6.patch This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client
[ https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13266827#comment-13266827 ] nkeywal commented on HBASE-5877: I have the same locally, so it's likely my patch... When a query fails because the region has moved, let the regionserver return the new address to the client -- Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5877.v1.patch, 5877.v6.patch This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5905) Protobuf interface for Admin: split between the internal and the external/customer interface
[ https://issues.apache.org/jira/browse/HBASE-5905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13266839#comment-13266839 ] nkeywal commented on HBASE-5905: This would make sense if we think that customers should/will use the protobuf interface. Protobuf interface for Admin: split between the internal and the external/customer interface Key: HBASE-5905 URL: https://issues.apache.org/jira/browse/HBASE-5905 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal After a short discussion with Stack, I create a jira. -- I'am a little bit confused by the protobuf interface for closeRegion. We have two types of closeRegion today: 1) the external ones; available in client.HBaseAdmin. They take the server and the region identifier as a parameter and nothing else. 2) The internal ones, called for example by the master. They have more parameters (like versionOfClosingNode or transitionInZK). When I look at protobuf.ProtobufUtil, I see: public static void closeRegion(final AdminProtocol admin, final byte[] regionName, final boolean transitionInZK) throws IOException { CloseRegionRequest closeRegionRequest = RequestConverter.buildCloseRegionRequest(regionName, transitionInZK); try { admin.closeRegion(null, closeRegionRequest); } catch (ServiceException se) { throw getRemoteException(se); } } In other words, it seems that we merged the two interfaces into a single one. Is that the intend? I checked, the internal fields in closeRegionRequest are all optional (that's good). Still, it means that the end user could use them or at least would need to distinguish between the optional for functional reasons and the optional - do not use. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5844) Delete the region servers znode after a regions server crash
[ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5844: --- Attachment: 5844.v3.patch Delete the region servers znode after a regions server crash Key: HBASE-5844 URL: https://issues.apache.org/jira/browse/HBASE-5844 Project: HBase Issue Type: Improvement Components: regionserver, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Fix For: 0.96.0 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 5844.v3.patch, 5844.v4.patch today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s. By deleting the znode in start script, we remove this delay and the recovery starts immediately. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5844) Delete the region servers znode after a regions server crash
[ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5844: --- Status: Open (was: Patch Available) Delete the region servers znode after a regions server crash Key: HBASE-5844 URL: https://issues.apache.org/jira/browse/HBASE-5844 Project: HBase Issue Type: Improvement Components: regionserver, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Fix For: 0.96.0 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 5844.v3.patch, 5844.v4.patch today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s. By deleting the znode in start script, we remove this delay and the recovery starts immediately. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5844) Delete the region servers znode after a regions server crash
[ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5844: --- Attachment: 5844.v4.patch Delete the region servers znode after a regions server crash Key: HBASE-5844 URL: https://issues.apache.org/jira/browse/HBASE-5844 Project: HBase Issue Type: Improvement Components: regionserver, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Fix For: 0.96.0 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 5844.v3.patch, 5844.v4.patch today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s. By deleting the znode in start script, we remove this delay and the recovery starts immediately. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5844) Delete the region servers znode after a regions server crash
[ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5844: --- Status: Patch Available (was: Reopened) Delete the region servers znode after a regions server crash Key: HBASE-5844 URL: https://issues.apache.org/jira/browse/HBASE-5844 Project: HBase Issue Type: Improvement Components: regionserver, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Fix For: 0.96.0 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 5844.v3.patch, 5844.v4.patch today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s. By deleting the znode in start script, we remove this delay and the recovery starts immediately. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5844) Delete the region servers znode after a regions server crash
[ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5844: --- Status: Patch Available (was: Open) Delete the region servers znode after a regions server crash Key: HBASE-5844 URL: https://issues.apache.org/jira/browse/HBASE-5844 Project: HBase Issue Type: Improvement Components: regionserver, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Fix For: 0.96.0 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 5844.v3.patch, 5844.v4.patch today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s. By deleting the znode in start script, we remove this delay and the recovery starts immediately. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash
[ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13267332#comment-13267332 ] nkeywal commented on HBASE-5844: v4 should be ok. I will do another jira for the master. Delete the region servers znode after a regions server crash Key: HBASE-5844 URL: https://issues.apache.org/jira/browse/HBASE-5844 Project: HBase Issue Type: Improvement Components: regionserver, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Fix For: 0.96.0 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 5844.v3.patch, 5844.v4.patch today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s. By deleting the znode in start script, we remove this delay and the recovery starts immediately. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5924) In the client code, don't wait for all the requests to be executed before resubmitting a request in error.
nkeywal created HBASE-5924: -- Summary: In the client code, don't wait for all the requests to be executed before resubmitting a request in error. Key: HBASE-5924 URL: https://issues.apache.org/jira/browse/HBASE-5924 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor The client (in the function HConnectionManager#processBatchCallback) works in two steps: - make the requests - collect the failures and successes and prepare for retry It means that when there is an immediate error (region moved, split, dead server, ...) we still wait for all the initial requests to be executed before submitting again the failed request. If we have a scenario with all the requests taking 5 seconds we have a final execution time of: 5 (initial requests) + 1 (wait time) + 5 (final request) = 11s. We could improve this by analyzing immediately the results. This would lead us, for the scenario mentioned above, to 6 seconds. So we could have a performance improvement of nearly 50% in many cases, and much more than 50% if the request execution time is different. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5902) Some scripts are not executable
[ https://issues.apache.org/jira/browse/HBASE-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5902: --- Status: Patch Available (was: Open) Some scripts are not executable --- Key: HBASE-5902 URL: https://issues.apache.org/jira/browse/HBASE-5902 Project: HBase Issue Type: Bug Components: scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Trivial Attachments: 5902.v1.patch -rw-rw-r-- graceful_stop.sh -rw-rw-r-- hbase-config.sh -rw-rw-r-- local-master-backup.sh -rw-rw-r-- local-regionservers.sh -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5926) Delete the master znode after a znode crash
nkeywal created HBASE-5926: -- Summary: Delete the master znode after a znode crash Key: HBASE-5926 URL: https://issues.apache.org/jira/browse/HBASE-5926 Project: HBase Issue Type: Improvement Components: master, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor This is the continuation of the work done in HBASE-5844. But we can't apply exactly the same strategy: for the region server, there is a znode per region server, while for the master backup master there is a single znode for both. So if we apply the same strategy as for a regionserver, we may have this scenario: 1) Master starts 2) Backup master starts 3) Master dies 4) ZK detects it 5) Backup master receives the update from ZK 6) Backup master creates the new master node and become the main master 7) Previous master script continues 8) Previous master script delete the master node in ZK 9) = issue: we deleted the node just created by the new master This should not happen often (usually the znode will be delete soon enough), but it can happen. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5926) Delete the master znode after a znode crash
[ https://issues.apache.org/jira/browse/HBASE-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5926: --- Description: This is the continuation of the work done in HBASE-5844. But we can't apply exactly the same strategy: for the region server, there is a znode per region server, while for the master backup master there is a single znode for both. So if we apply the same strategy as for a regionserver, we may have this scenario: 1) Master starts 2) Backup master starts 3) Master dies 4) ZK detects it 5) Backup master receives the update from ZK 6) Backup master creates the new master node and become the main master 7) Previous master script continues 8) Previous master script deletes the master node in ZK 9) = issue: we deleted the node just created by the new master This should not happen often (usually the znode will be deleted soon enough), but it can happen. was: This is the continuation of the work done in HBASE-5844. But we can't apply exactly the same strategy: for the region server, there is a znode per region server, while for the master backup master there is a single znode for both. So if we apply the same strategy as for a regionserver, we may have this scenario: 1) Master starts 2) Backup master starts 3) Master dies 4) ZK detects it 5) Backup master receives the update from ZK 6) Backup master creates the new master node and become the main master 7) Previous master script continues 8) Previous master script delete the master node in ZK 9) = issue: we deleted the node just created by the new master This should not happen often (usually the znode will be delete soon enough), but it can happen. Delete the master znode after a znode crash --- Key: HBASE-5926 URL: https://issues.apache.org/jira/browse/HBASE-5926 Project: HBase Issue Type: Improvement Components: master, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor This is the continuation of the work done in HBASE-5844. But we can't apply exactly the same strategy: for the region server, there is a znode per region server, while for the master backup master there is a single znode for both. So if we apply the same strategy as for a regionserver, we may have this scenario: 1) Master starts 2) Backup master starts 3) Master dies 4) ZK detects it 5) Backup master receives the update from ZK 6) Backup master creates the new master node and become the main master 7) Previous master script continues 8) Previous master script deletes the master node in ZK 9) = issue: we deleted the node just created by the new master This should not happen often (usually the znode will be deleted soon enough), but it can happen. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5926) Delete the master znode after a master crash
[ https://issues.apache.org/jira/browse/HBASE-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5926: --- Summary: Delete the master znode after a master crash (was: Delete the master znode after a znode crash) Delete the master znode after a master crash Key: HBASE-5926 URL: https://issues.apache.org/jira/browse/HBASE-5926 Project: HBase Issue Type: Improvement Components: master, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor This is the continuation of the work done in HBASE-5844. But we can't apply exactly the same strategy: for the region server, there is a znode per region server, while for the master backup master there is a single znode for both. So if we apply the same strategy as for a regionserver, we may have this scenario: 1) Master starts 2) Backup master starts 3) Master dies 4) ZK detects it 5) Backup master receives the update from ZK 6) Backup master creates the new master node and become the main master 7) Previous master script continues 8) Previous master script deletes the master node in ZK 9) = issue: we deleted the node just created by the new master This should not happen often (usually the znode will be deleted soon enough), but it can happen. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client
[ https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13267751#comment-13267751 ] nkeywal commented on HBASE-5877: v12, should be final. 1) ServerName is used everywhere in the interface, thanks to protobuf 2) hadoop.ipc serialization of exception is based on the #getMessage. So we have to parse it internally. It's not visisble to the exception user. 3) The code to manage the error in the client package is quite complex. We have the exception at the very beginning, and then it's checked again, but we don't have the real exception anymore. I used a new historyList to make it works. There is another JIRA for other improvement, in which I could get rid of this (HBASE-5924) 4) Generated with protobuf 2.4.1 5) The destination is the closeRegion interface is a kind of interface hijacking. Other options would be: - sharing the region state in zookeeper - letting the regionserver calls the master to get the new server. On paper this would be more efficient than a client - master call. In both cases we could consider that the client should not connect to the master except for cluster administration (create table, split regin; ...). That would increase global reliability. That's for another discussion as well I think. 6) RegionServerServices has been modified to set a destination when removing a region from the online regions. 7) In another JIRA I will manage the case when the destination is not specified when calling the move function. When a query fails because the region has moved, let the regionserver return the new address to the client -- Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5877.v1.patch, 5877.v12.patch, 5877.v6.patch This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client
[ https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5877: --- Attachment: 5877.v12.patch When a query fails because the region has moved, let the regionserver return the new address to the client -- Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5877.v1.patch, 5877.v12.patch, 5877.v6.patch This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client
[ https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5877: --- Status: Open (was: Patch Available) When a query fails because the region has moved, let the regionserver return the new address to the client -- Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5877.v1.patch, 5877.v12.patch, 5877.v6.patch This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client
[ https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5877: --- Status: Patch Available (was: Open) When a query fails because the region has moved, let the regionserver return the new address to the client -- Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5877.v1.patch, 5877.v12.patch, 5877.v6.patch This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5902) Some scripts are not executable
[ https://issues.apache.org/jira/browse/HBASE-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13268186#comment-13268186 ] nkeywal commented on HBASE-5902: @stack: yes, thank you! Some scripts are not executable --- Key: HBASE-5902 URL: https://issues.apache.org/jira/browse/HBASE-5902 Project: HBase Issue Type: Bug Components: scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Trivial Attachments: 5902.v1.patch, 5902v2.txt -rw-rw-r-- graceful_stop.sh -rw-rw-r-- hbase-config.sh -rw-rw-r-- local-master-backup.sh -rw-rw-r-- local-regionservers.sh -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client
[ https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13268199#comment-13268199 ] nkeywal commented on HBASE-5877: @stack bq. You don't want to have RegionMovedException carry a ServerName#toString instead of host and port? I think it's safer this way, as I have to parse the string afterward. Otherwise, if someone modifies ServerName#toString he will break the parsing in RegionMovedException, a class he may never have heard of (yes, it will break the unit test :-)) bq. Is this a bug fix? Unfortunately, it's a feature. The error management is duplicated, and I have to manage both cases, because we don't have the exception when we come back to the result later in the code. bq. Put the history of moved regions out into its own class? You're right, it would be better. Wil do. bq. Don't presize this I'd say: private static final long TIMEOUT_REGION_MOVED = (2L * 60L * 1000L); You would prefer a configurable value? bq. Stuff is lazily cleared from movedRegions? Should we have a cleaner come visit occasionally? Aggreed, it would be better with a cleaner. Will do as well. bq. Why you say the above? When we protobuf it, it'll just be an option so it shouldn't be too bad? Yeah, if it was too bad I would not have proposed it :-). It's an imperfection to accept I think. We would not have it if we share the regions state within the cluster with ZK. @ted bq. Under what condition would newHrl be null above ? Oops. Refactoring error. Removed. bq. Please remove the space between newHrl and ')' below: Done. bq. Would the above code result in NPE since I see the following in javadoc: It should not happen because we test hrl value before. But I added a check on the arguments to make it safer. bq. Since updateCachedLocations() is used to handle exception, the presizing above may not be needed. bq. Since updateCachedLocations() is used to handle exception, the presizing above may not be needed. Yeah, I sized it thinking: if we're doing a rolling restart we may have 100 regions with a wrong location if we're really unlucky. As it small, any solution would work here, but I prefer to have the size explicitly set, as it says I though about it, that's a reasonable size. I added a comment however. bq. The indentation of CloseRegionHandler() above is off. Fixed. bq. 'will contains' - 'will contain'. 'keep a too old' - 'keep too old'. Fixed. When a query fails because the region has moved, let the regionserver return the new address to the client -- Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5877.v1.patch, 5877.v12.patch, 5877.v6.patch This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash
[ https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13268224#comment-13268224 ] nkeywal commented on HBASE-5844: Ready to be committed imho. Delete the region servers znode after a regions server crash Key: HBASE-5844 URL: https://issues.apache.org/jira/browse/HBASE-5844 Project: HBase Issue Type: Improvement Components: regionserver, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Fix For: 0.96.0 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 5844.v3.patch, 5844.v4.patch today, if the regions server crashes, its znode is not deleted in ZooKeeper. So the recovery process will stop only after a timeout, usually 30s. By deleting the znode in start script, we remove this delay and the recovery starts immediately. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5939) Add an autorestart option in the start scripts
nkeywal created HBASE-5939: -- Summary: Add an autorestart option in the start scripts Key: HBASE-5939 URL: https://issues.apache.org/jira/browse/HBASE-5939 Project: HBase Issue Type: Improvement Components: master, regionserver, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor When a binary dies on a server, we don't try to restart it while it would be possible in most cases. We can have something as: loop start wait if cleanStop then exit if already stopped less than 5 minutes ago sleep 1 minute endloop This is simple for master backup master, a little bit more complex for the region server as it can be stopped by a script or by the shutdown procedure. On a long long term it could allow a restart with exactly the same assignments. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client
[ https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5877: --- Status: Open (was: Patch Available) When a query fails because the region has moved, let the regionserver return the new address to the client -- Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5877.v1.patch, 5877.v12.patch, 5877.v6.patch This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client
[ https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5877: --- Attachment: 5877.v15.patch When a query fails because the region has moved, let the regionserver return the new address to the client -- Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5877.v1.patch, 5877.v12.patch, 5877.v15.patch, 5877.v6.patch This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client
[ https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5877: --- Status: Patch Available (was: Open) When a query fails because the region has moved, let the regionserver return the new address to the client -- Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5877.v1.patch, 5877.v12.patch, 5877.v15.patch, 5877.v6.patch This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client
[ https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13268701#comment-13268701 ] nkeywal commented on HBASE-5877: thank you Ted. It's strange, it though it worked locally. May be this test was hung and was not reported as an error. Thank you anyway. When a query fails because the region has moved, let the regionserver return the new address to the client -- Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5877-v16.txt, 5877.v1.patch, 5877.v12.patch, 5877.v15.patch, 5877.v6.patch This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client
[ https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13270212#comment-13270212 ] nkeywal commented on HBASE-5877: I didn't find the test in error. Could be committed imho. When a query fails because the region has moved, let the regionserver return the new address to the client -- Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5877-v16.txt, 5877-v17.txt, 5877.v1.patch, 5877.v12.patch, 5877.v15.patch, 5877.v6.patch This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client
[ https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13270263#comment-13270263 ] nkeywal commented on HBASE-5877: Ok... I will do the merge and provide another patch today... When a query fails because the region has moved, let the regionserver return the new address to the client -- Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5877-v16.txt, 5877-v17.txt, 5877-v17.txt, 5877.v1.patch, 5877.v12.patch, 5877.v15.patch, 5877.v6.patch This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client
[ https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5877: --- Status: Open (was: Patch Available) When a query fails because the region has moved, let the regionserver return the new address to the client -- Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5877-v16.txt, 5877-v17.txt, 5877-v17.txt, 5877.v1.patch, 5877.v12.patch, 5877.v15.patch, 5877.v6.patch This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client
[ https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5877: --- Attachment: 5877.v18.patch Hopefully I didn't break anything while doing the merge. Local tests are ok. I included Ted's fix. When a query fails because the region has moved, let the regionserver return the new address to the client -- Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5877-v16.txt, 5877-v17.txt, 5877-v17.txt, 5877.v1.patch, 5877.v12.patch, 5877.v15.patch, 5877.v18.patch, 5877.v6.patch This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client
[ https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5877: --- Status: Patch Available (was: Open) When a query fails because the region has moved, let the regionserver return the new address to the client -- Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5877-v16.txt, 5877-v17.txt, 5877-v17.txt, 5877.v1.patch, 5877.v12.patch, 5877.v15.patch, 5877.v18.patch, 5877.v6.patch This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client
[ https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5877: --- Status: Open (was: Patch Available) When a query fails because the region has moved, let the regionserver return the new address to the client -- Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5877-v16.txt, 5877-v17.txt, 5877-v17.txt, 5877.v1.patch, 5877.v12.patch, 5877.v15.patch, 5877.v18.patch, 5877.v6.patch This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client
[ https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5877: --- Attachment: 5877.v18.patch When a query fails because the region has moved, let the regionserver return the new address to the client -- Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5877-v16.txt, 5877-v17.txt, 5877-v17.txt, 5877.v1.patch, 5877.v12.patch, 5877.v15.patch, 5877.v18.patch, 5877.v18.patch, 5877.v6.patch This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client
[ https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5877: --- Status: Patch Available (was: Open) When a query fails because the region has moved, let the regionserver return the new address to the client -- Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5877-v16.txt, 5877-v17.txt, 5877-v17.txt, 5877.v1.patch, 5877.v12.patch, 5877.v15.patch, 5877.v18.patch, 5877.v18.patch, 5877.v6.patch This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client
[ https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13271590#comment-13271590 ] nkeywal commented on HBASE-5877: Seems to be ready for a commit :-] When a query fails because the region has moved, let the regionserver return the new address to the client -- Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5877-v16.txt, 5877-v17.txt, 5877-v17.txt, 5877.v1.patch, 5877.v12.patch, 5877.v15.patch, 5877.v18.patch, 5877.v18.patch, 5877.v6.patch This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client
[ https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13271651#comment-13271651 ] nkeywal commented on HBASE-5877: bq. Now how does the updateCachelocation help here. If for some reason the opening of the region is not yet done and if the client gets RegionMovedException the client will try to contact the RS thinking the region got moved to it. Yes, exactly. That's why I kept the sleep in the client code even for this RegionMoved. We could optimize this by adding a timestamps, with an heuristic like: we give two seconds for the region to move after it's closed on the origin server. Sharing the region state in ZK would be a simpler option, as we would know if the region has moved or not. When a query fails because the region has moved, let the regionserver return the new address to the client -- Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5877-v16.txt, 5877-v17.txt, 5877-v17.txt, 5877.v1.patch, 5877.v12.patch, 5877.v15.patch, 5877.v18.patch, 5877.v18.patch, 5877.v6.patch This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5939) Add an autorestart option in the start scripts
[ https://issues.apache.org/jira/browse/HBASE-5939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5939: --- Fix Version/s: 0.96.0 Status: Patch Available (was: Open) Add an autorestart option in the start scripts -- Key: HBASE-5939 URL: https://issues.apache.org/jira/browse/HBASE-5939 Project: HBase Issue Type: Improvement Components: master, regionserver, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5939.v4.patch When a binary dies on a server, we don't try to restart it while it would be possible in most cases. We can have something as: loop start wait if cleanStop then exit if already stopped less than 5 minutes ago sleep 1 minute endloop This is simple for master backup master, a little bit more complex for the region server as it can be stopped by a script or by the shutdown procedure. On a long long term it could allow a restart with exactly the same assignments. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5939) Add an autorestart option in the start scripts
[ https://issues.apache.org/jira/browse/HBASE-5939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5939: --- Attachment: 5939.v4.patch Add an autorestart option in the start scripts -- Key: HBASE-5939 URL: https://issues.apache.org/jira/browse/HBASE-5939 Project: HBase Issue Type: Improvement Components: master, regionserver, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5939.v4.patch When a binary dies on a server, we don't try to restart it while it would be possible in most cases. We can have something as: loop start wait if cleanStop then exit if already stopped less than 5 minutes ago sleep 1 minute endloop This is simple for master backup master, a little bit more complex for the region server as it can be stopped by a script or by the shutdown procedure. On a long long term it could allow a restart with exactly the same assignments. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client
[ https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5877: --- Status: Open (was: Patch Available) When a query fails because the region has moved, let the regionserver return the new address to the client -- Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5877-v16.txt, 5877-v17.txt, 5877-v17.txt, 5877.v1.patch, 5877.v12.patch, 5877.v15.patch, 5877.v18.patch, 5877.v18.patch, 5877.v18.patch, 5877.v6.patch This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client
[ https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5877: --- Status: Patch Available (was: Open) When a query fails because the region has moved, let the regionserver return the new address to the client -- Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5877-v16.txt, 5877-v17.txt, 5877-v17.txt, 5877.v1.patch, 5877.v12.patch, 5877.v15.patch, 5877.v18.patch, 5877.v18.patch, 5877.v18.patch, 5877.v6.patch This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client
[ https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5877: --- Attachment: 5877.v18.patch When a query fails because the region has moved, let the regionserver return the new address to the client -- Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5877-v16.txt, 5877-v17.txt, 5877-v17.txt, 5877.v1.patch, 5877.v12.patch, 5877.v15.patch, 5877.v18.patch, 5877.v18.patch, 5877.v18.patch, 5877.v6.patch This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client
[ https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13273151#comment-13273151 ] nkeywal commented on HBASE-5877: Strange, the status is failure, but actually the code is available in the trunk (which is good, it saves another merge) When a query fails because the region has moved, let the regionserver return the new address to the client -- Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5877-v16.txt, 5877-v17.txt, 5877-v17.txt, 5877.v1.patch, 5877.v12.patch, 5877.v15.patch, 5877.v18.patch, 5877.v18.patch, 5877.v18.patch, 5877.v6.patch This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4955) Use the official versions of surefire junit
[ https://issues.apache.org/jira/browse/HBASE-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13273153#comment-13273153 ] nkeywal commented on HBASE-4955: Update: Still waiting for JUnit 4.11 Surefire 2.13, still no release date. Use the official versions of surefire junit - Key: HBASE-4955 URL: https://issues.apache.org/jira/browse/HBASE-4955 Project: HBase Issue Type: Improvement Components: test Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor We currently use private versions for Surefire JUnit since HBASE-4763. This JIRA traks what we need to move to official versions. Surefire 2.11 is just out, but, after some tests, it does not contain all what we need. JUnit. Could be for JUnit 4.11. Issue to monitor: https://github.com/KentBeck/junit/issues/359: fixed in our version, no feedback for an integration on trunk Surefire: Could be for Surefire 2.12. Issues to monitor are: 329 (category support): fixed, we use the official implementation from the trunk 786 (@Category with forkMode=always): fixed, we use the official implementation from the trunk 791 (incorrect elapsed time on test failure): fixed, we use the official implementation from the trunk 793 (incorrect time in the XML report): Not fixed (reopen) on trunk, fixed on our version. 760 (does not take into account the test method): fixed in trunk, not fixed in our version 798 (print immediately the test class name): not fixed in trunk, not fixed in our version 799 (Allow test parallelization when forkMode=always): not fixed in trunk, not fixed in our version 800 (redirectTestOutputToFile not taken into account): not yet fix on trunk, fixed on our version 800 793 are the more important to monitor, it's the only ones that are fixed in our version but not on trunk. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5877) When a query fails because the region has moved, let the regionserver return the new address to the client
[ https://issues.apache.org/jira/browse/HBASE-5877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5877: --- Resolution: Fixed Status: Resolved (was: Patch Available) When a query fails because the region has moved, let the regionserver return the new address to the client -- Key: HBASE-5877 URL: https://issues.apache.org/jira/browse/HBASE-5877 Project: HBase Issue Type: Improvement Components: client, master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5877-v16.txt, 5877-v17.txt, 5877-v17.txt, 5877.v1.patch, 5877.v12.patch, 5877.v15.patch, 5877.v18.patch, 5877.v18.patch, 5877.v18.patch, 5877.v6.patch This is mainly useful when we do a rolling restart. This will decrease the load on the master and the network load. Note that a region is not immediately opened after a close. So: - it seems preferable to wait before retrying on the other server. An optimisation would be to have an heuristic depending on when the region was closed. - during a rolling restart, the server moves the regions then stops. So we may have failures when the server is stopped, and this patch won't help. The implementation in the first patch does: - on the region move, there is an added parameter on the regionserver#close to say where we are sending the region - the regionserver keeps a list of what was moved. Each entry is kept 100 seconds. - the regionserver sends a specific exception when it receives a query on a moved region. This exception contains the new address. - the client analyses the exeptions and update its cache accordingly... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5939) Add an autorestart option in the start scripts
[ https://issues.apache.org/jira/browse/HBASE-5939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13273158#comment-13273158 ] nkeywal commented on HBASE-5939: Changes are not related to the failed test. Patch could be committed imho. Add an autorestart option in the start scripts -- Key: HBASE-5939 URL: https://issues.apache.org/jira/browse/HBASE-5939 Project: HBase Issue Type: Improvement Components: master, regionserver, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5939.v4.patch When a binary dies on a server, we don't try to restart it while it would be possible in most cases. We can have something as: loop start wait if cleanStop then exit if already stopped less than 5 minutes ago sleep 1 minute endloop This is simple for master backup master, a little bit more complex for the region server as it can be stopped by a script or by the shutdown procedure. On a long long term it could allow a restart with exactly the same assignments. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5992) Generalization of region move implementation + manage draining servers in bulk assign
nkeywal created HBASE-5992: -- Summary: Generalization of region move implementation + manage draining servers in bulk assign Key: HBASE-5992 URL: https://issues.apache.org/jira/browse/HBASE-5992 Project: HBase Issue Type: Bug Components: master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor The region move implementation now has now a similar behavior whatever the destination server is specified or not. This allows: - to benefit from the improvement in HBASE-5877 - as a side effect to have the coprocessors calls when the destination server is not specified This includes various fixes around draining servers. Draining servers were not excluded during a bulk assign. This is now fixed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5992) Generalization of region move implementation + manage draining servers in bulk assign
[ https://issues.apache.org/jira/browse/HBASE-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5992: --- Attachment: 5992.v2.patch Generalization of region move implementation + manage draining servers in bulk assign - Key: HBASE-5992 URL: https://issues.apache.org/jira/browse/HBASE-5992 Project: HBase Issue Type: Bug Components: master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5992.v2.patch The region move implementation now has now a similar behavior whatever the destination server is specified or not. This allows: - to benefit from the improvement in HBASE-5877 - as a side effect to have the coprocessors calls when the destination server is not specified This includes various fixes around draining servers. Draining servers were not excluded during a bulk assign. This is now fixed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5992) Generalization of region move implementation + manage draining servers in bulk assign
[ https://issues.apache.org/jira/browse/HBASE-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5992: --- Status: Open (was: Patch Available) Generalization of region move implementation + manage draining servers in bulk assign - Key: HBASE-5992 URL: https://issues.apache.org/jira/browse/HBASE-5992 Project: HBase Issue Type: Bug Components: master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5992.v2.patch The region move implementation now has now a similar behavior whatever the destination server is specified or not. This allows: - to benefit from the improvement in HBASE-5877 - as a side effect to have the coprocessors calls when the destination server is not specified This includes various fixes around draining servers. Draining servers were not excluded during a bulk assign. This is now fixed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5992) Generalization of region move implementation + manage draining servers in bulk assign
[ https://issues.apache.org/jira/browse/HBASE-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13273703#comment-13273703 ] nkeywal commented on HBASE-5992: bq. What does the big refactoring of TestDrainingServer do? Hopefully it's now non flaky. Before both the test and the core code were a little bit flaky! :-). bq. In hbase code base, we have spaces around operators... i.e. this needs spaces Done. bq. does removeDeadNotExpiredServers also need to be? It does compile if it's package protected. I didn't dare doing it in the first patch to maintain the existing interface, but since you're asking :-). Done. TestDrainingServer failed for an unrelated reason in its setup. But well. Fixed (hopefully, I actually don't really reproduce the issue locally) as well in the next patch. Generalization of region move implementation + manage draining servers in bulk assign - Key: HBASE-5992 URL: https://issues.apache.org/jira/browse/HBASE-5992 Project: HBase Issue Type: Bug Components: master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5992.v2.patch The region move implementation now has now a similar behavior whatever the destination server is specified or not. This allows: - to benefit from the improvement in HBASE-5877 - as a side effect to have the coprocessors calls when the destination server is not specified This includes various fixes around draining servers. Draining servers were not excluded during a bulk assign. This is now fixed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5939) Add an autorestart option in the start scripts
[ https://issues.apache.org/jira/browse/HBASE-5939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13273716#comment-13273716 ] nkeywal commented on HBASE-5939: It would make sense to make it the default. Just that he developers or admins used to a simple kill will be surprised to see the process coming back. As you like. For the release notes, I'm ok. I was planning to update the reference guide (including a part with the forgotten-but-useful local-region.sh script), but I will write a release note for this one as well. Add an autorestart option in the start scripts -- Key: HBASE-5939 URL: https://issues.apache.org/jira/browse/HBASE-5939 Project: HBase Issue Type: Improvement Components: master, regionserver, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5939.v4.patch When a binary dies on a server, we don't try to restart it while it would be possible in most cases. We can have something as: loop start wait if cleanStop then exit if already stopped less than 5 minutes ago sleep 1 minute endloop This is simple for master backup master, a little bit more complex for the region server as it can be stopped by a script or by the shutdown procedure. On a long long term it could allow a restart with exactly the same assignments. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5939) Add an autorestart option in the start scripts
[ https://issues.apache.org/jira/browse/HBASE-5939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5939: --- Description: When a binary dies on a server, we don't try to restart it while it would be possible in most cases. We can have something as: loop start wait if cleanStop then exit if already stopped less than 5 minutes ago sleep 5 minute endloop This is simple for master backup master, a little bit more complex for the region server as it can be stopped by a script or by the shutdown procedure. On a long long term it could allow a restart with exactly the same assignments. was: When a binary dies on a server, we don't try to restart it while it would be possible in most cases. We can have something as: loop start wait if cleanStop then exit if already stopped less than 5 minutes ago sleep 1 minute endloop This is simple for master backup master, a little bit more complex for the region server as it can be stopped by a script or by the shutdown procedure. On a long long term it could allow a restart with exactly the same assignments. Release Note: When launched with autorestart, HBase processes will automatically restart if they are not properly terminated, either by a stop command or by a cluster stop. To ensure that it does not overload the system when the server itself is corrupted and the process cannot be restarted, the server sleeps for 5 minutes before restarting if it was already started 5 minutes ago previously. To use it, launch the process with bin/start-hbase autorestart. This option is not fully compatible with the existing restart command: if you ask for a restart on a server launched with autorestart, the server will restart but the next server instance won't be automatically restarted. Add an autorestart option in the start scripts -- Key: HBASE-5939 URL: https://issues.apache.org/jira/browse/HBASE-5939 Project: HBase Issue Type: Improvement Components: master, regionserver, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5939.v4.patch When a binary dies on a server, we don't try to restart it while it would be possible in most cases. We can have something as: loop start wait if cleanStop then exit if already stopped less than 5 minutes ago sleep 5 minute endloop This is simple for master backup master, a little bit more complex for the region server as it can be stopped by a script or by the shutdown procedure. On a long long term it could allow a restart with exactly the same assignments. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5992) Generalization of region move implementation + manage draining servers in bulk assign
[ https://issues.apache.org/jira/browse/HBASE-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13273724#comment-13273724 ] nkeywal commented on HBASE-5992: v5 with Stack's and Ted's comments taken into account. Thanks. Generalization of region move implementation + manage draining servers in bulk assign - Key: HBASE-5992 URL: https://issues.apache.org/jira/browse/HBASE-5992 Project: HBase Issue Type: Bug Components: master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5992.v2.patch, 5992.v5.patch The region move implementation now has now a similar behavior whatever the destination server is specified or not. This allows: - to benefit from the improvement in HBASE-5877 - as a side effect to have the coprocessors calls when the destination server is not specified This includes various fixes around draining servers. Draining servers were not excluded during a bulk assign. This is now fixed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5992) Generalization of region move implementation + manage draining servers in bulk assign
[ https://issues.apache.org/jira/browse/HBASE-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5992: --- Attachment: 5992.v5.patch Generalization of region move implementation + manage draining servers in bulk assign - Key: HBASE-5992 URL: https://issues.apache.org/jira/browse/HBASE-5992 Project: HBase Issue Type: Bug Components: master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5992.v2.patch, 5992.v5.patch The region move implementation now has now a similar behavior whatever the destination server is specified or not. This allows: - to benefit from the improvement in HBASE-5877 - as a side effect to have the coprocessors calls when the destination server is not specified This includes various fixes around draining servers. Draining servers were not excluded during a bulk assign. This is now fixed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5992) Generalization of region move implementation + manage draining servers in bulk assign
[ https://issues.apache.org/jira/browse/HBASE-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5992: --- Status: Patch Available (was: Open) Generalization of region move implementation + manage draining servers in bulk assign - Key: HBASE-5992 URL: https://issues.apache.org/jira/browse/HBASE-5992 Project: HBase Issue Type: Bug Components: master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5992.v2.patch, 5992.v5.patch The region move implementation now has now a similar behavior whatever the destination server is specified or not. This allows: - to benefit from the improvement in HBASE-5877 - as a side effect to have the coprocessors calls when the destination server is not specified This includes various fixes around draining servers. Draining servers were not excluded during a bulk assign. This is now fixed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5992) Generalization of region move implementation + manage draining servers in bulk assign
[ https://issues.apache.org/jira/browse/HBASE-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13273997#comment-13273997 ] nkeywal commented on HBASE-5992: It's org.apache.hadoop.hbase.TestDrainingServer. I fixed the issue, I'm retesting before uploading the patch. Generalization of region move implementation + manage draining servers in bulk assign - Key: HBASE-5992 URL: https://issues.apache.org/jira/browse/HBASE-5992 Project: HBase Issue Type: Bug Components: master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5992.v2.patch, 5992.v5.patch, 5992.v5.patch The region move implementation now has now a similar behavior whatever the destination server is specified or not. This allows: - to benefit from the improvement in HBASE-5877 - as a side effect to have the coprocessors calls when the destination server is not specified This includes various fixes around draining servers. Draining servers were not excluded during a bulk assign. This is now fixed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5992) Generalization of region move implementation + manage draining servers in bulk assign
[ https://issues.apache.org/jira/browse/HBASE-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5992: --- Status: Open (was: Patch Available) Generalization of region move implementation + manage draining servers in bulk assign - Key: HBASE-5992 URL: https://issues.apache.org/jira/browse/HBASE-5992 Project: HBase Issue Type: Bug Components: master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5992.v2.patch, 5992.v5.patch, 5992.v5.patch The region move implementation now has now a similar behavior whatever the destination server is specified or not. This allows: - to benefit from the improvement in HBASE-5877 - as a side effect to have the coprocessors calls when the destination server is not specified This includes various fixes around draining servers. Draining servers were not excluded during a bulk assign. This is now fixed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5992) Generalization of region move implementation + manage draining servers in bulk assign
[ https://issues.apache.org/jira/browse/HBASE-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5992: --- Attachment: 5992.v11.patch Generalization of region move implementation + manage draining servers in bulk assign - Key: HBASE-5992 URL: https://issues.apache.org/jira/browse/HBASE-5992 Project: HBase Issue Type: Bug Components: master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5992.v11.patch, 5992.v2.patch, 5992.v5.patch, 5992.v5.patch The region move implementation now has now a similar behavior whatever the destination server is specified or not. This allows: - to benefit from the improvement in HBASE-5877 - as a side effect to have the coprocessors calls when the destination server is not specified This includes various fixes around draining servers. Draining servers were not excluded during a bulk assign. This is now fixed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5992) Generalization of region move implementation + manage draining servers in bulk assign
[ https://issues.apache.org/jira/browse/HBASE-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5992: --- Status: Patch Available (was: Open) Generalization of region move implementation + manage draining servers in bulk assign - Key: HBASE-5992 URL: https://issues.apache.org/jira/browse/HBASE-5992 Project: HBase Issue Type: Bug Components: master, regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5992.v11.patch, 5992.v2.patch, 5992.v5.patch, 5992.v5.patch The region move implementation now has now a similar behavior whatever the destination server is specified or not. This allows: - to benefit from the improvement in HBASE-5877 - as a side effect to have the coprocessors calls when the destination server is not specified This includes various fixes around draining servers. Draining servers were not excluded during a bulk assign. This is now fixed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5998) Bulk assignment: regionserver optimization by using a temporary cache for table descriptors when receveing an open regions request
nkeywal created HBASE-5998: -- Summary: Bulk assignment: regionserver optimization by using a temporary cache for table descriptors when receveing an open regions request Key: HBASE-5998 URL: https://issues.apache.org/jira/browse/HBASE-5998 Project: HBase Issue Type: Improvement Components: regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor During the assignment, on the regionserver, before creating the handlers we load the table description. Even if there is a cache, we check the timestamps for each region, while it's not necessary. The test below is just with one node, with more nodes the benefit will improve. By limiting the time spent in HRegion#openRegion we increase the parallelization during cluster startup, as the master is using a pool of threads to call the RS. -- Without the fix 2012-05-14 11:40:52,501 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 1193 region(s) to localhost,11003,1336988444043 2012-05-14 11:41:09,947 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done for localhost,11003,1336988444043 -- With the fix 2012-05-14 11:34:40,444 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 1193 region(s) to localhost,11003,1336988444043 2012-05-14 11:34:40,929 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done for localhost,11003,1336988065948 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5998) Bulk assignment: regionserver optimization by using a temporary cache for table descriptors when receveing an open regions request
[ https://issues.apache.org/jira/browse/HBASE-5998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5998: --- Attachment: 5998.v2.patch Bulk assignment: regionserver optimization by using a temporary cache for table descriptors when receveing an open regions request -- Key: HBASE-5998 URL: https://issues.apache.org/jira/browse/HBASE-5998 Project: HBase Issue Type: Improvement Components: regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5998.v2.patch During the assignment, on the regionserver, before creating the handlers we load the table description. Even if there is a cache, we check the timestamps for each region, while it's not necessary. The test below is just with one node, with more nodes the benefit will improve. By limiting the time spent in HRegion#openRegion we increase the parallelization during cluster startup, as the master is using a pool of threads to call the RS. -- Without the fix 2012-05-14 11:40:52,501 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 1193 region(s) to localhost,11003,1336988444043 2012-05-14 11:41:09,947 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done for localhost,11003,1336988444043 -- With the fix 2012-05-14 11:34:40,444 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 1193 region(s) to localhost,11003,1336988444043 2012-05-14 11:34:40,929 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done for localhost,11003,1336988065948 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5998) Bulk assignment: regionserver optimization by using a temporary cache for table descriptors when receveing an open regions request
[ https://issues.apache.org/jira/browse/HBASE-5998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5998: --- Fix Version/s: 0.96.0 Status: Patch Available (was: Open) Bulk assignment: regionserver optimization by using a temporary cache for table descriptors when receveing an open regions request -- Key: HBASE-5998 URL: https://issues.apache.org/jira/browse/HBASE-5998 Project: HBase Issue Type: Improvement Components: regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5998.v2.patch During the assignment, on the regionserver, before creating the handlers we load the table description. Even if there is a cache, we check the timestamps for each region, while it's not necessary. The test below is just with one node, with more nodes the benefit will improve. By limiting the time spent in HRegion#openRegion we increase the parallelization during cluster startup, as the master is using a pool of threads to call the RS. -- Without the fix 2012-05-14 11:40:52,501 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 1193 region(s) to localhost,11003,1336988444043 2012-05-14 11:41:09,947 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done for localhost,11003,1336988444043 -- With the fix 2012-05-14 11:34:40,444 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 1193 region(s) to localhost,11003,1336988444043 2012-05-14 11:34:40,929 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done for localhost,11003,1336988065948 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5998) Bulk assignment: regionserver optimization by using a temporary cache for table descriptors when receveing an open regions request
[ https://issues.apache.org/jira/browse/HBASE-5998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5998: --- Status: Patch Available (was: Open) Bulk assignment: regionserver optimization by using a temporary cache for table descriptors when receveing an open regions request -- Key: HBASE-5998 URL: https://issues.apache.org/jira/browse/HBASE-5998 Project: HBase Issue Type: Improvement Components: regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5998.v2.patch, 5998.v3.patch During the assignment, on the regionserver, before creating the handlers we load the table description. Even if there is a cache, we check the timestamps for each region, while it's not necessary. The test below is just with one node, with more nodes the benefit will improve. By limiting the time spent in HRegion#openRegion we increase the parallelization during cluster startup, as the master is using a pool of threads to call the RS. -- Without the fix 2012-05-14 11:40:52,501 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 1193 region(s) to localhost,11003,1336988444043 2012-05-14 11:41:09,947 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done for localhost,11003,1336988444043 -- With the fix 2012-05-14 11:34:40,444 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 1193 region(s) to localhost,11003,1336988444043 2012-05-14 11:34:40,929 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done for localhost,11003,1336988065948 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5998) Bulk assignment: regionserver optimization by using a temporary cache for table descriptors when receveing an open regions request
[ https://issues.apache.org/jira/browse/HBASE-5998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5998: --- Attachment: 5998.v3.patch Bulk assignment: regionserver optimization by using a temporary cache for table descriptors when receveing an open regions request -- Key: HBASE-5998 URL: https://issues.apache.org/jira/browse/HBASE-5998 Project: HBase Issue Type: Improvement Components: regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5998.v2.patch, 5998.v3.patch During the assignment, on the regionserver, before creating the handlers we load the table description. Even if there is a cache, we check the timestamps for each region, while it's not necessary. The test below is just with one node, with more nodes the benefit will improve. By limiting the time spent in HRegion#openRegion we increase the parallelization during cluster startup, as the master is using a pool of threads to call the RS. -- Without the fix 2012-05-14 11:40:52,501 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 1193 region(s) to localhost,11003,1336988444043 2012-05-14 11:41:09,947 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done for localhost,11003,1336988444043 -- With the fix 2012-05-14 11:34:40,444 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 1193 region(s) to localhost,11003,1336988444043 2012-05-14 11:34:40,929 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done for localhost,11003,1336988065948 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5998) Bulk assignment: regionserver optimization by using a temporary cache for table descriptors when receveing an open regions request
[ https://issues.apache.org/jira/browse/HBASE-5998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13274572#comment-13274572 ] nkeywal commented on HBASE-5998: The failure in TestDrainingServer is not directly related to my change. I will have a look and fix the flakiness in another jira. Bulk assignment: regionserver optimization by using a temporary cache for table descriptors when receveing an open regions request -- Key: HBASE-5998 URL: https://issues.apache.org/jira/browse/HBASE-5998 Project: HBase Issue Type: Improvement Components: regionserver Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5998.v2.patch, 5998.v3.patch During the assignment, on the regionserver, before creating the handlers we load the table description. Even if there is a cache, we check the timestamps for each region, while it's not necessary. The test below is just with one node, with more nodes the benefit will improve. By limiting the time spent in HRegion#openRegion we increase the parallelization during cluster startup, as the master is using a pool of threads to call the RS. -- Without the fix 2012-05-14 11:40:52,501 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 1193 region(s) to localhost,11003,1336988444043 2012-05-14 11:41:09,947 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done for localhost,11003,1336988444043 -- With the fix 2012-05-14 11:34:40,444 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 1193 region(s) to localhost,11003,1336988444043 2012-05-14 11:34:40,929 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done for localhost,11003,1336988065948 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5926) Delete the master znode after a master crash
[ https://issues.apache.org/jira/browse/HBASE-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5926: --- Attachment: 5926.v6.patch Delete the master znode after a master crash Key: HBASE-5926 URL: https://issues.apache.org/jira/browse/HBASE-5926 Project: HBase Issue Type: Improvement Components: master, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5926.v6.patch This is the continuation of the work done in HBASE-5844. But we can't apply exactly the same strategy: for the region server, there is a znode per region server, while for the master backup master there is a single znode for both. So if we apply the same strategy as for a regionserver, we may have this scenario: 1) Master starts 2) Backup master starts 3) Master dies 4) ZK detects it 5) Backup master receives the update from ZK 6) Backup master creates the new master node and become the main master 7) Previous master script continues 8) Previous master script deletes the master node in ZK 9) = issue: we deleted the node just created by the new master This should not happen often (usually the znode will be deleted soon enough), but it can happen. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5926) Delete the master znode after a master crash
[ https://issues.apache.org/jira/browse/HBASE-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5926: --- Fix Version/s: 0.96.0 Status: Patch Available (was: Open) Delete the master znode after a master crash Key: HBASE-5926 URL: https://issues.apache.org/jira/browse/HBASE-5926 Project: HBase Issue Type: Improvement Components: master, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5926.v6.patch This is the continuation of the work done in HBASE-5844. But we can't apply exactly the same strategy: for the region server, there is a znode per region server, while for the master backup master there is a single znode for both. So if we apply the same strategy as for a regionserver, we may have this scenario: 1) Master starts 2) Backup master starts 3) Master dies 4) ZK detects it 5) Backup master receives the update from ZK 6) Backup master creates the new master node and become the main master 7) Previous master script continues 8) Previous master script deletes the master node in ZK 9) = issue: we deleted the node just created by the new master This should not happen often (usually the znode will be deleted soon enough), but it can happen. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5926) Delete the master znode after a master crash
[ https://issues.apache.org/jira/browse/HBASE-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13277172#comment-13277172 ] nkeywal commented on HBASE-5926: the race condition is decreased to a production-acceptable minimum imho. We do a compare delete in the java code, so the race condition is now: between the comparison and the delete, we fail if, and only if: the session expires and the master node is deleted and the master backup recreates the node. That's unlikely. Delete the master znode after a master crash Key: HBASE-5926 URL: https://issues.apache.org/jira/browse/HBASE-5926 Project: HBase Issue Type: Improvement Components: master, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5926.v6.patch This is the continuation of the work done in HBASE-5844. But we can't apply exactly the same strategy: for the region server, there is a znode per region server, while for the master backup master there is a single znode for both. So if we apply the same strategy as for a regionserver, we may have this scenario: 1) Master starts 2) Backup master starts 3) Master dies 4) ZK detects it 5) Backup master receives the update from ZK 6) Backup master creates the new master node and become the main master 7) Previous master script continues 8) Previous master script deletes the master node in ZK 9) = issue: we deleted the node just created by the new master This should not happen often (usually the znode will be deleted soon enough), but it can happen. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5926) Delete the master znode after a master crash
[ https://issues.apache.org/jira/browse/HBASE-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5926: --- Status: Open (was: Patch Available) Delete the master znode after a master crash Key: HBASE-5926 URL: https://issues.apache.org/jira/browse/HBASE-5926 Project: HBase Issue Type: Improvement Components: master, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5926.v6.patch, 5926.v8.patch This is the continuation of the work done in HBASE-5844. But we can't apply exactly the same strategy: for the region server, there is a znode per region server, while for the master backup master there is a single znode for both. So if we apply the same strategy as for a regionserver, we may have this scenario: 1) Master starts 2) Backup master starts 3) Master dies 4) ZK detects it 5) Backup master receives the update from ZK 6) Backup master creates the new master node and become the main master 7) Previous master script continues 8) Previous master script deletes the master node in ZK 9) = issue: we deleted the node just created by the new master This should not happen often (usually the znode will be deleted soon enough), but it can happen. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5926) Delete the master znode after a master crash
[ https://issues.apache.org/jira/browse/HBASE-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13277684#comment-13277684 ] nkeywal commented on HBASE-5926: v8. with Ted's comments taken into account. Delete the master znode after a master crash Key: HBASE-5926 URL: https://issues.apache.org/jira/browse/HBASE-5926 Project: HBase Issue Type: Improvement Components: master, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5926.v6.patch, 5926.v8.patch This is the continuation of the work done in HBASE-5844. But we can't apply exactly the same strategy: for the region server, there is a znode per region server, while for the master backup master there is a single znode for both. So if we apply the same strategy as for a regionserver, we may have this scenario: 1) Master starts 2) Backup master starts 3) Master dies 4) ZK detects it 5) Backup master receives the update from ZK 6) Backup master creates the new master node and become the main master 7) Previous master script continues 8) Previous master script deletes the master node in ZK 9) = issue: we deleted the node just created by the new master This should not happen often (usually the znode will be deleted soon enough), but it can happen. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5926) Delete the master znode after a master crash
[ https://issues.apache.org/jira/browse/HBASE-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5926: --- Attachment: 5926.v8.patch Delete the master znode after a master crash Key: HBASE-5926 URL: https://issues.apache.org/jira/browse/HBASE-5926 Project: HBase Issue Type: Improvement Components: master, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5926.v6.patch, 5926.v8.patch This is the continuation of the work done in HBASE-5844. But we can't apply exactly the same strategy: for the region server, there is a znode per region server, while for the master backup master there is a single znode for both. So if we apply the same strategy as for a regionserver, we may have this scenario: 1) Master starts 2) Backup master starts 3) Master dies 4) ZK detects it 5) Backup master receives the update from ZK 6) Backup master creates the new master node and become the main master 7) Previous master script continues 8) Previous master script deletes the master node in ZK 9) = issue: we deleted the node just created by the new master This should not happen often (usually the znode will be deleted soon enough), but it can happen. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5926) Delete the master znode after a master crash
[ https://issues.apache.org/jira/browse/HBASE-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5926: --- Status: Patch Available (was: Open) Delete the master znode after a master crash Key: HBASE-5926 URL: https://issues.apache.org/jira/browse/HBASE-5926 Project: HBase Issue Type: Improvement Components: master, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5926.v6.patch, 5926.v8.patch This is the continuation of the work done in HBASE-5844. But we can't apply exactly the same strategy: for the region server, there is a znode per region server, while for the master backup master there is a single znode for both. So if we apply the same strategy as for a regionserver, we may have this scenario: 1) Master starts 2) Backup master starts 3) Master dies 4) ZK detects it 5) Backup master receives the update from ZK 6) Backup master creates the new master node and become the main master 7) Previous master script continues 8) Previous master script deletes the master node in ZK 9) = issue: we deleted the node just created by the new master This should not happen often (usually the znode will be deleted soon enough), but it can happen. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5926) Delete the master znode after a master crash
[ https://issues.apache.org/jira/browse/HBASE-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13277702#comment-13277702 ] nkeywal commented on HBASE-5926: These tests run ok locally. Delete the master znode after a master crash Key: HBASE-5926 URL: https://issues.apache.org/jira/browse/HBASE-5926 Project: HBase Issue Type: Improvement Components: master, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5926.v6.patch, 5926.v8.patch This is the continuation of the work done in HBASE-5844. But we can't apply exactly the same strategy: for the region server, there is a znode per region server, while for the master backup master there is a single znode for both. So if we apply the same strategy as for a regionserver, we may have this scenario: 1) Master starts 2) Backup master starts 3) Master dies 4) ZK detects it 5) Backup master receives the update from ZK 6) Backup master creates the new master node and become the main master 7) Previous master script continues 8) Previous master script deletes the master node in ZK 9) = issue: we deleted the node just created by the new master This should not happen often (usually the znode will be deleted soon enough), but it can happen. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5926) Delete the master znode after a master crash
[ https://issues.apache.org/jira/browse/HBASE-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13277996#comment-13277996 ] nkeywal commented on HBASE-5926: bq. javadoc done. bq. readMyEphemeralNodeOnDisk() throws IOException but writeMyEphemeralNodeOnDisk() doesn't. What was the reason ? When we write we ignore the results (i.e. we don't stop the master or the region server if we can't store the znode, we just continue). When we read, we're interested in the exception: the pattern in HMasterCommandLine is to return -1 on error. bq. Please add ' contents' at the end of the above. ok. bq. Should the setting be restored before exiting the above method ? I now clone the conf. Delete the master znode after a master crash Key: HBASE-5926 URL: https://issues.apache.org/jira/browse/HBASE-5926 Project: HBase Issue Type: Improvement Components: master, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5926.v6.patch, 5926.v8.patch This is the continuation of the work done in HBASE-5844. But we can't apply exactly the same strategy: for the region server, there is a znode per region server, while for the master backup master there is a single znode for both. So if we apply the same strategy as for a regionserver, we may have this scenario: 1) Master starts 2) Backup master starts 3) Master dies 4) ZK detects it 5) Backup master receives the update from ZK 6) Backup master creates the new master node and become the main master 7) Previous master script continues 8) Previous master script deletes the master node in ZK 9) = issue: we deleted the node just created by the new master This should not happen often (usually the znode will be deleted soon enough), but it can happen. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5926) Delete the master znode after a master crash
[ https://issues.apache.org/jira/browse/HBASE-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5926: --- Status: Open (was: Patch Available) Delete the master znode after a master crash Key: HBASE-5926 URL: https://issues.apache.org/jira/browse/HBASE-5926 Project: HBase Issue Type: Improvement Components: master, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5926.v6.patch, 5926.v8.patch, 5926.v9.patch This is the continuation of the work done in HBASE-5844. But we can't apply exactly the same strategy: for the region server, there is a znode per region server, while for the master backup master there is a single znode for both. So if we apply the same strategy as for a regionserver, we may have this scenario: 1) Master starts 2) Backup master starts 3) Master dies 4) ZK detects it 5) Backup master receives the update from ZK 6) Backup master creates the new master node and become the main master 7) Previous master script continues 8) Previous master script deletes the master node in ZK 9) = issue: we deleted the node just created by the new master This should not happen often (usually the znode will be deleted soon enough), but it can happen. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5926) Delete the master znode after a master crash
[ https://issues.apache.org/jira/browse/HBASE-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nkeywal updated HBASE-5926: --- Attachment: 5926.v9.patch Delete the master znode after a master crash Key: HBASE-5926 URL: https://issues.apache.org/jira/browse/HBASE-5926 Project: HBase Issue Type: Improvement Components: master, scripts Affects Versions: 0.96.0 Reporter: nkeywal Assignee: nkeywal Priority: Minor Fix For: 0.96.0 Attachments: 5926.v6.patch, 5926.v8.patch, 5926.v9.patch This is the continuation of the work done in HBASE-5844. But we can't apply exactly the same strategy: for the region server, there is a znode per region server, while for the master backup master there is a single znode for both. So if we apply the same strategy as for a regionserver, we may have this scenario: 1) Master starts 2) Backup master starts 3) Master dies 4) ZK detects it 5) Backup master receives the update from ZK 6) Backup master creates the new master node and become the main master 7) Previous master script continues 8) Previous master script deletes the master node in ZK 9) = issue: we deleted the node just created by the new master This should not happen often (usually the znode will be deleted soon enough), but it can happen. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira