[GitHub] [hbase] infraio commented on a change in pull request #2322: [HBASE-24956] ConnectionManager#locateRegionInMeta waits for user region lock indefinitely.
infraio commented on a change in pull request #2322: URL: https://github.com/apache/hbase/pull/2322#discussion_r486877715 ## File path: hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java ## @@ -968,6 +974,19 @@ private RegionLocations locateRegionInMeta(TableName tableName, byte[] row, bool } } + void takeUserRegionLock() throws IOException { +try { + long waitTime = connectionConfig.getScannerTimeoutPeriod(); Review comment: Why move the takeUserRegionLock to inside "try catch block"? The right fix is to use operation timeout.. ## File path: hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java ## @@ -968,6 +974,19 @@ private RegionLocations locateRegionInMeta(TableName tableName, byte[] row, bool } } + void takeUserRegionLock() throws IOException { +try { + long waitTime = connectionConfig.getScannerTimeoutPeriod(); Review comment: I am -1 to use scanner timeout. It is too weird and confused. Operation timeout is not the best choice too but better. Thanks. ## File path: hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java ## @@ -968,6 +974,19 @@ private RegionLocations locateRegionInMeta(TableName tableName, byte[] row, bool } } + void takeUserRegionLock() throws IOException { +try { + long waitTime = connectionConfig.getScannerTimeoutPeriod(); Review comment: > Reread RpcRetryingCallerImpl#callWithRetries code. If callable.prepare throw the LockTimeoutException, it will not retry. There are a check about call duration and callTimeout in line 156. My point here is that your 15 seconds SLA case is not right. It is still meet your SLA even you use the operation timeout. I didn't mean that "move takeUserRegionLock to try catch block". Thanks. ## File path: hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java ## @@ -968,6 +974,19 @@ private RegionLocations locateRegionInMeta(TableName tableName, byte[] row, bool } } + void takeUserRegionLock() throws IOException { +try { + long waitTime = connectionConfig.getScannerTimeoutPeriod(); Review comment: > If we wait for operation timeout period and if it can't get the lock after the timeout, it will not have any time remaining for next attempts. Yes. The guarantee is that the operation will fail or success within the "operation timeout". No remaining time to retry and failed the operation is acceptable. > are you suggesting to wait for operation timeout period while trying to get lock Yes. Use the operation timeout period when wait for lock, instead of the scanner timeout now. > I think we are going back and forth on which timeout to use. I thought my point is clearly since we start this discussion. I suggested that use operation timeout instead of scanner timeout. Then you give me a 15 seconds SLA example. Then I checked the code: use operation timeout can meet your SLA requirements, too. So why not use operation timeout? ## File path: hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java ## @@ -968,6 +974,19 @@ private RegionLocations locateRegionInMeta(TableName tableName, byte[] row, bool } } + void takeUserRegionLock() throws IOException { +try { + long waitTime = connectionConfig.getScannerTimeoutPeriod(); Review comment: > If we wait for operation timeout period and if it can't get the lock after the timeout, it will not have any time remaining for next attempts. Yes. The guarantee is that the operation will fail or success within the "operation timeout". No remaining time to retry and failed the operation is acceptable. > are you suggesting to wait for operation timeout period while trying to get lock Yes. Use the operation timeout period when wait for lock, instead of the scanner timeout now. > I think we are going back and forth on which timeout to use. I thought my point is clearly since we start this discussion. I suggested that use operation timeout instead of scanner timeout. Then you give me a 15 seconds SLA example. Then I checked the code: use operation timeout can meet your SLA requirements, too. So why not use operation timeout? ## File path: hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java ## @@ -968,6 +974,19 @@ private RegionLocations locateRegionInMeta(TableName tableName, byte[] row, bool } } + void takeUserRegionLock() throws IOException { +try { + long waitTime = connectionConfig.getScannerTimeoutPeriod(); Review comment: Why move the takeUserRegionLock to inside "try catch block"? The right fix is to use operation timeout.. ## File
[GitHub] [hbase] infraio commented on a change in pull request #2322: [HBASE-24956] ConnectionManager#locateRegionInMeta waits for user region lock indefinitely.
infraio commented on a change in pull request #2322: URL: https://github.com/apache/hbase/pull/2322#discussion_r486877715 ## File path: hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java ## @@ -968,6 +974,19 @@ private RegionLocations locateRegionInMeta(TableName tableName, byte[] row, bool } } + void takeUserRegionLock() throws IOException { +try { + long waitTime = connectionConfig.getScannerTimeoutPeriod(); Review comment: Why move the takeUserRegionLock to inside "try catch block"? The right fix is to use operation timeout.. ## File path: hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java ## @@ -968,6 +974,19 @@ private RegionLocations locateRegionInMeta(TableName tableName, byte[] row, bool } } + void takeUserRegionLock() throws IOException { +try { + long waitTime = connectionConfig.getScannerTimeoutPeriod(); Review comment: I am -1 to use scanner timeout. It is too weird and confused. Operation timeout is not the best choice too but better. Thanks. ## File path: hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java ## @@ -968,6 +974,19 @@ private RegionLocations locateRegionInMeta(TableName tableName, byte[] row, bool } } + void takeUserRegionLock() throws IOException { +try { + long waitTime = connectionConfig.getScannerTimeoutPeriod(); Review comment: > Reread RpcRetryingCallerImpl#callWithRetries code. If callable.prepare throw the LockTimeoutException, it will not retry. There are a check about call duration and callTimeout in line 156. My point here is that your 15 seconds SLA case is not right. It is still meet your SLA even you use the operation timeout. I didn't mean that "move takeUserRegionLock to try catch block". Thanks. ## File path: hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java ## @@ -968,6 +974,19 @@ private RegionLocations locateRegionInMeta(TableName tableName, byte[] row, bool } } + void takeUserRegionLock() throws IOException { +try { + long waitTime = connectionConfig.getScannerTimeoutPeriod(); Review comment: > If we wait for operation timeout period and if it can't get the lock after the timeout, it will not have any time remaining for next attempts. Yes. The guarantee is that the operation will fail or success within the "operation timeout". No remaining time to retry and failed the operation is acceptable. > are you suggesting to wait for operation timeout period while trying to get lock Yes. Use the operation timeout period when wait for lock, instead of the scanner timeout now. > I think we are going back and forth on which timeout to use. I thought my point is clearly since we start this discussion. I suggested that use operation timeout instead of scanner timeout. Then you give me a 15 seconds SLA example. Then I checked the code: use operation timeout can meet your SLA requirements, too. So why not use operation timeout? ## File path: hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java ## @@ -968,6 +974,19 @@ private RegionLocations locateRegionInMeta(TableName tableName, byte[] row, bool } } + void takeUserRegionLock() throws IOException { +try { + long waitTime = connectionConfig.getScannerTimeoutPeriod(); Review comment: > If we wait for operation timeout period and if it can't get the lock after the timeout, it will not have any time remaining for next attempts. Yes. The guarantee is that the operation will fail or success within the "operation timeout". No remaining time to retry and failed the operation is acceptable. > are you suggesting to wait for operation timeout period while trying to get lock Yes. Use the operation timeout period when wait for lock, instead of the scanner timeout now. > I think we are going back and forth on which timeout to use. I thought my point is clearly since we start this discussion. I suggested that use operation timeout instead of scanner timeout. Then you give me a 15 seconds SLA example. Then I checked the code: use operation timeout can meet your SLA requirements, too. So why not use operation timeout? ## File path: hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java ## @@ -968,6 +974,19 @@ private RegionLocations locateRegionInMeta(TableName tableName, byte[] row, bool } } + void takeUserRegionLock() throws IOException { +try { + long waitTime = connectionConfig.getScannerTimeoutPeriod(); Review comment: Why move the takeUserRegionLock to inside "try catch block"? The right fix is to use operation timeout.. ## File
[GitHub] [hbase] infraio commented on a change in pull request #2322: [HBASE-24956] ConnectionManager#locateRegionInMeta waits for user region lock indefinitely.
infraio commented on a change in pull request #2322: URL: https://github.com/apache/hbase/pull/2322#discussion_r487349217 ## File path: hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java ## @@ -968,6 +974,19 @@ private RegionLocations locateRegionInMeta(TableName tableName, byte[] row, bool } } + void takeUserRegionLock() throws IOException { +try { + long waitTime = connectionConfig.getScannerTimeoutPeriod(); Review comment: > If we wait for operation timeout period and if it can't get the lock after the timeout, it will not have any time remaining for next attempts. Yes. The guarantee is that the operation will fail or success within the "operation timeout". No remaining time to retry and failed the operation is acceptable. > are you suggesting to wait for operation timeout period while trying to get lock Yes. Use the operation timeout period when wait for lock, instead of the scanner timeout now. > I think we are going back and forth on which timeout to use. I thought my point is clearly since we start this discussion. I suggested that use operation timeout instead of scanner timeout. Then you give me a 15 seconds SLA example. Then I checked the code: use operation timeout can meet your SLA requirements, too. So why not use operation timeout? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hbase] infraio commented on a change in pull request #2322: [HBASE-24956] ConnectionManager#locateRegionInMeta waits for user region lock indefinitely.
infraio commented on a change in pull request #2322: URL: https://github.com/apache/hbase/pull/2322#discussion_r487349217 ## File path: hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java ## @@ -968,6 +974,19 @@ private RegionLocations locateRegionInMeta(TableName tableName, byte[] row, bool } } + void takeUserRegionLock() throws IOException { +try { + long waitTime = connectionConfig.getScannerTimeoutPeriod(); Review comment: > If we wait for operation timeout period and if it can't get the lock after the timeout, it will not have any time remaining for next attempts. Yes. The guarantee is that the operation will fail or success within the "operation timeout". No remaining time to retry and failed the operation is acceptable. > are you suggesting to wait for operation timeout period while trying to get lock Yes. Use the operation timeout period when wait for lock, instead of the scanner timeout now. > I think we are going back and forth on which timeout to use. I thought my point is clearly since we start this discussion. I suggested that use operation timeout instead of scanner timeout. Then you give me a 15 seconds SLA example. Then I checked the code: use operation timeout can meet your SLA requirements, too. So why not use operation timeout? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hbase] infraio commented on a change in pull request #2322: [HBASE-24956] ConnectionManager#locateRegionInMeta waits for user region lock indefinitely.
infraio commented on a change in pull request #2322: URL: https://github.com/apache/hbase/pull/2322#discussion_r486884181 ## File path: hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java ## @@ -968,6 +974,19 @@ private RegionLocations locateRegionInMeta(TableName tableName, byte[] row, bool } } + void takeUserRegionLock() throws IOException { +try { + long waitTime = connectionConfig.getScannerTimeoutPeriod(); Review comment: > Reread RpcRetryingCallerImpl#callWithRetries code. If callable.prepare throw the LockTimeoutException, it will not retry. There are a check about call duration and callTimeout in line 156. My point here is that your 15 seconds SLA case is not right. It is still meet your SLA even you use the operation timeout. I didn't mean that "move takeUserRegionLock to try catch block". Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hbase] infraio commented on a change in pull request #2322: [HBASE-24956] ConnectionManager#locateRegionInMeta waits for user region lock indefinitely.
infraio commented on a change in pull request #2322: URL: https://github.com/apache/hbase/pull/2322#discussion_r486879689 ## File path: hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java ## @@ -968,6 +974,19 @@ private RegionLocations locateRegionInMeta(TableName tableName, byte[] row, bool } } + void takeUserRegionLock() throws IOException { +try { + long waitTime = connectionConfig.getScannerTimeoutPeriod(); Review comment: I am -1 to use scanner timeout. It is too weird and confused. Operation timeout is not the best choice too but better. Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hbase] infraio commented on a change in pull request #2322: [HBASE-24956] ConnectionManager#locateRegionInMeta waits for user region lock indefinitely.
infraio commented on a change in pull request #2322: URL: https://github.com/apache/hbase/pull/2322#discussion_r486877715 ## File path: hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java ## @@ -968,6 +974,19 @@ private RegionLocations locateRegionInMeta(TableName tableName, byte[] row, bool } } + void takeUserRegionLock() throws IOException { +try { + long waitTime = connectionConfig.getScannerTimeoutPeriod(); Review comment: Why move the takeUserRegionLock to inside "try catch block"? The right fix is to use operation timeout.. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hbase] infraio commented on a change in pull request #2322: [HBASE-24956] ConnectionManager#locateRegionInMeta waits for user region lock indefinitely.
infraio commented on a change in pull request #2322: URL: https://github.com/apache/hbase/pull/2322#discussion_r486099032 ## File path: hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java ## @@ -968,6 +974,19 @@ private RegionLocations locateRegionInMeta(TableName tableName, byte[] row, bool } } + void takeUserRegionLock() throws IOException { +try { + long waitTime = connectionConfig.getScannerTimeoutPeriod(); Review comment: Seems not push the latest commit? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hbase] infraio commented on a change in pull request #2322: [HBASE-24956] ConnectionManager#locateRegionInMeta waits for user region lock indefinitely.
infraio commented on a change in pull request #2322: URL: https://github.com/apache/hbase/pull/2322#discussion_r485272081 ## File path: hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java ## @@ -968,6 +968,19 @@ private RegionLocations locateRegionInMeta(TableName tableName, byte[] row, bool } } + private void takeUserRegionLock() throws IOException { +try { + long waitTime = connectionConfig.getScannerTimeoutPeriod(); + if (!userRegionLock.tryLock(waitTime, TimeUnit.MILLISECONDS)) { Review comment: bq. There is one retry loop here which will retry if exception is not TNFE or retries are not exhausted. Please check the code. The LockTimeoutException will be throwed out and terminate the loop. It is not in the try...catch code block. bq. IIUC the code, callable.prepare will never throw LockTimeoutException. callable.prepare will call locateRegionInMeta inside. So locateRegionInMeta throw LockTimeoutException, then it will throw this exception out. Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hbase] infraio commented on a change in pull request #2322: [HBASE-24956] ConnectionManager#locateRegionInMeta waits for user region lock indefinitely.
infraio commented on a change in pull request #2322: URL: https://github.com/apache/hbase/pull/2322#discussion_r484809224 ## File path: hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java ## @@ -968,6 +968,19 @@ private RegionLocations locateRegionInMeta(TableName tableName, byte[] row, bool } } + private void takeUserRegionLock() throws IOException { +try { + long waitTime = connectionConfig.getScannerTimeoutPeriod(); + if (!userRegionLock.tryLock(waitTime, TimeUnit.MILLISECONDS)) { Review comment: > then in worst case scenario call will fail in 15 (lock timeout) + 15 (lock timeout 2nd try) + 1 (assuming sleep of 1 second) = 31 seconds Reread RpcRetryingCallerImpl#callWithRetries code. If callable.prepare throw the LockTimeoutException, it will not retry. There are a check about call duration and callTimeout in line 156. See https://github.com/apache/hbase/blob/branch-2/hbase-client/src/main/java/org/apache/hadoop/hbase/client/RpcRetryingCallerImpl.java#L156 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hbase] infraio commented on a change in pull request #2322: [HBASE-24956] ConnectionManager#locateRegionInMeta waits for user region lock indefinitely.
infraio commented on a change in pull request #2322: URL: https://github.com/apache/hbase/pull/2322#discussion_r483046581 ## File path: hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java ## @@ -968,6 +968,19 @@ private RegionLocations locateRegionInMeta(TableName tableName, byte[] row, bool } } + private void takeUserRegionLock() throws IOException { +try { + long waitTime = connectionConfig.getScannerTimeoutPeriod(); + if (!userRegionLock.tryLock(waitTime, TimeUnit.MILLISECONDS)) { Review comment: Got your case. I thought the problem is if lock timeout, the 2nd retry need to consider this case and should not retry. Let me give a example about the rpc timeout (configed as 4 secs) and operation timeout(configed as 10 secs). The rpc timeout maybe reset when the remaining time is less than rpc timeout. 1. rpc call with 4 secs timeout. 2. retry call with 4 secs timeout. 3. retry call with 2 secs timeout which equals 10 - 4 - 4. For this case, we need same guarantee. If the remaining time is zero, it should not to retry. I thought this is the right way to fix this problem. Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hbase] infraio commented on a change in pull request #2322: [HBASE-24956] ConnectionManager#locateRegionInMeta waits for user region lock indefinitely.
infraio commented on a change in pull request #2322: URL: https://github.com/apache/hbase/pull/2322#discussion_r481152983 ## File path: hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java ## @@ -968,6 +968,19 @@ private RegionLocations locateRegionInMeta(TableName tableName, byte[] row, bool } } + private void takeUserRegionLock() throws IOException { +try { + long waitTime = connectionConfig.getScannerTimeoutPeriod(); + if (!userRegionLock.tryLock(waitTime, TimeUnit.MILLISECONDS)) { Review comment: Operation timout is not the best choice but better than scanner timeout. Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hbase] infraio commented on a change in pull request #2322: [HBASE-24956] ConnectionManager#locateRegionInMeta waits for user region lock indefinitely.
infraio commented on a change in pull request #2322: URL: https://github.com/apache/hbase/pull/2322#discussion_r481152014 ## File path: hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java ## @@ -968,6 +968,19 @@ private RegionLocations locateRegionInMeta(TableName tableName, byte[] row, bool } } + private void takeUserRegionLock() throws IOException { +try { + long waitTime = connectionConfig.getScannerTimeoutPeriod(); + if (!userRegionLock.tryLock(waitTime, TimeUnit.MILLISECONDS)) { Review comment: > the default value for operation timeout is 20 minutes. Do you want clients to wait for 20 minutes to acquire this lock ? The problem is not how long the default value. The problem here is to use the right timeout. And for the default 20min operation timeout, we set it to 20mins because it is easy for passing ITBLL. And in real producetion cluster, no user will use the default value if it serve for online service. The guarantee of operation timeout is all table call (exclude scan) should return or throw exception when timeout. And when need to locate, locateRegionInMeta is one step of table call. So it should to keep this guarantee. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hbase] infraio commented on a change in pull request #2322: [HBASE-24956] ConnectionManager#locateRegionInMeta waits for user region lock indefinitely.
infraio commented on a change in pull request #2322: URL: https://github.com/apache/hbase/pull/2322#discussion_r479865631 ## File path: hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java ## @@ -968,6 +968,19 @@ private RegionLocations locateRegionInMeta(TableName tableName, byte[] row, bool } } + private void takeUserRegionLock() throws IOException { +try { + long waitTime = connectionConfig.getScannerTimeoutPeriod(); + if (!userRegionLock.tryLock(waitTime, TimeUnit.MILLISECONDS)) { Review comment: I am +1 to use operation timeout here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hbase] infraio commented on a change in pull request #2322: [HBASE-24956] ConnectionManager#locateRegionInMeta waits for user region lock indefinitely.
infraio commented on a change in pull request #2322: URL: https://github.com/apache/hbase/pull/2322#discussion_r478762775 ## File path: hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java ## @@ -968,6 +968,19 @@ private RegionLocations locateRegionInMeta(TableName tableName, byte[] row, bool } } + private void takeUserRegionLock() throws IOException { +try { + long waitTime = connectionConfig.getScannerTimeoutPeriod(); + if (!userRegionLock.tryLock(waitTime, TimeUnit.MILLISECONDS)) { Review comment: > We were wondering what is the right timeout to use for this lock. +1. The scanner timeout is not a good choice here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org