[jira] [Assigned] (HBASE-24454) BucketCache disabled instantly before error duration toleration is reached due to timing issue

2020-05-30 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack reassigned HBASE-24454:
-

Assignee: Jacob LeBlanc  (was: Michael Stack)

> BucketCache disabled instantly before error duration toleration is reached 
> due to timing issue
> --
>
> Key: HBASE-24454
> URL: https://issues.apache.org/jira/browse/HBASE-24454
> Project: HBase
>  Issue Type: Bug
>  Components: BucketCache
>Affects Versions: 1.4.10
> Environment: We saw this in HBase 1.4.10 (EMR 5.28.1), but I believe 
> the newest code still has this problem.
>Reporter: Jacob LeBlanc
>Assignee: Jacob LeBlanc
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.3.0, 1.7.0, 2.2.6
>
>
> We saw an instance where BucketCache was disabled after only two IO error 
> were encountered at nearly the same time. The following shows all errors from 
> a region server log for the 2020-05-26 17:00 hour. Notice that there are no 
> other errors until the 17:14:50 at which time the BucketCache is disabled 
> because it claims duration time has exceeded 6 ms:
> $ grep ERROR hbase-hbase-regionserver-ip-172-20-113-147.log.2020-05-26-17
>  2020-05-26 17:14:50,396 ERROR [hfile-prefetch-1589117924884] 
> bucket.BucketCache: Failed reading block 
> 9adfad3a603047cfa0c98b16da0b0974_13786 from bucket cache
>  2020-05-26 17:14:50,397 ERROR 
> [regionserver/ip-172-20-113-147.us-west-2.compute.internal/172.20.113.147:16020-BucketCacheWriter-0]
>  bucket.BucketCache: Failed syncing IO engine
>  2020-05-26 17:14:50,400 ERROR [hfile-prefetch-1589117924884] 
> bucket.BucketCache: Failed reading block 
> 9adfad3a603047cfa0c98b16da0b0974_13786 from bucket cache
>  2020-05-26 17:14:50,401 ERROR [hfile-prefetch-1589117924884] 
> bucket.BucketCache: IO errors duration time has exceeded 6ms, disabling 
> cache, please check your IOEngine
> The region server is very busy so it should be constantly getting successful 
> reads and writes in the bucket cache (it is not as though there was some long 
> ago error and then no successful IO to clear the error flag).
> We are running a busy EMR cluster backed by S3. A bucketcache getting 
> disabled is a huge performance issue because all reads must go to S3.
> Looking at the code, I believe I've found a timing issue. Here is the code 
> for checking and setting the ioErrorStartTime (taken from BucketCache.java):
>  
> {code:java}
>   /**
>* Check whether we tolerate IO error this time. If the duration of IOEngine
>* throwing errors exceeds ioErrorsDurationTimeTolerated, we will disable 
> the
>* cache
>*/
>   private void checkIOErrorIsTolerated() {
> long now = EnvironmentEdgeManager.currentTime();
> if (this.ioErrorStartTime > 0) {
>   if (cacheEnabled && (now - ioErrorStartTime) > 
> this.ioErrorsTolerationDuration) {
> LOG.error("IO errors duration time has exceeded " + 
> ioErrorsTolerationDuration +
>   "ms, disabling cache, please check your IOEngine");
> disableCache();
>   }
> } else {
>   this.ioErrorStartTime = now;
> }
>   }
> {code}
>  
> And here is the code for clearing the ioErrorStartTime when a successful read 
> or write is done:
> {code:java}
>   if (this.ioErrorStartTime > 0) {
> ioErrorStartTime = -1;
>   }
> {code}
> Notice that that if ioErrorStartTime is set to -1 after the first if 
> statement in checkIOErrorIsTolerated but before (now - ioErrorStartTime), 
> then (now - ioErrorStartTime) will evaluate to (now + 1) resulting in the 
> code thinking it has exceeded ioErrorsTolerationDuration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HBASE-24454) BucketCache disabled instantly before error duration toleration is reached due to timing issue

2020-05-28 Thread Jacob LeBlanc (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacob LeBlanc reassigned HBASE-24454:
-

Assignee: Jacob LeBlanc

> BucketCache disabled instantly before error duration toleration is reached 
> due to timing issue
> --
>
> Key: HBASE-24454
> URL: https://issues.apache.org/jira/browse/HBASE-24454
> Project: HBase
>  Issue Type: Bug
>  Components: BucketCache
>Affects Versions: 1.4.10
> Environment: We saw this in HBase 1.4.10 (EMR 5.28.1), but I believe 
> the newest code still has this problem.
>Reporter: Jacob LeBlanc
>Assignee: Jacob LeBlanc
>Priority: Major
>
> We saw an instance where BucketCache was disabled after only two IO error 
> were encountered at nearly the same time. The following shows all errors from 
> a region server log for the 2020-05-26 17:00 hour. Notice that there are no 
> other errors until the 17:14:50 at which time the BucketCache is disabled 
> because it claims duration time has exceeded 6 ms:
> $ grep ERROR hbase-hbase-regionserver-ip-172-20-113-147.log.2020-05-26-17
>  2020-05-26 17:14:50,396 ERROR [hfile-prefetch-1589117924884] 
> bucket.BucketCache: Failed reading block 
> 9adfad3a603047cfa0c98b16da0b0974_13786 from bucket cache
>  2020-05-26 17:14:50,397 ERROR 
> [regionserver/ip-172-20-113-147.us-west-2.compute.internal/172.20.113.147:16020-BucketCacheWriter-0]
>  bucket.BucketCache: Failed syncing IO engine
>  2020-05-26 17:14:50,400 ERROR [hfile-prefetch-1589117924884] 
> bucket.BucketCache: Failed reading block 
> 9adfad3a603047cfa0c98b16da0b0974_13786 from bucket cache
>  2020-05-26 17:14:50,401 ERROR [hfile-prefetch-1589117924884] 
> bucket.BucketCache: IO errors duration time has exceeded 6ms, disabling 
> cache, please check your IOEngine
> The region server is very busy so it should be constantly getting successful 
> reads and writes in the bucket cache (it is not as though there was some long 
> ago error and then no successful IO to clear the error flag).
> We are running a busy EMR cluster backed by S3. A bucketcache getting 
> disabled is a huge performance issue because all reads must go to S3.
> Looking at the code, I believe I've found a timing issue. Here is the code 
> for checking and setting the ioErrorStartTime (taken from BucketCache.java):
>  
> {code:java}
>   /**
>* Check whether we tolerate IO error this time. If the duration of IOEngine
>* throwing errors exceeds ioErrorsDurationTimeTolerated, we will disable 
> the
>* cache
>*/
>   private void checkIOErrorIsTolerated() {
> long now = EnvironmentEdgeManager.currentTime();
> if (this.ioErrorStartTime > 0) {
>   if (cacheEnabled && (now - ioErrorStartTime) > 
> this.ioErrorsTolerationDuration) {
> LOG.error("IO errors duration time has exceeded " + 
> ioErrorsTolerationDuration +
>   "ms, disabling cache, please check your IOEngine");
> disableCache();
>   }
> } else {
>   this.ioErrorStartTime = now;
> }
>   }
> {code}
>  
> And here is the code for clearing the ioErrorStartTime when a successful read 
> or write is done:
> {code:java}
>   if (this.ioErrorStartTime > 0) {
> ioErrorStartTime = -1;
>   }
> {code}
> Notice that that if ioErrorStartTime is set to -1 after the first if 
> statement in checkIOErrorIsTolerated but before (now - ioErrorStartTime), 
> then (now - ioErrorStartTime) will evaluate to (now + 1) resulting in the 
> code thinking it has exceeded ioErrorsTolerationDuration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)