[jira] [Commented] (HBASE-26812) ShortCircuitingClusterConnection fails to close RegionScanners when making short-circuited calls

2022-03-31 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17515436#comment-17515436
 ] 

Lars Hofhansl commented on HBASE-26812:
---

+1

> ShortCircuitingClusterConnection fails to close RegionScanners when making 
> short-circuited calls
> 
>
> Key: HBASE-26812
> URL: https://issues.apache.org/jira/browse/HBASE-26812
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.4.9
>Reporter: Lars Hofhansl
>Priority: Critical
>
> Just ran into this on the Phoenix side.
> We retrieve a Connection via 
> {{{}RegionCoprocessorEnvironment.createConnection... getTable(...){}}}. And 
> then call get on that table. The Get's key happens to be local. Now each call 
> to table.get() leaves an open StoreScanner around forever. (verified with a 
> memory profiler).
> There references are held via 
> RegionScannerImpl.storeHeap.scannersForDelayedClose. Eventially the 
> RegionServer goes into a GC of death and can only ended with kill -9.
> The reason appears to be that in this case there is no currentCall context. 
> Some time in 2.x the Rpc handler/call was made responsible for closing open 
> region scanners, but we forgot to handle {{ShortCircuitingClusterConnection}}
> It's not immediately clear how to fix this. But it does make 
> ShortCircuitingClusterConnection useless and dangerous. If you use it, you 
> *will* create a giant memory leak.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26812) ShortCircuitingClusterConnection fails to close RegionScanners when making short-circuited calls

2022-03-28 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513522#comment-17513522
 ] 

Lars Hofhansl commented on HBASE-26812:
---

I can see arguments both ways.

Maybe we commit this one (without the {{doAs}} part). The user behavior is 
document in the {{{}RegionCoprocessorEnvironment.getConnection(){}}}'s Javadoc, 
and might be fine for HBase 2.

Or we can just remove it. :)

> ShortCircuitingClusterConnection fails to close RegionScanners when making 
> short-circuited calls
> 
>
> Key: HBASE-26812
> URL: https://issues.apache.org/jira/browse/HBASE-26812
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.4.9
>Reporter: Lars Hofhansl
>Priority: Critical
>
> Just ran into this on the Phoenix side.
> We retrieve a Connection via 
> {{{}RegionCoprocessorEnvironment.createConnection... getTable(...){}}}. And 
> then call get on that table. The Get's key happens to be local. Now each call 
> to table.get() leaves an open StoreScanner around forever. (verified with a 
> memory profiler).
> There references are held via 
> RegionScannerImpl.storeHeap.scannersForDelayedClose. Eventially the 
> RegionServer goes into a GC of death and can only ended with kill -9.
> The reason appears to be that in this case there is no currentCall context. 
> Some time in 2.x the Rpc handler/call was made responsible for closing open 
> region scanners, but we forgot to handle {{ShortCircuitingClusterConnection}}
> It's not immediately clear how to fix this. But it does make 
> ShortCircuitingClusterConnection useless and dangerous. If you use it, you 
> *will* create a giant memory leak.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26869) RSRpcServices.scan should deep clone cells when RpcCallContext is null

2022-03-22 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510818#comment-17510818
 ] 

Lars Hofhansl commented on HBASE-26869:
---

You have my +1 for master and 2.x branches :)

> RSRpcServices.scan should deep clone cells when RpcCallContext is null
> --
>
> Key: HBASE-26869
> URL: https://issues.apache.org/jira/browse/HBASE-26869
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 3.0.0-alpha-2, 2.4.11
>Reporter: chenglei
>Assignee: chenglei
>Priority: Major
>
> When inspect HBASAE-26812, I find that if {{RpcCallContext}} is null,  
> {{RSRpcServices.scan}}  does not set {{ServerCall.rpcCallback}} and directly 
> closes {{RegionScannerImpl}},  but it does not deep clone the result cells , 
> so these cells may be returned to the {{ByteBuffAllocator}} and may be 
> overwritten before the caller reads them, similar as HBASE-26036, and at the 
> same time, if {{RpcCallContext}} is null, when {{RSRpcServices.scan}} return 
> partial results, it does not invoke {{RegionScannerImpl.shipped}} to release 
> the used resources such as {{HFileScanner}}, pooled {{ByteBuffer}} etc.
> No matter {{ShortCircuitingClusterConnection}} should be removed or not, I 
> think  this {{RSRpcServices.scan}} problem should fix for future 
> maintainability.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26869) RSRpcServices.scan should deep clone cells when RpcCallContext is null

2022-03-21 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510021#comment-17510021
 ] 

Lars Hofhansl commented on HBASE-26869:
---

I can confirm that this fixes the slow scanning issue I have observed in 
Phoenix.

> RSRpcServices.scan should deep clone cells when RpcCallContext is null
> --
>
> Key: HBASE-26869
> URL: https://issues.apache.org/jira/browse/HBASE-26869
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 3.0.0-alpha-2, 2.4.11
>Reporter: chenglei
>Assignee: chenglei
>Priority: Major
>
> When inspect HBASAE-26812, I find that if {{RpcCallContext}} is null,  
> {{RSRpcServices.scan}}  does not set {{ServerCall.rpcCallback}} and directly 
> closes {{RegionScannerImpl}},  but it does not deep clone the result cells , 
> so these cells may be returned to the {{ByteBuffAllocator}} and may be 
> overwritten before the caller reads them, similar as HBASE-26036, and at the 
> same time, if {{RpcCallContext}} is null, when {{RSRpcServices.scan}} return 
> partial results, it does not invoke {{RegionScannerImpl.shipped}} to release 
> the used resources such as {{HFileScanner}}, pooled {{ByteBuffer}} etc.
> No matter {{ShortCircuitingClusterConnection}} should be removed or not, I 
> think  this {{RSRpcServices.scan}} problem should fix for future 
> maintainability.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26869) RSRpcServices.scan should deep clone cells when RpcCallContext is null

2022-03-21 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510006#comment-17510006
 ] 

Lars Hofhansl commented on HBASE-26869:
---

This change is not needed in master, only in HBase 2.x, right?

> RSRpcServices.scan should deep clone cells when RpcCallContext is null
> --
>
> Key: HBASE-26869
> URL: https://issues.apache.org/jira/browse/HBASE-26869
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 3.0.0-alpha-2, 2.4.11
>Reporter: chenglei
>Assignee: chenglei
>Priority: Major
>
> When inspect HBASAE-26812, I find that if {{RpcCallContext}} is null,  
> {{RSRpcServices.scan}}  does not set {{ServerCall.rpcCallback}} and directly 
> closes {{RegionScannerImpl}},  but it does not deep clone the result cells , 
> so these cells may be returned to the {{ByteBuffAllocator}} and may be 
> overwritten before the caller reads them, similar as HBASE-26036, and at the 
> same time, if {{RpcCallContext}} is null, when {{RSRpcServices.scan}} return 
> partial results, it does not invoke {{RegionScannerImpl.shipped}} to release 
> the used resources such as {{HFileScanner}}, pooled {{ByteBuffer}} etc.
> No matter {{ShortCircuitingClusterConnection}} should be removed or not, I 
> think  this {{RSRpcServices.scan}} problem should fix for future 
> maintainability.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (HBASE-26812) ShortCircuitingClusterConnection fails to close RegionScanners when making short-circuited calls

2022-03-20 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509613#comment-17509613
 ] 

Lars Hofhansl edited comment on HBASE-26812 at 3/21/22, 5:52 AM:
-

Thanks [~comnetwork]. 
Getting the user showed up the profiler, but that might have just obscured the 
actual problem you mentioned.

PHOENIX-6671 is a one-liner that I can easily undo locally and I'm happy to 
test an HBase fix here. I'll try you patch tomorrow.
(And apologies that I don't have much time to fix the HBase problem myself. I 
used to be much more active in the HBase community. :( )

 


was (Author: lhofhansl):
Thanks [~comnetwork]. 
Getting the user showed up the profiler, but that might have just obscured the 
actual problem you mentioned.

PHOENIX-6671 is a one-liner that I can easily undo locally and I'm happy to 
test an HBase fix here.
(And apologies that I don't have much time to fix the HBase problem myself. I 
used to be much more active in the HBase community. :( )

 

> ShortCircuitingClusterConnection fails to close RegionScanners when making 
> short-circuited calls
> 
>
> Key: HBASE-26812
> URL: https://issues.apache.org/jira/browse/HBASE-26812
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.4.9
>Reporter: Lars Hofhansl
>Priority: Critical
>
> Just ran into this on the Phoenix side.
> We retrieve a Connection via 
> {{{}RegionCoprocessorEnvironment.createConnection... getTable(...){}}}. And 
> then call get on that table. The Get's key happens to be local. Now each call 
> to table.get() leaves an open StoreScanner around forever. (verified with a 
> memory profiler).
> There references are held via 
> RegionScannerImpl.storeHeap.scannersForDelayedClose. Eventially the 
> RegionServer goes into a GC of death and can only ended with kill -9.
> The reason appears to be that in this case there is no currentCall context. 
> Some time in 2.x the Rpc handler/call was made responsible for closing open 
> region scanners, but we forgot to handle {{ShortCircuitingClusterConnection}}
> It's not immediately clear how to fix this. But it does make 
> ShortCircuitingClusterConnection useless and dangerous. If you use it, you 
> *will* create a giant memory leak.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26812) ShortCircuitingClusterConnection fails to close RegionScanners when making short-circuited calls

2022-03-20 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509613#comment-17509613
 ] 

Lars Hofhansl commented on HBASE-26812:
---

Thanks [~comnetwork]. 
Getting the user showed up the profiler, but that might have just obscured the 
actual problem you mentioned.

PHOENIX-6671 is a one-liner that I can easily undo locally and I'm happy to 
test an HBase fix here.
(And apologies that I don't have much time to fix the HBase problem myself. I 
used to be much more active in the HBase community. :( )

 

> ShortCircuitingClusterConnection fails to close RegionScanners when making 
> short-circuited calls
> 
>
> Key: HBASE-26812
> URL: https://issues.apache.org/jira/browse/HBASE-26812
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.4.9
>Reporter: Lars Hofhansl
>Priority: Critical
>
> Just ran into this on the Phoenix side.
> We retrieve a Connection via 
> {{{}RegionCoprocessorEnvironment.createConnection... getTable(...){}}}. And 
> then call get on that table. The Get's key happens to be local. Now each call 
> to table.get() leaves an open StoreScanner around forever. (verified with a 
> memory profiler).
> There references are held via 
> RegionScannerImpl.storeHeap.scannersForDelayedClose. Eventially the 
> RegionServer goes into a GC of death and can only ended with kill -9.
> The reason appears to be that in this case there is no currentCall context. 
> Some time in 2.x the Rpc handler/call was made responsible for closing open 
> region scanners, but we forgot to handle {{ShortCircuitingClusterConnection}}
> It's not immediately clear how to fix this. But it does make 
> ShortCircuitingClusterConnection useless and dangerous. If you use it, you 
> *will* create a giant memory leak.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (HBASE-26812) ShortCircuitingClusterConnection fails to close RegionScanners when making short-circuited calls

2022-03-19 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509304#comment-17509304
 ] 

Lars Hofhansl edited comment on HBASE-26812 at 3/19/22, 4:46 PM:
-

[~comnetwork] It might be. In the slow case it finishes eventually, though.
In the profiler I see that much is spent in getting the quota and there in 
getting the user and there in Java access controller (I think, can't easily 
check right now).
When conditions are such that all data is read during openScanner the problem 
does not occur, but when further calls to next to are needed, then it becomes 
slow. Sorry for being so vague - little time for this right now.

See also PHOENIX-6671, which fixes the problem.


was (Author: lhofhansl):
[~comnetwork] It might be. In the slow case it finishes eventually, though.
In the profiler I see that much is spent in getting the quota and there in 
getting the user and there in Java access controller (I think, can't easily 
check right now).
When conditions are such that all data is read during openScanner the problem 
does not occur, but when further calls to next to are needed, then it becomes 
slow. Sorry for being so vague.

See also PHOENIX-6671, which fixes the problem.

> ShortCircuitingClusterConnection fails to close RegionScanners when making 
> short-circuited calls
> 
>
> Key: HBASE-26812
> URL: https://issues.apache.org/jira/browse/HBASE-26812
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.4.9
>Reporter: Lars Hofhansl
>Priority: Critical
>
> Just ran into this on the Phoenix side.
> We retrieve a Connection via 
> {{{}RegionCoprocessorEnvironment.createConnection... getTable(...){}}}. And 
> then call get on that table. The Get's key happens to be local. Now each call 
> to table.get() leaves an open StoreScanner around forever. (verified with a 
> memory profiler).
> There references are held via 
> RegionScannerImpl.storeHeap.scannersForDelayedClose. Eventially the 
> RegionServer goes into a GC of death and can only ended with kill -9.
> The reason appears to be that in this case there is no currentCall context. 
> Some time in 2.x the Rpc handler/call was made responsible for closing open 
> region scanners, but we forgot to handle {{ShortCircuitingClusterConnection}}
> It's not immediately clear how to fix this. But it does make 
> ShortCircuitingClusterConnection useless and dangerous. If you use it, you 
> *will* create a giant memory leak.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26812) ShortCircuitingClusterConnection fails to close RegionScanners when making short-circuited calls

2022-03-19 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509304#comment-17509304
 ] 

Lars Hofhansl commented on HBASE-26812:
---

[~comnetwork] It might be. In the slow case it finishes eventually, though.
In the profiler I see that much is spent in getting the quota and there in 
getting the user and there in Java access controller (I think, can't easily 
check right now).
When conditions are such that all data is read during openScanner the problem 
does not occur, but when further calls to next to are needed, then it becomes 
slow. Sorry for being so vague.

See also PHOENIX-6671, which fixes the problem.

> ShortCircuitingClusterConnection fails to close RegionScanners when making 
> short-circuited calls
> 
>
> Key: HBASE-26812
> URL: https://issues.apache.org/jira/browse/HBASE-26812
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.4.9
>Reporter: Lars Hofhansl
>Priority: Critical
>
> Just ran into this on the Phoenix side.
> We retrieve a Connection via 
> {{{}RegionCoprocessorEnvironment.createConnection... getTable(...){}}}. And 
> then call get on that table. The Get's key happens to be local. Now each call 
> to table.get() leaves an open StoreScanner around forever. (verified with a 
> memory profiler).
> There references are held via 
> RegionScannerImpl.storeHeap.scannersForDelayedClose. Eventially the 
> RegionServer goes into a GC of death and can only ended with kill -9.
> The reason appears to be that in this case there is no currentCall context. 
> Some time in 2.x the Rpc handler/call was made responsible for closing open 
> region scanners, but we forgot to handle {{ShortCircuitingClusterConnection}}
> It's not immediately clear how to fix this. But it does make 
> ShortCircuitingClusterConnection useless and dangerous. If you use it, you 
> *will* create a giant memory leak.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (HBASE-26812) ShortCircuitingClusterConnection fails to close RegionScanners when making short-circuited calls

2022-03-18 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509099#comment-17509099
 ] 

Lars Hofhansl edited comment on HBASE-26812 at 3/18/22, 10:32 PM:
--

As an example of the severity: With PHOENIX-6501 and scanners that need at 
least two roundtrips the query in question did not finish after 10 minutes. 
When I just replace Connection in Phoenix with the standard one retrieved from 
{{org.apache.hadoop.hbase.client.ConnectionFactory}} the same query - without 
any other changes - finishes in 6s.

In that case the scan calls already the rpc context set to null.



was (Author: lhofhansl):
As an example of the severity: With PHOENIX-6501 and scanners that need at 
least two roundtrips the query in question did not finish after 10 minutes. 
When I just replace Connection in Phoenix with the standard one retrieved from 
`org.apache.hadoop.hbase.client.ConnectionFactory` the same query - without any 
other changes - finishes in 6s.

In that case the scan calls already the rpc context set to null.


> ShortCircuitingClusterConnection fails to close RegionScanners when making 
> short-circuited calls
> 
>
> Key: HBASE-26812
> URL: https://issues.apache.org/jira/browse/HBASE-26812
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.4.9
>Reporter: Lars Hofhansl
>Priority: Critical
>
> Just ran into this on the Phoenix side.
> We retrieve a Connection via 
> {{{}RegionCoprocessorEnvironment.createConnection... getTable(...){}}}. And 
> then call get on that table. The Get's key happens to be local. Now each call 
> to table.get() leaves an open StoreScanner around forever. (verified with a 
> memory profiler).
> There references are held via 
> RegionScannerImpl.storeHeap.scannersForDelayedClose. Eventially the 
> RegionServer goes into a GC of death and can only ended with kill -9.
> The reason appears to be that in this case there is no currentCall context. 
> Some time in 2.x the Rpc handler/call was made responsible for closing open 
> region scanners, but we forgot to handle {{ShortCircuitingClusterConnection}}
> It's not immediately clear how to fix this. But it does make 
> ShortCircuitingClusterConnection useless and dangerous. If you use it, you 
> *will* create a giant memory leak.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26812) ShortCircuitingClusterConnection fails to close RegionScanners when making short-circuited calls

2022-03-18 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509099#comment-17509099
 ] 

Lars Hofhansl commented on HBASE-26812:
---

As an example of the severity: With PHOENIX-6501 and scanners that need at 
least two roundtrips the query in question did not finish after 10 minutes. 
When I just replace Connection in Phoenix with the standard one retrieved from 
`org.apache.hadoop.hbase.client.ConnectionFactory` the same query - without any 
other changes - finishes in 6s.

In that case the scan calls already the rpc context set to null.


> ShortCircuitingClusterConnection fails to close RegionScanners when making 
> short-circuited calls
> 
>
> Key: HBASE-26812
> URL: https://issues.apache.org/jira/browse/HBASE-26812
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.4.9
>Reporter: Lars Hofhansl
>Priority: Critical
>
> Just ran into this on the Phoenix side.
> We retrieve a Connection via 
> {{{}RegionCoprocessorEnvironment.createConnection... getTable(...){}}}. And 
> then call get on that table. The Get's key happens to be local. Now each call 
> to table.get() leaves an open StoreScanner around forever. (verified with a 
> memory profiler).
> There references are held via 
> RegionScannerImpl.storeHeap.scannersForDelayedClose. Eventially the 
> RegionServer goes into a GC of death and can only ended with kill -9.
> The reason appears to be that in this case there is no currentCall context. 
> Some time in 2.x the Rpc handler/call was made responsible for closing open 
> region scanners, but we forgot to handle {{ShortCircuitingClusterConnection}}
> It's not immediately clear how to fix this. But it does make 
> ShortCircuitingClusterConnection useless and dangerous. If you use it, you 
> *will* create a giant memory leak.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (HBASE-26812) ShortCircuitingClusterConnection fails to close RegionScanners when making short-circuited calls

2022-03-18 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509088#comment-17509088
 ] 

Lars Hofhansl edited comment on HBASE-26812 at 3/18/22, 10:28 PM:
--

FYI... There's a problem with scanning too (I guess we knew). So PHOENIX-6501 
is also currently broken with HBase 2.

BTW. In that case wrapping with a null call context did *not* fix all problems. 
The context is used to resolve the right user for scanner.next(...) if that is 
missing it gets the system user, which is incredibly slow.

The savings from this are questionable anyway. Even on a local machine with all 
regions locally I hardly see a perf difference with disabling this.

I'd vote for ripping it out.



was (Author: lhofhansl):
FYI... There's a problem with scanning too (I guess we knew). So PHOENIX-6501 
is also currently broken with HBase 2.

BTW. In that case wrapping with a null call context did *not* help. The context 
is used to resolve the right user for scanner.next(...) if that is missing it 
gets the system user, which is incredibly slow.

The savings from this are questionable anyway. Even on a local machine with all 
regions locally I hardly see a perf difference with disabling this.

I'd vote for ripping it out.


> ShortCircuitingClusterConnection fails to close RegionScanners when making 
> short-circuited calls
> 
>
> Key: HBASE-26812
> URL: https://issues.apache.org/jira/browse/HBASE-26812
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.4.9
>Reporter: Lars Hofhansl
>Priority: Critical
>
> Just ran into this on the Phoenix side.
> We retrieve a Connection via 
> {{{}RegionCoprocessorEnvironment.createConnection... getTable(...){}}}. And 
> then call get on that table. The Get's key happens to be local. Now each call 
> to table.get() leaves an open StoreScanner around forever. (verified with a 
> memory profiler).
> There references are held via 
> RegionScannerImpl.storeHeap.scannersForDelayedClose. Eventially the 
> RegionServer goes into a GC of death and can only ended with kill -9.
> The reason appears to be that in this case there is no currentCall context. 
> Some time in 2.x the Rpc handler/call was made responsible for closing open 
> region scanners, but we forgot to handle {{ShortCircuitingClusterConnection}}
> It's not immediately clear how to fix this. But it does make 
> ShortCircuitingClusterConnection useless and dangerous. If you use it, you 
> *will* create a giant memory leak.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (HBASE-26812) ShortCircuitingClusterConnection fails to close RegionScanners when making short-circuited calls

2022-03-18 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509088#comment-17509088
 ] 

Lars Hofhansl edited comment on HBASE-26812 at 3/18/22, 10:14 PM:
--

FYI... There's a problem with scanning too (I guess we knew). So PHOENIX-6501 
is also currently broken with HBase 2.

BTW. In that case wrapping with a null call context did *not* help. The context 
is used to resolve the right user for scanner.next(...) if that is missing it 
gets the system user, which is incredibly slow.

The savings from this are questionable anyway. Even on a local machine with all 
regions locally I hardly see a perf difference with disabling this.

I'd vote for ripping it out.



was (Author: lhofhansl):
FYI... There's a problem with scanning too (I guess we knew). So PHOENIX-6501 
is also currently broken with HBase 2.
The savings from this are questionable anyway. Even on a local machine I see 
hardly a perf difference with disabling this.

I'd vote for ripping it out.

> ShortCircuitingClusterConnection fails to close RegionScanners when making 
> short-circuited calls
> 
>
> Key: HBASE-26812
> URL: https://issues.apache.org/jira/browse/HBASE-26812
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.4.9
>Reporter: Lars Hofhansl
>Priority: Critical
>
> Just ran into this on the Phoenix side.
> We retrieve a Connection via 
> {{{}RegionCoprocessorEnvironment.createConnection... getTable(...){}}}. And 
> then call get on that table. The Get's key happens to be local. Now each call 
> to table.get() leaves an open StoreScanner around forever. (verified with a 
> memory profiler).
> There references are held via 
> RegionScannerImpl.storeHeap.scannersForDelayedClose. Eventially the 
> RegionServer goes into a GC of death and can only ended with kill -9.
> The reason appears to be that in this case there is no currentCall context. 
> Some time in 2.x the Rpc handler/call was made responsible for closing open 
> region scanners, but we forgot to handle {{ShortCircuitingClusterConnection}}
> It's not immediately clear how to fix this. But it does make 
> ShortCircuitingClusterConnection useless and dangerous. If you use it, you 
> *will* create a giant memory leak.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26812) ShortCircuitingClusterConnection fails to close RegionScanners when making short-circuited calls

2022-03-18 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509088#comment-17509088
 ] 

Lars Hofhansl commented on HBASE-26812:
---

FYI... There's a problem with scanning too (I guess we knew). So PHOENIX-6501 
is also currently broken with HBase 2.
The savings from this are questionable anyway. Even on a local machine I see 
hardly a perf difference with disabling this.

I'd vote for ripping it out.

> ShortCircuitingClusterConnection fails to close RegionScanners when making 
> short-circuited calls
> 
>
> Key: HBASE-26812
> URL: https://issues.apache.org/jira/browse/HBASE-26812
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.4.9
>Reporter: Lars Hofhansl
>Priority: Critical
>
> Just ran into this on the Phoenix side.
> We retrieve a Connection via 
> {{{}RegionCoprocessorEnvironment.createConnection... getTable(...){}}}. And 
> then call get on that table. The Get's key happens to be local. Now each call 
> to table.get() leaves an open StoreScanner around forever. (verified with a 
> memory profiler).
> There references are held via 
> RegionScannerImpl.storeHeap.scannersForDelayedClose. Eventially the 
> RegionServer goes into a GC of death and can only ended with kill -9.
> The reason appears to be that in this case there is no currentCall context. 
> Some time in 2.x the Rpc handler/call was made responsible for closing open 
> region scanners, but we forgot to handle {{ShortCircuitingClusterConnection}}
> It's not immediately clear how to fix this. But it does make 
> ShortCircuitingClusterConnection useless and dangerous. If you use it, you 
> *will* create a giant memory leak.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26812) ShortCircuitingClusterConnection fails to close RegionScanners when making short-circuited calls

2022-03-15 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17507205#comment-17507205
 ] 

Lars Hofhansl commented on HBASE-26812:
---

Thanks [~comnetwork]. That would indeed work for this instance.

There is still a time-bomb in HBase, though. Each time you cal Get - or 
anything that creates a RegionScanner - via the short circuited client you will 
leave StoreScanner and all its references around forever.

So IMHO this should be fixed in HBase.

After reflecting a bit, maybe the best is to remove the short-circuit 
optimization from HBase, since it is not currently working correctly.


> ShortCircuitingClusterConnection fails to close RegionScanners when making 
> short-circuited calls
> 
>
> Key: HBASE-26812
> URL: https://issues.apache.org/jira/browse/HBASE-26812
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.4.9
>Reporter: Lars Hofhansl
>Priority: Critical
>
> Just ran into this on the Phoenix side.
> We retrieve a Connection via 
> {{{}RegionCoprocessorEnvironment.createConnection... getTable(...){}}}. And 
> then call get on that table. The Get's key happens to be local. Now each call 
> to table.get() leaves an open StoreScanner around forever. (verified with a 
> memory profiler).
> There references are held via 
> RegionScannerImpl.storeHeap.scannersForDelayedClose. Eventially the 
> RegionServer goes into a GC of death and can only ended with kill -9.
> The reason appears to be that in this case there is no currentCall context. 
> Some time in 2.x the Rpc handler/call was made responsible for closing open 
> region scanners, but we forgot to handle {{ShortCircuitingClusterConnection}}
> It's not immediately clear how to fix this. But it does make 
> ShortCircuitingClusterConnection useless and dangerous. If you use it, you 
> *will* create a giant memory leak.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26812) ShortCircuitingClusterConnection fails to close RegionScanners when making short-circuited calls

2022-03-11 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17505092#comment-17505092
 ] 

Lars Hofhansl commented on HBASE-26812:
---

[~comnetwork] I came to the same conclusion. So what do we do about it?

The RegionScannerImpl should be closed after the local client's Get operation 
has returned. It seems we need a "fake" ServerCall for this, but that's tricky, 
since we need to close only those RegionScanners involved in the local 
operation not all of them. Perhaps some other API that wraps RsRPCServices in 
this case.

I can think of some fragile ways of fixing this, like putting another 
threadlocal marker on the current thread, but I do not like that.

As is we have a timebomb in HBase. Might be best to disable any local 
optimization until we have a fix.


> ShortCircuitingClusterConnection fails to close RegionScanners when making 
> short-circuited calls
> 
>
> Key: HBASE-26812
> URL: https://issues.apache.org/jira/browse/HBASE-26812
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.4.9
>Reporter: Lars Hofhansl
>Priority: Critical
>
> Just ran into this on the Phoenix side.
> We retrieve a Connection via 
> {{{}RegionCoprocessorEnvironment.createConnection... getTable(...){}}}. And 
> then call get on that table. The Get's key happens to be local. Now each call 
> to table.get() leaves an open StoreScanner around forever. (verified with a 
> memory profiler).
> There references are held via 
> RegionScannerImpl.storeHeap.scannersForDelayedClose. Eventially the 
> RegionServer goes into a GC of death and can only ended with kill -9.
> The reason appears to be that in this case there is no currentCall context. 
> Some time in 2.x the Rpc handler/call was made responsible for closing open 
> region scanners, but we forgot to handle {{ShortCircuitingClusterConnection}}
> It's not immediately clear how to fix this. But it does make 
> ShortCircuitingClusterConnection useless and dangerous. If you use it, you 
> *will* create a giant memory leak.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-26812) ShortCircuitingClusterConnection fails to close RegionScanners when making short-circuited calls

2022-03-08 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17503128#comment-17503128
 ] 

Lars Hofhansl commented on HBASE-26812:
---

See PHOENIX-6458 and PHOENIX-6501. We (will) have a better solution in Phoenix.

> ShortCircuitingClusterConnection fails to close RegionScanners when making 
> short-circuited calls
> 
>
> Key: HBASE-26812
> URL: https://issues.apache.org/jira/browse/HBASE-26812
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.4.9
>Reporter: Lars Hofhansl
>Priority: Critical
>
> Just ran into this on the Phoenix side.
> We retrieve a Connection via 
> {{{}RegionCoprocessorEnvironment.createConnection... getTable(...){}}}. And 
> then call get on that table. The Get's key happens to be local. Now each call 
> to table.get() leaves an open StoreScanner around forever. (verified with a 
> memory profiler).
> There references are held via 
> RegionScannerImpl.storeHeap.scannersForDelayedClose. Eventially the 
> RegionServer goes into a GC of death and can only ended with kill -9.
> The reason appears to be that in this case there is no currentCall context. 
> Some time in 2.x the Rpc handler/call was made responsible for closing open 
> region scanners, but we forgot to handle {{ShortCircuitingClusterConnection}}
> It's not immediately clear how to fix this. But it does make 
> ShortCircuitingClusterConnection useless and dangerous. If you use it, you 
> *will* create a giant memory leak.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HBASE-26812) ShortCircuitingClusterConnection fails to close RegionScanners when making short-circuited calls

2022-03-07 Thread Lars Hofhansl (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl updated HBASE-26812:
--
Description: 
Just ran into this on the Phoenix side.
We retrieve a Connection via 
{{{}RegionCoprocessorEnvironment.createConnection... getTable(...){}}}. And 
then call get on that table. The Get's key happens to be local. Now each call 
to table.get() leaves an open StoreScanner around forever. (verified with a 
memory profiler).

There references are held via 
RegionScannerImpl.storeHeap.scannersForDelayedClose. Eventially the 
RegionServer goes into a GC of death and can only ended with kill -9.

The reason appears to be that in this case there is no currentCall context. 
Some time in 2.x the Rpc handler/call was made responsible for closing open 
region scanners, but we forgot to handle {{ShortCircuitingClusterConnection}}

It's not immediately clear how to fix this. But it does make 
ShortCircuitingClusterConnection useless and dangerous. If you use it, you 
*will* create a giant memory leak.

  was:
Just ran into this on the Phoenix side.
We retrieve a Connection via {{RegionCoprocessorEnvironment.createConnection... 
getTable(...)}}. And then call get on that table. The Get's key happens to 
local. Now each call to table.get() leaves an open StoreScanner around forever. 
(verified with a memory profiler).

There references are held via 
RegionScannerImpl.storeHeap.scannersForDelayedClose. Eventially the 
RegionServer goes a GC of death.

The reason appears to be that in this case there is currentCall context. Some 
time in 2.x the Rpc handler/call was made responsible for closing open region 
scanners, but we forgot to handle {{ShortCircuitingClusterConnection}}

It's not immediately clear how to fix this. But it does make 
ShortCircuitingClusterConnection useless and dangerous. If you use it, you 
*will* create a giant memory leak.


> ShortCircuitingClusterConnection fails to close RegionScanners when making 
> short-circuited calls
> 
>
> Key: HBASE-26812
> URL: https://issues.apache.org/jira/browse/HBASE-26812
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.4.9
>Reporter: Lars Hofhansl
>Priority: Critical
>
> Just ran into this on the Phoenix side.
> We retrieve a Connection via 
> {{{}RegionCoprocessorEnvironment.createConnection... getTable(...){}}}. And 
> then call get on that table. The Get's key happens to be local. Now each call 
> to table.get() leaves an open StoreScanner around forever. (verified with a 
> memory profiler).
> There references are held via 
> RegionScannerImpl.storeHeap.scannersForDelayedClose. Eventially the 
> RegionServer goes into a GC of death and can only ended with kill -9.
> The reason appears to be that in this case there is no currentCall context. 
> Some time in 2.x the Rpc handler/call was made responsible for closing open 
> region scanners, but we forgot to handle {{ShortCircuitingClusterConnection}}
> It's not immediately clear how to fix this. But it does make 
> ShortCircuitingClusterConnection useless and dangerous. If you use it, you 
> *will* create a giant memory leak.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HBASE-26812) ShortCircuitingClusterConnection fails to close RegionScanners when making short-circuited calls

2022-03-07 Thread Lars Hofhansl (Jira)
Lars Hofhansl created HBASE-26812:
-

 Summary: ShortCircuitingClusterConnection fails to close 
RegionScanners when making short-circuited calls
 Key: HBASE-26812
 URL: https://issues.apache.org/jira/browse/HBASE-26812
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.4.9
Reporter: Lars Hofhansl


Just ran into this on the Phoenix side.
We retrieve a Connection via {{RegionCoprocessorEnvironment.createConnection... 
getTable(...)}}. And then call get on that table. The Get's key happens to 
local. Now each call to table.get() leaves an open StoreScanner around forever. 
(verified with a memory profiler).

There references are held via 
RegionScannerImpl.storeHeap.scannersForDelayedClose. Eventially the 
RegionServer goes a GC of death.

The reason appears to be that in this case there is currentCall context. Some 
time in 2.x the Rpc handler/call was made responsible for closing open region 
scanners, but we forgot to handle {{ShortCircuitingClusterConnection}}

It's not immediately clear how to fix this. But it does make 
ShortCircuitingClusterConnection useless and dangerous. If you use it, you 
*will* create a giant memory leak.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HBASE-25505) ZK watcher threads are daemonized; reconsider

2021-01-13 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17264588#comment-17264588
 ] 

Lars Hofhansl commented on HBASE-25505:
---

In case someone wants to - as I said above - I tagged the thread pool with the 
identifier of the watcher and then checked the hung thread, which shows this to 
be a zk wacther on behalf of the ReplicationLogCleaner.

 

> ZK watcher threads are daemonized; reconsider
> -
>
> Key: HBASE-25505
> URL: https://issues.apache.org/jira/browse/HBASE-25505
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Andrew Kyle Purtell
>Priority: Major
>
> On HBASE-25279 there was some discussion and difference of opinion about 
> having ZK watcher pool threads be daemonized. This is not necessarily a 
> problem but should be reconsidered. 
> Daemon threads are subject to abrupt termination during JVM shutdown and 
> therefore may be interrupted before state changes are complete or resources 
> are released. 
> As long as ZK watchers are properly closed by shutdown logic the pool threads 
> will be terminated in a controlled manner and the JVM will exit. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25279) Non-daemon thread in ZKWatcher

2021-01-13 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17264587#comment-17264587
 ] 

Lars Hofhansl commented on HBASE-25279:
---

I agree the failure to close the watcher is the bug. We're pasting over the 
actual problem, and I bet that nobody will fix HBASE-25505 :)

> Non-daemon thread in ZKWatcher
> --
>
> Key: HBASE-25279
> URL: https://issues.apache.org/jira/browse/HBASE-25279
> Project: HBase
>  Issue Type: Bug
>  Components: Zookeeper
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Critical
> Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.1
>
>
> ZKWatcher spawns an ExecutorService which doesn't mark its threads as daemons 
> which will prevent clean shut downs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-25279) Non-daemon thread in ZKWatcher

2021-01-13 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-25279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17264527#comment-17264527
 ] 

Lars Hofhansl commented on HBASE-25279:
---

I just came across this as well. 

Currently my 2.4.1 the HMaster still hangs upon shutdown. When I annotated the 
thread names with the identifier of the ZKWatcher owning that pool I see it's 
on behalf of the ReplicationLogCleaner. Following the life-cycle of 
HFileLogCleaner, CleanerChore, and ScheduledChore I can't find anything 
obviously wrong. (If you called setConf more than once the previous ZKWatcher 
would not get closed, but that turned out to be not the problem.)

[~apurtell] also tried to reproduce but he could not reproduce.

 

[~elserj], [~vjasani] are you still seeing this problem?

> Non-daemon thread in ZKWatcher
> --
>
> Key: HBASE-25279
> URL: https://issues.apache.org/jira/browse/HBASE-25279
> Project: HBase
>  Issue Type: Bug
>Reporter: Josh Elser
>Assignee: Josh Elser
>Priority: Critical
> Fix For: 3.0.0-alpha-1, 2.4.1
>
>
> ZKWatcher spawns an ExecutorService which doesn't mark its threads as daemons 
> which will prevent clean shut downs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24742) Improve performance of SKIP vs SEEK logic

2020-08-02 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169628#comment-17169628
 ] 

Lars Hofhansl commented on HBASE-24742:
---

Re: Tests. This is purely an internal optimization of another optimization (no 
kidding) with no functional impact. For the SEEK optimization we have tests 
that assert the number of SEEKs vs SKIPs during scanning.

I cannot think of any useful additional tests. Lemme perhaps check if there are 
SEEK vs SKIP tests with ROWCOL BFs enabled. Or [~bharathv], could you also 
perhaps have a look as I'm off the next few weeks.


> Improve performance of SKIP vs SEEK logic
> -
>
> Key: HBASE-24742
> URL: https://issues.apache.org/jira/browse/HBASE-24742
> Project: HBase
>  Issue Type: Bug
>  Components: Performance, regionserver
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.4.0
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Major
> Fix For: 3.0.0-alpha-1, 2.3.1, 1.7.0, 2.4.0, 2.1.10, 2.2.6
>
> Attachments: 24742-master.txt, hbase-1.6-regression-flame-graph.png, 
> hbase-24742-branch-1.txt
>
>
> In our testing of HBase 1.3 against the current tip of branch-1 we saw a 30% 
> slowdown in scanning scenarios.
> We tracked it back to HBASE-17958 and HBASE-19863.
> Both add comparisons to one of the tightest HBase has.
> [~bharathv]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24637) Reseek regression related to filter SKIP hinting

2020-07-16 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159618#comment-17159618
 ] 

Lars Hofhansl commented on HBASE-24637:
---

I see the logic in UserScanQueryMatcher (in mergeFilterResponse())) has changed 
to do exactly the logic I described above.

It tries to be smarter in the case where the Filter said SKIP but the SQM said 
SEEK.
In theory SEEK'ing is better, but it looks like it's causing exactly this 
change of behavior.


> Reseek regression related to filter SKIP hinting
> 
>
> Key: HBASE-24637
> URL: https://issues.apache.org/jira/browse/HBASE-24637
> Project: HBase
>  Issue Type: Bug
>  Components: Filters, Performance, Scanners
>Affects Versions: 2.2.5
>Reporter: Andrew Kyle Purtell
>Priority: Major
> Attachments: W-7665966-FAST_DIFF-FILTER_ALL.pdf, 
> W-7665966-Instrument-low-level-scan-details-branch-1.patch, 
> W-7665966-Instrument-low-level-scan-details-branch-2.2.patch, 
> parse_call_trace.pl
>
>
> I have been looking into reported performance regressions in HBase 2 relative 
> to HBase 1. Depending on the test scenario, HBase 2 can demonstrate 
> significantly better microbenchmarks in a number of cases, and usually shows 
> improvement in whole cluster benchmarks like YCSB.
> To assist in debugging I added methods to RpcServer for updating per-call 
> metrics that leverage the fact it puts a reference to the current Call into a 
> thread local and that all activity for a given RPC is processed by a single 
> thread context. I then instrumented ScanQueryMatcher (in branch-1) and its 
> various friends (in branch-2.2), StoreScanner, HFileReaderV2 and 
> HFileReaderV3 (in branch-1) and HFileReaderImpl (in branch-2.2), HFileBlock, 
> and DefaultMemStore (branch-1) and SegmentScanner (branch-2.2). Test tables 
> with one family and 1, 5, 10, 20, 50, and 100 distinct column-qualifiers per 
> row were created, snapshot, dropped, and cloned from the snapshot. Both 1.6 
> and 2.2 versions under test operated on identical data files in HDFS. For 
> tests with 1.6 and 2.2 on the server side the same 1.6 PE client was used, to 
> ensure only the server side differed.
> The results for pe --filterAll were revealing. See attached. 
> It appears a refactor to ScanQueryMatcher and friends has disabled the 
> ability of filters to provide meaningful SKIP hints, which disables an 
> optimization that avoids reseeking, leading to a serious and proportional 
> regression in reseek activity and time spent in that code path. So for 
> queries that use filters, there can be a substantial regression.
> Other test cases that did not use filters did not show this regression. If 
> filters are not used the behavior of ScanQueryMatcher between 1.6 and 2.2 was 
> almost identical, as measured by counts of the hint types returned, whether 
> or not column or version trackers are called, and counts of store seeks or 
> reseeks. Regarding micro-timings, there was a 10% variance in my testing and 
> results generally fell within this range, except for the filter all case of 
> course. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24637) Filter SKIP hinting regression

2020-07-16 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159486#comment-17159486
 ] 

Lars Hofhansl commented on HBASE-24637:
---

Yep. And unfortunately whether a SEEK is an advantage depends on many factors. 
If there are many versions then seeking to the next column and row is cheaper 
than skipping, and having theses hints enables that.
However if there are few versions that SKIP'ing is better, and the optimization 
can only figure so much.

I agree that we should restore the previous behavior. It pains me a bit, since 
getting more information from the SQM is good thing - looks like this was too 
much a good thing :) And likely just introduced by accident anyway.


> Filter SKIP hinting regression
> --
>
> Key: HBASE-24637
> URL: https://issues.apache.org/jira/browse/HBASE-24637
> Project: HBase
>  Issue Type: Bug
>  Components: Filters, Performance, Scanners
>Affects Versions: 2.2.5
>Reporter: Andrew Kyle Purtell
>Priority: Major
> Attachments: W-7665966-FAST_DIFF-FILTER_ALL.pdf, 
> W-7665966-Instrument-low-level-scan-details-branch-1.patch, 
> W-7665966-Instrument-low-level-scan-details-branch-2.2.patch, 
> parse_call_trace.pl
>
>
> I have been looking into reported performance regressions in HBase 2 relative 
> to HBase 1. Depending on the test scenario, HBase 2 can demonstrate 
> significantly better microbenchmarks in a number of cases, and usually shows 
> improvement in whole cluster benchmarks like YCSB.
> To assist in debugging I added methods to RpcServer for updating per-call 
> metrics that leverage the fact it puts a reference to the current Call into a 
> thread local and that all activity for a given RPC is processed by a single 
> thread context. I then instrumented ScanQueryMatcher (in branch-1) and its 
> various friends (in branch-2.2), StoreScanner, HFileReaderV2 and 
> HFileReaderV3 (in branch-1) and HFileReaderImpl (in branch-2.2), HFileBlock, 
> and DefaultMemStore (branch-1) and SegmentScanner (branch-2.2). Test tables 
> with one family and 1, 5, 10, 20, 50, and 100 distinct column-qualifiers per 
> row were created, snapshot, dropped, and cloned from the snapshot. Both 1.6 
> and 2.2 versions under test operated on identical data files in HDFS. For 
> tests with 1.6 and 2.2 on the server side the same 1.6 PE client was used, to 
> ensure only the server side differed.
> The results for pe --filterAll were revealing. See attached. 
> It appears a refactor to ScanQueryMatcher and friends has disabled the 
> ability of filters to provide meaningful SKIP hints, which disables an 
> optimization that avoids reseeking, leading to a serious and proportional 
> regression in reseek activity and time spent in that code path. So for 
> queries that use filters, there can be a substantial regression.
> Other test cases that did not use filters did not show this regression. If 
> filters are not used the behavior of ScanQueryMatcher between 1.6 and 2.2 was 
> almost identical, as measured by counts of the hint types returned, whether 
> or not column or version trackers are called, and counts of store seeks or 
> reseeks. Regarding micro-timings, there was a 10% variance in my testing and 
> results generally fell within this range, except for the filter all case of 
> course. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-24637) Filter SKIP hinting regression

2020-07-16 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159477#comment-17159477
 ] 

Lars Hofhansl edited comment on HBASE-24637 at 7/16/20, 8:46 PM:
-

I see. SKIP is not a hint as such, though, it's the default. The hint (which 
can be ignore) is the SEEK hint. Both are implemented with return codes.

Looks like the SQM is now marking transitions from Column to Column with a 
SEEK-to-next-column hint, and for each row with a SEEK-to-next-row.

Also looking at the numbers, the optimization I mentioned is turning the vast 
majority of SEEKs back into SKIPs (and that check is not free).

As I said, it's not wrong per se (need to look at the code more), but that does 
not mean that there isn't a performance regression - as I have described in the 
previous comment - that we need to fix, possibly by restoring the old behavior.

Edit: Grammar :)


was (Author: lhofhansl):
I see. SKIP is not a hint as such, though, it's the default. The hint (which 
can be ignore) is the SEEK hint. Both are implemented with return codes.

Looks like the SQM is now marking transitions from Column to Column with a 
SEEK-to-next-column hint, and for each row with a SEEK-to-next-row.

Also look at the numbers the optimization I mentioned is turning the vast 
majority back into SKIPs (and that check is not free).

As I said, it's not wrong per se (need to look at the code more), but that does 
not mean that there isn't a performance regression - as I have described in the 
previous comment - that we need to fix, possibly by restoring the old behavior.


> Filter SKIP hinting regression
> --
>
> Key: HBASE-24637
> URL: https://issues.apache.org/jira/browse/HBASE-24637
> Project: HBase
>  Issue Type: Bug
>  Components: Filters, Performance, Scanners
>Affects Versions: 2.2.5
>Reporter: Andrew Kyle Purtell
>Priority: Major
> Attachments: W-7665966-FAST_DIFF-FILTER_ALL.pdf, 
> W-7665966-Instrument-low-level-scan-details-branch-1.patch, 
> W-7665966-Instrument-low-level-scan-details-branch-2.2.patch, 
> parse_call_trace.pl
>
>
> I have been looking into reported performance regressions in HBase 2 relative 
> to HBase 1. Depending on the test scenario, HBase 2 can demonstrate 
> significantly better microbenchmarks in a number of cases, and usually shows 
> improvement in whole cluster benchmarks like YCSB.
> To assist in debugging I added methods to RpcServer for updating per-call 
> metrics that leverage the fact it puts a reference to the current Call into a 
> thread local and that all activity for a given RPC is processed by a single 
> thread context. I then instrumented ScanQueryMatcher (in branch-1) and its 
> various friends (in branch-2.2), StoreScanner, HFileReaderV2 and 
> HFileReaderV3 (in branch-1) and HFileReaderImpl (in branch-2.2), HFileBlock, 
> and DefaultMemStore (branch-1) and SegmentScanner (branch-2.2). Test tables 
> with one family and 1, 5, 10, 20, 50, and 100 distinct column-qualifiers per 
> row were created, snapshot, dropped, and cloned from the snapshot. Both 1.6 
> and 2.2 versions under test operated on identical data files in HDFS. For 
> tests with 1.6 and 2.2 on the server side the same 1.6 PE client was used, to 
> ensure only the server side differed.
> The results for pe --filterAll were revealing. See attached. 
> It appears a refactor to ScanQueryMatcher and friends has disabled the 
> ability of filters to provide meaningful SKIP hints, which disables an 
> optimization that avoids reseeking, leading to a serious and proportional 
> regression in reseek activity and time spent in that code path. So for 
> queries that use filters, there can be a substantial regression.
> Other test cases that did not use filters did not show this regression. If 
> filters are not used the behavior of ScanQueryMatcher between 1.6 and 2.2 was 
> almost identical, as measured by counts of the hint types returned, whether 
> or not column or version trackers are called, and counts of store seeks or 
> reseeks. Regarding micro-timings, there was a 10% variance in my testing and 
> results generally fell within this range, except for the filter all case of 
> course. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24637) Filter SKIP hinting regression

2020-07-16 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159480#comment-17159480
 ] 

Lars Hofhansl commented on HBASE-24637:
---

That is to say: A SEEK is a hint that can be turned into a series of SKIP, but 
a SKIP has no extra information.

> Filter SKIP hinting regression
> --
>
> Key: HBASE-24637
> URL: https://issues.apache.org/jira/browse/HBASE-24637
> Project: HBase
>  Issue Type: Bug
>  Components: Filters, Performance, Scanners
>Affects Versions: 2.2.5
>Reporter: Andrew Kyle Purtell
>Priority: Major
> Attachments: W-7665966-FAST_DIFF-FILTER_ALL.pdf, 
> W-7665966-Instrument-low-level-scan-details-branch-1.patch, 
> W-7665966-Instrument-low-level-scan-details-branch-2.2.patch, 
> parse_call_trace.pl
>
>
> I have been looking into reported performance regressions in HBase 2 relative 
> to HBase 1. Depending on the test scenario, HBase 2 can demonstrate 
> significantly better microbenchmarks in a number of cases, and usually shows 
> improvement in whole cluster benchmarks like YCSB.
> To assist in debugging I added methods to RpcServer for updating per-call 
> metrics that leverage the fact it puts a reference to the current Call into a 
> thread local and that all activity for a given RPC is processed by a single 
> thread context. I then instrumented ScanQueryMatcher (in branch-1) and its 
> various friends (in branch-2.2), StoreScanner, HFileReaderV2 and 
> HFileReaderV3 (in branch-1) and HFileReaderImpl (in branch-2.2), HFileBlock, 
> and DefaultMemStore (branch-1) and SegmentScanner (branch-2.2). Test tables 
> with one family and 1, 5, 10, 20, 50, and 100 distinct column-qualifiers per 
> row were created, snapshot, dropped, and cloned from the snapshot. Both 1.6 
> and 2.2 versions under test operated on identical data files in HDFS. For 
> tests with 1.6 and 2.2 on the server side the same 1.6 PE client was used, to 
> ensure only the server side differed.
> The results for pe --filterAll were revealing. See attached. 
> It appears a refactor to ScanQueryMatcher and friends has disabled the 
> ability of filters to provide meaningful SKIP hints, which disables an 
> optimization that avoids reseeking, leading to a serious and proportional 
> regression in reseek activity and time spent in that code path. So for 
> queries that use filters, there can be a substantial regression.
> Other test cases that did not use filters did not show this regression. If 
> filters are not used the behavior of ScanQueryMatcher between 1.6 and 2.2 was 
> almost identical, as measured by counts of the hint types returned, whether 
> or not column or version trackers are called, and counts of store seeks or 
> reseeks. Regarding micro-timings, there was a 10% variance in my testing and 
> results generally fell within this range, except for the filter all case of 
> course. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24637) Filter SKIP hinting regression

2020-07-16 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159477#comment-17159477
 ] 

Lars Hofhansl commented on HBASE-24637:
---

I see. SKIP is not a hint as such, though, it's the default. The hint (which 
can be ignore) is the SEEK hint. Both are implemented with return codes.

Looks like the SQM is now marking transitions from Column to Column with a 
SEEK-to-next-column hint, and for each row with a SEEK-to-next-row.

Also look at the numbers the optimization I mentioned is turning the vast 
majority back into SKIPs (and that check is not free).

As I said, it's not wrong per se (need to look at the code more), but that does 
not mean that there isn't a performance regression - as I have described in the 
previous comment - that we need to fix, possibly by restoring the old behavior.


> Filter SKIP hinting regression
> --
>
> Key: HBASE-24637
> URL: https://issues.apache.org/jira/browse/HBASE-24637
> Project: HBase
>  Issue Type: Bug
>  Components: Filters, Performance, Scanners
>Affects Versions: 2.2.5
>Reporter: Andrew Kyle Purtell
>Priority: Major
> Attachments: W-7665966-FAST_DIFF-FILTER_ALL.pdf, 
> W-7665966-Instrument-low-level-scan-details-branch-1.patch, 
> W-7665966-Instrument-low-level-scan-details-branch-2.2.patch, 
> parse_call_trace.pl
>
>
> I have been looking into reported performance regressions in HBase 2 relative 
> to HBase 1. Depending on the test scenario, HBase 2 can demonstrate 
> significantly better microbenchmarks in a number of cases, and usually shows 
> improvement in whole cluster benchmarks like YCSB.
> To assist in debugging I added methods to RpcServer for updating per-call 
> metrics that leverage the fact it puts a reference to the current Call into a 
> thread local and that all activity for a given RPC is processed by a single 
> thread context. I then instrumented ScanQueryMatcher (in branch-1) and its 
> various friends (in branch-2.2), StoreScanner, HFileReaderV2 and 
> HFileReaderV3 (in branch-1) and HFileReaderImpl (in branch-2.2), HFileBlock, 
> and DefaultMemStore (branch-1) and SegmentScanner (branch-2.2). Test tables 
> with one family and 1, 5, 10, 20, 50, and 100 distinct column-qualifiers per 
> row were created, snapshot, dropped, and cloned from the snapshot. Both 1.6 
> and 2.2 versions under test operated on identical data files in HDFS. For 
> tests with 1.6 and 2.2 on the server side the same 1.6 PE client was used, to 
> ensure only the server side differed.
> The results for pe --filterAll were revealing. See attached. 
> It appears a refactor to ScanQueryMatcher and friends has disabled the 
> ability of filters to provide meaningful SKIP hints, which disables an 
> optimization that avoids reseeking, leading to a serious and proportional 
> regression in reseek activity and time spent in that code path. So for 
> queries that use filters, there can be a substantial regression.
> Other test cases that did not use filters did not show this regression. If 
> filters are not used the behavior of ScanQueryMatcher between 1.6 and 2.2 was 
> almost identical, as measured by counts of the hint types returned, whether 
> or not column or version trackers are called, and counts of store seeks or 
> reseeks. Regarding micro-timings, there was a 10% variance in my testing and 
> results generally fell within this range, except for the filter all case of 
> course. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24637) Filter SKIP hinting regression

2020-07-16 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159467#comment-17159467
 ] 

Lars Hofhansl commented on HBASE-24637:
---

Maybe I misunderstood the data in the pdf...? Looks like the SQM is hinting 
seeks way more in branch-2 than branch-1.

> Filter SKIP hinting regression
> --
>
> Key: HBASE-24637
> URL: https://issues.apache.org/jira/browse/HBASE-24637
> Project: HBase
>  Issue Type: Bug
>  Components: Filters, Performance, Scanners
>Affects Versions: 2.2.5
>Reporter: Andrew Kyle Purtell
>Priority: Major
> Attachments: W-7665966-FAST_DIFF-FILTER_ALL.pdf, 
> W-7665966-Instrument-low-level-scan-details-branch-1.patch, 
> W-7665966-Instrument-low-level-scan-details-branch-2.2.patch, 
> parse_call_trace.pl
>
>
> I have been looking into reported performance regressions in HBase 2 relative 
> to HBase 1. Depending on the test scenario, HBase 2 can demonstrate 
> significantly better microbenchmarks in a number of cases, and usually shows 
> improvement in whole cluster benchmarks like YCSB.
> To assist in debugging I added methods to RpcServer for updating per-call 
> metrics that leverage the fact it puts a reference to the current Call into a 
> thread local and that all activity for a given RPC is processed by a single 
> thread context. I then instrumented ScanQueryMatcher (in branch-1) and its 
> various friends (in branch-2.2), StoreScanner, HFileReaderV2 and 
> HFileReaderV3 (in branch-1) and HFileReaderImpl (in branch-2.2), HFileBlock, 
> and DefaultMemStore (branch-1) and SegmentScanner (branch-2.2). Test tables 
> with one family and 1, 5, 10, 20, 50, and 100 distinct column-qualifiers per 
> row were created, snapshot, dropped, and cloned from the snapshot. Both 1.6 
> and 2.2 versions under test operated on identical data files in HDFS. For 
> tests with 1.6 and 2.2 on the server side the same 1.6 PE client was used, to 
> ensure only the server side differed.
> The results for pe --filterAll were revealing. See attached. 
> It appears a refactor to ScanQueryMatcher and friends has disabled the 
> ability of filters to provide meaningful SKIP hints, which disables an 
> optimization that avoids reseeking, leading to a serious and proportional 
> regression in reseek activity and time spent in that code path. So for 
> queries that use filters, there can be a substantial regression.
> Other test cases that did not use filters did not show this regression. If 
> filters are not used the behavior of ScanQueryMatcher between 1.6 and 2.2 was 
> almost identical, as measured by counts of the hint types returned, whether 
> or not column or version trackers are called, and counts of store seeks or 
> reseeks. Regarding micro-timings, there was a 10% variance in my testing and 
> results generally fell within this range, except for the filter all case of 
> course. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-24637) Filter SKIP hinting regression

2020-07-16 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159459#comment-17159459
 ] 

Lars Hofhansl edited comment on HBASE-24637 at 7/16/20, 8:31 PM:
-

Hmm... The SQM giving more precise SEEK hints is not necessarily wrong. It's a 
hint that a SEEK is *possible*.

With the SKIP vs SEEK optimization I put in place a while ago then decides at 
the StoreScanner to follow that hint or not. Now, that optimization itself is 
not free, it adds 1 or 2 extra compares.

In HBASE-24742 I managed to remove one compare in most cases. So it might 
better now, but it's still not good if we issue too many SEEK hints, for each 
of which we then have to decide to follow it or not.



was (Author: lhofhansl):
Hmm... The SQM giving more precise SEEK hints is not necessarily wrong. It's a 
hint that a SEEK is *possible*.

With the SKIP vs SEEK optimization I put in place a while ago then decides at 
the StoreScanner to follow that hint or not. Now, that itself optimization is 
not free, it adds one compare per Cell-version + 1 or 2 extra compares (# 
versions + 1 or 2 in total).

In HBASE-24742 I managed to remove one compare in most cases. So it might 
better now, but it's still not good if we issue too many SEEK hints, for each 
of which we then have to decide to follow it or not.


> Filter SKIP hinting regression
> --
>
> Key: HBASE-24637
> URL: https://issues.apache.org/jira/browse/HBASE-24637
> Project: HBase
>  Issue Type: Bug
>  Components: Filters, Performance, Scanners
>Affects Versions: 2.2.5
>Reporter: Andrew Kyle Purtell
>Priority: Major
> Attachments: W-7665966-FAST_DIFF-FILTER_ALL.pdf, 
> W-7665966-Instrument-low-level-scan-details-branch-1.patch, 
> W-7665966-Instrument-low-level-scan-details-branch-2.2.patch, 
> parse_call_trace.pl
>
>
> I have been looking into reported performance regressions in HBase 2 relative 
> to HBase 1. Depending on the test scenario, HBase 2 can demonstrate 
> significantly better microbenchmarks in a number of cases, and usually shows 
> improvement in whole cluster benchmarks like YCSB.
> To assist in debugging I added methods to RpcServer for updating per-call 
> metrics that leverage the fact it puts a reference to the current Call into a 
> thread local and that all activity for a given RPC is processed by a single 
> thread context. I then instrumented ScanQueryMatcher (in branch-1) and its 
> various friends (in branch-2.2), StoreScanner, HFileReaderV2 and 
> HFileReaderV3 (in branch-1) and HFileReaderImpl (in branch-2.2), HFileBlock, 
> and DefaultMemStore (branch-1) and SegmentScanner (branch-2.2). Test tables 
> with one family and 1, 5, 10, 20, 50, and 100 distinct column-qualifiers per 
> row were created, snapshot, dropped, and cloned from the snapshot. Both 1.6 
> and 2.2 versions under test operated on identical data files in HDFS. For 
> tests with 1.6 and 2.2 on the server side the same 1.6 PE client was used, to 
> ensure only the server side differed.
> The results for pe --filterAll were revealing. See attached. 
> It appears a refactor to ScanQueryMatcher and friends has disabled the 
> ability of filters to provide meaningful SKIP hints, which disables an 
> optimization that avoids reseeking, leading to a serious and proportional 
> regression in reseek activity and time spent in that code path. So for 
> queries that use filters, there can be a substantial regression.
> Other test cases that did not use filters did not show this regression. If 
> filters are not used the behavior of ScanQueryMatcher between 1.6 and 2.2 was 
> almost identical, as measured by counts of the hint types returned, whether 
> or not column or version trackers are called, and counts of store seeks or 
> reseeks. Regarding micro-timings, there was a 10% variance in my testing and 
> results generally fell within this range, except for the filter all case of 
> course. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24637) Filter SKIP hinting regression

2020-07-16 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159459#comment-17159459
 ] 

Lars Hofhansl commented on HBASE-24637:
---

Hmm... The SQM giving more precise SEEK hints is not necessarily wrong. It's a 
hint that a SEEK is *possible*.

With the SKIP vs SEEK optimization I put in place a while ago then decides at 
the StoreScanner to follow that hint or not. Now, that itself optimization is 
not free, it adds one compare per Cell-version + 1 or 2 extra compares (# 
versions + 1 or 2 in total).

In HBASE-24742 I managed to remove one compare in most cases. So it might 
better now, but it's still not good if we issue too many SEEK hints, for each 
of which we then have to decide to follow it or not.


> Filter SKIP hinting regression
> --
>
> Key: HBASE-24637
> URL: https://issues.apache.org/jira/browse/HBASE-24637
> Project: HBase
>  Issue Type: Bug
>  Components: Filters, Performance, Scanners
>Affects Versions: 2.2.5
>Reporter: Andrew Kyle Purtell
>Priority: Major
> Attachments: W-7665966-FAST_DIFF-FILTER_ALL.pdf, 
> W-7665966-Instrument-low-level-scan-details-branch-1.patch, 
> W-7665966-Instrument-low-level-scan-details-branch-2.2.patch, 
> parse_call_trace.pl
>
>
> I have been looking into reported performance regressions in HBase 2 relative 
> to HBase 1. Depending on the test scenario, HBase 2 can demonstrate 
> significantly better microbenchmarks in a number of cases, and usually shows 
> improvement in whole cluster benchmarks like YCSB.
> To assist in debugging I added methods to RpcServer for updating per-call 
> metrics that leverage the fact it puts a reference to the current Call into a 
> thread local and that all activity for a given RPC is processed by a single 
> thread context. I then instrumented ScanQueryMatcher (in branch-1) and its 
> various friends (in branch-2.2), StoreScanner, HFileReaderV2 and 
> HFileReaderV3 (in branch-1) and HFileReaderImpl (in branch-2.2), HFileBlock, 
> and DefaultMemStore (branch-1) and SegmentScanner (branch-2.2). Test tables 
> with one family and 1, 5, 10, 20, 50, and 100 distinct column-qualifiers per 
> row were created, snapshot, dropped, and cloned from the snapshot. Both 1.6 
> and 2.2 versions under test operated on identical data files in HDFS. For 
> tests with 1.6 and 2.2 on the server side the same 1.6 PE client was used, to 
> ensure only the server side differed.
> The results for pe --filterAll were revealing. See attached. 
> It appears a refactor to ScanQueryMatcher and friends has disabled the 
> ability of filters to provide meaningful SKIP hints, which disables an 
> optimization that avoids reseeking, leading to a serious and proportional 
> regression in reseek activity and time spent in that code path. So for 
> queries that use filters, there can be a substantial regression.
> Other test cases that did not use filters did not show this regression. If 
> filters are not used the behavior of ScanQueryMatcher between 1.6 and 2.2 was 
> almost identical, as measured by counts of the hint types returned, whether 
> or not column or version trackers are called, and counts of store seeks or 
> reseeks. Regarding micro-timings, there was a 10% variance in my testing and 
> results generally fell within this range, except for the filter all case of 
> course. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-24742) Improve performance of SKIP vs SEEK logic

2020-07-16 Thread Lars Hofhansl (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl resolved HBASE-24742.
---
Resolution: Fixed

Also pushed to branch-2 and master.

> Improve performance of SKIP vs SEEK logic
> -
>
> Key: HBASE-24742
> URL: https://issues.apache.org/jira/browse/HBASE-24742
> Project: HBase
>  Issue Type: Bug
>  Components: Performance, regionserver
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.4.0
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Major
> Fix For: 3.0.0-alpha-1, 1.7.0, 2.4.0
>
> Attachments: 24742-master.txt, hbase-1.6-regression-flame-graph.png, 
> hbase-24742-branch-1.txt
>
>
> In our testing of HBase 1.3 against the current tip of branch-1 we saw a 30% 
> slowdown in scanning scenarios.
> We tracked it back to HBASE-17958 and HBASE-19863.
> Both add comparisons to one of the tightest HBase has.
> [~bharathv]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24742) Improve performance of SKIP vs SEEK logic

2020-07-16 Thread Lars Hofhansl (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl updated HBASE-24742:
--
Attachment: 24742-master.txt

> Improve performance of SKIP vs SEEK logic
> -
>
> Key: HBASE-24742
> URL: https://issues.apache.org/jira/browse/HBASE-24742
> Project: HBase
>  Issue Type: Bug
>  Components: Performance, regionserver
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.4.0
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Major
> Fix For: 3.0.0-alpha-1, 1.7.0, 2.4.0
>
> Attachments: 24742-master.txt, hbase-1.6-regression-flame-graph.png, 
> hbase-24742-branch-1.txt
>
>
> In our testing of HBase 1.3 against the current tip of branch-1 we saw a 30% 
> slowdown in scanning scenarios.
> We tracked it back to HBASE-17958 and HBASE-19863.
> Both add comparisons to one of the tightest HBase has.
> [~bharathv]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24742) Improve performance of SKIP vs SEEK logic

2020-07-16 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159444#comment-17159444
 ] 

Lars Hofhansl commented on HBASE-24742:
---

Master (and branch-2) patch. Will just apply as they're the same as the 
branch-1 patch.

> Improve performance of SKIP vs SEEK logic
> -
>
> Key: HBASE-24742
> URL: https://issues.apache.org/jira/browse/HBASE-24742
> Project: HBase
>  Issue Type: Bug
>  Components: Performance, regionserver
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.4.0
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Major
> Fix For: 3.0.0-alpha-1, 1.7.0, 2.4.0
>
> Attachments: 24742-master.txt, hbase-1.6-regression-flame-graph.png, 
> hbase-24742-branch-1.txt
>
>
> In our testing of HBase 1.3 against the current tip of branch-1 we saw a 30% 
> slowdown in scanning scenarios.
> We tracked it back to HBASE-17958 and HBASE-19863.
> Both add comparisons to one of the tightest HBase has.
> [~bharathv]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24742) Improve performance of SKIP vs SEEK logic

2020-07-16 Thread Lars Hofhansl (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl updated HBASE-24742:
--
Fix Version/s: 2.4.0
   3.0.0-alpha-1

> Improve performance of SKIP vs SEEK logic
> -
>
> Key: HBASE-24742
> URL: https://issues.apache.org/jira/browse/HBASE-24742
> Project: HBase
>  Issue Type: Bug
>  Components: Performance, regionserver
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.4.0
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Major
> Fix For: 3.0.0-alpha-1, 1.7.0, 2.4.0
>
> Attachments: hbase-1.6-regression-flame-graph.png, 
> hbase-24742-branch-1.txt
>
>
> In our testing of HBase 1.3 against the current tip of branch-1 we saw a 30% 
> slowdown in scanning scenarios.
> We tracked it back to HBASE-17958 and HBASE-19863.
> Both add comparisons to one of the tightest HBase has.
> [~bharathv]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HBASE-24742) Improve performance of SKIP vs SEEK logic

2020-07-16 Thread Lars Hofhansl (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl reopened HBASE-24742:
---

Lemme put this into branch-2 and master as well.

> Improve performance of SKIP vs SEEK logic
> -
>
> Key: HBASE-24742
> URL: https://issues.apache.org/jira/browse/HBASE-24742
> Project: HBase
>  Issue Type: Bug
>  Components: Performance, regionserver
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.4.0
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Major
> Fix For: 1.7.0
>
> Attachments: hbase-1.6-regression-flame-graph.png, 
> hbase-24742-branch-1.txt
>
>
> In our testing of HBase 1.3 against the current tip of branch-1 we saw a 30% 
> slowdown in scanning scenarios.
> We tracked it back to HBASE-17958 and HBASE-19863.
> Both add comparisons to one of the tightest HBase has.
> [~bharathv]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-24742) Improve performance of SKIP vs SEEK logic

2020-07-16 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159385#comment-17159385
 ] 

Lars Hofhansl edited comment on HBASE-24742 at 7/16/20, 7:50 PM:
-

Merged into branch-1.

I'll look into master/branch-2, but my feeling is that things are quite 
different there.

[~apurtell] (before you yell at me for not looking at branch-2/master) :)


was (Author: lhofhansl):
Merged into branch-1.

I'll look into master/branch-2, but my feeling is that things are quite 
different there.

> Improve performance of SKIP vs SEEK logic
> -
>
> Key: HBASE-24742
> URL: https://issues.apache.org/jira/browse/HBASE-24742
> Project: HBase
>  Issue Type: Bug
>  Components: Performance, regionserver
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.4.0
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Major
> Fix For: 1.7.0
>
> Attachments: hbase-1.6-regression-flame-graph.png, 
> hbase-24742-branch-1.txt
>
>
> In our testing of HBase 1.3 against the current tip of branch-1 we saw a 30% 
> slowdown in scanning scenarios.
> We tracked it back to HBASE-17958 and HBASE-19863.
> Both add comparisons to one of the tightest HBase has.
> [~bharathv]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-24742) Improve performance of SKIP vs SEEK logic

2020-07-16 Thread Lars Hofhansl (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl resolved HBASE-24742.
---
Resolution: Fixed

> Improve performance of SKIP vs SEEK logic
> -
>
> Key: HBASE-24742
> URL: https://issues.apache.org/jira/browse/HBASE-24742
> Project: HBase
>  Issue Type: Bug
>  Components: Performance, regionserver
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.4.0
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Major
> Fix For: 1.7.0
>
> Attachments: hbase-1.6-regression-flame-graph.png, 
> hbase-24742-branch-1.txt
>
>
> In our testing of HBase 1.3 against the current tip of branch-1 we saw a 30% 
> slowdown in scanning scenarios.
> We tracked it back to HBASE-17958 and HBASE-19863.
> Both add comparisons to one of the tightest HBase has.
> [~bharathv]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24742) Improve performance of SKIP vs SEEK logic

2020-07-16 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159385#comment-17159385
 ] 

Lars Hofhansl commented on HBASE-24742:
---

Merged into branch-1.

I'll look into master/branch-2, but my feeling is that things are quite 
different there.

> Improve performance of SKIP vs SEEK logic
> -
>
> Key: HBASE-24742
> URL: https://issues.apache.org/jira/browse/HBASE-24742
> Project: HBase
>  Issue Type: Bug
>  Components: Performance, regionserver
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.4.0
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Major
> Fix For: 1.7.0
>
> Attachments: hbase-1.6-regression-flame-graph.png, 
> hbase-24742-branch-1.txt
>
>
> In our testing of HBase 1.3 against the current tip of branch-1 we saw a 30% 
> slowdown in scanning scenarios.
> We tracked it back to HBASE-17958 and HBASE-19863.
> Both add comparisons to one of the tightest HBase has.
> [~bharathv]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24742) Improve performance of SKIP vs SEEK logic

2020-07-16 Thread Lars Hofhansl (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl updated HBASE-24742:
--
Fix Version/s: 1.7.0

> Improve performance of SKIP vs SEEK logic
> -
>
> Key: HBASE-24742
> URL: https://issues.apache.org/jira/browse/HBASE-24742
> Project: HBase
>  Issue Type: Bug
>  Components: Performance, regionserver
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.4.0
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Major
> Fix For: 1.7.0
>
> Attachments: hbase-1.6-regression-flame-graph.png, 
> hbase-24742-branch-1.txt
>
>
> In our testing of HBase 1.3 against the current tip of branch-1 we saw a 30% 
> slowdown in scanning scenarios.
> We tracked it back to HBASE-17958 and HBASE-19863.
> Both add comparisons to one of the tightest HBase has.
> [~bharathv]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24742) Improve performance of SKIP vs SEEK logic

2020-07-15 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17158497#comment-17158497
 ] 

Lars Hofhansl commented on HBASE-24742:
---

Created a PR for observation #1 above.

> Improve performance of SKIP vs SEEK logic
> -
>
> Key: HBASE-24742
> URL: https://issues.apache.org/jira/browse/HBASE-24742
> Project: HBase
>  Issue Type: Bug
>  Components: Performance, regionserver
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.4.0
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Major
> Attachments: hbase-1.6-regression-flame-graph.png, 
> hbase-24742-branch-1.txt
>
>
> In our testing of HBase 1.3 against the current tip of branch-1 we saw a 30% 
> slowdown in scanning scenarios.
> We tracked it back to HBASE-17958 and HBASE-19863.
> Both add comparisons to one of the tightest HBase has.
> [~bharathv]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24742) Improve performance of SKIP vs SEEK logic

2020-07-14 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17157895#comment-17157895
 ] 

Lars Hofhansl commented on HBASE-24742:
---

Upon second thought. Perhaps a seek (not a reseek, but a seek that could 
actually goes backwards) could make it so that previousIndexedKey and 
nextIndexedKey are accidentally the same and we *still* would have to do the 
compare.

In my test the majority of the improvement came from the first change.


> Improve performance of SKIP vs SEEK logic
> -
>
> Key: HBASE-24742
> URL: https://issues.apache.org/jira/browse/HBASE-24742
> Project: HBase
>  Issue Type: Bug
>  Components: Performance, regionserver
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.4.0
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Major
> Attachments: hbase-1.6-regression-flame-graph.png, 
> hbase-24742-branch-1.txt
>
>
> In our testing of HBase 1.3 against the current tip of branch-1 we saw a 30% 
> slowdown in scanning scenarios.
> We tracked it back to HBASE-17958 and HBASE-19863.
> Both add comparisons to one of the tightest HBase has.
> [~bharathv]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24742) Improve performance of SKIP vs SEEK logic

2020-07-14 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17157890#comment-17157890
 ] 

Lars Hofhansl commented on HBASE-24742:
---

[~zhangduo] I'm with you there. The second part was harder to reason about, and 
I feel a bit less easy about it.
In the end it's an optimization to save a comparison, previousIndexedKey and 
nextIndexedKey will never accidentally be the same (as in identical), so I 
*think* it should OK.

I'm not sure how we can avoid passing the fake keys up, since it is designed to 
handle things and the "upper" heap. At least not without a lot refactoring.

[~apurtell] I'll take a look at HBASE-24637 (I'm a bit thinly spread, though)


> Improve performance of SKIP vs SEEK logic
> -
>
> Key: HBASE-24742
> URL: https://issues.apache.org/jira/browse/HBASE-24742
> Project: HBase
>  Issue Type: Bug
>  Components: Performance, regionserver
>Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.4.0
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Major
> Attachments: hbase-1.6-regression-flame-graph.png, 
> hbase-24742-branch-1.txt
>
>
> In our testing of HBase 1.3 against the current tip of branch-1 we saw a 30% 
> slowdown in scanning scenarios.
> We tracked it back to HBASE-17958 and HBASE-19863.
> Both add comparisons to one of the tightest HBase has.
> [~bharathv]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-24742) Improve performance of SKIP vs SEEK logic

2020-07-14 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17157774#comment-17157774
 ] 

Lars Hofhansl edited comment on HBASE-24742 at 7/15/20, 12:43 AM:
--

There are two observations:
1. We do not need to check for "fake" keys inserted by the ROWCOL BF logic if 
there are not ROWCOL BFs (or if they are not used)
2. We can extend the identity-compare of the nextIndexedKey across multiple 
calls. It's just an optimization and not for correctness.



was (Author: lhofhansl):
There are two observations:
1. We do not need to check for "fake" keys inserted by the ROWCOL BF logic if 
there are not ROWCOL BFs (or if they are not used)
2. We can extend the identify compare of the nextIndexedKey across multiple 
calls. It's just an optimization and not for correctness.


> Improve performance of SKIP vs SEEK logic
> -
>
> Key: HBASE-24742
> URL: https://issues.apache.org/jira/browse/HBASE-24742
> Project: HBase
>  Issue Type: Bug
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Major
> Attachments: hbase-24742-branch-1.txt
>
>
> In our testing of HBase 1.3 against the current tip of branch-1 we saw a 30% 
> slowdown in scanning scenarios.
> We tracked it back to HBASE-17958 and HBASE-19863.
> Both add comparisons to one of the tightest HBase has.
> [~bharathv]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-24742) Improve performance of SKIP vs SEEK logic

2020-07-14 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17157774#comment-17157774
 ] 

Lars Hofhansl edited comment on HBASE-24742 at 7/15/20, 12:42 AM:
--

There are two observations:
1. We do not need to check for "fake" keys inserted by the ROWCOL BF logic if 
there are not ROWCOL BFs (or if they are not used)
2. We can extend the identify compare of the nextIndexedKey across multiple 
calls. It's just an optimization and not for correctness.



was (Author: lhofhansl):
There are two observations:
1. We do not need for "fake" keys inserted by the ROWCOL BF logic if there are 
not ROWCOL BFs (or if they are not used)
2. We can extend the identify compare of the nextIndexedKey across multiple 
calls. It's just an optimization and not for correctness.


> Improve performance of SKIP vs SEEK logic
> -
>
> Key: HBASE-24742
> URL: https://issues.apache.org/jira/browse/HBASE-24742
> Project: HBase
>  Issue Type: Bug
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Major
> Attachments: hbase-24742-branch-1.txt
>
>
> In our testing of HBase 1.3 against the current tip of branch-1 we saw a 30% 
> slowdown in scanning scenarios.
> We tracked it back to HBASE-17958 and HBASE-19863.
> Both add comparisons to one of the tightest HBase has.
> [~bharathv]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-24742) Improve performance of SKIP vs SEEK logic

2020-07-14 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17157779#comment-17157779
 ] 

Lars Hofhansl edited comment on HBASE-24742 at 7/15/20, 12:42 AM:
--

Passes the test added in HBASE-19863 and brings the runtime of a test Phoenix 
query from 5.8s to 4.2s.
(This is for a fully compacted table and VERSIONS=1, which represents the worst 
case, where the two linked jiras triple the number of comparisons per Cell).

I'll post a PR tomorrow.


was (Author: lhofhansl):
Passes the test added in HBASE-19863 and brings the runtime of a test Phoenix 
query from 5.8s to 4.2s.
(This is for a fully compacted table, which represents the worst case, where 
the two linked jiras triple the number of comparisons per Cell).

I'll post a PR tomorrow.

> Improve performance of SKIP vs SEEK logic
> -
>
> Key: HBASE-24742
> URL: https://issues.apache.org/jira/browse/HBASE-24742
> Project: HBase
>  Issue Type: Bug
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Major
> Attachments: hbase-24742-branch-1.txt
>
>
> In our testing of HBase 1.3 against the current tip of branch-1 we saw a 30% 
> slowdown in scanning scenarios.
> We tracked it back to HBASE-17958 and HBASE-19863.
> Both add comparisons to one of the tightest HBase has.
> [~bharathv]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-24742) Improve performance of SKIP vs SEEK logic

2020-07-14 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17157779#comment-17157779
 ] 

Lars Hofhansl edited comment on HBASE-24742 at 7/15/20, 12:40 AM:
--

Passes the test added in HBASE-19863 and brings the runtime of a test Phoenix 
query from 5.8s to 4.2s.
(This is for a fully compacted table, which represents the worst case, where 
the two linked jiras triple the number of comparisons per Cell).

I'll post a PR tomorrow.


was (Author: lhofhansl):
Passes the test added in HBASE-19863 and brings the runtime of a test Phoenix 
query from 5.8s to 4.2s.


> Improve performance of SKIP vs SEEK logic
> -
>
> Key: HBASE-24742
> URL: https://issues.apache.org/jira/browse/HBASE-24742
> Project: HBase
>  Issue Type: Bug
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Major
> Attachments: hbase-24742-branch-1.txt
>
>
> In our testing of HBase 1.3 against the current tip of branch-1 we saw a 30% 
> slowdown in scanning scenarios.
> We tracked it back to HBASE-17958 and HBASE-19863.
> Both add comparisons to one of the tightest HBase has.
> [~bharathv]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24742) Improve performance of SKIP vs SEEK logic

2020-07-14 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17157779#comment-17157779
 ] 

Lars Hofhansl commented on HBASE-24742:
---

Passes the test added in HBASE-19863 and brings the runtime of a test Phoenix 
query from 5.8s to 4.2s.


> Improve performance of SKIP vs SEEK logic
> -
>
> Key: HBASE-24742
> URL: https://issues.apache.org/jira/browse/HBASE-24742
> Project: HBase
>  Issue Type: Bug
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Major
> Attachments: hbase-24742-branch-1.txt
>
>
> In our testing of HBase 1.3 against the current tip of branch-1 we saw a 30% 
> slowdown in scanning scenarios.
> We tracked it back to HBASE-17958 and HBASE-19863.
> Both add comparisons to one of the tightest HBase has.
> [~bharathv]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HBASE-24742) Improve performance of SKIP vs SEEK logic

2020-07-14 Thread Lars Hofhansl (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl reassigned HBASE-24742:
-

Assignee: Lars Hofhansl

> Improve performance of SKIP vs SEEK logic
> -
>
> Key: HBASE-24742
> URL: https://issues.apache.org/jira/browse/HBASE-24742
> Project: HBase
>  Issue Type: Bug
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Major
> Attachments: hbase-24742-branch-1.txt
>
>
> In our testing of HBase 1.3 against the current tip of branch-1 we saw a 30% 
> slowdown in scanning scenarios.
> We tracked it back to HBASE-17958 and HBASE-19863.
> Both add comparisons to one of the tightest HBase has.
> [~bharathv]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24742) Improve performance of SKIP vs SEEK logic

2020-07-14 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17157775#comment-17157775
 ] 

Lars Hofhansl commented on HBASE-24742:
---

Here's a patch.
Please have a careful look, especially at the part that turns 
previousIndexedKey into a member.


> Improve performance of SKIP vs SEEK logic
> -
>
> Key: HBASE-24742
> URL: https://issues.apache.org/jira/browse/HBASE-24742
> Project: HBase
>  Issue Type: Bug
>Reporter: Lars Hofhansl
>Priority: Major
> Attachments: hbase-24742-branch-1.txt
>
>
> In our testing of HBase 1.3 against the current tip of branch-1 we saw a 30% 
> slowdown in scanning scenarios.
> We tracked it back to HBASE-17958 and HBASE-19863.
> Both add comparisons to one of the tightest HBase has.
> [~bharathv]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-24742) Improve performance of SKIP vs SEEK logic

2020-07-14 Thread Lars Hofhansl (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-24742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl updated HBASE-24742:
--
Attachment: hbase-24742-branch-1.txt

> Improve performance of SKIP vs SEEK logic
> -
>
> Key: HBASE-24742
> URL: https://issues.apache.org/jira/browse/HBASE-24742
> Project: HBase
>  Issue Type: Bug
>Reporter: Lars Hofhansl
>Priority: Major
> Attachments: hbase-24742-branch-1.txt
>
>
> In our testing of HBase 1.3 against the current tip of branch-1 we saw a 30% 
> slowdown in scanning scenarios.
> We tracked it back to HBASE-17958 and HBASE-19863.
> Both add comparisons to one of the tightest HBase has.
> [~bharathv]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24742) Improve performance of SKIP vs SEEK logic

2020-07-14 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17157774#comment-17157774
 ] 

Lars Hofhansl commented on HBASE-24742:
---

There are two observations:
1. We do not need for "fake" keys inserted by the ROWCOL BF logic if there are 
not ROWCOL BFs (or if they are not used)
2. We can extend the identify compare of the nextIndexedKey across multiple 
calls. It's just an optimization and not for correctness.


> Improve performance of SKIP vs SEEK logic
> -
>
> Key: HBASE-24742
> URL: https://issues.apache.org/jira/browse/HBASE-24742
> Project: HBase
>  Issue Type: Bug
>Reporter: Lars Hofhansl
>Priority: Major
>
> In our testing of HBase 1.3 against the current tip of branch-1 we saw a 30% 
> slowdown in scanning scenarios.
> We tracked it back to HBASE-17958 and HBASE-19863.
> Both add comparisons to one of the tightest HBase has.
> [~bharathv]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-24742) Improve performance of SKIP vs SEEK logic

2020-07-14 Thread Lars Hofhansl (Jira)
Lars Hofhansl created HBASE-24742:
-

 Summary: Improve performance of SKIP vs SEEK logic
 Key: HBASE-24742
 URL: https://issues.apache.org/jira/browse/HBASE-24742
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl


In our testing of HBase 1.3 against the current tip of branch-1 we saw a 30% 
slowdown in scanning scenarios.

We tracked it back to HBASE-17958 and HBASE-19863.
Both add comparisons to one of the tightest HBase has.

[~bharathv]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23349) Low refCount preventing archival of compacted away files

2020-01-14 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015365#comment-17015365
 ] 

Lars Hofhansl commented on HBASE-23349:
---

Minor nit: This is not lock coarsening. That was the failed I attempt I had to 
reduce the frequency of taking memory barriers (the locks were almost never 
contended), by pushing the locking up the stack into the region scanner.
[~ram_krish] and [~anoop.hbase] then came up with an actual solution :), but 
that then required the reference counting.
Note that the numbers on HBASE-13082, where with the lock coarsening, not with 
reference counting.

At this point my concern is just about correctness and the issues we have seen 
with reference counting. It is generally very hard to retrofit reference 
counting into a large, complex system. Ram and Anoop did an awesome job! 
Perhaps HBase is just too complex to add this reliably.

> Low refCount preventing archival of compacted away files
> 
>
> Key: HBASE-23349
> URL: https://issues.apache.org/jira/browse/HBASE-23349
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0, 2.3.0, 1.6.0
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 3.0.0, 2.3.0, 1.6.0
>
>
> We have observed that refCount on compacted away store files as low as 1 is 
> prevent archival.
> {code:java}
> regionserver.HStore - Can't archive compacted file 
> hdfs://{{root-dir}}/hbase/data/default/t1/12a9e1112e0371955b3db8d3ebb2d298/cf1/73b72f5ddfce4a34a9e01afe7b83c1f9
>  because of either isCompactedAway=true or file has reference, 
> isReferencedInReads=true, refCount=1, skipping for now.
> {code}
> We should come up with core code (run as part of discharger thread) 
> gracefully resolve reader lock issue by resetting ongoing scanners to start 
> pointing to new store files instead of compacted away store files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23602) TTL Before Which No Data is Purged

2020-01-08 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011414#comment-17011414
 ] 

Lars Hofhansl commented on HBASE-23602:
---

When you set a TTL and KEEP_DELETED_CELLS=TTL and *MIN_VERSIONS* you get that.

Now HBase will keep everything (up to VERSIONS) until the TTL expires, after 
that it keep MIN_VERSIONS.

At least that's what I had in mind when I added MIN_VERSIONS and 
KEEP_DELETED_CELLS to HBase back in the day. Granted it's a bit convoluted, but 
pretty flexible this way.

Say VERSIONS=MAX_INT, TTL=5days, KEEP_DELETED_CELLS=TTL, MIN_VERSIONS=2. Now 
within 5 days you have everything - all Puts, all Deletes, etc, and you can do 
correct point-in-time queries. After 5 days HBase retains 2 versions only.

> TTL Before Which No Data is Purged
> --
>
> Key: HBASE-23602
> URL: https://issues.apache.org/jira/browse/HBASE-23602
> Project: HBase
>  Issue Type: New Feature
>Reporter: Geoffrey Jacoby
>Assignee: Geoffrey Jacoby
>Priority: Major
> Fix For: 3.0.0, 2.3.0, 1.6.0
>
>
> HBase currently offers operators a choice. They can set 
> KEEP_DELETED_CELLS=true and VERSIONS to max value, plus no TTL, and they will 
> always have a complete history of all changes (but high storage costs and 
> penalties to read performance). Or they can have KEEP_DELETED_CELLS=false and 
> VERSIONS/TTL set to some reasonable values, but that means that major 
> compactions can destroy the ability to do a consistent snapshot read of any 
> prior time. (This limits the usefulness and correctness of, for example, 
> Phoenix's SCN lookback feature.) 
> I propose having a new TTL property to give a minimum age that an expired or 
> deleted Cell would have to achieve before it could be purged. (I see that 
> HBASE-10118 already does something similar for the delete markers 
> themselves.) 
> This would allow operators to have a consistent history for some finite 
> amount of recent time while still purging out the "long tail" of obsolete / 
> deleted versions. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23602) TTL Before Which No Data is Purged

2020-01-08 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011089#comment-17011089
 ] 

Lars Hofhansl commented on HBASE-23602:
---

See HBASE-12363 for a (looong) discussion.

> TTL Before Which No Data is Purged
> --
>
> Key: HBASE-23602
> URL: https://issues.apache.org/jira/browse/HBASE-23602
> Project: HBase
>  Issue Type: New Feature
>Reporter: Geoffrey Jacoby
>Assignee: Geoffrey Jacoby
>Priority: Major
> Fix For: 3.0.0, 2.3.0, 1.6.0
>
>
> HBase currently offers operators a choice. They can set 
> KEEP_DELETED_CELLS=true and VERSIONS to max value, plus no TTL, and they will 
> always have a complete history of all changes (but high storage costs and 
> penalties to read performance). Or they can have KEEP_DELETED_CELLS=false and 
> VERSIONS/TTL set to some reasonable values, but that means that major 
> compactions can destroy the ability to do a consistent snapshot read of any 
> prior time. (This limits the usefulness and correctness of, for example, 
> Phoenix's SCN lookback feature.) 
> I propose having a new TTL property to give a minimum age that an expired or 
> deleted Cell would have to achieve before it could be purged. (I see that 
> HBASE-10118 already does something similar for the delete markers 
> themselves.) 
> This would allow operators to have a consistent history for some finite 
> amount of recent time while still purging out the "long tail" of obsolete / 
> deleted versions. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23602) TTL Before Which No Data is Purged

2020-01-08 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17011087#comment-17011087
 ] 

Lars Hofhansl commented on HBASE-23602:
---

You can set KEEP_DELETED_CELLS=TTL (you can set that to true, false, and TTL), 
and get what you want, I think.

> TTL Before Which No Data is Purged
> --
>
> Key: HBASE-23602
> URL: https://issues.apache.org/jira/browse/HBASE-23602
> Project: HBase
>  Issue Type: New Feature
>Reporter: Geoffrey Jacoby
>Assignee: Geoffrey Jacoby
>Priority: Major
> Fix For: 3.0.0, 2.3.0, 1.6.0
>
>
> HBase currently offers operators a choice. They can set 
> KEEP_DELETED_CELLS=true and VERSIONS to max value, plus no TTL, and they will 
> always have a complete history of all changes (but high storage costs and 
> penalties to read performance). Or they can have KEEP_DELETED_CELLS=false and 
> VERSIONS/TTL set to some reasonable values, but that means that major 
> compactions can destroy the ability to do a consistent snapshot read of any 
> prior time. (This limits the usefulness and correctness of, for example, 
> Phoenix's SCN lookback feature.) 
> I propose having a new TTL property to give a minimum age that an expired or 
> deleted Cell would have to achieve before it could be purged. (I see that 
> HBASE-10118 already does something similar for the delete markers 
> themselves.) 
> This would allow operators to have a consistent history for some finite 
> amount of recent time while still purging out the "long tail" of obsolete / 
> deleted versions. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23349) Low refCount preventing archival of compacted away files

2020-01-01 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006485#comment-17006485
 ] 

Lars Hofhansl commented on HBASE-23349:
---

Sure.

[~ram_krish], [~anoop.hbase], FYI. I know you guys invested a lot of time in 
this. In light of the issues I'm in favor removing the refcounting code and 
restoring the old behavior. Let's have a discussion.

 

> Low refCount preventing archival of compacted away files
> 
>
> Key: HBASE-23349
> URL: https://issues.apache.org/jira/browse/HBASE-23349
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0, 2.3.0, 1.6.0
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 3.0.0, 2.3.0, 1.6.0
>
>
> We have observed that refCount on compacted away store files as low as 1 is 
> prevent archival.
> {code:java}
> regionserver.HStore - Can't archive compacted file 
> hdfs://{{root-dir}}/hbase/data/default/t1/12a9e1112e0371955b3db8d3ebb2d298/cf1/73b72f5ddfce4a34a9e01afe7b83c1f9
>  because of either isCompactedAway=true or file has reference, 
> isReferencedInReads=true, refCount=1, skipping for now.
> {code}
> We should come up with core code (run as part of discharger thread) 
> gracefully resolve reader lock issue by resetting ongoing scanners to start 
> pointing to new store files instead of compacted away store files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-6970) hbase-deamon.sh creates/updates pid file even when that start failed.

2019-12-24 Thread Lars Hofhansl (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-6970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl resolved HBASE-6970.
--
Resolution: Won't Fix

> hbase-deamon.sh creates/updates pid file even when that start failed.
> -
>
> Key: HBASE-6970
> URL: https://issues.apache.org/jira/browse/HBASE-6970
> Project: HBase
>  Issue Type: Bug
>  Components: Usability
>Reporter: Lars Hofhansl
>Priority: Major
>
> We just ran into a strange issue where could neither start nor stop services 
> with hbase-deamon.sh.
> The problem is this:
> {code}
> nohup nice -n $HBASE_NICENESS "$HBASE_HOME"/bin/hbase \
> --config "${HBASE_CONF_DIR}" \
> $command "$@" $startStop > "$logout" 2>&1 < /dev/null &
> echo $! > $pid
> {code}
> So the pid file is created or updated even when the start of the service 
> failed. The next stop command will then fail, because the pid file has the 
> wrong pid in it.
> Edit: Spelling and more spelling errors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-9272) A parallel, unordered scanner

2019-12-24 Thread Lars Hofhansl (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-9272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl resolved HBASE-9272.
--
Resolution: Won't Fix

> A parallel, unordered scanner
> -
>
> Key: HBASE-9272
> URL: https://issues.apache.org/jira/browse/HBASE-9272
> Project: HBase
>  Issue Type: New Feature
>Reporter: Lars Hofhansl
>Priority: Minor
> Attachments: 9272-0.94-v2.txt, 9272-0.94-v3.txt, 9272-0.94-v4.txt, 
> 9272-0.94.txt, 9272-trunk-v2.txt, 9272-trunk-v3.txt, 9272-trunk-v3.txt, 
> 9272-trunk-v4.txt, 9272-trunk.txt, ParallelClientScanner.java, 
> ParallelClientScanner.java
>
>
> The contract of ClientScanner is to return rows in sort order. That limits 
> the order in which region can be scanned.
> I propose a simple ParallelScanner that does not have this requirement and 
> queries regions in parallel, return whatever gets returned first.
> This is generally useful for scans that filter a lot of data on the server, 
> or in cases where the client can very quickly react to the returned data.
> I have a simple prototype (doesn't do error handling right, and might be a 
> bit heavy on the synchronization side - it used a BlockingQueue to hand data 
> between the client using the scanner and the threads doing the scanning, it 
> also could potentially starve some scanners long enugh to time out at the 
> server).
> On the plus side, it's only a 130 lines of code. :)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-13751) Refactoring replication WAL reading logic as WAL Iterator

2019-12-24 Thread Lars Hofhansl (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-13751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl resolved HBASE-13751.
---
Resolution: Won't Fix

> Refactoring replication WAL reading logic as WAL Iterator
> -
>
> Key: HBASE-13751
> URL: https://issues.apache.org/jira/browse/HBASE-13751
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Lars Hofhansl
>Priority: Major
>
> The current replication code is all over the place.
> A simple refactoring that we could consider is to factor out the part that 
> reads from the WALs. Could be a simple iterator interface with one additional 
> wrinkle: The iterator needs to be able to provide the position (file and 
> offset) of the last read edit.
> Once we have this, we use this as a building block to many other changes in 
> the replication code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-14014) Explore row-by-row grouping options

2019-12-24 Thread Lars Hofhansl (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-14014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl resolved HBASE-14014.
---
Resolution: Won't Fix

> Explore row-by-row grouping options
> ---
>
> Key: HBASE-14014
> URL: https://issues.apache.org/jira/browse/HBASE-14014
> Project: HBase
>  Issue Type: Improvement
>  Components: Replication
>Reporter: Lars Hofhansl
>Priority: Major
>
> See discussion in parent.
> We need to considering the following attributes of WALKey:
> * The cluster ids
> * Table Name
> * write time (here we could use the latest of any batch)
> * seqNum
> As long as we preserve these we can rearrange the cells between WALEdits. 
> Since seqNum is unique this will be a challenge. Currently it is not used, 
> but we shouldn't design anything that prevents us guaranteeing better 
> ordering guarantees using seqNum.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-14509) Configurable sparse indexes?

2019-12-24 Thread Lars Hofhansl (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-14509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl resolved HBASE-14509.
---
Resolution: Won't Fix

> Configurable sparse indexes?
> 
>
> Key: HBASE-14509
> URL: https://issues.apache.org/jira/browse/HBASE-14509
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Lars Hofhansl
>Priority: Major
>
> This idea just popped up today and I wanted to record it for discussion:
> What if we kept sparse column indexes per region or HFile or per configurable 
> range?
> I.e. For any given CQ we record the lowest and highest value for a particular 
> range (HFile, Region, or a custom range like the Phoenix guide post).
> By tweaking the size of these ranges we can control the size of the index, vs 
> its selectivity.
> For example if we kept it by HFile we can almost instantly decide whether we 
> need scan a particular HFile at all to find a particular value in a Cell.
> We can also collect min/max values for each n MB of data, for example when we 
> can the region the first time. Assuming ranges are large enough we can always 
> keep the index in memory together with the region.
> Kind of a sparse local index. Might much easier than the buddy region stuff 
> we've been discussing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23349) Low refCount preventing archival of compacted away files

2019-12-22 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17001958#comment-17001958
 ] 

Lars Hofhansl commented on HBASE-23349:
---

I think we should step back and remember why we have the ref counting in the 
first place. This came from a discussion started in HBASE-13082 and 
HBASE-10060, namely too much synchronization.

If any changes we make now needs new synchronization in the scanner.next(...) 
path we're back to where we started and in that case we should remove the ref 
counting and bring back the old notification and scanner switching we had 
before.

My apologies that I had triggered the original discussion, and then completely 
dropped off (worked on other stuff) when we attempted to fix it. Reference 
counting is bad (I've never seen this successful implemented), if we can avoid 
it we should a bit of performance drop is acceptable.

Long story for: If we bring back scanner notification then let's get rid of ref 
counting completely.

 

> Low refCount preventing archival of compacted away files
> 
>
> Key: HBASE-23349
> URL: https://issues.apache.org/jira/browse/HBASE-23349
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0, 2.3.0, 1.6.0
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
> Fix For: 3.0.0, 2.3.0, 1.6.0
>
>
> We have observed that refCount on compacted away store files as low as 1 is 
> prevent archival.
> {code:java}
> regionserver.HStore - Can't archive compacted file 
> hdfs://{{root-dir}}/hbase/data/default/t1/12a9e1112e0371955b3db8d3ebb2d298/cf1/73b72f5ddfce4a34a9e01afe7b83c1f9
>  because of either isCompactedAway=true or file has reference, 
> isReferencedInReads=true, refCount=1, skipping for now.
> {code}
> We should come up with core code (run as part of discharger thread) 
> gracefully resolve reader lock issue by resetting ongoing scanners to start 
> pointing to new store files instead of compacted away store files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HBASE-23364) HRegionServer sometimes does not shut down.

2019-12-05 Thread Lars Hofhansl (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl resolved HBASE-23364.
---
Fix Version/s: 1.6.0
   2.3.0
   3.0.0
   Resolution: Fixed

Committed to branch-1, branch-2, and master.

> HRegionServer sometimes does not shut down.
> ---
>
> Key: HBASE-23364
> URL: https://issues.apache.org/jira/browse/HBASE-23364
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.3.0, 1.6.0
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Major
> Fix For: 3.0.0, 2.3.0, 1.6.0
>
> Attachments: 23364-branch-1.txt
>
>
> Note that I initially assumed this to be a Phoenix bug. But I tracked it down 
> to HBase.
> 
> I noticed that recently only. Latest build from HBase's branch-1 and latest 
> build from Phoenix' 4.x-HBase-1.5. I don't know, yet, whether it's a Phoenix 
> or an HBase issues.
> Just filing it here for later reference.
> jstack show this thread as the only non-daemon thread:
> {code:java}
> "pool-11-thread-1" #470 prio=5 os_prio=0 tid=0x558a709a4800 nid=0x238e 
> waiting on condition [0x7f213ad68000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x00058eafece8> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){code}
> No other information. Somebody created a thread pool somewhere and forgot to 
> set the threads to daemon or is not shutting down the pool properly.
> Edit: I looked for other reference of the locked objects in the stack dump, 
> but didn't find any.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23364) HRegionServer sometimes does not shut down.

2019-12-04 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16988540#comment-16988540
 ] 

Lars Hofhansl commented on HBASE-23364:
---

Thanks for looking [~vjasani] .

I'll fix the long line on commit, and commit to above mentioned branches.

 

> HRegionServer sometimes does not shut down.
> ---
>
> Key: HBASE-23364
> URL: https://issues.apache.org/jira/browse/HBASE-23364
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.3.0, 1.6.0
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Major
> Attachments: 23364-branch-1.txt
>
>
> Note that I initially assumed this to be a Phoenix bug. But I tracked it down 
> to HBase.
> 
> I noticed that recently only. Latest build from HBase's branch-1 and latest 
> build from Phoenix' 4.x-HBase-1.5. I don't know, yet, whether it's a Phoenix 
> or an HBase issues.
> Just filing it here for later reference.
> jstack show this thread as the only non-daemon thread:
> {code:java}
> "pool-11-thread-1" #470 prio=5 os_prio=0 tid=0x558a709a4800 nid=0x238e 
> waiting on condition [0x7f213ad68000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x00058eafece8> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){code}
> No other information. Somebody created a thread pool somewhere and forgot to 
> set the threads to daemon or is not shutting down the pool properly.
> Edit: I looked for other reference of the locked objects in the stack dump, 
> but didn't find any.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23364) HRegionServer sometimes does not shut down.

2019-12-04 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16988456#comment-16988456
 ] 

Lars Hofhansl commented on HBASE-23364:
---

Here's a simple patch. Seems to fix the problem. Please have a look.

This is a problem in branch-1, branch-2, and master, but not on any of the 
other branches.

> HRegionServer sometimes does not shut down.
> ---
>
> Key: HBASE-23364
> URL: https://issues.apache.org/jira/browse/HBASE-23364
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.3.0, 1.6.0
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Major
> Attachments: 23364-branch-1.txt
>
>
> Note that I initially assumed this to be a Phoenix bug. But I tracked it down 
> to HBase.
> 
> I noticed that recently only. Latest build from HBase's branch-1 and latest 
> build from Phoenix' 4.x-HBase-1.5. I don't know, yet, whether it's a Phoenix 
> or an HBase issues.
> Just filing it here for later reference.
> jstack show this thread as the only non-daemon thread:
> {code:java}
> "pool-11-thread-1" #470 prio=5 os_prio=0 tid=0x558a709a4800 nid=0x238e 
> waiting on condition [0x7f213ad68000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x00058eafece8> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){code}
> No other information. Somebody created a thread pool somewhere and forgot to 
> set the threads to daemon or is not shutting down the pool properly.
> Edit: I looked for other reference of the locked objects in the stack dump, 
> but didn't find any.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HBASE-23364) HRegionServer sometimes does not shut down.

2019-12-04 Thread Lars Hofhansl (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl reassigned HBASE-23364:
-

Assignee: Lars Hofhansl

> HRegionServer sometimes does not shut down.
> ---
>
> Key: HBASE-23364
> URL: https://issues.apache.org/jira/browse/HBASE-23364
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.3.0, 1.6.0
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Major
> Attachments: 23364-branch-1.txt
>
>
> Note that I initially assumed this to be a Phoenix bug. But I tracked it down 
> to HBase.
> 
> I noticed that recently only. Latest build from HBase's branch-1 and latest 
> build from Phoenix' 4.x-HBase-1.5. I don't know, yet, whether it's a Phoenix 
> or an HBase issues.
> Just filing it here for later reference.
> jstack show this thread as the only non-daemon thread:
> {code:java}
> "pool-11-thread-1" #470 prio=5 os_prio=0 tid=0x558a709a4800 nid=0x238e 
> waiting on condition [0x7f213ad68000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x00058eafece8> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){code}
> No other information. Somebody created a thread pool somewhere and forgot to 
> set the threads to daemon or is not shutting down the pool properly.
> Edit: I looked for other reference of the locked objects in the stack dump, 
> but didn't find any.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-23364) HRegionServer sometimes does not shut down.

2019-12-04 Thread Lars Hofhansl (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl updated HBASE-23364:
--
Attachment: 23364-branch-1.txt

> HRegionServer sometimes does not shut down.
> ---
>
> Key: HBASE-23364
> URL: https://issues.apache.org/jira/browse/HBASE-23364
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.3.0, 1.6.0
>Reporter: Lars Hofhansl
>Priority: Major
> Attachments: 23364-branch-1.txt
>
>
> Note that I initially assumed this to be a Phoenix bug. But I tracked it down 
> to HBase.
> 
> I noticed that recently only. Latest build from HBase's branch-1 and latest 
> build from Phoenix' 4.x-HBase-1.5. I don't know, yet, whether it's a Phoenix 
> or an HBase issues.
> Just filing it here for later reference.
> jstack show this thread as the only non-daemon thread:
> {code:java}
> "pool-11-thread-1" #470 prio=5 os_prio=0 tid=0x558a709a4800 nid=0x238e 
> waiting on condition [0x7f213ad68000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x00058eafece8> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){code}
> No other information. Somebody created a thread pool somewhere and forgot to 
> set the threads to daemon or is not shutting down the pool properly.
> Edit: I looked for other reference of the locked objects in the stack dump, 
> but didn't find any.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-23364) HRegionServer sometimes does not shut down.

2019-12-04 Thread Lars Hofhansl (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl updated HBASE-23364:
--
Affects Version/s: 2.3.0
   3.0.0

> HRegionServer sometimes does not shut down.
> ---
>
> Key: HBASE-23364
> URL: https://issues.apache.org/jira/browse/HBASE-23364
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0, 2.3.0, 1.6.0
>Reporter: Lars Hofhansl
>Priority: Major
>
> Note that I initially assumed this to be a Phoenix bug. But I tracked it down 
> to HBase.
> 
> I noticed that recently only. Latest build from HBase's branch-1 and latest 
> build from Phoenix' 4.x-HBase-1.5. I don't know, yet, whether it's a Phoenix 
> or an HBase issues.
> Just filing it here for later reference.
> jstack show this thread as the only non-daemon thread:
> {code:java}
> "pool-11-thread-1" #470 prio=5 os_prio=0 tid=0x558a709a4800 nid=0x238e 
> waiting on condition [0x7f213ad68000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x00058eafece8> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){code}
> No other information. Somebody created a thread pool somewhere and forgot to 
> set the threads to daemon or is not shutting down the pool properly.
> Edit: I looked for other reference of the locked objects in the stack dump, 
> but didn't find any.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-23364) HRegionServer sometimes does not shut down.

2019-12-04 Thread Lars Hofhansl (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl updated HBASE-23364:
--
Affects Version/s: 1.6.0

> HRegionServer sometimes does not shut down.
> ---
>
> Key: HBASE-23364
> URL: https://issues.apache.org/jira/browse/HBASE-23364
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: Lars Hofhansl
>Priority: Major
>
> Note that I initially assumed this to be a Phoenix bug. But I tracked it down 
> to HBase.
> 
> I noticed that recently only. Latest build from HBase's branch-1 and latest 
> build from Phoenix' 4.x-HBase-1.5. I don't know, yet, whether it's a Phoenix 
> or an HBase issues.
> Just filing it here for later reference.
> jstack show this thread as the only non-daemon thread:
> {code:java}
> "pool-11-thread-1" #470 prio=5 os_prio=0 tid=0x558a709a4800 nid=0x238e 
> waiting on condition [0x7f213ad68000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x00058eafece8> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){code}
> No other information. Somebody created a thread pool somewhere and forgot to 
> set the threads to daemon or is not shutting down the pool properly.
> Edit: I looked for other reference of the locked objects in the stack dump, 
> but didn't find any.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-23364) HRegionServer sometimes does not shut down.

2019-12-03 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987585#comment-16987585
 ] 

Lars Hofhansl edited comment on HBASE-23364 at 12/4/19 6:54 AM:


So back to the theory above. I think I tracked it down to HBASE-23210.

[~apurtell], FYI.

I think it's this change:
{code:java}
 +executor = Executors.newSingleThreadExecutor({code}
That causes the problem. That executor neither has daemon thread factory, nor 
is it shutdown (as far as I can see from looking briefly)

I'll look more tomorrow and provide a fix - if I get time.


was (Author: lhofhansl):
So back to the theory above. I think I tracked it down to HBASE-23210.

[~apurtell], FYI.

I think it's this change:
{code:java}
 +executor = Executors.newSingleThreadExecutor({code}
That causes the problem. That executor neither has daemon thread factory, nor 
is it shutdown (as far as I can see from looking briefly)

> HRegionServer sometimes does not shut down.
> ---
>
> Key: HBASE-23364
> URL: https://issues.apache.org/jira/browse/HBASE-23364
> Project: HBase
>  Issue Type: Bug
>Reporter: Lars Hofhansl
>Priority: Major
>
> Note that I initially assumed this to be a Phoenix bug. But I tracked it down 
> to HBase.
> 
> I noticed that recently only. Latest build from HBase's branch-1 and latest 
> build from Phoenix' 4.x-HBase-1.5. I don't know, yet, whether it's a Phoenix 
> or an HBase issues.
> Just filing it here for later reference.
> jstack show this thread as the only non-daemon thread:
> {code:java}
> "pool-11-thread-1" #470 prio=5 os_prio=0 tid=0x558a709a4800 nid=0x238e 
> waiting on condition [0x7f213ad68000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x00058eafece8> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){code}
> No other information. Somebody created a thread pool somewhere and forgot to 
> set the threads to daemon or is not shutting down the pool properly.
> Edit: I looked for other reference of the locked objects in the stack dump, 
> but didn't find any.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23210) Backport HBASE-15519 (Add per-user metrics) to branch-1

2019-12-03 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987587#comment-16987587
 ] 

Lars Hofhansl commented on HBASE-23210:
---

See HBASE-23364 ... I think this causes the region server to "hang" upon 
shutdown.

> Backport HBASE-15519 (Add per-user metrics) to branch-1
> ---
>
> Key: HBASE-23210
> URL: https://issues.apache.org/jira/browse/HBASE-23210
> Project: HBase
>  Issue Type: New Feature
>Reporter: Andrew Kyle Purtell
>Assignee: Andrew Kyle Purtell
>Priority: Major
> Fix For: 1.6.0
>
>
> We will need HBASE-15519 in branch-1 for eventual backport of HBASE-23065.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-23364) HRegionServer sometimes does not shut down.

2019-12-03 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987585#comment-16987585
 ] 

Lars Hofhansl edited comment on HBASE-23364 at 12/4/19 6:50 AM:


So back to the theory above. I think I tracked it down to HBASE-23210.

[~apurtell], FYI.

I think it's this change:
{code:java}
 +executor = Executors.newSingleThreadExecutor({code}
That causes the problem. That executor neither has daemon thread factory, nor 
is it shutdown (as far as I can see from looking briefly)


was (Author: lhofhansl):
So back to the theory above. I think I tracked it down to HBASE-23210.

 

> HRegionServer sometimes does not shut down.
> ---
>
> Key: HBASE-23364
> URL: https://issues.apache.org/jira/browse/HBASE-23364
> Project: HBase
>  Issue Type: Bug
>Reporter: Lars Hofhansl
>Priority: Major
>
> Note that I initially assumed this to be a Phoenix bug. But I tracked it down 
> to HBase.
> 
> I noticed that recently only. Latest build from HBase's branch-1 and latest 
> build from Phoenix' 4.x-HBase-1.5. I don't know, yet, whether it's a Phoenix 
> or an HBase issues.
> Just filing it here for later reference.
> jstack show this thread as the only non-daemon thread:
> {code:java}
> "pool-11-thread-1" #470 prio=5 os_prio=0 tid=0x558a709a4800 nid=0x238e 
> waiting on condition [0x7f213ad68000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x00058eafece8> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){code}
> No other information. Somebody created a thread pool somewhere and forgot to 
> set the threads to daemon or is not shutting down the pool properly.
> Edit: I looked for other reference of the locked objects in the stack dump, 
> but didn't find any.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-23364) HRegionServer sometimes does not shut down.

2019-12-03 Thread Lars Hofhansl (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl updated HBASE-23364:
--
Description: 
Note that I initially assumed this to be a Phoenix bug. But I tracked it down 
to HBase.



I noticed that recently only. Latest build from HBase's branch-1 and latest 
build from Phoenix' 4.x-HBase-1.5. I don't know, yet, whether it's a Phoenix or 
an HBase issues.

Just filing it here for later reference.

jstack show this thread as the only non-daemon thread:
{code:java}
"pool-11-thread-1" #470 prio=5 os_prio=0 tid=0x558a709a4800 nid=0x238e 
waiting on condition [0x7f213ad68000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x00058eafece8> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748){code}
No other information. Somebody created a thread pool somewhere and forgot to 
set the threads to daemon or is not shutting down the pool properly.

Edit: I looked for other reference of the locked objects in the stack dump, but 
didn't find any.

 

 

  was:
I noticed that recently only. Latest build from HBase's branch-1 and latest 
build from Phoenix' 4.x-HBase-1.5. I don't know, yet, whether it's a Phoenix or 
an HBase issues.

Just filing it here for later reference.

jstack show this thread as the only non-daemon thread:
{code:java}
"pool-11-thread-1" #470 prio=5 os_prio=0 tid=0x558a709a4800 nid=0x238e 
waiting on condition [0x7f213ad68000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x00058eafece8> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748){code}
No other information. Somebody created a thread pool somewhere and forgot to 
set the threads to daemon or is not shutting down the pool properly.

Edit: I looked for other reference of the locked objects in the stack dump, but 
didn't find any.

 

 


> HRegionServer sometimes does not shut down.
> ---
>
> Key: HBASE-23364
> URL: https://issues.apache.org/jira/browse/HBASE-23364
> Project: HBase
>  Issue Type: Bug
>Reporter: Lars Hofhansl
>Priority: Major
>
> Note that I initially assumed this to be a Phoenix bug. But I tracked it down 
> to HBase.
> 
> I noticed that recently only. Latest build from HBase's branch-1 and latest 
> build from Phoenix' 4.x-HBase-1.5. I don't know, yet, whether it's a Phoenix 
> or an HBase issues.
> Just filing it here for later reference.
> jstack show this thread as the only non-daemon thread:
> {code:java}
> "pool-11-thread-1" #470 prio=5 os_prio=0 tid=0x558a709a4800 nid=0x238e 
> waiting on condition [0x7f213ad68000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x00058eafece8> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at 

[jira] [Moved] (HBASE-23364) HRegionServer sometimes does not shut down.

2019-12-03 Thread Lars Hofhansl (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl moved PHOENIX-5579 to HBASE-23364:


Key: HBASE-23364  (was: PHOENIX-5579)
Project: HBase  (was: Phoenix)

> HRegionServer sometimes does not shut down.
> ---
>
> Key: HBASE-23364
> URL: https://issues.apache.org/jira/browse/HBASE-23364
> Project: HBase
>  Issue Type: Bug
>Reporter: Lars Hofhansl
>Priority: Major
>
> I noticed that recently only. Latest build from HBase's branch-1 and latest 
> build from Phoenix' 4.x-HBase-1.5. I don't know, yet, whether it's a Phoenix 
> or an HBase issues.
> Just filing it here for later reference.
> jstack show this thread as the only non-daemon thread:
> {code:java}
> "pool-11-thread-1" #470 prio=5 os_prio=0 tid=0x558a709a4800 nid=0x238e 
> waiting on condition [0x7f213ad68000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x00058eafece8> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){code}
> No other information. Somebody created a thread pool somewhere and forgot to 
> set the threads to daemon or is not shutting down the pool properly.
> Edit: I looked for other reference of the locked objects in the stack dump, 
> but didn't find any.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HBASE-22457) Harden the HBase HFile reader reference counting

2019-12-02 Thread Lars Hofhansl (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-22457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl reassigned HBASE-22457:
-

Assignee: (was: Lars Hofhansl)

> Harden the HBase HFile reader reference counting
> 
>
> Key: HBASE-22457
> URL: https://issues.apache.org/jira/browse/HBASE-22457
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Lars Hofhansl
>Priority: Major
> Attachments: 22457-random-1.5.txt
>
>
> The problem that any coprocessor hook that replaces a passed scanner without 
> closing it can cause an incorrect reference count.
> This was bad and wrong before of course, but now it has pretty bad 
> consequences, since an incorrect reference could will prevent HFiles from 
> being archived indefinitely.
> All hooks that are passed a scanner and return a scanner are suspect, since 
> the returned scanner may or may not close the passed scanner:
> * preCompact
> * preCompactScannerOpen
> * preFlush
> * preFlushScannerOpen
> * preScannerOpen
> * preStoreScannerOpen
> * preStoreFileReaderOpen...? (not sure about this one, it could mess with the 
> reader)
> I sampled the Phoenix and also Tephra code, and found a few instances where 
> this is happening.
> And for those I filed issued: TEPHRA-300, PHOENIX-5291
> (We're not using Tephra)
> The Phoenix ones should be rare. In our case we are seeing readers with 
> refCount > 1000.
> Perhaps there are other issues, a path where not all exceptions are caught 
> and scanner is left open that way perhaps. (Generally I am not a fan of 
> reference counting in complex systems - it's too easy to miss something. But 
> that's a different discussion. :) ).
> Let's brainstorm some way in which we can harden this.
> [~ram_krish], [~anoop.hbase], [~apurtell]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23279) Switch default block encoding to ROW_INDEX_V1

2019-11-23 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16980983#comment-16980983
 ] 

Lars Hofhansl commented on HBASE-23279:
---

Patch looks good generally. Are the changes for the tests necessary (keeping 
NONE as encoding) required to have them pass? That would be a bit scary.

Also the size difference is unexpectedly high... I guess it depends on the size 
of the keys relative to the total size of the Cells, in my tests I've seen 
about 3%. I tested with Phoenix:

{{CREATE TABLE  (pk INTEGER PRIMARY key, v1 FLOAT, v2 FLOAT, v3 
INTEGER)}}

So there would 3 cells per "row" with 4 bytes as the row key, each with a 4 
byte value. I'd expect that to be pretty bad case for row indexing. Are those 
heap sizes, file sizes, or bucket cache sizes?


> Switch default block encoding to ROW_INDEX_V1
> -
>
> Key: HBASE-23279
> URL: https://issues.apache.org/jira/browse/HBASE-23279
> Project: HBase
>  Issue Type: Wish
>Affects Versions: 3.0.0, 2.3.0
>Reporter: Lars Hofhansl
>Assignee: Viraj Jasani
>Priority: Minor
> Fix For: 3.0.0, 2.3.0
>
> Attachments: HBASE-23279.master.000.patch, 
> HBASE-23279.master.001.patch, HBASE-23279.master.002.patch, 
> HBASE-23279.master.003.patch
>
>
> Currently we set both block encoding and compression to NONE.
> ROW_INDEX_V1 has many advantages and (almost) no disadvantages (the hfiles 
> are slightly larger about 3% or so). I think that would a better default than 
> NONE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23309) Add support in ChainWalEntryFilter to filter Entry if all cells get filtered through WalCellFilter

2019-11-21 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979690#comment-16979690
 ] 

Lars Hofhansl commented on HBASE-23309:
---

Let me ask a more radical question: Is it a bug to return an empty WALEdit 
after all Cells have been removed? In other words should we just change the 
behavior?

> Add support in ChainWalEntryFilter to filter Entry if all cells get filtered 
> through WalCellFilter
> --
>
> Key: HBASE-23309
> URL: https://issues.apache.org/jira/browse/HBASE-23309
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0, 1.3.6, 2.3.3
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Major
> Attachments: HBASE-23309.branch-1.patch, HBASE-23309.branch-2.patch, 
> HBASE-23309.patch
>
>
> ChainWalEntryFilter applies the filter on entry followed by filter on cells. 
>  If filter on cells remove all the cells from the entry, we should add an 
> option in chainwalentryfilter to filter the entry as well. 
> Here is the snippet for ChainWalEntryFilter filter. After filterCells we 
> should check if there is any cells remaining in the entry. 
> {code:java}
> @Override
> public Entry filter(Entry entry) {
>  for (WALEntryFilter filter : filters) {
>  if (entry == null) {
>  return null;
>  }
>  entry = filter.filter(entry);
>  }
>  filterCells(entry);
>  return entry;
> }{code}
>  Customer replication endpoints may use this flag.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-23279) Switch default block encoding to ROW_INDEX_V1

2019-11-14 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974596#comment-16974596
 ] 

Lars Hofhansl edited comment on HBASE-23279 at 11/15/19 2:44 AM:
-

Thanks all.

So based on the discussion here ... Is already a problem if I went and changed 
the block encoding to anything other than NONE? Most block encodings (like 
FAST_DIFF, etc) will decrease the size, but even there there're abnormal cases 
where the size might be increased.

Or in other words any encoding (or compression) will cause the actual size of a 
block to not be a constant.

Other cases are large key values. The block is extended at the end to hold the 
last key value, right?

NM: Read the above again. I think we do not have to change the formula. How 
much bigger the index encoded file is depends on the type of the data.


was (Author: lhofhansl):
Thanks all.

So based on the discussion here ... Is already a problem if I went and changed 
the block encoding to anything other than NONE? Most block encodings (like 
FAST_DIFF, etc) will decrease the size, but even there there're abnormal cases 
where the size might be increased.

Or in other words any encoding (or compression) will cause the actual size of a 
block to not be a constant.

Other cases are large key values. The block is extended at the end to hold the 
last key value, right?

> Switch default block encoding to ROW_INDEX_V1
> -
>
> Key: HBASE-23279
> URL: https://issues.apache.org/jira/browse/HBASE-23279
> Project: HBase
>  Issue Type: Wish
>Reporter: Lars Hofhansl
>Assignee: Viraj Jasani
>Priority: Minor
>
> Currently we set both block encoding and compression to NONE.
> ROW_INDEX_V1 has many advantages and (almost) no disadvantages (the hfiles 
> are slightly larger about 3% or so). I think that would a better default than 
> NONE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-23279) Switch default block encoding to ROW_INDEX_V1

2019-11-14 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974596#comment-16974596
 ] 

Lars Hofhansl edited comment on HBASE-23279 at 11/14/19 8:56 PM:
-

Thanks all.

So based on the discussion here ... Is already a problem if I went and changed 
the block encoding to anything other than NONE? Most block encodings (like 
FAST_DIFF, etc) will decrease the size, but even there there're abnormal cases 
where the size might be increased.

Or in other words any encoding (or compression) will cause the actual size of a 
block to not be a constant.

Other cases are large key values. The block is extended at the end to hold the 
last key value, right?


was (Author: lhofhansl):
Thanks all.

So based on the discussion here ... Is already a problem if I went and changed 
the block encoding to anything other than NONE? Most block encodings (like 
FAST_DIFF, etc) will decrease the size, but even there there're abnormal cases 
where the size might be increased.

> Switch default block encoding to ROW_INDEX_V1
> -
>
> Key: HBASE-23279
> URL: https://issues.apache.org/jira/browse/HBASE-23279
> Project: HBase
>  Issue Type: Wish
>Reporter: Lars Hofhansl
>Assignee: Viraj Jasani
>Priority: Minor
>
> Currently we set both block encoding and compression to NONE.
> ROW_INDEX_V1 has many advantages and (almost) no disadvantages (the hfiles 
> are slightly larger about 3% or so). I think that would a better default than 
> NONE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23279) Switch default block encoding to ROW_INDEX_V1

2019-11-14 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974596#comment-16974596
 ] 

Lars Hofhansl commented on HBASE-23279:
---

Thanks all.

So based on the discussion here ... Is already a problem if I went and changed 
the block encoding to anything other than NONE? Most block encodings (like 
FAST_DIFF, etc) will decrease the size, but even there there're abnormal cases 
where the size might be increased.

> Switch default block encoding to ROW_INDEX_V1
> -
>
> Key: HBASE-23279
> URL: https://issues.apache.org/jira/browse/HBASE-23279
> Project: HBase
>  Issue Type: Wish
>Reporter: Lars Hofhansl
>Assignee: Viraj Jasani
>Priority: Minor
>
> Currently we set both block encoding and compression to NONE.
> ROW_INDEX_V1 has many advantages and (almost) no disadvantages (the hfiles 
> are slightly larger about 3% or so). I think that would a better default than 
> NONE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-23279) Switch default block encoding to ROW_INDEX_V1

2019-11-11 Thread Lars Hofhansl (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl updated HBASE-23279:
--
Issue Type: Wish  (was: Improvement)

> Switch default block encoding to ROW_INDEX_V1
> -
>
> Key: HBASE-23279
> URL: https://issues.apache.org/jira/browse/HBASE-23279
> Project: HBase
>  Issue Type: Wish
>Reporter: Lars Hofhansl
>Priority: Minor
>
> Currently we set both block encoding and compression to NONE.
> ROW_INDEX_V1 has many advantages and (almost) no disadvantages (the hfiles 
> are slightly larger about 3% or so). I think that would a better default than 
> NONE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-23279) Switch default block encoding to ROW_INDEX_V1

2019-11-11 Thread Lars Hofhansl (Jira)
Lars Hofhansl created HBASE-23279:
-

 Summary: Switch default block encoding to ROW_INDEX_V1
 Key: HBASE-23279
 URL: https://issues.apache.org/jira/browse/HBASE-23279
 Project: HBase
  Issue Type: Improvement
Reporter: Lars Hofhansl


Currently we set both block encoding and compression to NONE.

ROW_INDEX_V1 has many advantages and (almost) no disadvantages (the hfiles are 
slightly larger about 3% or so). I think that would a better default than NONE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-23240) branch-1 master and regionservers do not start when compiled against Hadoop 3.2.1

2019-11-09 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970942#comment-16970942
 ] 

Lars Hofhansl edited comment on HBASE-23240 at 11/9/19 8:40 PM:


Yeah I had a brief look but ran out of time for this.

I guess Hadoop is not as strict as we are w.r.t. backwards compatibility in 
minor and patch releases.


was (Author: lhofhansl):
Yeah I had a brief look but ran out of time for this.

> branch-1 master and regionservers do not start when compiled against Hadoop 
> 3.2.1
> -
>
> Key: HBASE-23240
> URL: https://issues.apache.org/jira/browse/HBASE-23240
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 1.5.0
>Reporter: Lars Hofhansl
>Priority: Major
> Fix For: 1.6.0, 1.5.1
>
>
> Exception in thread "main" java.lang.NoSuchMethodError: 
> com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
>  at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357)
>  at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338)
>  at org.apache.hadoop.conf.Configuration.setBoolean(Configuration.java:1679)
>  at 
> org.apache.hadoop.util.GenericOptionsParser.processGeneralOptions(GenericOptionsParser.java:339)
>  at 
> org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:572)
>  at 
> org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:174)
>  at 
> org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:156)
>  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>  at 
> org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:127)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23240) branch-1 master and regionservers do not start when compiled against Hadoop 3.2.1

2019-11-09 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970942#comment-16970942
 ] 

Lars Hofhansl commented on HBASE-23240:
---

Yeah I had a brief look but ran out of time for this.

> branch-1 master and regionservers do not start when compiled against Hadoop 
> 3.2.1
> -
>
> Key: HBASE-23240
> URL: https://issues.apache.org/jira/browse/HBASE-23240
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 1.5.0
>Reporter: Lars Hofhansl
>Priority: Major
> Fix For: 1.6.0, 1.5.1
>
>
> Exception in thread "main" java.lang.NoSuchMethodError: 
> com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
>  at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357)
>  at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338)
>  at org.apache.hadoop.conf.Configuration.setBoolean(Configuration.java:1679)
>  at 
> org.apache.hadoop.util.GenericOptionsParser.processGeneralOptions(GenericOptionsParser.java:339)
>  at 
> org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:572)
>  at 
> org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:174)
>  at 
> org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:156)
>  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>  at 
> org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:127)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-23240) branch-1 master and regionservers do not start when compiled against Hadoop 3.2.1

2019-10-31 Thread Lars Hofhansl (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl updated HBASE-23240:
--
Affects Version/s: 1.5.0

> branch-1 master and regionservers do not start when compiled against Hadoop 
> 3.2.1
> -
>
> Key: HBASE-23240
> URL: https://issues.apache.org/jira/browse/HBASE-23240
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 1.5.0
>Reporter: Lars Hofhansl
>Priority: Major
>
> Exception in thread "main" java.lang.NoSuchMethodError: 
> com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
>  at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357)
>  at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338)
>  at org.apache.hadoop.conf.Configuration.setBoolean(Configuration.java:1679)
>  at 
> org.apache.hadoop.util.GenericOptionsParser.processGeneralOptions(GenericOptionsParser.java:339)
>  at 
> org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:572)
>  at 
> org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:174)
>  at 
> org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:156)
>  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>  at 
> org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:127)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HBASE-23240) branch-1 master and regionservers do not start when compiled against Hadoop 3.2.1

2019-10-31 Thread Lars Hofhansl (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-23240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl updated HBASE-23240:
--
Fix Version/s: 1.5.1
   1.6.0

> branch-1 master and regionservers do not start when compiled against Hadoop 
> 3.2.1
> -
>
> Key: HBASE-23240
> URL: https://issues.apache.org/jira/browse/HBASE-23240
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 1.5.0
>Reporter: Lars Hofhansl
>Priority: Major
> Fix For: 1.6.0, 1.5.1
>
>
> Exception in thread "main" java.lang.NoSuchMethodError: 
> com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
>  at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357)
>  at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338)
>  at org.apache.hadoop.conf.Configuration.setBoolean(Configuration.java:1679)
>  at 
> org.apache.hadoop.util.GenericOptionsParser.processGeneralOptions(GenericOptionsParser.java:339)
>  at 
> org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:572)
>  at 
> org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:174)
>  at 
> org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:156)
>  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>  at 
> org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:127)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HBASE-23240) branch-1 master and regionservers do not start when compiled against Hadoop 3.2.1

2019-10-31 Thread Lars Hofhansl (Jira)
Lars Hofhansl created HBASE-23240:
-

 Summary: branch-1 master and regionservers do not start when 
compiled against Hadoop 3.2.1
 Key: HBASE-23240
 URL: https://issues.apache.org/jira/browse/HBASE-23240
 Project: HBase
  Issue Type: Improvement
Reporter: Lars Hofhansl


Exception in thread "main" java.lang.NoSuchMethodError: 
com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
 at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357)
 at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338)
 at org.apache.hadoop.conf.Configuration.setBoolean(Configuration.java:1679)
 at 
org.apache.hadoop.util.GenericOptionsParser.processGeneralOptions(GenericOptionsParser.java:339)
 at 
org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:572)
 at 
org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:174)
 at 
org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:156)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
 at 
org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:127)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-21856) Consider Causal Replication Ordering

2019-10-10 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948976#comment-16948976
 ] 

Lars Hofhansl commented on HBASE-21856:
---

[~bharathv] See description (I mention "Serial Replication" there) :) ... The 
thought is that serial replication (i.e. a global ordering) is very expensive 
and not always needed.
I propose that for most use cases the ordering I propose here is sufficient. In 
the end we should have a discussion about what the specific problem is that we 
want to solve.

> Consider Causal Replication Ordering
> 
>
> Key: HBASE-21856
> URL: https://issues.apache.org/jira/browse/HBASE-21856
> Project: HBase
>  Issue Type: Brainstorming
>  Components: Replication
>Reporter: Lars Hofhansl
>Priority: Major
>  Labels: Replication
>
> We've had various efforts to improve the ordering guarantees for HBase 
> replication, most notably Serial Replication.
> I think in many cases guaranteeing a Total Replication Order is not required, 
> but a simpler Causal Replication Order is sufficient.
> Specifically we would guarantee causal ordering for a single Rowkey. Any 
> changes to a Row - Puts, Deletes, etc - would be replicated in the exact 
> order in which they occurred in the source system.
> Unlike total ordering this can be accomplished with only local region server 
> control.
> I don't have a full design in mind, let's discuss here. It should be 
> sufficient to to the following:
> # RegionServers only adopt the replication queues from other RegionServers 
> for regions they (now) own. This requires log splitting for replication.
> # RegionServers ship all edits for queues adopted from other servers before 
> any of their "own" edits are shipped.
> It's probably a bit more involved, but should be much cheaper that the total 
> ordering provided by serial replication.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23015) branch-1 hbase-server, testing util, and shaded testing util need jackson

2019-09-20 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934529#comment-16934529
 ] 

Lars Hofhansl commented on HBASE-23015:
---

> If you're up for RM'ing a 1.5 release in parallel once this blocker closes 
> that'd be wonderful.

I am.  But I'll also check in with [~apurtell] who had volunteered before when 
he's back from vacation.

> branch-1 hbase-server, testing util, and  shaded testing util need jackson
> --
>
> Key: HBASE-23015
> URL: https://issues.apache.org/jira/browse/HBASE-23015
> Project: HBase
>  Issue Type: Bug
>  Components: Client, shading
>Affects Versions: 1.5.0, 1.3.6, 1.4.11
>Reporter: Sean Busbey
>Assignee: Viraj Jasani
>Priority: Blocker
> Fix For: 1.5.0, 1.3.6, 1.4.11
>
> Attachments: HBASE-23015.branch-1.3.000.patch, 
> HBASE-23015.branch-1.3.001.patch
>
>
> HBASE-22728 moved out jackson transitive dependencies. mostly good, but 
> moving jackson2 to provided in hbase-server broke few things
> testing-util needs a transitive jackson 2 in order to start the minicluster, 
> currently fails with CNFE for {{com.fasterxml.jackson.databind.ObjectMapper}} 
> when trying to initialize the master.
> shaded-testing-util needs a relocated jackson 2 for the same reason
> it's not used for any of the mapreduce stuff in hbase-server, so 
> {{hbase-shaded-server}} for that purpose should be fine. But it is used by 
> {{WALPrettyPrinter}} and some folks might expect that to work from that 
> artifact since it is present.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-23015) branch-1 hbase-server, testing util, and shaded testing util need jackson

2019-09-19 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934004#comment-16934004
 ] 

Lars Hofhansl edited comment on HBASE-23015 at 9/20/19 3:22 AM:


I meant apply *locally* above. In any case it does not apply cleanly in my 
setup, so it'd be a chunk of work anyway.

And totally agree that non-released branches are at downstream project's own 
risk.

I will say that we have been dragging our feet on 1.5 trying to make it perfect 
instead of just releasing (as long as there are *no new* problems) and fix 
outstanding in the next release (i.e. 1.5.1) which should only be a month away.

 


was (Author: lhofhansl):
I meant apply *locally* above. In any case it does not apply cleanly in my 
setup, so it'd be a chunk of work anyway.

> branch-1 hbase-server, testing util, and  shaded testing util need jackson
> --
>
> Key: HBASE-23015
> URL: https://issues.apache.org/jira/browse/HBASE-23015
> Project: HBase
>  Issue Type: Bug
>  Components: Client, shading
>Affects Versions: 1.5.0, 1.3.6, 1.4.11
>Reporter: Sean Busbey
>Assignee: Viraj Jasani
>Priority: Blocker
> Fix For: 1.5.0, 1.3.6, 1.4.11
>
> Attachments: HBASE-23015.branch-1.3.000.patch
>
>
> HBASE-22728 moved out jackson transitive dependencies. mostly good, but 
> moving jackson2 to provided in hbase-server broke few things
> testing-util needs a transitive jackson 2 in order to start the minicluster, 
> currently fails with CNFE for {{com.fasterxml.jackson.databind.ObjectMapper}} 
> when trying to initialize the master.
> shaded-testing-util needs a relocated jackson 2 for the same reason
> it's not used for any of the mapreduce stuff in hbase-server, so 
> {{hbase-shaded-server}} for that purpose should be fine. But it is used by 
> {{WALPrettyPrinter}} and some folks might expect that to work from that 
> artifact since it is present.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23015) branch-1 hbase-server, testing util, and shaded testing util need jackson

2019-09-19 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934004#comment-16934004
 ] 

Lars Hofhansl commented on HBASE-23015:
---

I meant apply *locally* above. In any case it does not apply cleanly in my 
setup, so it'd be a chunk of work anyway.

> branch-1 hbase-server, testing util, and  shaded testing util need jackson
> --
>
> Key: HBASE-23015
> URL: https://issues.apache.org/jira/browse/HBASE-23015
> Project: HBase
>  Issue Type: Bug
>  Components: Client, shading
>Affects Versions: 1.5.0, 1.3.6, 1.4.11
>Reporter: Sean Busbey
>Assignee: Viraj Jasani
>Priority: Blocker
> Fix For: 1.5.0, 1.3.6, 1.4.11
>
> Attachments: HBASE-23015.branch-1.3.000.patch
>
>
> HBASE-22728 moved out jackson transitive dependencies. mostly good, but 
> moving jackson2 to provided in hbase-server broke few things
> testing-util needs a transitive jackson 2 in order to start the minicluster, 
> currently fails with CNFE for {{com.fasterxml.jackson.databind.ObjectMapper}} 
> when trying to initialize the master.
> shaded-testing-util needs a relocated jackson 2 for the same reason
> it's not used for any of the mapreduce stuff in hbase-server, so 
> {{hbase-shaded-server}} for that purpose should be fine. But it is used by 
> {{WALPrettyPrinter}} and some folks might expect that to work from that 
> artifact since it is present.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-23015) branch-1 hbase-server, testing util, and shaded testing util need jackson

2019-09-19 Thread Lars Hofhansl (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-23015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933851#comment-16933851
 ] 

Lars Hofhansl commented on HBASE-23015:
---

Lemme apply this and see if it fixes the problem I've been seeing.

> branch-1 hbase-server, testing util, and  shaded testing util need jackson
> --
>
> Key: HBASE-23015
> URL: https://issues.apache.org/jira/browse/HBASE-23015
> Project: HBase
>  Issue Type: Bug
>  Components: Client, shading
>Affects Versions: 1.5.0, 1.3.6, 1.4.11
>Reporter: Sean Busbey
>Assignee: Viraj Jasani
>Priority: Blocker
> Fix For: 1.5.0, 1.3.6, 1.4.11
>
> Attachments: HBASE-23015.branch-1.3.000.patch
>
>
> HBASE-22728 moved out jackson transitive dependencies. mostly good, but 
> moving jackson2 to provided in hbase-server broke few things
> testing-util needs a transitive jackson 2 in order to start the minicluster, 
> currently fails with CNFE for {{com.fasterxml.jackson.databind.ObjectMapper}} 
> when trying to initialize the master.
> shaded-testing-util needs a relocated jackson 2 for the same reason
> it's not used for any of the mapreduce stuff in hbase-server, so 
> {{hbase-shaded-server}} for that purpose should be fine. But it is used by 
> {{WALPrettyPrinter}} and some folks might expect that to work from that 
> artifact since it is present.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HBASE-21158) Empty qualifier cell should not be returned if it does not match QualifierFilter

2019-05-28 Thread Lars Hofhansl (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849352#comment-16849352
 ] 

Lars Hofhansl edited comment on HBASE-21158 at 5/28/19 6:16 AM:


This causes *very* subtle changes, and actually break Phoenix secondary 
indexing.

Furthermore is different between
 * branch-1.3 (check remove)
 * branch-1.4 (check still present)
 * branch-1 (check removed, g, took me 3h to track this down)
 * branch-2 (check still present) and
 * master (check removed)

This is bad. And bad that we managed to leave it all in different states in 
different HBase branches.

What happened here?

[~apurtell], we should check whether we have this in our HBase. If so it can 
cause subtle index out of sync problems (see linked Phoenix jira).

Phoenix in this case relies on that a family delete marker (which does not have 
a qualifier) is flowing through this filter along with all other K/Vs it might 
affect (but limited to a known set of qualifiers).


was (Author: lhofhansl):
This causes *very* subtle changes, and actually break Phoenix secondary 
indexing.

Furthermore is different between
 * branch-1.3 (check remove)
 * branch-1.4 (check still present)
 * branch-1 (check removed, g, took me 3h to track this down)
 * branch-2 (check still present) and
 * master (check removed)

This is bad. And bad that we managed to leave it all in different states in 
different HBase branches.

What happened here?

[~apurtell],

> Empty qualifier cell should not be returned if it does not match 
> QualifierFilter
> 
>
> Key: HBASE-21158
> URL: https://issues.apache.org/jira/browse/HBASE-21158
> Project: HBase
>  Issue Type: Bug
>  Components: Filters
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Guangxu Cheng
>Assignee: Guangxu Cheng
>Priority: Critical
> Fix For: 3.0.0, 1.3.3, 1.2.8, 2.2.0, 1.4.8, 2.1.1, 2.0.3
>
> Attachments: HBASE-21158.branch-1.001.patch, 
> HBASE-21158.master.001.patch, HBASE-21158.master.002.patch, 
> HBASE-21158.master.003.patch, HBASE-21158.master.004.patch
>
>
> {code:xml}
> hbase(main):002:0> put 'testTable','testrow','f:testcol1','testvalue1'
> 0 row(s) in 0.0040 seconds
> hbase(main):003:0> put 'testTable','testrow','f:','testvalue2'
> 0 row(s) in 0.0070 seconds
> # get row with empty column f:, result is correct.
> hbase(main):004:0> scan 'testTable',{FILTER => "QualifierFilter (=, 
> 'binary:')"}
> ROW COLUMN+CELL   
>   
>
>  testrowcolumn=f:, 
> timestamp=1536218563581, value=testvalue2 
>   
> 1 row(s) in 0.0460 seconds
> # get row with column f:testcol1, result is incorrect.
> hbase(main):005:0> scan 'testTable',{FILTER => "QualifierFilter (=, 
> 'binary:testcol1')"}
> ROW COLUMN+CELL   
>   
>
>  testrowcolumn=f:, 
> timestamp=1536218563581, value=testvalue2 
>   
>  testrowcolumn=f:testcol1, 
> timestamp=1536218550827, value=testvalue1 
>   
> 1 row(s) in 0.0070 seconds
> {code}
> As the above operation, when the row contains empty qualifier column, empty 
> qualifier cell is always returned when using QualifierFilter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21158) Empty qualifier cell should not be returned if it does not match QualifierFilter

2019-05-27 Thread Lars Hofhansl (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849352#comment-16849352
 ] 

Lars Hofhansl commented on HBASE-21158:
---

This causes *very* subtle changes, and actually break Phoenix secondary 
indexing.

Furthermore is different between
 * branch-1.3 (check remove)
 * branch-1.4 (check still present)
 * branch-1 (check removed, g, took me 3h to track this down)
 * branch-2 (check still present) and
 * master (check removed)

This is bad. And bad that we managed to leave it all in different states in 
different HBase branches.

What happened here?

[~apurtell],

> Empty qualifier cell should not be returned if it does not match 
> QualifierFilter
> 
>
> Key: HBASE-21158
> URL: https://issues.apache.org/jira/browse/HBASE-21158
> Project: HBase
>  Issue Type: Bug
>  Components: Filters
>Affects Versions: 3.0.0, 2.2.0
>Reporter: Guangxu Cheng
>Assignee: Guangxu Cheng
>Priority: Critical
> Fix For: 3.0.0, 1.3.3, 1.2.8, 2.2.0, 1.4.8, 2.1.1, 2.0.3
>
> Attachments: HBASE-21158.branch-1.001.patch, 
> HBASE-21158.master.001.patch, HBASE-21158.master.002.patch, 
> HBASE-21158.master.003.patch, HBASE-21158.master.004.patch
>
>
> {code:xml}
> hbase(main):002:0> put 'testTable','testrow','f:testcol1','testvalue1'
> 0 row(s) in 0.0040 seconds
> hbase(main):003:0> put 'testTable','testrow','f:','testvalue2'
> 0 row(s) in 0.0070 seconds
> # get row with empty column f:, result is correct.
> hbase(main):004:0> scan 'testTable',{FILTER => "QualifierFilter (=, 
> 'binary:')"}
> ROW COLUMN+CELL   
>   
>
>  testrowcolumn=f:, 
> timestamp=1536218563581, value=testvalue2 
>   
> 1 row(s) in 0.0460 seconds
> # get row with column f:testcol1, result is incorrect.
> hbase(main):005:0> scan 'testTable',{FILTER => "QualifierFilter (=, 
> 'binary:testcol1')"}
> ROW COLUMN+CELL   
>   
>
>  testrowcolumn=f:, 
> timestamp=1536218563581, value=testvalue2 
>   
>  testrowcolumn=f:testcol1, 
> timestamp=1536218550827, value=testvalue1 
>   
> 1 row(s) in 0.0070 seconds
> {code}
> As the above operation, when the row contains empty qualifier column, empty 
> qualifier cell is always returned when using QualifierFilter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-22457) Harden the HBase HFile reader reference counting

2019-05-22 Thread Lars Hofhansl (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846258#comment-16846258
 ] 

Lars Hofhansl edited comment on HBASE-22457 at 5/22/19 10:17 PM:
-

> No, but this is no more or less fast then the close then open we do for 
> 'alter' processing. It would be implemented the same way, ideally. 

We came to that conclusion as well in a discussion in the office. Just alter 
some minor thing on the Table/ColumnDescriptor so that all regions are 
closed/reopened resulting "Wouldn't it be nice if we had a tool that could do 
that without forcing us to change something." :)

> Scanner wrapping is a key thing. Without it I don't think Phoenix works. 

Oh totally agree. Though perhaps it is structurally somehow possible to ensure 
that passed scanner is either wrapped or closed. I can't think of anything, 
though.
(I.e. we can check after the hook invocation whether the returned scanner is 
different from the passed one... but of course we cannot tell whether it 
wrapped the passed scanner and will eventually close it.)


was (Author: lhofhansl):
> No, but this is no more or less fast then the close then open we do for 
> 'alter' processing. It would be implemented the same way, ideally. 

We came to that conclusion as well in a discussion in the office. Just alter 
some minor thing on the Table/ColumnDescriptor so that all regions are 
closed/reopened resulting "Wouldn't it be nice if we had a tool that could do 
that without forcing us to change something." :)

> Scanner wrapping is a key thing. Without it I don't think Phoenix works. 

Oh totally agree. Though perhaps it is structurally somehow possible to ensure 
that passed scanner is either wrapped or closed. I can't think of anything, 
though.

> Harden the HBase HFile reader reference counting
> 
>
> Key: HBASE-22457
> URL: https://issues.apache.org/jira/browse/HBASE-22457
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Major
> Attachments: 22457-random-1.5.txt
>
>
> The problem that any coprocessor hook that replaces a passed scanner without 
> closing it can cause an incorrect reference count.
> This was bad and wrong before of course, but now it has pretty bad 
> consequences, since an incorrect reference could will prevent HFiles from 
> being archived indefinitely.
> All hooks that are passed a scanner and return a scanner are suspect, since 
> the returned scanner may or may not close the passed scanner:
> * preCompact
> * preCompactScannerOpen
> * preFlush
> * preFlushScannerOpen
> * preScannerOpen
> * preStoreScannerOpen
> * preStoreFileReaderOpen...? (not sure about this one, it could mess with the 
> reader)
> I sampled the Phoenix and also Tephra code, and found a few instances where 
> this is happening.
> And for those I filed issued: TEPHRA-300, PHOENIX-5291
> (We're not using Tephra)
> The Phoenix ones should be rare. In our case we are seeing readers with 
> refCount > 1000.
> Perhaps there are other issues, a path where not all exceptions are caught 
> and scanner is left open that way perhaps. (Generally I am not a fan of 
> reference counting in complex systems - it's too easy to miss something. But 
> that's a different discussion. :) ).
> Let's brainstorm some way in which we can harden this.
> [~ram_krish], [~anoop.hbase], [~apurtell]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22457) Harden the HBase HFile reader reference counting

2019-05-22 Thread Lars Hofhansl (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846258#comment-16846258
 ] 

Lars Hofhansl commented on HBASE-22457:
---

> No, but this is no more or less fast then the close then open we do for 
> 'alter' processing. It would be implemented the same way, ideally. 

We came to that conclusion as well in a discussion in the office. Just alter 
some minor thing on the Table/ColumnDescriptor so that all regions are 
closed/reopened resulting "Wouldn't it be nice if we had a tool that could do 
that without forcing us to change something." :)

> Scanner wrapping is a key thing. Without it I don't think Phoenix works. 

Oh totally agree. Though perhaps it is structurally somehow possible to ensure 
that passed scanner is either wrapped or closed. I can't think of anything, 
though.

> Harden the HBase HFile reader reference counting
> 
>
> Key: HBASE-22457
> URL: https://issues.apache.org/jira/browse/HBASE-22457
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Lars Hofhansl
>Assignee: Lars Hofhansl
>Priority: Major
> Attachments: 22457-random-1.5.txt
>
>
> The problem that any coprocessor hook that replaces a passed scanner without 
> closing it can cause an incorrect reference count.
> This was bad and wrong before of course, but now it has pretty bad 
> consequences, since an incorrect reference could will prevent HFiles from 
> being archived indefinitely.
> All hooks that are passed a scanner and return a scanner are suspect, since 
> the returned scanner may or may not close the passed scanner:
> * preCompact
> * preCompactScannerOpen
> * preFlush
> * preFlushScannerOpen
> * preScannerOpen
> * preStoreScannerOpen
> * preStoreFileReaderOpen...? (not sure about this one, it could mess with the 
> reader)
> I sampled the Phoenix and also Tephra code, and found a few instances where 
> this is happening.
> And for those I filed issued: TEPHRA-300, PHOENIX-5291
> (We're not using Tephra)
> The Phoenix ones should be rare. In our case we are seeing readers with 
> refCount > 1000.
> Perhaps there are other issues, a path where not all exceptions are caught 
> and scanner is left open that way perhaps. (Generally I am not a fan of 
> reference counting in complex systems - it's too easy to miss something. But 
> that's a different discussion. :) ).
> Let's brainstorm some way in which we can harden this.
> [~ram_krish], [~anoop.hbase], [~apurtell]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-22457) Harden the HBase HFile reader reference counting

2019-05-22 Thread Lars Hofhansl (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-22457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl updated HBASE-22457:
--
Attachment: 22457-random-1.5.txt

> Harden the HBase HFile reader reference counting
> 
>
> Key: HBASE-22457
> URL: https://issues.apache.org/jira/browse/HBASE-22457
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Lars Hofhansl
>Priority: Major
> Attachments: 22457-random-1.5.txt
>
>
> The problem that any coprocessor hook that replaces a passed scanner without 
> closing it can cause an incorrect reference count.
> This was bad and wrong before of course, but now it has pretty bad 
> consequences, since an incorrect reference could will prevent HFiles from 
> being archived indefinitely.
> All hooks that are passed a scanner and return a scanner are suspect, since 
> the returned scanner may or may not close the passed scanner:
> * preCompact
> * preCompactScannerOpen
> * preFlush
> * preFlushScannerOpen
> * preScannerOpen
> * preStoreScannerOpen
> * preStoreFileReaderOpen...? (not sure about this one, it could mess with the 
> reader)
> I sampled the Phoenix and also Tephra code, and found a few instances where 
> this is happening.
> And for those I filed issued: TEPHRA-300, PHOENIX-5291
> (We're not using Tephra)
> The Phoenix ones should be rare. In our case we are seeing readers with 
> refCount > 1000.
> Perhaps there are other issues, a path where not all exceptions are caught 
> and scanner is left open that way perhaps. (Generally I am not a fan of 
> reference counting in complex systems - it's too easy to miss something. But 
> that's a different discussion. :) ).
> Let's brainstorm some way in which we can harden this.
> [~ram_krish], [~anoop.hbase], [~apurtell]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22457) Harden the HBase HFile reader reference counting

2019-05-22 Thread Lars Hofhansl (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846172#comment-16846172
 ] 

Lars Hofhansl commented on HBASE-22457:
---

Some random things that didn't quite look right.

> Harden the HBase HFile reader reference counting
> 
>
> Key: HBASE-22457
> URL: https://issues.apache.org/jira/browse/HBASE-22457
> Project: HBase
>  Issue Type: Brainstorming
>Reporter: Lars Hofhansl
>Priority: Major
> Attachments: 22457-random-1.5.txt
>
>
> The problem that any coprocessor hook that replaces a passed scanner without 
> closing it can cause an incorrect reference count.
> This was bad and wrong before of course, but now it has pretty bad 
> consequences, since an incorrect reference could will prevent HFiles from 
> being archived indefinitely.
> All hooks that are passed a scanner and return a scanner are suspect, since 
> the returned scanner may or may not close the passed scanner:
> * preCompact
> * preCompactScannerOpen
> * preFlush
> * preFlushScannerOpen
> * preScannerOpen
> * preStoreScannerOpen
> * preStoreFileReaderOpen...? (not sure about this one, it could mess with the 
> reader)
> I sampled the Phoenix and also Tephra code, and found a few instances where 
> this is happening.
> And for those I filed issued: TEPHRA-300, PHOENIX-5291
> (We're not using Tephra)
> The Phoenix ones should be rare. In our case we are seeing readers with 
> refCount > 1000.
> Perhaps there are other issues, a path where not all exceptions are caught 
> and scanner is left open that way perhaps. (Generally I am not a fan of 
> reference counting in complex systems - it's too easy to miss something. But 
> that's a different discussion. :) ).
> Let's brainstorm some way in which we can harden this.
> [~ram_krish], [~anoop.hbase], [~apurtell]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   4   5   6   7   8   9   10   >