[jira] [Comment Edited] (HBASE-26708) Netty "leak detected" and OutOfDirectMemoryError due to direct memory buffering with SASL implementation

2022-07-06 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563499#comment-17563499
 ] 

Andrew Kyle Purtell edited comment on HBASE-26708 at 7/7/22 1:09 AM:
-

bq. In general, I prefer we just remove the SimpleRpcServer implementation and 
rewrite the decode and encode part with netty, to make the code more clear.

SimpleRcpServer is currently used as a fallback by Cloudera customers (and I 
presume others) with 2.2 when the Netty implementation has issues. I would also 
want it as a fallback option for our production. Anyway this is the kind of 
major operational change which should have a deprecation before removal. I know 
it is not the default but still represents an important fallback option. We 
should be accommodating to users here. Deprecation can be done now, that seems 
ok. Removal can be done in 3.0.  So it should be fixed first, removed later. 
Can land SimpleRpcServer specific things on HBASE-27097 after this issue is 
done.


was (Author: apurtell):
bq. In general, I prefer we just remove the SimpleRpcServer implementation and 
rewrite the decode and encode part with netty, to make the code more clear.

SimpleRcpServer is currently used as a fallback by Cloudera customers (and I 
presume others) with 2.2 when the Netty implementation has issues. I would also 
want it as a fallback option for our production. Anyway this is the kind of 
major operational change which should have a deprecation before removal. I know 
it is not the default but still represents an important fallback option. We 
should be accommodating to users here. Deprecation can be done now, that seems 
ok. Removal can be done in 3.0.  -So it should be fixed first, removed later. 
Can land SimpleRpcServer specific things on HBASE-27097 after this issue is 
done.-

> Netty "leak detected" and OutOfDirectMemoryError due to direct memory 
> buffering with SASL implementation
> 
>
> Key: HBASE-26708
> URL: https://issues.apache.org/jira/browse/HBASE-26708
> Project: HBase
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 2.5.0, 2.4.6
>Reporter: Viraj Jasani
>Assignee: Duo Zhang
>Priority: Blocker
> Fix For: 2.5.0, 3.0.0-alpha-4, 2.4.14
>
>
> Under constant data ingestion, using default Netty based RpcServer and 
> RpcClient implementation results in OutOfDirectMemoryError, supposedly caused 
> by leaks detected by Netty's LeakDetector.
> {code:java}
> 2022-01-25 17:03:10,084 ERROR [S-EventLoopGroup-1-3] 
> util.ResourceLeakDetector - java:115)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.expandCumulation(ByteToMessageDecoder.java:538)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:97)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:274)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
>   
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>   
> org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>   
> org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   java.lang.Thread.run(Thread.java:748)
>  {code}
> {code:java}
> 2022-01-25 17:03:14,014 ERROR [S-EventLoopGroup-1-3] 
> 

[jira] [Comment Edited] (HBASE-26708) Netty "leak detected" and OutOfDirectMemoryError due to direct memory buffering with SASL implementation

2022-07-06 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563499#comment-17563499
 ] 

Andrew Kyle Purtell edited comment on HBASE-26708 at 7/7/22 1:08 AM:
-

bq. In general, I prefer we just remove the SimpleRpcServer implementation and 
rewrite the decode and encode part with netty, to make the code more clear.

SimpleRcpServer is currently used as a fallback by Cloudera customers (and I 
presume others) with 2.2 when the Netty implementation has issues. I would also 
want it as a fallback option for our production. Anyway this is the kind of 
major operational change which should have a deprecation before removal. I know 
it is not the default but still represents an important fallback option. We 
should be accommodating to users here. Deprecation can be done now, that seems 
ok. Removal can be done in 3.0.  -So it should be fixed first, removed later. 
Can land SimpleRpcServer specific things on HBASE-27097 after this issue is 
done.-


was (Author: apurtell):
bq. In general, I prefer we just remove the SimpleRpcServer implementation and 
rewrite the decode and encode part with netty, to make the code more clear.

SimpleRcpServer is currently used as a fallback by Cloudera customers (and I 
presume others) with 2.2 when the Netty implementation has issues. I would also 
want it as a fallback option for our production. Anyway this is the kind of 
major operational change which should have a deprecation before removal. I know 
it is not the default but still represents an important fallback option. We 
should be accommodating to users here. Deprecation can be done now, that seems 
ok. Removal can be done in 3.0.  So it should be fixed first, removed later. 
Can land SimpleRpcServer specific things on HBASE-27097 after this issue is 
done.

> Netty "leak detected" and OutOfDirectMemoryError due to direct memory 
> buffering with SASL implementation
> 
>
> Key: HBASE-26708
> URL: https://issues.apache.org/jira/browse/HBASE-26708
> Project: HBase
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 2.5.0, 2.4.6
>Reporter: Viraj Jasani
>Assignee: Duo Zhang
>Priority: Blocker
> Fix For: 2.5.0, 3.0.0-alpha-4, 2.4.14
>
>
> Under constant data ingestion, using default Netty based RpcServer and 
> RpcClient implementation results in OutOfDirectMemoryError, supposedly caused 
> by leaks detected by Netty's LeakDetector.
> {code:java}
> 2022-01-25 17:03:10,084 ERROR [S-EventLoopGroup-1-3] 
> util.ResourceLeakDetector - java:115)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.expandCumulation(ByteToMessageDecoder.java:538)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:97)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:274)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
>   
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>   
> org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>   
> org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   java.lang.Thread.run(Thread.java:748)
>  {code}
> {code:java}
> 2022-01-25 17:03:14,014 ERROR [S-EventLoopGroup-1-3] 
> 

[jira] [Comment Edited] (HBASE-26708) Netty "leak detected" and OutOfDirectMemoryError due to direct memory buffering with SASL implementation

2022-07-06 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563499#comment-17563499
 ] 

Andrew Kyle Purtell edited comment on HBASE-26708 at 7/6/22 11:53 PM:
--

bq. In general, I prefer we just remove the SimpleRpcServer implementation and 
rewrite the decode and encode part with netty, to make the code more clear.

SimpleRcpServer is currently used as a fallback by Cloudera customers (and I 
presume others) with 2.2 when the Netty implementation has issues. I would also 
want it as a fallback option for our production. Anyway this is the kind of 
major operational change which should have a deprecation before removal. I know 
it is not the default but still represents an important fallback option. We 
should be accommodating to users here. Deprecation can be done now, that seems 
ok. Removal can be done in 3.0.  So it should be fixed first, removed later. 
Can land SimpleRpcServer specific things on HBASE-27097 after this issue is 
done.


was (Author: apurtell):
bq. In general, I prefer we just remove the SimpleRpcServer implementation and 
rewrite the decode and encode part with netty, to make the code more clear.

SimpleRcpServer is currently used as a fallback by Cloudera customers (and I 
presume others) with 2.2 when the Netty implementation has issues. I would also 
want it as a fallback option for our production. Anyway this is the kind of 
major operational change which should have a deprecation before removal. I know 
it is not the default but still represents an important fallback option. We 
should be accommodating to users here. Deprecation can be done now, that seems 
ok. Removal can be done in 3.0. 

> Netty "leak detected" and OutOfDirectMemoryError due to direct memory 
> buffering with SASL implementation
> 
>
> Key: HBASE-26708
> URL: https://issues.apache.org/jira/browse/HBASE-26708
> Project: HBase
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 2.5.0, 2.4.6
>Reporter: Viraj Jasani
>Assignee: Duo Zhang
>Priority: Blocker
> Fix For: 2.5.0, 3.0.0-alpha-4, 2.4.14
>
>
> Under constant data ingestion, using default Netty based RpcServer and 
> RpcClient implementation results in OutOfDirectMemoryError, supposedly caused 
> by leaks detected by Netty's LeakDetector.
> {code:java}
> 2022-01-25 17:03:10,084 ERROR [S-EventLoopGroup-1-3] 
> util.ResourceLeakDetector - java:115)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.expandCumulation(ByteToMessageDecoder.java:538)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:97)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:274)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
>   
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>   
> org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>   
> org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   java.lang.Thread.run(Thread.java:748)
>  {code}
> {code:java}
> 2022-01-25 17:03:14,014 ERROR [S-EventLoopGroup-1-3] 
> util.ResourceLeakDetector - 
> 

[jira] [Comment Edited] (HBASE-26708) Netty "leak detected" and OutOfDirectMemoryError due to direct memory buffering with SASL implementation

2022-07-06 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563499#comment-17563499
 ] 

Andrew Kyle Purtell edited comment on HBASE-26708 at 7/6/22 11:43 PM:
--

bq. In general, I prefer we just remove the SimpleRpcServer implementation and 
rewrite the decode and encode part with netty, to make the code more clear.

SimpleRcpServer is currently used as a fallback by Cloudera customers (and I 
presume others) with 2.2 when the Netty implementation has issues. I would also 
want it as a fallback option for our production. Anyway this is the kind of 
major operational change which should have a deprecation before removal. I know 
it is not the default but still represents an important fallback option. We 
should be accommodating to users here. Deprecation can be done now, that seems 
ok. Removal can be done in 3.0. 


was (Author: apurtell):
bq. In general, I prefer we just remove the SimpleRpcServer implementation and 
rewrite the decode and encode part with netty, to make the code more clear.

SimpleRcpServer is currently used as a fallback by Cloudera customers (and I 
presume others) with 2.2 when the Netty implementation has issues. I would also 
want it as a fallback option for our production. Anyway this is the kind of 
major operational change which should have a deprecation before removal. 
Deprecation can be done now, that seems ok. Removal can be done in 3.0. 

> Netty "leak detected" and OutOfDirectMemoryError due to direct memory 
> buffering with SASL implementation
> 
>
> Key: HBASE-26708
> URL: https://issues.apache.org/jira/browse/HBASE-26708
> Project: HBase
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 2.5.0, 2.4.6
>Reporter: Viraj Jasani
>Assignee: Duo Zhang
>Priority: Blocker
> Fix For: 2.5.0, 3.0.0-alpha-4, 2.4.14
>
>
> Under constant data ingestion, using default Netty based RpcServer and 
> RpcClient implementation results in OutOfDirectMemoryError, supposedly caused 
> by leaks detected by Netty's LeakDetector.
> {code:java}
> 2022-01-25 17:03:10,084 ERROR [S-EventLoopGroup-1-3] 
> util.ResourceLeakDetector - java:115)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.expandCumulation(ByteToMessageDecoder.java:538)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:97)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:274)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
>   
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>   
> org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>   
> org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   java.lang.Thread.run(Thread.java:748)
>  {code}
> {code:java}
> 2022-01-25 17:03:14,014 ERROR [S-EventLoopGroup-1-3] 
> util.ResourceLeakDetector - 
> apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446)
>   
> 

[jira] [Comment Edited] (HBASE-26708) Netty "leak detected" and OutOfDirectMemoryError due to direct memory buffering with SASL implementation

2022-07-06 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563499#comment-17563499
 ] 

Andrew Kyle Purtell edited comment on HBASE-26708 at 7/6/22 11:41 PM:
--

bq. In general, I prefer we just remove the SimpleRpcServer implementation and 
rewrite the decode and encode part with netty, to make the code more clear.

SimpleRcpServer is currently used as a fallback by Cloudera customers (and I 
presume others) with 2.2 when the Netty implementation has issues. I would also 
want it as a fallback option for our production. Anyway this is the kind of 
major operational change which should have a deprecation before removal. 
Deprecation can be done now, that seems ok. Removal can be done in 3.0. 


was (Author: apurtell):
bq. In general, I prefer we just remove the SimpleRpcServer implementation and 
rewrite the decode and encode part with netty, to make the code more clear.


This will not be possible in 2.x. We need a fix. SimpleRcpServer is currently 
used as a fallback by Cloudera customers (and I presume others) with 2.2 when 
the Netty implementation has issues. I would also want it as a fallback option 
for our production. Anyway this is the kind of major operational change which 
should have a deprecation before removal. Deprecation can be done now, that 
seems ok. Removal can be done in 3.0. 

> Netty "leak detected" and OutOfDirectMemoryError due to direct memory 
> buffering with SASL implementation
> 
>
> Key: HBASE-26708
> URL: https://issues.apache.org/jira/browse/HBASE-26708
> Project: HBase
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 2.5.0, 2.4.6
>Reporter: Viraj Jasani
>Assignee: Duo Zhang
>Priority: Blocker
> Fix For: 2.5.0, 3.0.0-alpha-4, 2.4.14
>
>
> Under constant data ingestion, using default Netty based RpcServer and 
> RpcClient implementation results in OutOfDirectMemoryError, supposedly caused 
> by leaks detected by Netty's LeakDetector.
> {code:java}
> 2022-01-25 17:03:10,084 ERROR [S-EventLoopGroup-1-3] 
> util.ResourceLeakDetector - java:115)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.expandCumulation(ByteToMessageDecoder.java:538)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:97)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:274)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
>   
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>   
> org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>   
> org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   java.lang.Thread.run(Thread.java:748)
>  {code}
> {code:java}
> 2022-01-25 17:03:14,014 ERROR [S-EventLoopGroup-1-3] 
> util.ResourceLeakDetector - 
> apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
>   
> 

[jira] [Comment Edited] (HBASE-26708) Netty "leak detected" and OutOfDirectMemoryError due to direct memory buffering with SASL implementation

2022-07-06 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17563499#comment-17563499
 ] 

Andrew Kyle Purtell edited comment on HBASE-26708 at 7/6/22 11:41 PM:
--

bq. In general, I prefer we just remove the SimpleRpcServer implementation and 
rewrite the decode and encode part with netty, to make the code more clear.


This will not be possible in 2.x. We need a fix. SimpleRcpServer is currently 
used as a fallback by Cloudera customers (and I presume others) with 2.2 when 
the Netty implementation has issues. I would also want it as a fallback option 
for our production. Anyway this is the kind of major operational change which 
should have a deprecation before removal. Deprecation can be done now, that 
seems ok. Removal can be done in 3.0. 


was (Author: apurtell):
bq. In general, I prefer we just remove the SimpleRpcServer implementation and 
rewrite the decode and encode part with netty, to make the code more clear.


This will not be possible in 2.x. We need a fix. SimpleRcpServer is currently 
used as a fallback by Cloudera customers (and I presume others) with 2.2 when 
the Netty implementation has issues, and anyway this is the kind of major 
operational change which should have a deprecation before removal. Deprecation 
can be done now, that seems ok. Removal can be done in 3.0. 

> Netty "leak detected" and OutOfDirectMemoryError due to direct memory 
> buffering with SASL implementation
> 
>
> Key: HBASE-26708
> URL: https://issues.apache.org/jira/browse/HBASE-26708
> Project: HBase
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 2.5.0, 2.4.6
>Reporter: Viraj Jasani
>Assignee: Duo Zhang
>Priority: Blocker
> Fix For: 2.5.0, 3.0.0-alpha-4, 2.4.14
>
>
> Under constant data ingestion, using default Netty based RpcServer and 
> RpcClient implementation results in OutOfDirectMemoryError, supposedly caused 
> by leaks detected by Netty's LeakDetector.
> {code:java}
> 2022-01-25 17:03:10,084 ERROR [S-EventLoopGroup-1-3] 
> util.ResourceLeakDetector - java:115)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.expandCumulation(ByteToMessageDecoder.java:538)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:97)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:274)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
>   
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>   
> org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>   
> org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   java.lang.Thread.run(Thread.java:748)
>  {code}
> {code:java}
> 2022-01-25 17:03:14,014 ERROR [S-EventLoopGroup-1-3] 
> util.ResourceLeakDetector - 
> apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
>   
> 

[jira] [Comment Edited] (HBASE-26708) Netty "leak detected" and OutOfDirectMemoryError due to direct memory buffering

2022-06-09 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552464#comment-17552464
 ] 

Andrew Kyle Purtell edited comment on HBASE-26708 at 6/9/22 11:15 PM:
--

On the subject of configuration and NettyRpcServer, we leave netty level 
resource limits unbounded. The number of threads to use for the event loop is 
default 0 (unbounded). The default for io.netty.eventLoop.maxPendingTasks is 
INT_MAX. We don't do this for our own RPC handlers. We have a notion of maximum 
handler pool size, with a default of 30, typically raised in production by the 
user. We constrain the depth of the request queue in multiple ways... limits on 
number of queued calls, limits on total size of calls data that can be queued 
(to avoid memory usage overrun, just like this case), CoDel conditioning of the 
call queues if it is enabled, and so on.

Under load can we pile up a excess of pending request state, such as direct 
buffers containing request bytes, at the netty layer because of downstream 
resource limits? Those limits will act as a bottleneck, as intended, and before 
would have also applied backpressure through RPC too, because SimpleRpcServer 
had thread limits ("hbase.ipc.server.read.threadpool.size", default 10), but 
now netty may be able to queue up a lot more, in comparison, because netty has 
been designed for concurrency. 

This is going to be somewhat application dependent too. If the application 
interacts synchronously with calls and has its own bound, then in flight 
requests or their network level handling will be bounded by the aggregate 
(client_in_flight_max x number_of_clients). If the application is highly async, 
write-mostly, or a load test client – which is typically write-mostly, async, 
and configured with large bounds :) – then this can explain the findings 
reported here. It may also explain why security makes it worse, because when 
security is active we wrap (encrypt) and unwrap (decrypt) up in the call layer, 
beyond netty, and that takes additional time there, which would back things up 
at the netty layer more than if call handling would complete more quickly 
without encryption.

Consider the hbase.netty.eventloop.rpcserver.thread.count default. It is 0 
(unbounded). I don't know what it can actually get up to in production, because 
we lack the metric, but there are diminishing returns when threads > cores so a 
reasonable default here could be Runtime.getRuntime().availableProcessors() 
instead of unbounded?

maxPendingTasks probably should not be INT_MAX, but that may matter less.

The goal would be to limit concurrency at the netty layer in such a way that:
1. Performance is still good
2. Under load, we don't balloon resource usage at the netty layer

I could be looking at something that isn't the real issue but it is notable.


was (Author: apurtell):
On the subject of configuration and NettyRpcServer, we leave netty level 
resource limits unbounded. The number of threads to use for the event loop is 
default 0 (unbounded). The default for io.netty.eventLoop.maxPendingTasks is 
INT_MAX. We don't do this for our own RPC handlers. We have a notion of maximum 
handler pool size, with a default of 30, typically raised in production by the 
user. We constrain the depth of the request queue in multiple ways... limits on 
number of queued calls, limits on total size of calls data that can be queued 
(to avoid memory usage overrun, just like this case), CoDel conditioning of the 
call queues if it is enabled, and so on.

Under load can we pile up a excess of pending request state, such as direct 
buffers containing request bytes, at the netty layer because of downstream 
resource limits? Those limits will act as a bottleneck, as intended, and before 
would have also applied backpressure through RPC too, because SimpleRpcServer 
had thread limits ("hbase.ipc.server.read.threadpool.size", default 10), but 
now netty may be able to queue up a lot more, in comparison, because netty has 
been designed for concurrency. 

This is going to be somewhat application dependent too. If the application 
interacts synchronously with calls and has its own bound, then in flight 
requests or their network level handling will be bounded by the aggregate 
(client_limit x number_of_clients). If the application is highly async, 
write-mostly, or a load test client – which is typically write-mostly, async, 
and configured with large bounds :) – then this can explain the findings 
reported here. It may also explain why security makes it worse, because when 
security is active we wrap (encrypt) and unwrap (decrypt) up in the call layer, 
beyond netty, and that takes additional time there, which would back things up 
at the netty layer more than if call handling would complete more quickly 
without encryption.

Consider the hbase.netty.eventloop.rpcserver.thread.count default. 

[jira] [Comment Edited] (HBASE-26708) Netty "leak detected" and OutOfDirectMemoryError due to direct memory buffering

2022-06-09 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552464#comment-17552464
 ] 

Andrew Kyle Purtell edited comment on HBASE-26708 at 6/9/22 11:14 PM:
--

On the subject of configuration and NettyRpcServer, we leave netty level 
resource limits unbounded. The number of threads to use for the event loop is 
default 0 (unbounded). The default for io.netty.eventLoop.maxPendingTasks is 
INT_MAX. We don't do this for our own RPC handlers. We have a notion of maximum 
handler pool size, with a default of 30, typically raised in production by the 
user. We constrain the depth of the request queue in multiple ways... limits on 
number of queued calls, limits on total size of calls data that can be queued 
(to avoid memory usage overrun, just like this case), CoDel conditioning of the 
call queues if it is enabled, and so on.

Under load can we pile up a excess of pending request state, such as direct 
buffers containing request bytes, at the netty layer because of downstream 
resource limits? Those limits will act as a bottleneck, as intended, and before 
would have also applied backpressure through RPC too, because SimpleRpcServer 
had thread limits ("hbase.ipc.server.read.threadpool.size", default 10), but 
now netty may be able to queue up a lot more, in comparison, because netty has 
been designed for concurrency. 

This is going to be somewhat application dependent too. If the application 
interacts synchronously with calls and has its own bound, then in flight 
requests or their network level handling will be bounded by the aggregate 
(client_limit x number_of_clients). If the application is highly async, 
write-mostly, or a load test client – which is typically write-mostly, async, 
and configured with large bounds :) – then this can explain the findings 
reported here. It may also explain why security makes it worse, because when 
security is active we wrap (encrypt) and unwrap (decrypt) up in the call layer, 
beyond netty, and that takes additional time there, which would back things up 
at the netty layer more than if call handling would complete more quickly 
without encryption.

Consider the hbase.netty.eventloop.rpcserver.thread.count default. It is 0 
(unbounded). I don't know what it can actually get up to in production, because 
we lack the metric, but there are diminishing returns when threads > cores so a 
reasonable default here could be Runtime.getRuntime().availableProcessors() 
instead of unbounded?

maxPendingTasks probably should not be INT_MAX, but that may matter less.

The goal would be to limit concurrency at the netty layer in such a way that:
1. Performance is still good
2. Under load, we don't balloon resource usage at the netty layer

I could be looking at something that isn't the real issue but it is notable.


was (Author: apurtell):
On the subject of configuration and NettyRpcServer, we leave netty level 
resource limits unbounded. The number of threads to use for the event loop is 
default 0 (unbounded). The default for io.netty.eventLoop.maxPendingTasks is 
INT_MAX. We don't do this for our own RPC handlers. We have a notion of maximum 
handler pool size, with a default of 30, typically raised in production by the 
user. We constrain the depth of the request queue in multiple ways... limits on 
number of queued calls, limits on total size of calls data that can be queued 
(to avoid memory usage overrun, just like this case), CoDel conditioning of the 
call queues if it is enabled, and so on.

Under load can we pile up a excess of pending request state, such as direct 
buffers containing request bytes, at the netty layer because of downstream 
resource limits? Those limits will act as a bottleneck, as intended, and before 
would have also applied backpressure through RPC too, *because SimpleRpcServer 
had thread limits ("hbase.ipc.server.read.threadpool.size", default 10), but 
now netty may be able to queue up a lot more, in comparison*. 

This is going to be somewhat application dependent too. If the application 
interacts synchronously with calls and has its own bound, then in flight 
requests or their network level handling will be bounded by the aggregate 
(client_limit x number_of_clients). If the application is highly async, 
write-mostly, or a load test client – which is typically write-mostly, async, 
and configured with large bounds :) – then this can explain the findings 
reported here. It may also explain why security makes it worse, because when 
security is active we wrap (encrypt) and unwrap (decrypt) up in the call layer, 
beyond netty, and that takes additional time there, which would back things up 
at the netty layer more than if call handling would complete more quickly 
without encryption.

Consider the hbase.netty.eventloop.rpcserver.thread.count default. It is 0 
(unbounded). I don't know what it can actually 

[jira] [Comment Edited] (HBASE-26708) Netty "leak detected" and OutOfDirectMemoryError due to direct memory buffering

2022-06-09 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552464#comment-17552464
 ] 

Andrew Kyle Purtell edited comment on HBASE-26708 at 6/9/22 11:13 PM:
--

On the subject of configuration and NettyRpcServer, we leave netty level 
resource limits unbounded. The number of threads to use for the event loop is 
default 0 (unbounded). The default for io.netty.eventLoop.maxPendingTasks is 
INT_MAX. We don't do this for our own RPC handlers. We have a notion of maximum 
handler pool size, with a default of 30, typically raised in production by the 
user. We constrain the depth of the request queue in multiple ways... limits on 
number of queued calls, limits on total size of calls data that can be queued 
(to avoid memory usage overrun, just like this case), CoDel conditioning of the 
call queues if it is enabled, and so on.

Under load can we pile up a excess of pending request state, such as direct 
buffers containing request bytes, at the netty layer because of downstream 
resource limits? Those limits will act as a bottleneck, as intended, and before 
would have also applied backpressure through RPC too, *because SimpleRpcServer 
had thread limits ("hbase.ipc.server.read.threadpool.size", default 10), but 
now netty may be able to queue up a lot more, in comparison*. 

This is going to be somewhat application dependent too. If the application 
interacts synchronously with calls and has its own bound, then in flight 
requests or their network level handling will be bounded by the aggregate 
(client_limit x number_of_clients). If the application is highly async, 
write-mostly, or a load test client – which is typically write-mostly, async, 
and configured with large bounds :) – then this can explain the findings 
reported here. It may also explain why security makes it worse, because when 
security is active we wrap (encrypt) and unwrap (decrypt) up in the call layer, 
beyond netty, and that takes additional time there, which would back things up 
at the netty layer more than if call handling would complete more quickly 
without encryption.

Consider the hbase.netty.eventloop.rpcserver.thread.count default. It is 0 
(unbounded). I don't know what it can actually get up to in production, because 
we lack the metric, but there are diminishing returns when threads > cores so a 
reasonable default here could be Runtime.getRuntime().availableProcessors() 
instead of unbounded?

maxPendingTasks probably should not be INT_MAX, but that may matter less.

The goal would be to limit concurrency at the netty layer in such a way that:
1. Performance is still good
2. Under load, we don't balloon resource usage at the netty layer

I could be looking at something that isn't the real issue but it is notable.


was (Author: apurtell):
On the subject of configuration and NettyRpcServer, we leave netty level 
resource limits unbounded. The number of threads to use for the event loop is 
default 0 (unbounded). The default for io.netty.eventLoop.maxPendingTasks is 
INT_MAX. We don't do this for our own RPC handlers. We have a notion of maximum 
handler pool size, with a default of 30, typically raised in production by the 
user. We constrain the depth of the request queue in multiple ways... limits on 
number of queued calls, limits on total size of calls data that can be queued 
(to avoid memory usage overrun, just like this case), CoDel conditioning of the 
call queues if it is enabled, and so on.

Under load can we pile up a excess of pending request state, such as direct 
buffers containing request bytes, at the netty layer because of downstream 
resource limits? Those limits will act as a bottleneck, as intended, and before 
would have also applied backpressure through RPC too, *because SimpleRpcServer 
had thread limits ("hbase.ipc.server.read.threadpool.size", default 10) and was 
not async, but now netty is able to queue up a lot of work asynchronously*. 

This is going to be somewhat application dependent too. If the application 
interacts synchronously with calls and has its own bound, then in flight 
requests or their network level handling will be bounded by the aggregate 
(client_limit x number_of_clients). If the application is highly async, 
write-mostly, or a load test client – which is typically write-mostly, async, 
and configured with large bounds :) – then this can explain the findings 
reported here. It may also explain why security makes it worse, because when 
security is active we wrap (encrypt) and unwrap (decrypt) up in the call layer, 
beyond netty, and that takes additional time there, which would back things up 
at the netty layer more than if call handling would complete more quickly 
without encryption.

Consider the hbase.netty.eventloop.rpcserver.thread.count default. It is 0 
(unbounded). I don't know what it can actually get up to in production, because 

[jira] [Comment Edited] (HBASE-26708) Netty "leak detected" and OutOfDirectMemoryError due to direct memory buffering

2022-06-09 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552464#comment-17552464
 ] 

Andrew Kyle Purtell edited comment on HBASE-26708 at 6/9/22 11:12 PM:
--

On the subject of configuration and NettyRpcServer, we leave netty level 
resource limits unbounded. The number of threads to use for the event loop is 
default 0 (unbounded). The default for io.netty.eventLoop.maxPendingTasks is 
INT_MAX. We don't do this for our own RPC handlers. We have a notion of maximum 
handler pool size, with a default of 30, typically raised in production by the 
user. We constrain the depth of the request queue in multiple ways... limits on 
number of queued calls, limits on total size of calls data that can be queued 
(to avoid memory usage overrun, just like this case), CoDel conditioning of the 
call queues if it is enabled, and so on.

Under load can we pile up a excess of pending request state, such as direct 
buffers containing request bytes, at the netty layer because of downstream 
resource limits? Those limits will act as a bottleneck, as intended, and before 
would have also applied backpressure through RPC too, *because SimpleRpcServer 
had thread limits ("hbase.ipc.server.read.threadpool.size", default 10) and was 
not async, but now netty is able to queue up a lot of work asynchronously*. 

This is going to be somewhat application dependent too. If the application 
interacts synchronously with calls and has its own bound, then in flight 
requests or their network level handling will be bounded by the aggregate 
(client_limit x number_of_clients). If the application is highly async, 
write-mostly, or a load test client – which is typically write-mostly, async, 
and configured with large bounds :) – then this can explain the findings 
reported here. It may also explain why security makes it worse, because when 
security is active we wrap (encrypt) and unwrap (decrypt) up in the call layer, 
beyond netty, and that takes additional time there, which would back things up 
at the netty layer more than if call handling would complete more quickly 
without encryption.

Consider the hbase.netty.eventloop.rpcserver.thread.count default. It is 0 
(unbounded). I don't know what it can actually get up to in production, because 
we lack the metric, but there are diminishing returns when threads > cores so a 
reasonable default here could be Runtime.getRuntime().availableProcessors() 
instead of unbounded?

maxPendingTasks probably should not be INT_MAX, but that may matter less.

The goal would be to limit concurrency at the netty layer in such a way that:
1. Performance is still good
2. Under load, we don't balloon resource usage at the netty layer

I could be looking at something that isn't the real issue but it is notable.


was (Author: apurtell):
On the subject of configuration and NettyRpcServer, we leave netty level 
resource limits unbounded. The number of threads to use for the event loop is 
default 0 (unbounded). The default for io.netty.eventLoop.maxPendingTasks is 
INT_MAX. We don't do this for our own RPC handlers. We have a notion of maximum 
handler pool size, with a default of 30, typically raised in production by the 
user. We constrain the depth of the request queue in multiple ways... limits on 
number of queued calls, limits on total size of calls data that can be queued 
(to avoid memory usage overrun, just like this case), CoDel conditioning of the 
call queues if it is enabled, and so on.

Under load can we pile up a excess of pending request state, such as direct 
buffers containing request bytes, at the netty layer because of downstream 
resource limits? Those limits will act as a bottleneck, as intended, and before 
would have also applied backpressure through RPC too, *because SimpleRpcServer 
had thread limits ("hbase.ipc.server.read.threadpool.size", default 10) and was 
not async, but now netty is able to queue up a lot of work asynchronously*. 

This is going to be somewhat application dependent too. If the application 
interacts synchronously with calls and has its own bound, then in flight 
requests or their network level handling will be bounded by the aggregate 
(client_limit x number_of_clients). If the application is highly async, 
write-mostly, or a load test client – which is typically write-mostly, async, 
and configured with large bounds :) – then this can explain the findings 
reported here. It may also explain why security makes it worse, because when 
security is active we wrap (encrypt) and unwrap (decrypt) up in the call layer, 
beyond netty, and that takes additional time there, which would back things up 
at the netty layer more than if call handling would complete more quickly 
without encryption.

Consider the hbase.netty.eventloop.rpcserver.thread.count default. It is 0 
(unbounded). I don't know what it can actually get up to in 

[jira] [Comment Edited] (HBASE-26708) Netty "leak detected" and OutOfDirectMemoryError due to direct memory buffering

2022-06-09 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552464#comment-17552464
 ] 

Andrew Kyle Purtell edited comment on HBASE-26708 at 6/9/22 11:09 PM:
--

On the subject of configuration and NettyRpcServer, we leave netty level 
resource limits unbounded. The number of threads to use for the event loop is 
default 0 (unbounded). The default for io.netty.eventLoop.maxPendingTasks is 
INT_MAX. We don't do this for our own RPC handlers. We have a notion of maximum 
handler pool size, with a default of 30, typically raised in production by the 
user. We constrain the depth of the request queue in multiple ways... limits on 
number of queued calls, limits on total size of calls data that can be queued 
(to avoid memory usage overrun, just like this case), CoDel conditioning of the 
call queues if it is enabled, and so on.

Under load can we pile up a excess of pending request state, such as direct 
buffers containing request bytes, at the netty layer because of downstream 
resource limits? Those limits will act as a bottleneck, as intended, and before 
would have also applied backpressure through RPC too, *because SimpleRpcServer 
had thread limits (hbase.ipc.server.read.threadpool.size", default 10) and was 
not async, but now netty is able to queue up a lot of work asynchronously*. 
This is going to be somewhat application dependent too. If the application 
interacts synchronously with calls and has its own bound, then in flight 
requests or their network level handling will be bounded by the aggregate 
(client_limit x number_of_clients). If the application is highly async, 
write-mostly, or a load test client – which is typically write-mostly, async, 
and configured with large bounds :) – then this can explain the findings 
reported here.

And this may also explain why security makes it worse, because when security is 
active we wrap (encrypt) and unwrap (decrypt) up in the call layer, beyond 
netty, and that takes additional time there, which would back things up at the 
netty layer more than if call handling would complete more quickly without 
encryption.

Consider the hbase.netty.eventloop.rpcserver.thread.count default. It is 0 
(unbounded). I don't know what it can actually get up to in production, because 
we lack the metric, but there are diminishing returns when threads > cores so a 
reasonable default here could be Runtime.getRuntime().availableProcessors() 
instead of unbounded?

maxPendingTasks probably should not be INT_MAX, but that may matter less.

The goal would be to limit concurrency at the netty layer in such a way that:
1. Performance is still good
2. Under load, we don't balloon resource usage at the netty layer


was (Author: apurtell):
On the subject of configuration and NettyRpcServer, we leave netty level 
resource limits unbounded. The number of threads to use for the event loop is 
default 0 (unbounded). The default for io.netty.eventLoop.maxPendingTasks is 
INT_MAX. We don't do this for our own RPC handlers. We have a notion of maximum 
handler pool size, with a default of 30, typically raised in production by the 
user. We constrain the depth of the request queue in multiple ways... limits on 
number of queued calls, limits on total size of calls data that can be queued 
(to avoid memory usage overrun, just like this case), CoDel conditioning of the 
call queues if it is enabled, and so on.

Under load can we pile up a excess of pending request state, such as direct 
buffers containing request bytes, at the netty layer because of downstream 
resource limits? Those limits will act as a bottleneck, as intended, and before 
would have also applied backpressure through RPC too, but now netty is able to 
queue up a lot of work asynchronously. This is going to be somewhat application 
dependent too. If the application interacts synchronously with calls and has 
its own bound, then in flight requests or their network level handling will be 
bounded by the aggregate (client_limit x number_of_clients). If the application 
is highly async, write-mostly, or a load test client – which is typically 
write-mostly, async, and configured with large bounds :) – then this can 
explain the findings reported here.

And this may also explain why security makes it worse, because when security is 
active we wrap (encrypt) and unwrap (decrypt) up in the call layer, beyond 
netty, and that takes additional time there, which would back things up at the 
netty layer more than if call handling would complete more quickly without 
encryption.

Consider the hbase.netty.eventloop.rpcserver.thread.count default. It is 0 
(unbounded). I don't know what it can actually get up to in production, because 
we lack the metric, but there are diminishing returns when threads > cores so a 
reasonable default here could be Runtime.getRuntime().availableProcessors() 
instead 

[jira] [Comment Edited] (HBASE-26708) Netty "leak detected" and OutOfDirectMemoryError due to direct memory buffering

2022-06-09 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552464#comment-17552464
 ] 

Andrew Kyle Purtell edited comment on HBASE-26708 at 6/9/22 11:09 PM:
--

On the subject of configuration and NettyRpcServer, we leave netty level 
resource limits unbounded. The number of threads to use for the event loop is 
default 0 (unbounded). The default for io.netty.eventLoop.maxPendingTasks is 
INT_MAX. We don't do this for our own RPC handlers. We have a notion of maximum 
handler pool size, with a default of 30, typically raised in production by the 
user. We constrain the depth of the request queue in multiple ways... limits on 
number of queued calls, limits on total size of calls data that can be queued 
(to avoid memory usage overrun, just like this case), CoDel conditioning of the 
call queues if it is enabled, and so on.

Under load can we pile up a excess of pending request state, such as direct 
buffers containing request bytes, at the netty layer because of downstream 
resource limits? Those limits will act as a bottleneck, as intended, and before 
would have also applied backpressure through RPC too, *because SimpleRpcServer 
had thread limits ("hbase.ipc.server.read.threadpool.size", default 10) and was 
not async, but now netty is able to queue up a lot of work asynchronously*. 

This is going to be somewhat application dependent too. If the application 
interacts synchronously with calls and has its own bound, then in flight 
requests or their network level handling will be bounded by the aggregate 
(client_limit x number_of_clients). If the application is highly async, 
write-mostly, or a load test client – which is typically write-mostly, async, 
and configured with large bounds :) – then this can explain the findings 
reported here. It may also explain why security makes it worse, because when 
security is active we wrap (encrypt) and unwrap (decrypt) up in the call layer, 
beyond netty, and that takes additional time there, which would back things up 
at the netty layer more than if call handling would complete more quickly 
without encryption.

Consider the hbase.netty.eventloop.rpcserver.thread.count default. It is 0 
(unbounded). I don't know what it can actually get up to in production, because 
we lack the metric, but there are diminishing returns when threads > cores so a 
reasonable default here could be Runtime.getRuntime().availableProcessors() 
instead of unbounded?

maxPendingTasks probably should not be INT_MAX, but that may matter less.

The goal would be to limit concurrency at the netty layer in such a way that:
1. Performance is still good
2. Under load, we don't balloon resource usage at the netty layer


was (Author: apurtell):
On the subject of configuration and NettyRpcServer, we leave netty level 
resource limits unbounded. The number of threads to use for the event loop is 
default 0 (unbounded). The default for io.netty.eventLoop.maxPendingTasks is 
INT_MAX. We don't do this for our own RPC handlers. We have a notion of maximum 
handler pool size, with a default of 30, typically raised in production by the 
user. We constrain the depth of the request queue in multiple ways... limits on 
number of queued calls, limits on total size of calls data that can be queued 
(to avoid memory usage overrun, just like this case), CoDel conditioning of the 
call queues if it is enabled, and so on.

Under load can we pile up a excess of pending request state, such as direct 
buffers containing request bytes, at the netty layer because of downstream 
resource limits? Those limits will act as a bottleneck, as intended, and before 
would have also applied backpressure through RPC too, *because SimpleRpcServer 
had thread limits (hbase.ipc.server.read.threadpool.size", default 10) and was 
not async, but now netty is able to queue up a lot of work asynchronously*. 
This is going to be somewhat application dependent too. If the application 
interacts synchronously with calls and has its own bound, then in flight 
requests or their network level handling will be bounded by the aggregate 
(client_limit x number_of_clients). If the application is highly async, 
write-mostly, or a load test client – which is typically write-mostly, async, 
and configured with large bounds :) – then this can explain the findings 
reported here.

And this may also explain why security makes it worse, because when security is 
active we wrap (encrypt) and unwrap (decrypt) up in the call layer, beyond 
netty, and that takes additional time there, which would back things up at the 
netty layer more than if call handling would complete more quickly without 
encryption.

Consider the hbase.netty.eventloop.rpcserver.thread.count default. It is 0 
(unbounded). I don't know what it can actually get up to in production, because 
we lack the metric, but there are diminishing returns 

[jira] [Comment Edited] (HBASE-26708) Netty "leak detected" and OutOfDirectMemoryError due to direct memory buffering

2022-06-09 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552464#comment-17552464
 ] 

Andrew Kyle Purtell edited comment on HBASE-26708 at 6/9/22 11:06 PM:
--

On the subject of configuration and NettyRpcServer, we leave netty level 
resource limits unbounded. The number of threads to use for the event loop is 
default 0 (unbounded). The default for io.netty.eventLoop.maxPendingTasks is 
INT_MAX. We don't do this for our own RPC handlers. We have a notion of maximum 
handler pool size, with a default of 30, typically raised in production by the 
user. We constrain the depth of the request queue in multiple ways... limits on 
number of queued calls, limits on total size of calls data that can be queued 
(to avoid memory usage overrun, just like this case), CoDel conditioning of the 
call queues if it is enabled, and so on.

Under load can we pile up a excess of pending request state, such as direct 
buffers containing request bytes, at the netty layer because of downstream 
resource limits? Those limits will act as a bottleneck, as intended, and before 
would have also applied backpressure through RPC too, but now netty is able to 
queue up a lot of work asynchronously. This is going to be somewhat application 
dependent too. If the application interacts synchronously with calls and has 
its own bound, then in flight requests or their network level handling will be 
bounded by the aggregate (client_limit x number_of_clients). If the application 
is highly async, write-mostly, or a load test client – which is typically 
write-mostly, async, and configured with large bounds :) – then this can 
explain the findings reported here.

And this may also explain why security makes it worse, because when security is 
active we wrap (encrypt) and unwrap (decrypt) up in the call layer, beyond 
netty, and that takes additional time there, which would back things up at the 
netty layer more than if call handling would complete more quickly without 
encryption.

Consider the hbase.netty.eventloop.rpcserver.thread.count default. It is 0 
(unbounded). I don't know what it can actually get up to in production, because 
we lack the metric, but there are diminishing returns when threads > cores so a 
reasonable default here could be Runtime.getRuntime().availableProcessors() 
instead of unbounded?

maxPendingTasks probably should not be INT_MAX, but that may matter less.

The goal would be to limit concurrency at the netty layer in such a way that:
1. Performance is still good
2. Under load, we don't balloon resource usage at the netty layer


was (Author: apurtell):
On the subject of configuration and NettyRpcServer, we leave netty level 
resource limits unbounded. The number of threads to use for the event loop is 
default 0 (unbounded). The default for io.netty.eventLoop.maxPendingTasks is 
INT_MAX. We don't do this for our own RPC handlers. We have a notion of maximum 
handler pool size, with a default of 30, typically raised in production by the 
user. We constrain the depth of the request queue in multiple ways... limits on 
number of queued calls, limits on total size of calls data that can be queued 
(to avoid memory usage overrun, just like this case), CoDel conditioning of the 
call queues if it is enabled, and so on.

Under load can we pile up a excess of pending request state, such as direct 
buffers containing request bytes, at the netty layer because of downstream 
resource limits? Those limits will act as a bottleneck, as intended, and before 
would have also applied backpressure through RPC too, but now netty is able to 
queue up a lot of work asynchronously. This is going to be somewhat application 
dependent too. If the application interacts synchronously with calls and has 
its own bound, then in flight requests or their network level handling will be 
bounded by the aggregate (client_limit x number_of_clients). If the application 
is highly async, write-mostly, or a load test client – which is typically 
write-mostly, async, and configured with large bounds :) – then this can 
explain the findings reported here.

And this may also explain why security makes it worse, because when security is 
active we wrap (encrypt) and unwrap (decrypt) up in the call layer, beyond 
netty, and that takes additional time there, which would back things up at the 
netty layer more than if call handling would complete more quickly without 
encryption.

Consider the hbase.netty.eventloop.rpcserver.thread.count default. It is 0 
(unbounded). I don't know what it can actually get up to in production, because 
we lack the metric, but there are diminishing returns when threads > cores so a 
reasonable default here could be Runtime.getRuntime().availableProcessors() 
instead of unbounded?

maxPendingTasks should not be INT_MAX, that's not a sane default.

> Netty "leak detected" and 

[jira] [Comment Edited] (HBASE-26708) Netty "leak detected" and OutOfDirectMemoryError due to direct memory buffering

2022-06-09 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552464#comment-17552464
 ] 

Andrew Kyle Purtell edited comment on HBASE-26708 at 6/9/22 11:04 PM:
--

On the subject of configuration and NettyRpcServer, we leave netty level 
resource limits unbounded. The number of threads to use for the event loop is 
default 0 (unbounded). The default for io.netty.eventLoop.maxPendingTasks is 
INT_MAX. We don't do this for our own RPC handlers. We have a notion of maximum 
handler pool size, with a default of 30, typically raised in production by the 
user. We constrain the depth of the request queue in multiple ways... limits on 
number of queued calls, limits on total size of calls data that can be queued 
(to avoid memory usage overrun, just like this case), CoDel conditioning of the 
call queues if it is enabled, and so on.

Under load can we pile up a excess of pending request state, such as direct 
buffers containing request bytes, at the netty layer because of downstream 
resource limits? Those limits will act as a bottleneck, as intended, and before 
would have also applied backpressure through RPC too, but now netty is able to 
queue up a lot of work asynchronously. This is going to be somewhat application 
dependent too. If the application interacts synchronously with calls and has 
its own bound, then in flight requests or their network level handling will be 
bounded by the aggregate (client_limit x number_of_clients). If the application 
is highly async, write-mostly, or a load test client – which is typically 
write-mostly, async, and configured with large bounds :) – then this can 
explain the findings reported here.

And this may also explain why security makes it worse, because when security is 
active we wrap (encrypt) and unwrap (decrypt) up in the call layer, beyond 
netty, and that takes additional time there, which would back things up at the 
netty layer more than if call handling would complete more quickly without 
encryption.

Consider the hbase.netty.eventloop.rpcserver.thread.count default. It is 0 
(unbounded). I don't know what it can actually get up to in production, because 
we lack the metric, but there are diminishing returns when threads > cores so a 
reasonable default here could be Runtime.getRuntime().availableProcessors() 
instead of unbounded?

maxPendingTasks should not be INT_MAX, that's not a sane default.


was (Author: apurtell):
On the subject of configuration and NettyRpcServer, we leave netty level 
resource limits unbounded. The number of threads to use for the event loop is 
default 0 (unbounded). The default for io.netty.eventLoop.maxPendingTasks is 
INT_MAX. We don't do this for our own RPC handlers. We have a notion of maximum 
handler pool size, with a default of 30, typically raised in production by the 
user. We constrain the depth of the request queue in multiple ways... limits on 
number of queued calls, limits on total size of calls data that can be queued 
(to avoid memory usage overrun, just like this case), CoDel conditioning of the 
call queues if it is enabled, and so on.

Under load can we pile up a excess of pending request state, such as direct 
buffers containing request bytes, at the netty layer because of downstream 
resource limits? Those limits will act as a bottleneck, as intended, and before 
would have also applied backpressure through RPC too, but now netty is able to 
queue up a lot of work asynchronously. This is going to be somewhat application 
dependent too. If the application interacts synchronously with calls and has 
its own bound, then in flight requests or their network level handling will be 
bounded by the aggregate (client_limit x number_of_clients). If the application 
is highly async, write-mostly, or a load test client – which is typically 
write-mostly, async, and configured with large bounds :) – then this can 
explain the findings reported here.

Consider the hbase.netty.eventloop.rpcserver.thread.count default. It is 0 
(unbounded). I don't know what it can actually get up to in production, because 
we lack the metric, but there are diminishing returns when threads > cores so a 
reasonable default here could be Runtime.getRuntime().availableProcessors() 
instead of unbounded?

maxPendingTasks should not be INT_MAX, that's not a sane default.

> Netty "leak detected" and OutOfDirectMemoryError due to direct memory 
> buffering
> ---
>
> Key: HBASE-26708
> URL: https://issues.apache.org/jira/browse/HBASE-26708
> Project: HBase
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 2.5.0, 2.4.6
>Reporter: Viraj Jasani
>Priority: Critical
>
> Under constant data ingestion, using default Netty based RpcServer and 
> 

[jira] [Comment Edited] (HBASE-26708) Netty "leak detected" and OutOfDirectMemoryError due to direct memory buffering

2022-06-09 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552464#comment-17552464
 ] 

Andrew Kyle Purtell edited comment on HBASE-26708 at 6/9/22 11:01 PM:
--

On the subject of configuration and NettyRpcServer, we leave netty level 
resource limits unbounded. The number of threads to use for the event loop is 
default 0 (unbounded). The default for io.netty.eventLoop.maxPendingTasks is 
INT_MAX. We don't do this for our own RPC handlers. We have a notion of maximum 
handler pool size, with a default of 30, typically raised in production by the 
user. We constrain the depth of the request queue in multiple ways... limits on 
number of queued calls, limits on total size of calls data that can be queued 
(to avoid memory usage overrun, just like this case), CoDel conditioning of the 
call queues if it is enabled, and so on.

Under load can we pile up a excess of pending request state, such as direct 
buffers containing request bytes, at the netty layer because of downstream 
resource limits? Those limits will act as a bottleneck, as intended, and before 
would have also applied backpressure through RPC too, but now netty is able to 
queue up a lot of work asynchronously. This is going to be somewhat application 
dependent too. If the application interacts synchronously with calls and has 
its own bound, then in flight requests or their network level handling will be 
bounded by the aggregate (client_limit x number_of_clients). If the application 
is highly async, write-mostly, or a load test client – which is typically 
write-mostly, async, and configured with large bounds :) – then this can 
explain the findings reported here.

Consider the hbase.netty.eventloop.rpcserver.thread.count default. It is 0 
(unbounded). I don't know what it can actually get up to in production, because 
we lack the metric, but there are diminishing returns when threads > cores so a 
reasonable default here could be Runtime.getRuntime().availableProcessors() 
instead of unbounded?

maxPendingTasks should not be INT_MAX, that's not a sane default.


was (Author: apurtell):
On the subject of configuration and NettyRpcServer, we leave netty level 
resource limits unbounded. The number of threads to use for the event loop is 
default 0 (unbounded). The default for io.netty.eventLoop.maxPendingTasks is 
INT_MAX. We don't do this for our own RPC handlers. We have a notion of maximum 
handler pool size, with a default of 30, typically raised in production by the 
user. We constrain the depth of the request queue in multiple ways... limits on 
number of queued calls, limits on total size of calls data that can be queued 
(to avoid memory usage overrun, just like this case), CoDel conditioning of the 
call queues if it is enabled, and so on.

Under load can we pile up a excess of pending request state, such as direct 
buffers containing request bytes, at the netty layer because of downstream 
resource limits? Those limits will act as a bottleneck, as intended, and before 
would have also applied backpressure through RPC too, but now netty is able to 
queue up a lot of work asynchronously. 

Consider the hbase.netty.eventloop.rpcserver.thread.count default. It is 0 
(unbounded). I don't know what it can actually get up to in production, because 
we lack the metric, but there are diminishing returns when threads > cores so a 
reasonable default here could be Runtime.getRuntime().availableProcessors() 
instead of unbounded?

maxPendingTasks should not be INT_MAX, that's not a sane default.

> Netty "leak detected" and OutOfDirectMemoryError due to direct memory 
> buffering
> ---
>
> Key: HBASE-26708
> URL: https://issues.apache.org/jira/browse/HBASE-26708
> Project: HBase
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 2.5.0, 2.4.6
>Reporter: Viraj Jasani
>Priority: Critical
>
> Under constant data ingestion, using default Netty based RpcServer and 
> RpcClient implementation results in OutOfDirectMemoryError, supposedly caused 
> by leaks detected by Netty's LeakDetector.
> {code:java}
> 2022-01-25 17:03:10,084 ERROR [S-EventLoopGroup-1-3] 
> util.ResourceLeakDetector - java:115)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.expandCumulation(ByteToMessageDecoder.java:538)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:97)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:274)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> 

[jira] [Comment Edited] (HBASE-26708) Netty "leak detected" and OutOfDirectMemoryError due to direct memory buffering

2022-06-09 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552340#comment-17552340
 ] 

Andrew Kyle Purtell edited comment on HBASE-26708 at 6/9/22 5:30 PM:
-

[~bbeaudreault]  I was/am confused by that because HBASE-2 is a child of 
HBASE-26553 which describes itself as "OAuth Bearer authentication mech plugin 
for SASL". Can you or someone clean this up so we can clearly see what is going 
on? Is it really a full TLS RPC stack? Because it looks to me like some TLS 
fiddling to get a token that then sets up the usual wrapped SASL connection, 
possibly why I am confused. That would not be native TLS support in the sense I 
mean and the sense that is really required, possibly why it has not gotten 
enough attention. 

Oh, the PR itself describes the work as "HBASE-2 Add native TLS encryption 
support to RPC server/client ". That is much different. 

Let's clean up the situation with HBASE-2 and HBASE-26553 and take the 
conversation there so as not to distract from this JIRA.


was (Author: apurtell):
[~bbeaudreault]  I was/am confused by that because HBASE-2 is a child of 
HBASE-26553 which describes itself as "OAuth Bearer authentication mech plugin 
for SASL". Can you or someone clean this up so we can clearly see what is going 
on? Is it really a full TLS RPC stack? Because it looks to me like some TLS 
fiddling to get a token that then sets up the usual wrapped SASL connection, 
possibly why I am confused. That would not be native TLS support in the sense I 
mean and the sense that is really required, possibly why it has not gotten 
enough attention. 

> Netty "leak detected" and OutOfDirectMemoryError due to direct memory 
> buffering
> ---
>
> Key: HBASE-26708
> URL: https://issues.apache.org/jira/browse/HBASE-26708
> Project: HBase
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 2.5.0, 2.4.6
>Reporter: Viraj Jasani
>Priority: Critical
>
> Under constant data ingestion, using default Netty based RpcServer and 
> RpcClient implementation results in OutOfDirectMemoryError, supposedly caused 
> by leaks detected by Netty's LeakDetector.
> {code:java}
> 2022-01-25 17:03:10,084 ERROR [S-EventLoopGroup-1-3] 
> util.ResourceLeakDetector - java:115)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.expandCumulation(ByteToMessageDecoder.java:538)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:97)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:274)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
>   
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>   
> org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>   
> org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   java.lang.Thread.run(Thread.java:748)
>  {code}
> {code:java}
> 2022-01-25 17:03:14,014 ERROR [S-EventLoopGroup-1-3] 
> util.ResourceLeakDetector - 
> apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446)
>   
> 

[jira] [Comment Edited] (HBASE-26708) Netty "leak detected" and OutOfDirectMemoryError due to direct memory buffering

2022-06-09 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552340#comment-17552340
 ] 

Andrew Kyle Purtell edited comment on HBASE-26708 at 6/9/22 5:28 PM:
-

[~bbeaudreault]  I was/am confused by that because HBASE-2 is a child of 
HBASE-26553 which describes itself as "OAuth Bearer authentication mech plugin 
for SASL". Can you or someone clean this up so we can clearly see what is going 
on? Is it really a full TLS RPC stack? Because it looks to me like some TLS 
fiddling to get a token that then sets up the usual wrapped SASL connection, 
possibly why I am confused. That would not be native TLS support in the sense I 
mean and the sense that is really required, possibly why it has not gotten 
enough attention. 


was (Author: apurtell):
[~bbeaudreault]  I was/am confused by that because HBASE-2 is a child of 
HBASE-26553 which describes itself as "OAuth Bearer authentication mech plugin 
for SASL". Can you or someone clean this up so we can clearly see what is going 
on? Is it really a full TLS RPC stack? Because it looks to me like some TLS 
fiddling to get a token that then sets up the usual wrapped SASL connection. It 
is not native TLS support in the sense I mean and the sense that is really 
required, which is TLS and only TLS end to end, possibly why it has not gotten 
enough attention. 

> Netty "leak detected" and OutOfDirectMemoryError due to direct memory 
> buffering
> ---
>
> Key: HBASE-26708
> URL: https://issues.apache.org/jira/browse/HBASE-26708
> Project: HBase
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 2.5.0, 2.4.6
>Reporter: Viraj Jasani
>Priority: Critical
>
> Under constant data ingestion, using default Netty based RpcServer and 
> RpcClient implementation results in OutOfDirectMemoryError, supposedly caused 
> by leaks detected by Netty's LeakDetector.
> {code:java}
> 2022-01-25 17:03:10,084 ERROR [S-EventLoopGroup-1-3] 
> util.ResourceLeakDetector - java:115)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.expandCumulation(ByteToMessageDecoder.java:538)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:97)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:274)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
>   
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>   
> org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>   
> org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   java.lang.Thread.run(Thread.java:748)
>  {code}
> {code:java}
> 2022-01-25 17:03:14,014 ERROR [S-EventLoopGroup-1-3] 
> util.ResourceLeakDetector - 
> apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> 

[jira] [Comment Edited] (HBASE-26708) Netty "leak detected" and OutOfDirectMemoryError due to direct memory buffering

2022-06-09 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552322#comment-17552322
 ] 

Andrew Kyle Purtell edited comment on HBASE-26708 at 6/9/22 5:15 PM:
-

[~zhangduo] Our current requirements would be _auth-conf_ but Viraj may have 
been testing with {_}auth{_}, which was the previous setting.

[~vjasani] I am curious if you apply my patch and set 
hbase.netty.rpcserver.allocator=unpooled if the direct memory allocation still 
gets up to > 50 GB. My guess is yes, that it is the concurrent demand for 
buffers at load driving the usage, and not excessive cache retention in the 
pooled allocator. Let's see if experimental results confirm the hypothesis. If 
it helps then I am wrong and pooling configuration tweaks – read on below – 
should be considered. 

If I am correct then we should investigate how to get direct IO buffers freed 
faster and/or limits or pacing applied to their allocation; using a custom 
allocator, possibly. Like [~zhangduo] mentioned we set up a certain number of 
buffers, depending, more when sasl is used. This should be tunable? People with 
large RAM servers/instances can tune it up? People with more memory constrained 
options can tune it down?

Looking at our PooledByteBufAllocator in hbase-thirdparty it is clear an issue 
people may be facing is confusion about system property names. I can see in the 
sources, via my IDE, that the shader rewrote the string constants containing 
the property keys too. Various resources on the Internet will offer 
documentation and suggestions, but because we relocated Netty into thirdparty, 
the names have changed, and so naively following the advice on StackOverflow 
and other places will have no effect. Key here is recommendations when you want 
to prefer heap instead of direct memory.

Let me list them in terms of relevancy for addressing this issue.

Highly relevant:
 - io.netty.allocator.cacheTrimInterval -> 
org.apache.hbase.thirdparty.io.netty.allocator.cacheTrimInterval
 -- This is the number of threshold of allocations when cached entries will be 
freed up if not frequently used. Lowering it from the default of 8192 may 
reduce the overall amount of direct memory retained in steady state, because 
the evaluation will be performed more often, as often as you specify.
 - io.netty.noPreferDirect -> 
org.apache.hbase.thirdparty.io.netty.noPreferDirect
 -- This will prefer heap arena allocations regardless of PlatformDependent 
ideas on preference if set to 'true'.
 - io.netty.allocator.numDirectArenas -> 
org.apache.hbase.thirdparty.io.netty.allocator.numDirectArenas
 -- Various advice on the Internet suggests setting numDirectArenas=0 and 
noPreferDirect=true as the way to prefer heap based buffers.

Less relevant:
 - io.netty.allocator.maxCachedBufferCapacity -> 
org.apache.hbase.thirdparty.io.netty.allocator.maxCachedBufferCapacity
 -- This is the sized based retention policy for buffers; individual buffers 
larger than this will not be cached.
 - io.netty.allocator.numHeapArenas -> 
org.apache.hbase.thirdparty.io.netty.allocator.numHeapArenas
 - io.netty.allocator.pageSize -> 
org.apache.hbase.thirdparty.io.netty.allocator.pageSize
 - io.netty.allocator.maxOrder -> 
org.apache.hbase.thirdparty.io.netty.allocator.maxOrder

On [https://github.com/apache/hbase/pull/4505] I have a draft PR that allows 
the user to tweak the Netty bytebuf allocation policy. This may be a good idea 
to do in general. We may want to provide support for some of the above Netty 
tunables in HBase site configuration as well, as a way to eliminate confusion 
about them... Our documentation on it would describe the HBase site config 
property names.

On a side note, we might spike on an alternative to SASL RPC that is a TLS 
based implementation instead. I know this has been discussed and even partially 
attempted, repeatedly, over our history but nonetheless the operational and 
performance issues with SASL remain. We were here once before on HBASE-17721. 
[~bbeaudreault]  posted HBASE-26548 more recently.


was (Author: apurtell):
[~zhangduo] Our current requirements would be _auth-conf_ but Viraj may have 
been testing with {_}auth{_}, which was the previous setting.

[~vjasani] I am curious if you apply my patch and set 
hbase.netty.rpcserver.allocator=unpooled if the direct memory allocation still 
gets up to > 50 GB. My guess is yes, that it is the concurrent demand for 
buffers at load driving the usage, and not excessive cache retention in the 
pooled allocator. Let's see if experimental results confirm the hypothesis. If 
it helps then I am wrong and pooling configuration tweaks – read on below – 
should be considered. If I am correct then we should investigate how to get 
direct IO buffers freed faster and/or limits or pacing applied to their 
allocation; using a custom allocator, possibly.

Looking at our 

[jira] [Comment Edited] (HBASE-26708) Netty "leak detected" and OutOfDirectMemoryError due to direct memory buffering

2022-06-09 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552322#comment-17552322
 ] 

Andrew Kyle Purtell edited comment on HBASE-26708 at 6/9/22 5:07 PM:
-

[~zhangduo] Our current requirements would be _auth-conf_ but Viraj may have 
been testing with {_}auth{_}, which was the previous setting.

[~vjasani] I am curious if you apply my patch and set 
hbase.netty.rpcserver.allocator=unpooled if the direct memory allocation still 
gets up to > 50 GB. My guess is yes, that it is the concurrent demand for 
buffers at load driving the usage, and not excessive cache retention in the 
pooled allocator. Let's see if experimental results confirm the hypothesis. If 
it helps then I am wrong and pooling configuration tweaks – read on below – 
should be considered. If I am correct then we should investigate how to get 
direct IO buffers freed faster and/or limits or pacing applied to their 
allocation; using a custom allocator, possibly.

Looking at our PooledByteBufAllocator in hbase-thirdparty it is clear an issue 
people may be facing is confusion about system property names. I can see in the 
sources, via my IDE, that the shader rewrote the string constants containing 
the property keys too. Various resources on the Internet will offer 
documentation and suggestions, but because we relocated Netty into thirdparty, 
the names have changed, and so naively following the advice on StackOverflow 
and other places will have no effect. Key here is recommendations when you want 
to prefer heap instead of direct memory.

Let me list them in terms of relevancy for addressing this issue.

Highly relevant:
 - io.netty.allocator.cacheTrimInterval -> 
org.apache.hbase.thirdparty.io.netty.allocator.cacheTrimInterval
 -- This is the number of threshold of allocations when cached entries will be 
freed up if not frequently used. Lowering it from the default of 8192 may 
reduce the overall amount of direct memory retained in steady state, because 
the evaluation will be performed more often, as often as you specify.
 - io.netty.noPreferDirect -> 
org.apache.hbase.thirdparty.io.netty.noPreferDirect
 -- This will prefer heap arena allocations regardless of PlatformDependent 
ideas on preference if set to 'true'.
 - io.netty.allocator.numDirectArenas -> 
org.apache.hbase.thirdparty.io.netty.allocator.numDirectArenas
 -- Various advice on the Internet suggests setting numDirectArenas=0 and 
noPreferDirect=true as the way to prefer heap based buffers.

Less relevant:
 - io.netty.allocator.maxCachedBufferCapacity -> 
org.apache.hbase.thirdparty.io.netty.allocator.maxCachedBufferCapacity
 -- This is the sized based retention policy for buffers; individual buffers 
larger than this will not be cached.
 - io.netty.allocator.numHeapArenas -> 
org.apache.hbase.thirdparty.io.netty.allocator.numHeapArenas
 - io.netty.allocator.pageSize -> 
org.apache.hbase.thirdparty.io.netty.allocator.pageSize
 - io.netty.allocator.maxOrder -> 
org.apache.hbase.thirdparty.io.netty.allocator.maxOrder

On [https://github.com/apache/hbase/pull/4505] I have a draft PR that allows 
the user to tweak the Netty bytebuf allocation policy. This may be a good idea 
to do in general. We may want to provide support for some of the above Netty 
tunables in HBase site configuration as well, as a way to eliminate confusion 
about them... Our documentation on it would describe the HBase site config 
property names.

On a side note, we might spike on an alternative to SASL RPC that is a TLS 
based implementation instead. I know this has been discussed and even partially 
attempted, repeatedly, over our history but nonetheless the operational and 
performance issues with SASL remain. We were here once before on HBASE-17721. 
[~bbeaudreault]  posted HBASE-26548 more recently.


was (Author: apurtell):
[~zhangduo] Our current requirements would be _auth-conf_ but Viraj may have 
been testing with {_}auth{_}, which was the previous setting.

[~vjasani] I am curious if you apply my patch and set 
hbase.netty.rpcserver.allocator=unpooled if the direct memory allocation still 
gets up to > 50 GB. My guess is yes, that it is the concurrent demand for 
buffers at load driving the usage, and not excessive cache retention in the 
pooled allocator. Let's see if experimental results confirm the hypothesis. If 
it helps then I am wrong and pooling configuration tweaks – read on below – 
should be considered. If I am correct then we should investigate how to get 
direct IO buffers freed faster and/or limits or pacing applied to their 
allocation; using a custom allocator, possibly.

Looking at our PooledByteBufAllocator in hbase-thirdparty it is clear an issue 
people may be facing is confusion about system property names. I can see in the 
sources, via my IDE, that the shader rewrote the string constants containing 
the property keys too. Various 

[jira] [Comment Edited] (HBASE-26708) Netty "leak detected" and OutOfDirectMemoryError due to direct memory buffering

2022-06-09 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17552322#comment-17552322
 ] 

Andrew Kyle Purtell edited comment on HBASE-26708 at 6/9/22 4:54 PM:
-

[~zhangduo] Our current requirements would be _auth-conf_ but Viraj may have 
been testing with {_}auth{_}, which was the previous setting.

[~vjasani] I am curious if you apply my patch and set 
hbase.netty.rpcserver.allocator=unpooled if the direct memory allocation still 
gets up to > 50 GB. My guess is yes, that it is the concurrent demand for 
buffers at load driving the usage, and not excessive cache retention in the 
pooled allocator. Let's see if experimental results confirm the hypothesis. If 
it helps then I am wrong and pooling configuration tweaks – read on below – 
should be considered. If I am correct then we should investigate how to get 
direct IO buffers freed faster and/or limits or pacing applied to their 
allocation; using a custom allocator, possibly.

Looking at our PooledByteBufAllocator in hbase-thirdparty it is clear an issue 
people may be facing is confusion about system property names. I can see in the 
sources, via my IDE, that the shader rewrote the string constants containing 
the property keys too. Various resources on the Internet will offer 
documentation and suggestions, but because we relocated Netty into thirdparty, 
the names have changed, and so naively following the advice on StackOverflow 
and other places will have no effect. Key here is recommendations when you want 
to prefer heap instead of direct memory.

Let me list them in terms of relevancy for addressing this issue.

Highly relevant:
 - io.netty.allocator.cacheTrimInterval -> 
org.apache.hbase.thirdparty.io.netty.allocator.cacheTrimInterval
 -- This is the number of threshold of allocations when cached entries will be 
freed up if not frequently used. Lowering it from the default of 8192 may 
reduce the overall amount of direct memory retained in steady state, because 
the evaluation will be performed more often, as often as you specify.
 - io.netty.noPreferDirect -> 
org.apache.hbase.thirdparty.io.netty.noPreferDirect
 -- This will prefer heap arena allocations regardless of PlatformDependent 
ideas on preference if set to 'true'.
 - io.netty.allocator.numDirectArenas -> 
org.apache.hbase.thirdparty.io.netty.allocator.numDirectArenas
 -- Various advice on the Internet suggests setting numDirectArenas=0 and 
noPreferDirect=true as the way to prefer heap based buffers.

Less relevant:
 - io.netty.allocator.maxCachedBufferCapacity -> 
org.apache.hbase.thirdparty.io.netty.allocator.maxCachedBufferCapacity
 -- This is the sized based retention policy for buffers; individual buffers 
larger than this will not be cached.
 - io.netty.allocator.numHeapArenas -> 
org.apache.hbase.thirdparty.io.netty.allocator.numHeapArenas
 - io.netty.allocator.pageSize -> 
org.apache.hbase.thirdparty.io.netty.allocator.pageSize
 - io.netty.allocator.maxOrder -> 
org.apache.hbase.thirdparty.io.netty.allocator.maxOrder

On [https://github.com/apache/hbase/pull/4505] I have a draft PR that allows 
the user to tweak the Netty bytebuf allocation policy. This may be a good idea 
to do in general. We may want to provide support for some of the above Netty 
tunables in HBase site configuration as well, as a way to eliminate confusion 
about them... Our documentation on it would describe the HBase site config 
property names.

On a side note, we might spike on an alternative to SASL RPC that is a TLS 
based implementation instead. I know this has been discussed and even partially 
attempted, repeatedly, over our history but nonetheless the operational and 
performance issues with SASL remain.


was (Author: apurtell):
[~zhangduo] Our current requirements would be _auth-conf_ but Viraj may have 
been testing with _auth_, which was the previous setting. 

[~vjasani] I am curious if you apply my patch and set 
hbase.netty.rpcserver.allocator=unpooled if the direct memory allocation still 
gets up to > 50 GB. My guess is yes, that it is the concurrent demand for 
buffers at load driving the usage, and not excessive cache retention in the 
pooled allocator. Let's see if experimental results confirm the hypothesis. If 
it helps then I am wrong and pooling configuration tweaks -- read on below -- 
should be considered. If I am correct then we should investigate how to get 
direct IO buffers freed faster and/or limits or pacing applied to their 
allocation; using a custom allocator, possibly. 

Looking at our PooledByteBufAllocator in hbase-thirdparty it is clear an issue 
people may be facing is confusion about system property names. Various 
resources on the Internet will offer documentation and suggestions, but because 
we relocated Netty into thirdparty, the names have changed, and so naively 
following the advice on StackOverflow and other places 

[jira] [Comment Edited] (HBASE-26708) Netty "leak detected" and OutOfDirectMemoryError due to direct memory buffering

2022-06-08 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17551700#comment-17551700
 ] 

Andrew Kyle Purtell edited comment on HBASE-26708 at 6/8/22 4:47 PM:
-

[~zhangduo] 

bq. Does increase MaxDirectMemorySize can solve the problem?

Yes, this avoids the failures, but this remains a cost-to-serve problem as it 
requires upselection of e.g. AWS instance type for the larger RAM allocation. 
But it is an effective workaround for us.

I get your point... We should update this issue, because this maybe isn't a 
_leak_. It is an excessive buffer retention issue, certainly.


was (Author: apurtell):
[~zhangduo] 

bq. Does increase MaxDirectMemorySize can solve the problem?

Yes, this avoids the failures, but this remains a cost-to-serve problem as it 
requires upselection of e.g. AWS instance type for the larger RAM allocation. 
But it is an effective workaround, for sure.

I get your point... We should update this issue, because this maybe isn't a 
_leak_. It is an excessive buffer retention issue.

> Netty "leak detected" and OutOfDirectMemoryError due to direct memory 
> buffering
> ---
>
> Key: HBASE-26708
> URL: https://issues.apache.org/jira/browse/HBASE-26708
> Project: HBase
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 2.5.0, 2.4.6
>Reporter: Viraj Jasani
>Priority: Major
>
> Under constant data ingestion, using default Netty based RpcServer and 
> RpcClient implementation results in OutOfDirectMemoryError, supposedly caused 
> by leaks detected by Netty's LeakDetector.
> {code:java}
> 2022-01-25 17:03:10,084 ERROR [S-EventLoopGroup-1-3] 
> util.ResourceLeakDetector - java:115)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.expandCumulation(ByteToMessageDecoder.java:538)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:97)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:274)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
>   
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>   
> org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>   
> org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   java.lang.Thread.run(Thread.java:748)
>  {code}
> {code:java}
> 2022-01-25 17:03:14,014 ERROR [S-EventLoopGroup-1-3] 
> util.ResourceLeakDetector - 
> apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   
> 

[jira] [Comment Edited] (HBASE-26708) Netty "leak detected" and OutOfDirectMemoryError due to direct memory buffering

2022-06-08 Thread Andrew Kyle Purtell (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17551700#comment-17551700
 ] 

Andrew Kyle Purtell edited comment on HBASE-26708 at 6/8/22 4:46 PM:
-

[~zhangduo] 

bq. Does increase MaxDirectMemorySize can solve the problem?

Yes, this avoids the failures, but this remains a cost-to-serve problem as it 
requires upselection of e.g. AWS instance type for the larger RAM allocation. 
But it is an effective workaround, for sure.

I get your point... We should update this issue, because this maybe isn't a 
_leak_. It is an excessive buffer retention issue.


was (Author: apurtell):
[~zhangduo] 

bq. Does increase MaxDirectMemorySize can solve the problem?

Yes, this avoids the failures, but this remains a cost-to-serve problem as it 
requires upselection of e.g. AWS instance type for the larger RAM allocation. 
But it is an effective workaround, for sure.

I get your point... We should update this issue, because this isn't a _leak_. 
It is an excessive (IMHO) buffer retention issue.

> Netty "leak detected" and OutOfDirectMemoryError due to direct memory 
> buffering
> ---
>
> Key: HBASE-26708
> URL: https://issues.apache.org/jira/browse/HBASE-26708
> Project: HBase
>  Issue Type: Bug
>  Components: rpc
>Affects Versions: 2.5.0, 2.4.6
>Reporter: Viraj Jasani
>Priority: Major
>
> Under constant data ingestion, using default Netty based RpcServer and 
> RpcClient implementation results in OutOfDirectMemoryError, supposedly caused 
> by leaks detected by Netty's LeakDetector.
> {code:java}
> 2022-01-25 17:03:10,084 ERROR [S-EventLoopGroup-1-3] 
> util.ResourceLeakDetector - java:115)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.expandCumulation(ByteToMessageDecoder.java:538)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:97)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:274)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
>   
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>   
> org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>   
> org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   java.lang.Thread.run(Thread.java:748)
>  {code}
> {code:java}
> 2022-01-25 17:03:14,014 ERROR [S-EventLoopGroup-1-3] 
> util.ResourceLeakDetector - 
> apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   
>