[jira] [Commented] (HBASE-27730) Optimize reference counting in off-heap ByteBuff

2024-01-02 Thread Becker Ewing (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801971#comment-17801971
 ] 

Becker Ewing commented on HBASE-27730:
--

Sorry about that! I've pushed up the code to this branch: 
[https://github.com/jbewing/hbase/tree/HBASE-27730-prelim] 

> Optimize reference counting in off-heap ByteBuff
> 
>
> Key: HBASE-27730
> URL: https://issues.apache.org/jira/browse/HBASE-27730
> Project: HBase
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Assignee: Becker Ewing
>Priority: Major
> Attachments: HBASE-27730-prelim.patch
>
>
> In HBASE-27710 we uncovered a performance regression in reference counting of 
> ByteBuff. This was especially prominent in on-heap buffers when doing a 
> simple HFile.Reader iteration of a file. For that case, we saw a 4x 
> regression when reference counting was in play.
> It stands to reason that this same regression exists in off-heap buffers, and 
> I've run a microbenchmark which indeed shows the same issue. With existing 
> reference counting, scanning a 20gb hfile takes 40s. With an optimized 
> version, scanning the same file takes 20s. We don't typically see this in 
> profiling live regionservers where so much else goes on, but optimizing this 
> would eliminate some cpu cycles.
> It's worth noting that netty saw this same regression a few years ago: 
> [https://github.com/netty/netty/pull/8895]. Hat tip to [~zhangduo] for 
> pointing this out.
> One of the fixes there was to copy some internal code from deeper in the ref 
> counting, so that the call stack was smaller and inlining was possible. We 
> can't really do that.
> Another thing they did was add a boolean field in their CompositeByteBuffer, 
> which gets set to true when the buffer is recycled. So they don't need to do 
> reference counting on every operation, instead they can just check a boolean.
> I tried adding a boolean to our RefCnt.java, and it indeed fixes the 
> regression. The problem is, due to class alignment issues in java, adding 
> this boolean field increases the heap size of RefCnt from 24 to 32 bytes. 
> This seems non-trivial given it's used in bucket cache where there could be 
> many millions of them.
> I think we can get around this by simply nulling out the recycler in RefCnt 
> after it has been called. Then, instead of doing a boolean check we can do a 
> null check. This performs similarly to the boolean, but without any extra 
> memory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27730) Optimize reference counting in off-heap ByteBuff

2024-01-02 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801968#comment-17801968
 ] 

Bryan Beaudreault commented on HBASE-27730:
---

Can you attach the full benchmark? Looks like only a diff. You could also push 
a branch to make it easier to read in GitHub rather than txt

> Optimize reference counting in off-heap ByteBuff
> 
>
> Key: HBASE-27730
> URL: https://issues.apache.org/jira/browse/HBASE-27730
> Project: HBase
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Assignee: Becker Ewing
>Priority: Major
> Attachments: HBASE-27730-prelim.patch
>
>
> In HBASE-27710 we uncovered a performance regression in reference counting of 
> ByteBuff. This was especially prominent in on-heap buffers when doing a 
> simple HFile.Reader iteration of a file. For that case, we saw a 4x 
> regression when reference counting was in play.
> It stands to reason that this same regression exists in off-heap buffers, and 
> I've run a microbenchmark which indeed shows the same issue. With existing 
> reference counting, scanning a 20gb hfile takes 40s. With an optimized 
> version, scanning the same file takes 20s. We don't typically see this in 
> profiling live regionservers where so much else goes on, but optimizing this 
> would eliminate some cpu cycles.
> It's worth noting that netty saw this same regression a few years ago: 
> [https://github.com/netty/netty/pull/8895]. Hat tip to [~zhangduo] for 
> pointing this out.
> One of the fixes there was to copy some internal code from deeper in the ref 
> counting, so that the call stack was smaller and inlining was possible. We 
> can't really do that.
> Another thing they did was add a boolean field in their CompositeByteBuffer, 
> which gets set to true when the buffer is recycled. So they don't need to do 
> reference counting on every operation, instead they can just check a boolean.
> I tried adding a boolean to our RefCnt.java, and it indeed fixes the 
> regression. The problem is, due to class alignment issues in java, adding 
> this boolean field increases the heap size of RefCnt from 24 to 32 bytes. 
> This seems non-trivial given it's used in bucket cache where there could be 
> many millions of them.
> I think we can get around this by simply nulling out the recycler in RefCnt 
> after it has been called. Then, instead of doing a boolean check we can do a 
> null check. This performs similarly to the boolean, but without any extra 
> memory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27730) Optimize reference counting in off-heap ByteBuff

2024-01-02 Thread Becker Ewing (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801953#comment-17801953
 ] 

Becker Ewing commented on HBASE-27730:
--

{noformat}
Benchmark              (shouldUseNoneRecycler)  (useOffHeapBuffer)  Mode  Cnt   
  Score     Error  Units
ByteBuffBenchmark.get                     true                true  avgt    5  
8715.026 ± 196.680  ns/op
ByteBuffBenchmark.get                     true               false  avgt    5  
9156.038 ± 879.701  ns/op
ByteBuffBenchmark.get                    false                true  avgt    5  
8746.112 ±  91.118  ns/op
ByteBuffBenchmark.get                    false               false  avgt    5  
9067.323 ± 190.359  ns/op


Benchmark                                         (shouldUseNoneRecycler)       
       (vlong)  Mode  Cnt   Score   Error  Units
ReadVLongBenchmark.readVLongHBase14186_OffHeapBB                     true       
             9  avgt    5   2.254 ± 0.077  ns/op
ReadVLongBenchmark.readVLongHBase14186_OffHeapBB                     true       
           512  avgt    5   5.305 ± 0.169  ns/op
ReadVLongBenchmark.readVLongHBase14186_OffHeapBB                     true       
         8  avgt    5   7.014 ± 0.929  ns/op
ReadVLongBenchmark.readVLongHBase14186_OffHeapBB                     true       
  548755813887  avgt    5   6.730 ± 0.037  ns/op
ReadVLongBenchmark.readVLongHBase14186_OffHeapBB                     true       
 1700104028981  avgt    5   7.497 ± 0.133  ns/op
ReadVLongBenchmark.readVLongHBase14186_OffHeapBB                     true  
9123372036854775807  avgt    5  10.984 ± 0.445  ns/op
ReadVLongBenchmark.readVLongHBase14186_OffHeapBB                    false       
             9  avgt    5   2.242 ± 0.031  ns/op
ReadVLongBenchmark.readVLongHBase14186_OffHeapBB                    false       
           512  avgt    5   5.349 ± 0.075  ns/op
ReadVLongBenchmark.readVLongHBase14186_OffHeapBB                    false       
         8  avgt    5   6.677 ± 0.041  ns/op
ReadVLongBenchmark.readVLongHBase14186_OffHeapBB                    false       
  548755813887  avgt    5   6.772 ± 0.528  ns/op
ReadVLongBenchmark.readVLongHBase14186_OffHeapBB                    false       
 1700104028981  avgt    5   7.436 ± 0.128  ns/op
ReadVLongBenchmark.readVLongHBase14186_OffHeapBB                    false  
9123372036854775807  avgt    5  11.173 ± 0.217  ns/op
ReadVLongBenchmark.readVLongHBase14186_OnHeapBB                      true       
             9  avgt    5   2.495 ± 0.268  ns/op
ReadVLongBenchmark.readVLongHBase14186_OnHeapBB                      true       
           512  avgt    5   5.335 ± 0.166  ns/op
ReadVLongBenchmark.readVLongHBase14186_OnHeapBB                      true       
         8  avgt    5   7.014 ± 0.177  ns/op
ReadVLongBenchmark.readVLongHBase14186_OnHeapBB                      true       
  548755813887  avgt    5   6.990 ± 0.083  ns/op
ReadVLongBenchmark.readVLongHBase14186_OnHeapBB                      true       
 1700104028981  avgt    5   7.085 ± 0.093  ns/op
ReadVLongBenchmark.readVLongHBase14186_OnHeapBB                      true  
9123372036854775807  avgt    5  10.319 ± 0.229  ns/op
ReadVLongBenchmark.readVLongHBase14186_OnHeapBB                     false       
             9  avgt    5   2.420 ± 0.126  ns/op
ReadVLongBenchmark.readVLongHBase14186_OnHeapBB                     false       
           512  avgt    5   5.508 ± 0.500  ns/op
ReadVLongBenchmark.readVLongHBase14186_OnHeapBB                     false       
         8  avgt    5   7.137 ± 0.108  ns/op
ReadVLongBenchmark.readVLongHBase14186_OnHeapBB                     false       
  548755813887  avgt    5   6.943 ± 0.040  ns/op
ReadVLongBenchmark.readVLongHBase14186_OnHeapBB                     false       
 1700104028981  avgt    5   6.921 ± 0.734  ns/op
ReadVLongBenchmark.readVLongHBase14186_OnHeapBB                     false  
9123372036854775807  avgt    5  10.250 ± 0.364  ns/op{noformat}

> Optimize reference counting in off-heap ByteBuff
> 
>
> Key: HBASE-27730
> URL: https://issues.apache.org/jira/browse/HBASE-27730
> Project: HBase
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Assignee: Becker Ewing
>Priority: Major
> Attachments: HBASE-27730-prelim.patch
>
>
> In HBASE-27710 we uncovered a performance regression in reference counting of 
> ByteBuff. This was especially prominent in on-heap buffers when doing a 
> simple HFile.Reader iteration of a file. For that case, we saw a 4x 
> regression when reference counting was in play.
> It stands to reason that this same regression exists in off-heap buffers, and 
> I've run a microbenchmark which indeed shows the same issue. With existing 
> reference counting, scanning a 20gb hfile takes 40s. With an optimized 
> version, 

[jira] [Commented] (HBASE-27730) Optimize reference counting in off-heap ByteBuff

2024-01-02 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801952#comment-17801952
 ] 

Bryan Beaudreault commented on HBASE-27730:
---

Could you try running my exact patch with your benchmark? 

> Optimize reference counting in off-heap ByteBuff
> 
>
> Key: HBASE-27730
> URL: https://issues.apache.org/jira/browse/HBASE-27730
> Project: HBase
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Assignee: Becker Ewing
>Priority: Major
> Attachments: HBASE-27730-prelim.patch
>
>
> In HBASE-27710 we uncovered a performance regression in reference counting of 
> ByteBuff. This was especially prominent in on-heap buffers when doing a 
> simple HFile.Reader iteration of a file. For that case, we saw a 4x 
> regression when reference counting was in play.
> It stands to reason that this same regression exists in off-heap buffers, and 
> I've run a microbenchmark which indeed shows the same issue. With existing 
> reference counting, scanning a 20gb hfile takes 40s. With an optimized 
> version, scanning the same file takes 20s. We don't typically see this in 
> profiling live regionservers where so much else goes on, but optimizing this 
> would eliminate some cpu cycles.
> It's worth noting that netty saw this same regression a few years ago: 
> [https://github.com/netty/netty/pull/8895]. Hat tip to [~zhangduo] for 
> pointing this out.
> One of the fixes there was to copy some internal code from deeper in the ref 
> counting, so that the call stack was smaller and inlining was possible. We 
> can't really do that.
> Another thing they did was add a boolean field in their CompositeByteBuffer, 
> which gets set to true when the buffer is recycled. So they don't need to do 
> reference counting on every operation, instead they can just check a boolean.
> I tried adding a boolean to our RefCnt.java, and it indeed fixes the 
> regression. The problem is, due to class alignment issues in java, adding 
> this boolean field increases the heap size of RefCnt from 24 to 32 bytes. 
> This seems non-trivial given it's used in bucket cache where there could be 
> many millions of them.
> I think we can get around this by simply nulling out the recycler in RefCnt 
> after it has been called. Then, instead of doing a boolean check we can do a 
> null check. This performs similarly to the boolean, but without any extra 
> memory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27730) Optimize reference counting in off-heap ByteBuff

2024-01-02 Thread Becker Ewing (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801950#comment-17801950
 ] 

Becker Ewing commented on HBASE-27730:
--

Yeah what I have is pretty similar. I just attached a patch file 
([^HBASE-27730-prelim.patch]) which has everything. The ByteBuffBenchmark class 
has proven to be pretty useless so far (no real meaningful numbers), but the 
readVLong performance is as I described. It's a subset of the benchmarks in 
HBASE-28256, so I used the results in 
https://issues.apache.org/jira/browse/HBASE-28256?focusedCommentId=17799506=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17799506
 as my point of reference. 

 

I attached some comments in RefCnt.java that let you toggle between the "v1" 
and "v2". I'll also note that the implementations in "v1" and "v2" are also 
somewhat flawed as it's possible to call Recycler#free twice (to fix the bug, 
you'd need to switch to using an AtomicReference so you could use the 
"compareAndSet" method when setting "recycler" to null

> Optimize reference counting in off-heap ByteBuff
> 
>
> Key: HBASE-27730
> URL: https://issues.apache.org/jira/browse/HBASE-27730
> Project: HBase
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Assignee: Becker Ewing
>Priority: Major
> Attachments: HBASE-27730-prelim.patch
>
>
> In HBASE-27710 we uncovered a performance regression in reference counting of 
> ByteBuff. This was especially prominent in on-heap buffers when doing a 
> simple HFile.Reader iteration of a file. For that case, we saw a 4x 
> regression when reference counting was in play.
> It stands to reason that this same regression exists in off-heap buffers, and 
> I've run a microbenchmark which indeed shows the same issue. With existing 
> reference counting, scanning a 20gb hfile takes 40s. With an optimized 
> version, scanning the same file takes 20s. We don't typically see this in 
> profiling live regionservers where so much else goes on, but optimizing this 
> would eliminate some cpu cycles.
> It's worth noting that netty saw this same regression a few years ago: 
> [https://github.com/netty/netty/pull/8895]. Hat tip to [~zhangduo] for 
> pointing this out.
> One of the fixes there was to copy some internal code from deeper in the ref 
> counting, so that the call stack was smaller and inlining was possible. We 
> can't really do that.
> Another thing they did was add a boolean field in their CompositeByteBuffer, 
> which gets set to true when the buffer is recycled. So they don't need to do 
> reference counting on every operation, instead they can just check a boolean.
> I tried adding a boolean to our RefCnt.java, and it indeed fixes the 
> regression. The problem is, due to class alignment issues in java, adding 
> this boolean field increases the heap size of RefCnt from 24 to 32 bytes. 
> This seems non-trivial given it's used in bucket cache where there could be 
> many millions of them.
> I think we can get around this by simply nulling out the recycler in RefCnt 
> after it has been called. Then, instead of doing a boolean check we can do a 
> null check. This performs similarly to the boolean, but without any extra 
> memory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27730) Optimize reference counting in off-heap ByteBuff

2024-01-02 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801947#comment-17801947
 ] 

Bryan Beaudreault commented on HBASE-27730:
---

Unfortunately, I can't find the benchmark I had. However, I did find my initial 
impl of the idea here. I pushed it up to a branch: 
[https://github.com/apache/hbase/compare/master...HubSpot:hbase:HBASE-27730?expand=1]

Can you see if what you have looks like that? Let's also see what your 
benchmark code looks like. 

> Optimize reference counting in off-heap ByteBuff
> 
>
> Key: HBASE-27730
> URL: https://issues.apache.org/jira/browse/HBASE-27730
> Project: HBase
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Assignee: Becker Ewing
>Priority: Major
>
> In HBASE-27710 we uncovered a performance regression in reference counting of 
> ByteBuff. This was especially prominent in on-heap buffers when doing a 
> simple HFile.Reader iteration of a file. For that case, we saw a 4x 
> regression when reference counting was in play.
> It stands to reason that this same regression exists in off-heap buffers, and 
> I've run a microbenchmark which indeed shows the same issue. With existing 
> reference counting, scanning a 20gb hfile takes 40s. With an optimized 
> version, scanning the same file takes 20s. We don't typically see this in 
> profiling live regionservers where so much else goes on, but optimizing this 
> would eliminate some cpu cycles.
> It's worth noting that netty saw this same regression a few years ago: 
> [https://github.com/netty/netty/pull/8895]. Hat tip to [~zhangduo] for 
> pointing this out.
> One of the fixes there was to copy some internal code from deeper in the ref 
> counting, so that the call stack was smaller and inlining was possible. We 
> can't really do that.
> Another thing they did was add a boolean field in their CompositeByteBuffer, 
> which gets set to true when the buffer is recycled. So they don't need to do 
> reference counting on every operation, instead they can just check a boolean.
> I tried adding a boolean to our RefCnt.java, and it indeed fixes the 
> regression. The problem is, due to class alignment issues in java, adding 
> this boolean field increases the heap size of RefCnt from 24 to 32 bytes. 
> This seems non-trivial given it's used in bucket cache where there could be 
> many millions of them.
> I think we can get around this by simply nulling out the recycler in RefCnt 
> after it has been called. Then, instead of doing a boolean check we can do a 
> null check. This performs similarly to the boolean, but without any extra 
> memory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27730) Optimize reference counting in off-heap ByteBuff

2024-01-02 Thread Becker Ewing (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801944#comment-17801944
 ] 

Becker Ewing commented on HBASE-27730:
--

[~zhangduo] / [~bbeaudreault] any chance that y'all have the microbenchmarks 
laying around for the initial work here for determining that the null-check 
solution is equally as fast to copying the AbstractReferenceCounted code? I've 
got some microbenchmarks that I wrote for this that are giving me very little 
signal (I fear that the java compiler is doing something sneaky). I'm able to 
get a lot of mileage out of my microbenchmark I wrote for readVLong for 
HBASE-28256 and using the numbers there as a control, but it's generally 
showing that implementing this with the null-check route either results in:
 * a regression to the NONE recycler case when implemented as suggested above 
(and not much perf improvement to performance). Going this route looks like it 
essentially nullifies the performance gains we made in HBASE-27710 for no real 
benefit. I believe this is because HBASE-27710 essentially made it so that when 
using a NONE recycler, a volatile variable read isn't needed. Using the above 
suggested implementation (i.e. the null check solution) introduces a volatile 
variable read into the read hot path
 * if you instead use the "leak" field being equal to null as a proxy for 
determining whether the NONE recycler is being used (so a volatile variable 
read isn't needed for the NONE recycler case), then the NONE recycler 
regression is fixed, but there's no real performance gain 

 

I can attach my benchmarking code/example patches if it would help, but I think 
given this I'm going to try going down the route of copying the 
AbstractReferenceCounted code and see if that yields any performance improvement

> Optimize reference counting in off-heap ByteBuff
> 
>
> Key: HBASE-27730
> URL: https://issues.apache.org/jira/browse/HBASE-27730
> Project: HBase
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Assignee: Becker Ewing
>Priority: Major
>
> In HBASE-27710 we uncovered a performance regression in reference counting of 
> ByteBuff. This was especially prominent in on-heap buffers when doing a 
> simple HFile.Reader iteration of a file. For that case, we saw a 4x 
> regression when reference counting was in play.
> It stands to reason that this same regression exists in off-heap buffers, and 
> I've run a microbenchmark which indeed shows the same issue. With existing 
> reference counting, scanning a 20gb hfile takes 40s. With an optimized 
> version, scanning the same file takes 20s. We don't typically see this in 
> profiling live regionservers where so much else goes on, but optimizing this 
> would eliminate some cpu cycles.
> It's worth noting that netty saw this same regression a few years ago: 
> [https://github.com/netty/netty/pull/8895]. Hat tip to [~zhangduo] for 
> pointing this out.
> One of the fixes there was to copy some internal code from deeper in the ref 
> counting, so that the call stack was smaller and inlining was possible. We 
> can't really do that.
> Another thing they did was add a boolean field in their CompositeByteBuffer, 
> which gets set to true when the buffer is recycled. So they don't need to do 
> reference counting on every operation, instead they can just check a boolean.
> I tried adding a boolean to our RefCnt.java, and it indeed fixes the 
> regression. The problem is, due to class alignment issues in java, adding 
> this boolean field increases the heap size of RefCnt from 24 to 32 bytes. 
> This seems non-trivial given it's used in bucket cache where there could be 
> many millions of them.
> I think we can get around this by simply nulling out the recycler in RefCnt 
> after it has been called. Then, instead of doing a boolean check we can do a 
> null check. This performs similarly to the boolean, but without any extra 
> memory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27730) Optimize reference counting in off-heap ByteBuff

2023-03-17 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701791#comment-17701791
 ] 

Duo Zhang commented on HBASE-27730:
---

Let's go with the null check solution first. Copying AbstractReferenceCounted 
code is for reducing the inherit levels and also make it more easier to inline 
the small functions. Can do this later as it requires advanced micro benchmarks.

 

Thanks.

> Optimize reference counting in off-heap ByteBuff
> 
>
> Key: HBASE-27730
> URL: https://issues.apache.org/jira/browse/HBASE-27730
> Project: HBase
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Priority: Major
>
> In HBASE-27710 we uncovered a performance regression in reference counting of 
> ByteBuff. This was especially prominent in on-heap buffers when doing a 
> simple HFile.Reader iteration of a file. For that case, we saw a 4x 
> regression when reference counting was in play.
> It stands to reason that this same regression exists in off-heap buffers, and 
> I've run a microbenchmark which indeed shows the same issue. With existing 
> reference counting, scanning a 20gb hfile takes 40s. With an optimized 
> version, scanning the same file takes 20s. We don't typically see this in 
> profiling live regionservers where so much else goes on, but optimizing this 
> would eliminate some cpu cycles.
> It's worth noting that netty saw this same regression a few years ago: 
> [https://github.com/netty/netty/pull/8895]. Hat tip to [~zhangduo] for 
> pointing this out.
> One of the fixes there was to copy some internal code from deeper in the ref 
> counting, so that the call stack was smaller and inlining was possible. We 
> can't really do that.
> Another thing they did was add a boolean field in their CompositeByteBuffer, 
> which gets set to true when the buffer is recycled. So they don't need to do 
> reference counting on every operation, instead they can just check a boolean.
> I tried adding a boolean to our RefCnt.java, and it indeed fixes the 
> regression. The problem is, due to class alignment issues in java, adding 
> this boolean field increases the heap size of RefCnt from 24 to 32 bytes. 
> This seems non-trivial given it's used in bucket cache where there could be 
> many millions of them.
> I think we can get around this by simply nulling out the recycler in RefCnt 
> after it has been called. Then, instead of doing a boolean check we can do a 
> null check. This performs similarly to the boolean, but without any extra 
> memory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27730) Optimize reference counting in off-heap ByteBuff

2023-03-17 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701773#comment-17701773
 ] 

Bryan Beaudreault commented on HBASE-27730:
---

[~zhangduo] I tested it out copying AbstractReferenceCounted code. That 
performs similarly to above, though is more operations than just the null 
check. I'm open to both approaches. Let me know what you think of my idea above.

> Optimize reference counting in off-heap ByteBuff
> 
>
> Key: HBASE-27730
> URL: https://issues.apache.org/jira/browse/HBASE-27730
> Project: HBase
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Priority: Major
>
> In HBASE-27710 we uncovered a performance regression in reference counting of 
> ByteBuff. This was especially prominent in on-heap buffers when doing a 
> simple HFile.Reader iteration of a file. For that case, we saw a 4x 
> regression when reference counting was in play.
> It stands to reason that this same regression exists in off-heap buffers, and 
> I've run a microbenchmark which indeed shows the same issue. With existing 
> reference counting, scanning a 20gb hfile takes 40s. With an optimized 
> version, scanning the same file takes 20s. We don't typically see this in 
> profiling live regionservers where so much else goes on, but optimizing this 
> would eliminate some cpu cycles.
> It's worth noting that netty saw this same regression a few years ago: 
> [https://github.com/netty/netty/pull/8895]. Hat tip to [~zhangduo] for 
> pointing this out.
> One of the fixes there was to copy some internal code from deeper in the ref 
> counting, so that the call stack was smaller and inlining was possible. We 
> can't really do that.
> Another thing they did was add a boolean field in their CompositeByteBuffer, 
> which gets set to true when the buffer is recycled. So they don't need to do 
> reference counting on every operation, instead they can just check a boolean.
> I tried adding a boolean to our RefCnt.java, and it indeed fixes the 
> regression. The problem is, due to class alignment issues in java, adding 
> this boolean field increases the heap size of RefCnt from 24 to 32 bytes. 
> This seems non-trivial given it's used in bucket cache where there could be 
> many millions of them.
> I think we can get around this by simply nulling out the recycler in RefCnt 
> after it has been called. Then, instead of doing a boolean check we can do a 
> null check. This performs similarly to the boolean, but without any extra 
> memory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27730) Optimize reference counting in off-heap ByteBuff

2023-03-17 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701774#comment-17701774
 ] 

Bryan Beaudreault commented on HBASE-27730:
---

Good idea, that makes sense for sure.

> Optimize reference counting in off-heap ByteBuff
> 
>
> Key: HBASE-27730
> URL: https://issues.apache.org/jira/browse/HBASE-27730
> Project: HBase
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Priority: Major
>
> In HBASE-27710 we uncovered a performance regression in reference counting of 
> ByteBuff. This was especially prominent in on-heap buffers when doing a 
> simple HFile.Reader iteration of a file. For that case, we saw a 4x 
> regression when reference counting was in play.
> It stands to reason that this same regression exists in off-heap buffers, and 
> I've run a microbenchmark which indeed shows the same issue. With existing 
> reference counting, scanning a 20gb hfile takes 40s. With an optimized 
> version, scanning the same file takes 20s. We don't typically see this in 
> profiling live regionservers where so much else goes on, but optimizing this 
> would eliminate some cpu cycles.
> It's worth noting that netty saw this same regression a few years ago: 
> [https://github.com/netty/netty/pull/8895]. Hat tip to [~zhangduo] for 
> pointing this out.
> One of the fixes there was to copy some internal code from deeper in the ref 
> counting, so that the call stack was smaller and inlining was possible. We 
> can't really do that.
> Another thing they did was add a boolean field in their CompositeByteBuffer, 
> which gets set to true when the buffer is recycled. So they don't need to do 
> reference counting on every operation, instead they can just check a boolean.
> I tried adding a boolean to our RefCnt.java, and it indeed fixes the 
> regression. The problem is, due to class alignment issues in java, adding 
> this boolean field increases the heap size of RefCnt from 24 to 32 bytes. 
> This seems non-trivial given it's used in bucket cache where there could be 
> many millions of them.
> I think we can get around this by simply nulling out the recycler in RefCnt 
> after it has been called. Then, instead of doing a boolean check we can do a 
> null check. This performs similarly to the boolean, but without any extra 
> memory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27730) Optimize reference counting in off-heap ByteBuff

2023-03-17 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701767#comment-17701767
 ] 

Duo Zhang commented on HBASE-27730:
---

I prefer to implement deallocate like this
{code}
Recycler  r = this.recycler;
if (r == null) {
  return;
}
// set to null before actually releasing to minimize the time window that we 
could use a recycled instance
this.recycler = null;
r.free();
if (leak != null) {
  this.leak.close(this);
}
{code}

> Optimize reference counting in off-heap ByteBuff
> 
>
> Key: HBASE-27730
> URL: https://issues.apache.org/jira/browse/HBASE-27730
> Project: HBase
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Priority: Major
>
> In HBASE-27710 we uncovered a performance regression in reference counting of 
> ByteBuff. This was especially prominent in on-heap buffers when doing a 
> simple HFile.Reader iteration of a file. For that case, we saw a 4x 
> regression when reference counting was in play.
> It stands to reason that this same regression exists in off-heap buffers, and 
> I've run a microbenchmark which indeed shows the same issue. With existing 
> reference counting, scanning a 20gb hfile takes 40s. With an optimized 
> version, scanning the same file takes 20s. We don't typically see this in 
> profiling live regionservers where so much else goes on, but optimizing this 
> would eliminate some cpu cycles.
> It's worth noting that netty saw this same regression a few years ago: 
> [https://github.com/netty/netty/pull/8895]. Hat tip to [~zhangduo] for 
> pointing this out.
> One of the fixes there was to copy some internal code from deeper in the ref 
> counting, so that the call stack was smaller and inlining was possible. We 
> can't really do that.
> Another thing they did was add a boolean field in their CompositeByteBuffer, 
> which gets set to true when the buffer is recycled. So they don't need to do 
> reference counting on every operation, instead they can just check a boolean.
> I tried adding a boolean to our RefCnt.java, and it indeed fixes the 
> regression. The problem is, due to class alignment issues in java, adding 
> this boolean field increases the heap size of RefCnt from 24 to 32 bytes. 
> This seems non-trivial given it's used in bucket cache where there could be 
> many millions of them.
> I think we can get around this by simply nulling out the recycler in RefCnt 
> after it has been called. Then, instead of doing a boolean check we can do a 
> null check. This performs similarly to the boolean, but without any extra 
> memory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27730) Optimize reference counting in off-heap ByteBuff

2023-03-17 Thread Bryan Beaudreault (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701755#comment-17701755
 ] 

Bryan Beaudreault commented on HBASE-27730:
---

I wasn't sure if we wanted to go down that path of copying netty code.. in some 
ways I wish we could do the opposite, use netty buffers at the lower levels 
rather than our own. Perhaps that's not possible, I just imagine they will 
continue to optimize things over time.

What do you think about this:
{code:java}
private volatile Recycler recycler;

public RefCnt(Recycler recycler) {
  this.recycler = Preconditions.checkNotNull(recycler, "recycler cannot be 
null, pass NONE instead");
  this.leak = recycler == ByteBuffAllocator.NONE ? null : detector.track(this);
}

@Override
protected final void deallocate() {
  if (recycler == null) {
return;
  }
  this.recycler.free();
  this.recycler = null; // NOTE THIS
  if (leak != null) {
this.leak.close(this);
  }
}

// Call this in ByteBuff.java checkRefCount instead
public boolean isRecycled() {
  return recycler == null;
}{code}
No extra memory, fast, safe. We don't have to worry about synchronizing 
deallocate, because it will only be called once by AbstractReferenceCounted. 
That is handled by the AtomicIntegerFieldUpdater usage in ReferenceCountUpdater.

 

> Optimize reference counting in off-heap ByteBuff
> 
>
> Key: HBASE-27730
> URL: https://issues.apache.org/jira/browse/HBASE-27730
> Project: HBase
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Priority: Major
>
> In HBASE-27710 we uncovered a performance regression in reference counting of 
> ByteBuff. This was especially prominent in on-heap buffers when doing a 
> simple HFile.Reader iteration of a file. For that case, we saw a 4x 
> regression when reference counting was in play.
> It stands to reason that this same regression exists in off-heap buffers, and 
> I've run a microbenchmark which indeed shows the same issue. With existing 
> reference counting, scanning a 20gb hfile takes 40s. With an optimized 
> version, scanning the same file takes 20s. We don't typically see this in 
> profiling live regionservers where so much else goes on, but optimizing this 
> would eliminate some cpu cycles.
> It's worth noting that netty saw this same regression a few years ago: 
> [https://github.com/netty/netty/pull/8895]. Hat tip to [~zhangduo] for 
> pointing this out.
> One of the fixes there was to copy some internal code from deeper in the ref 
> counting, so that the call stack was smaller and inlining was possible. We 
> can't really do that.
> Another thing they did was add a boolean field in their CompositeByteBuffer, 
> which gets set to true when the buffer is recycled. So they don't need to do 
> reference counting on every operation, instead they can just check a boolean.
> I tried adding a boolean to our RefCnt.java, and it indeed fixes the 
> regression. The problem is, due to class alignment issues in java, adding 
> this boolean field increases the heap size of RefCnt from 24 to 32 bytes. 
> This seems non-trivial given it's used in bucket cache where there could be 
> many millions of them.
> I think we can get around this by simply nulling out the recycler in RefCnt 
> after it has been called. Then, instead of doing a boolean check we can do a 
> null check. This performs similarly to the boolean, but without any extra 
> memory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HBASE-27730) Optimize reference counting in off-heap ByteBuff

2023-03-17 Thread Duo Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-27730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701740#comment-17701740
 ] 

Duo Zhang commented on HBASE-27730:
---

I think we could copy the code in AbstractReferenceCounted into RefCnt directly 
and use the same trick on the updater field, so we do not need an extra boolean 
field?

> Optimize reference counting in off-heap ByteBuff
> 
>
> Key: HBASE-27730
> URL: https://issues.apache.org/jira/browse/HBASE-27730
> Project: HBase
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Priority: Major
>
> In HBASE-27710 we uncovered a performance regression in reference counting of 
> ByteBuff. This was especially prominent in on-heap buffers when doing a 
> simple HFile.Reader iteration of a file. For that case, we saw a 4x 
> regression when reference counting was in play.
> It stands to reason that this same regression exists in off-heap buffers, and 
> I've run a microbenchmark which indeed shows the same issue. With existing 
> reference counting, scanning a 20gb hfile takes 40s. With an optimized 
> version, scanning the same file takes 20s. We don't typically see this in 
> profiling live regionservers where so much else goes on, but optimizing this 
> would eliminate some cpu cycles.
> It's worth noting that netty saw this same regression a few years ago: 
> [https://github.com/netty/netty/pull/8895]. Hat tip to [~zhangduo] for 
> pointing this out.
> One of the fixes there was to copy some internal code from deeper in the ref 
> counting, so that the call stack was smaller and inlining was possible. We 
> can't really do that.
> Another thing they did was add a boolean field in their CompositeByteBuffer, 
> which gets set to true when the buffer is recycled. So they don't need to do 
> reference counting on every operation, instead they can just check a boolean.
> I tried adding a boolean to our RefCnt.java, and it indeed fixes the 
> regression. The problem is, due to class alignment issues in java, adding 
> this boolean field increases the heap size of RefCnt from 24 to 32 bytes. 
> This seems non-trivial given it's used in bucket cache where there could be 
> many millions of them.
> I think we can get around this by simply nulling out the recycler in RefCnt 
> after it has been called. Then, instead of doing a boolean check we can do a 
> null check. This performs similarly to the boolean, but without any extra 
> memory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)