[jira] [Commented] (HBASE-27730) Optimize reference counting in off-heap ByteBuff
[ https://issues.apache.org/jira/browse/HBASE-27730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801971#comment-17801971 ] Becker Ewing commented on HBASE-27730: -- Sorry about that! I've pushed up the code to this branch: [https://github.com/jbewing/hbase/tree/HBASE-27730-prelim] > Optimize reference counting in off-heap ByteBuff > > > Key: HBASE-27730 > URL: https://issues.apache.org/jira/browse/HBASE-27730 > Project: HBase > Issue Type: Improvement >Reporter: Bryan Beaudreault >Assignee: Becker Ewing >Priority: Major > Attachments: HBASE-27730-prelim.patch > > > In HBASE-27710 we uncovered a performance regression in reference counting of > ByteBuff. This was especially prominent in on-heap buffers when doing a > simple HFile.Reader iteration of a file. For that case, we saw a 4x > regression when reference counting was in play. > It stands to reason that this same regression exists in off-heap buffers, and > I've run a microbenchmark which indeed shows the same issue. With existing > reference counting, scanning a 20gb hfile takes 40s. With an optimized > version, scanning the same file takes 20s. We don't typically see this in > profiling live regionservers where so much else goes on, but optimizing this > would eliminate some cpu cycles. > It's worth noting that netty saw this same regression a few years ago: > [https://github.com/netty/netty/pull/8895]. Hat tip to [~zhangduo] for > pointing this out. > One of the fixes there was to copy some internal code from deeper in the ref > counting, so that the call stack was smaller and inlining was possible. We > can't really do that. > Another thing they did was add a boolean field in their CompositeByteBuffer, > which gets set to true when the buffer is recycled. So they don't need to do > reference counting on every operation, instead they can just check a boolean. > I tried adding a boolean to our RefCnt.java, and it indeed fixes the > regression. The problem is, due to class alignment issues in java, adding > this boolean field increases the heap size of RefCnt from 24 to 32 bytes. > This seems non-trivial given it's used in bucket cache where there could be > many millions of them. > I think we can get around this by simply nulling out the recycler in RefCnt > after it has been called. Then, instead of doing a boolean check we can do a > null check. This performs similarly to the boolean, but without any extra > memory. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-27730) Optimize reference counting in off-heap ByteBuff
[ https://issues.apache.org/jira/browse/HBASE-27730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801968#comment-17801968 ] Bryan Beaudreault commented on HBASE-27730: --- Can you attach the full benchmark? Looks like only a diff. You could also push a branch to make it easier to read in GitHub rather than txt > Optimize reference counting in off-heap ByteBuff > > > Key: HBASE-27730 > URL: https://issues.apache.org/jira/browse/HBASE-27730 > Project: HBase > Issue Type: Improvement >Reporter: Bryan Beaudreault >Assignee: Becker Ewing >Priority: Major > Attachments: HBASE-27730-prelim.patch > > > In HBASE-27710 we uncovered a performance regression in reference counting of > ByteBuff. This was especially prominent in on-heap buffers when doing a > simple HFile.Reader iteration of a file. For that case, we saw a 4x > regression when reference counting was in play. > It stands to reason that this same regression exists in off-heap buffers, and > I've run a microbenchmark which indeed shows the same issue. With existing > reference counting, scanning a 20gb hfile takes 40s. With an optimized > version, scanning the same file takes 20s. We don't typically see this in > profiling live regionservers where so much else goes on, but optimizing this > would eliminate some cpu cycles. > It's worth noting that netty saw this same regression a few years ago: > [https://github.com/netty/netty/pull/8895]. Hat tip to [~zhangduo] for > pointing this out. > One of the fixes there was to copy some internal code from deeper in the ref > counting, so that the call stack was smaller and inlining was possible. We > can't really do that. > Another thing they did was add a boolean field in their CompositeByteBuffer, > which gets set to true when the buffer is recycled. So they don't need to do > reference counting on every operation, instead they can just check a boolean. > I tried adding a boolean to our RefCnt.java, and it indeed fixes the > regression. The problem is, due to class alignment issues in java, adding > this boolean field increases the heap size of RefCnt from 24 to 32 bytes. > This seems non-trivial given it's used in bucket cache where there could be > many millions of them. > I think we can get around this by simply nulling out the recycler in RefCnt > after it has been called. Then, instead of doing a boolean check we can do a > null check. This performs similarly to the boolean, but without any extra > memory. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-27730) Optimize reference counting in off-heap ByteBuff
[ https://issues.apache.org/jira/browse/HBASE-27730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801953#comment-17801953 ] Becker Ewing commented on HBASE-27730: -- {noformat} Benchmark (shouldUseNoneRecycler) (useOffHeapBuffer) Mode Cnt Score Error Units ByteBuffBenchmark.get true true avgt 5 8715.026 ± 196.680 ns/op ByteBuffBenchmark.get true false avgt 5 9156.038 ± 879.701 ns/op ByteBuffBenchmark.get false true avgt 5 8746.112 ± 91.118 ns/op ByteBuffBenchmark.get false false avgt 5 9067.323 ± 190.359 ns/op Benchmark (shouldUseNoneRecycler) (vlong) Mode Cnt Score Error Units ReadVLongBenchmark.readVLongHBase14186_OffHeapBB true 9 avgt 5 2.254 ± 0.077 ns/op ReadVLongBenchmark.readVLongHBase14186_OffHeapBB true 512 avgt 5 5.305 ± 0.169 ns/op ReadVLongBenchmark.readVLongHBase14186_OffHeapBB true 8 avgt 5 7.014 ± 0.929 ns/op ReadVLongBenchmark.readVLongHBase14186_OffHeapBB true 548755813887 avgt 5 6.730 ± 0.037 ns/op ReadVLongBenchmark.readVLongHBase14186_OffHeapBB true 1700104028981 avgt 5 7.497 ± 0.133 ns/op ReadVLongBenchmark.readVLongHBase14186_OffHeapBB true 9123372036854775807 avgt 5 10.984 ± 0.445 ns/op ReadVLongBenchmark.readVLongHBase14186_OffHeapBB false 9 avgt 5 2.242 ± 0.031 ns/op ReadVLongBenchmark.readVLongHBase14186_OffHeapBB false 512 avgt 5 5.349 ± 0.075 ns/op ReadVLongBenchmark.readVLongHBase14186_OffHeapBB false 8 avgt 5 6.677 ± 0.041 ns/op ReadVLongBenchmark.readVLongHBase14186_OffHeapBB false 548755813887 avgt 5 6.772 ± 0.528 ns/op ReadVLongBenchmark.readVLongHBase14186_OffHeapBB false 1700104028981 avgt 5 7.436 ± 0.128 ns/op ReadVLongBenchmark.readVLongHBase14186_OffHeapBB false 9123372036854775807 avgt 5 11.173 ± 0.217 ns/op ReadVLongBenchmark.readVLongHBase14186_OnHeapBB true 9 avgt 5 2.495 ± 0.268 ns/op ReadVLongBenchmark.readVLongHBase14186_OnHeapBB true 512 avgt 5 5.335 ± 0.166 ns/op ReadVLongBenchmark.readVLongHBase14186_OnHeapBB true 8 avgt 5 7.014 ± 0.177 ns/op ReadVLongBenchmark.readVLongHBase14186_OnHeapBB true 548755813887 avgt 5 6.990 ± 0.083 ns/op ReadVLongBenchmark.readVLongHBase14186_OnHeapBB true 1700104028981 avgt 5 7.085 ± 0.093 ns/op ReadVLongBenchmark.readVLongHBase14186_OnHeapBB true 9123372036854775807 avgt 5 10.319 ± 0.229 ns/op ReadVLongBenchmark.readVLongHBase14186_OnHeapBB false 9 avgt 5 2.420 ± 0.126 ns/op ReadVLongBenchmark.readVLongHBase14186_OnHeapBB false 512 avgt 5 5.508 ± 0.500 ns/op ReadVLongBenchmark.readVLongHBase14186_OnHeapBB false 8 avgt 5 7.137 ± 0.108 ns/op ReadVLongBenchmark.readVLongHBase14186_OnHeapBB false 548755813887 avgt 5 6.943 ± 0.040 ns/op ReadVLongBenchmark.readVLongHBase14186_OnHeapBB false 1700104028981 avgt 5 6.921 ± 0.734 ns/op ReadVLongBenchmark.readVLongHBase14186_OnHeapBB false 9123372036854775807 avgt 5 10.250 ± 0.364 ns/op{noformat} > Optimize reference counting in off-heap ByteBuff > > > Key: HBASE-27730 > URL: https://issues.apache.org/jira/browse/HBASE-27730 > Project: HBase > Issue Type: Improvement >Reporter: Bryan Beaudreault >Assignee: Becker Ewing >Priority: Major > Attachments: HBASE-27730-prelim.patch > > > In HBASE-27710 we uncovered a performance regression in reference counting of > ByteBuff. This was especially prominent in on-heap buffers when doing a > simple HFile.Reader iteration of a file. For that case, we saw a 4x > regression when reference counting was in play. > It stands to reason that this same regression exists in off-heap buffers, and > I've run a microbenchmark which indeed shows the same issue. With existing > reference counting, scanning a 20gb hfile takes 40s. With an optimized > version,
[jira] [Commented] (HBASE-27730) Optimize reference counting in off-heap ByteBuff
[ https://issues.apache.org/jira/browse/HBASE-27730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801952#comment-17801952 ] Bryan Beaudreault commented on HBASE-27730: --- Could you try running my exact patch with your benchmark? > Optimize reference counting in off-heap ByteBuff > > > Key: HBASE-27730 > URL: https://issues.apache.org/jira/browse/HBASE-27730 > Project: HBase > Issue Type: Improvement >Reporter: Bryan Beaudreault >Assignee: Becker Ewing >Priority: Major > Attachments: HBASE-27730-prelim.patch > > > In HBASE-27710 we uncovered a performance regression in reference counting of > ByteBuff. This was especially prominent in on-heap buffers when doing a > simple HFile.Reader iteration of a file. For that case, we saw a 4x > regression when reference counting was in play. > It stands to reason that this same regression exists in off-heap buffers, and > I've run a microbenchmark which indeed shows the same issue. With existing > reference counting, scanning a 20gb hfile takes 40s. With an optimized > version, scanning the same file takes 20s. We don't typically see this in > profiling live regionservers where so much else goes on, but optimizing this > would eliminate some cpu cycles. > It's worth noting that netty saw this same regression a few years ago: > [https://github.com/netty/netty/pull/8895]. Hat tip to [~zhangduo] for > pointing this out. > One of the fixes there was to copy some internal code from deeper in the ref > counting, so that the call stack was smaller and inlining was possible. We > can't really do that. > Another thing they did was add a boolean field in their CompositeByteBuffer, > which gets set to true when the buffer is recycled. So they don't need to do > reference counting on every operation, instead they can just check a boolean. > I tried adding a boolean to our RefCnt.java, and it indeed fixes the > regression. The problem is, due to class alignment issues in java, adding > this boolean field increases the heap size of RefCnt from 24 to 32 bytes. > This seems non-trivial given it's used in bucket cache where there could be > many millions of them. > I think we can get around this by simply nulling out the recycler in RefCnt > after it has been called. Then, instead of doing a boolean check we can do a > null check. This performs similarly to the boolean, but without any extra > memory. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-27730) Optimize reference counting in off-heap ByteBuff
[ https://issues.apache.org/jira/browse/HBASE-27730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801950#comment-17801950 ] Becker Ewing commented on HBASE-27730: -- Yeah what I have is pretty similar. I just attached a patch file ([^HBASE-27730-prelim.patch]) which has everything. The ByteBuffBenchmark class has proven to be pretty useless so far (no real meaningful numbers), but the readVLong performance is as I described. It's a subset of the benchmarks in HBASE-28256, so I used the results in https://issues.apache.org/jira/browse/HBASE-28256?focusedCommentId=17799506=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17799506 as my point of reference. I attached some comments in RefCnt.java that let you toggle between the "v1" and "v2". I'll also note that the implementations in "v1" and "v2" are also somewhat flawed as it's possible to call Recycler#free twice (to fix the bug, you'd need to switch to using an AtomicReference so you could use the "compareAndSet" method when setting "recycler" to null > Optimize reference counting in off-heap ByteBuff > > > Key: HBASE-27730 > URL: https://issues.apache.org/jira/browse/HBASE-27730 > Project: HBase > Issue Type: Improvement >Reporter: Bryan Beaudreault >Assignee: Becker Ewing >Priority: Major > Attachments: HBASE-27730-prelim.patch > > > In HBASE-27710 we uncovered a performance regression in reference counting of > ByteBuff. This was especially prominent in on-heap buffers when doing a > simple HFile.Reader iteration of a file. For that case, we saw a 4x > regression when reference counting was in play. > It stands to reason that this same regression exists in off-heap buffers, and > I've run a microbenchmark which indeed shows the same issue. With existing > reference counting, scanning a 20gb hfile takes 40s. With an optimized > version, scanning the same file takes 20s. We don't typically see this in > profiling live regionservers where so much else goes on, but optimizing this > would eliminate some cpu cycles. > It's worth noting that netty saw this same regression a few years ago: > [https://github.com/netty/netty/pull/8895]. Hat tip to [~zhangduo] for > pointing this out. > One of the fixes there was to copy some internal code from deeper in the ref > counting, so that the call stack was smaller and inlining was possible. We > can't really do that. > Another thing they did was add a boolean field in their CompositeByteBuffer, > which gets set to true when the buffer is recycled. So they don't need to do > reference counting on every operation, instead they can just check a boolean. > I tried adding a boolean to our RefCnt.java, and it indeed fixes the > regression. The problem is, due to class alignment issues in java, adding > this boolean field increases the heap size of RefCnt from 24 to 32 bytes. > This seems non-trivial given it's used in bucket cache where there could be > many millions of them. > I think we can get around this by simply nulling out the recycler in RefCnt > after it has been called. Then, instead of doing a boolean check we can do a > null check. This performs similarly to the boolean, but without any extra > memory. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-27730) Optimize reference counting in off-heap ByteBuff
[ https://issues.apache.org/jira/browse/HBASE-27730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801947#comment-17801947 ] Bryan Beaudreault commented on HBASE-27730: --- Unfortunately, I can't find the benchmark I had. However, I did find my initial impl of the idea here. I pushed it up to a branch: [https://github.com/apache/hbase/compare/master...HubSpot:hbase:HBASE-27730?expand=1] Can you see if what you have looks like that? Let's also see what your benchmark code looks like. > Optimize reference counting in off-heap ByteBuff > > > Key: HBASE-27730 > URL: https://issues.apache.org/jira/browse/HBASE-27730 > Project: HBase > Issue Type: Improvement >Reporter: Bryan Beaudreault >Assignee: Becker Ewing >Priority: Major > > In HBASE-27710 we uncovered a performance regression in reference counting of > ByteBuff. This was especially prominent in on-heap buffers when doing a > simple HFile.Reader iteration of a file. For that case, we saw a 4x > regression when reference counting was in play. > It stands to reason that this same regression exists in off-heap buffers, and > I've run a microbenchmark which indeed shows the same issue. With existing > reference counting, scanning a 20gb hfile takes 40s. With an optimized > version, scanning the same file takes 20s. We don't typically see this in > profiling live regionservers where so much else goes on, but optimizing this > would eliminate some cpu cycles. > It's worth noting that netty saw this same regression a few years ago: > [https://github.com/netty/netty/pull/8895]. Hat tip to [~zhangduo] for > pointing this out. > One of the fixes there was to copy some internal code from deeper in the ref > counting, so that the call stack was smaller and inlining was possible. We > can't really do that. > Another thing they did was add a boolean field in their CompositeByteBuffer, > which gets set to true when the buffer is recycled. So they don't need to do > reference counting on every operation, instead they can just check a boolean. > I tried adding a boolean to our RefCnt.java, and it indeed fixes the > regression. The problem is, due to class alignment issues in java, adding > this boolean field increases the heap size of RefCnt from 24 to 32 bytes. > This seems non-trivial given it's used in bucket cache where there could be > many millions of them. > I think we can get around this by simply nulling out the recycler in RefCnt > after it has been called. Then, instead of doing a boolean check we can do a > null check. This performs similarly to the boolean, but without any extra > memory. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-27730) Optimize reference counting in off-heap ByteBuff
[ https://issues.apache.org/jira/browse/HBASE-27730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801944#comment-17801944 ] Becker Ewing commented on HBASE-27730: -- [~zhangduo] / [~bbeaudreault] any chance that y'all have the microbenchmarks laying around for the initial work here for determining that the null-check solution is equally as fast to copying the AbstractReferenceCounted code? I've got some microbenchmarks that I wrote for this that are giving me very little signal (I fear that the java compiler is doing something sneaky). I'm able to get a lot of mileage out of my microbenchmark I wrote for readVLong for HBASE-28256 and using the numbers there as a control, but it's generally showing that implementing this with the null-check route either results in: * a regression to the NONE recycler case when implemented as suggested above (and not much perf improvement to performance). Going this route looks like it essentially nullifies the performance gains we made in HBASE-27710 for no real benefit. I believe this is because HBASE-27710 essentially made it so that when using a NONE recycler, a volatile variable read isn't needed. Using the above suggested implementation (i.e. the null check solution) introduces a volatile variable read into the read hot path * if you instead use the "leak" field being equal to null as a proxy for determining whether the NONE recycler is being used (so a volatile variable read isn't needed for the NONE recycler case), then the NONE recycler regression is fixed, but there's no real performance gain I can attach my benchmarking code/example patches if it would help, but I think given this I'm going to try going down the route of copying the AbstractReferenceCounted code and see if that yields any performance improvement > Optimize reference counting in off-heap ByteBuff > > > Key: HBASE-27730 > URL: https://issues.apache.org/jira/browse/HBASE-27730 > Project: HBase > Issue Type: Improvement >Reporter: Bryan Beaudreault >Assignee: Becker Ewing >Priority: Major > > In HBASE-27710 we uncovered a performance regression in reference counting of > ByteBuff. This was especially prominent in on-heap buffers when doing a > simple HFile.Reader iteration of a file. For that case, we saw a 4x > regression when reference counting was in play. > It stands to reason that this same regression exists in off-heap buffers, and > I've run a microbenchmark which indeed shows the same issue. With existing > reference counting, scanning a 20gb hfile takes 40s. With an optimized > version, scanning the same file takes 20s. We don't typically see this in > profiling live regionservers where so much else goes on, but optimizing this > would eliminate some cpu cycles. > It's worth noting that netty saw this same regression a few years ago: > [https://github.com/netty/netty/pull/8895]. Hat tip to [~zhangduo] for > pointing this out. > One of the fixes there was to copy some internal code from deeper in the ref > counting, so that the call stack was smaller and inlining was possible. We > can't really do that. > Another thing they did was add a boolean field in their CompositeByteBuffer, > which gets set to true when the buffer is recycled. So they don't need to do > reference counting on every operation, instead they can just check a boolean. > I tried adding a boolean to our RefCnt.java, and it indeed fixes the > regression. The problem is, due to class alignment issues in java, adding > this boolean field increases the heap size of RefCnt from 24 to 32 bytes. > This seems non-trivial given it's used in bucket cache where there could be > many millions of them. > I think we can get around this by simply nulling out the recycler in RefCnt > after it has been called. Then, instead of doing a boolean check we can do a > null check. This performs similarly to the boolean, but without any extra > memory. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-27730) Optimize reference counting in off-heap ByteBuff
[ https://issues.apache.org/jira/browse/HBASE-27730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701791#comment-17701791 ] Duo Zhang commented on HBASE-27730: --- Let's go with the null check solution first. Copying AbstractReferenceCounted code is for reducing the inherit levels and also make it more easier to inline the small functions. Can do this later as it requires advanced micro benchmarks. Thanks. > Optimize reference counting in off-heap ByteBuff > > > Key: HBASE-27730 > URL: https://issues.apache.org/jira/browse/HBASE-27730 > Project: HBase > Issue Type: Improvement >Reporter: Bryan Beaudreault >Priority: Major > > In HBASE-27710 we uncovered a performance regression in reference counting of > ByteBuff. This was especially prominent in on-heap buffers when doing a > simple HFile.Reader iteration of a file. For that case, we saw a 4x > regression when reference counting was in play. > It stands to reason that this same regression exists in off-heap buffers, and > I've run a microbenchmark which indeed shows the same issue. With existing > reference counting, scanning a 20gb hfile takes 40s. With an optimized > version, scanning the same file takes 20s. We don't typically see this in > profiling live regionservers where so much else goes on, but optimizing this > would eliminate some cpu cycles. > It's worth noting that netty saw this same regression a few years ago: > [https://github.com/netty/netty/pull/8895]. Hat tip to [~zhangduo] for > pointing this out. > One of the fixes there was to copy some internal code from deeper in the ref > counting, so that the call stack was smaller and inlining was possible. We > can't really do that. > Another thing they did was add a boolean field in their CompositeByteBuffer, > which gets set to true when the buffer is recycled. So they don't need to do > reference counting on every operation, instead they can just check a boolean. > I tried adding a boolean to our RefCnt.java, and it indeed fixes the > regression. The problem is, due to class alignment issues in java, adding > this boolean field increases the heap size of RefCnt from 24 to 32 bytes. > This seems non-trivial given it's used in bucket cache where there could be > many millions of them. > I think we can get around this by simply nulling out the recycler in RefCnt > after it has been called. Then, instead of doing a boolean check we can do a > null check. This performs similarly to the boolean, but without any extra > memory. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-27730) Optimize reference counting in off-heap ByteBuff
[ https://issues.apache.org/jira/browse/HBASE-27730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701773#comment-17701773 ] Bryan Beaudreault commented on HBASE-27730: --- [~zhangduo] I tested it out copying AbstractReferenceCounted code. That performs similarly to above, though is more operations than just the null check. I'm open to both approaches. Let me know what you think of my idea above. > Optimize reference counting in off-heap ByteBuff > > > Key: HBASE-27730 > URL: https://issues.apache.org/jira/browse/HBASE-27730 > Project: HBase > Issue Type: Improvement >Reporter: Bryan Beaudreault >Priority: Major > > In HBASE-27710 we uncovered a performance regression in reference counting of > ByteBuff. This was especially prominent in on-heap buffers when doing a > simple HFile.Reader iteration of a file. For that case, we saw a 4x > regression when reference counting was in play. > It stands to reason that this same regression exists in off-heap buffers, and > I've run a microbenchmark which indeed shows the same issue. With existing > reference counting, scanning a 20gb hfile takes 40s. With an optimized > version, scanning the same file takes 20s. We don't typically see this in > profiling live regionservers where so much else goes on, but optimizing this > would eliminate some cpu cycles. > It's worth noting that netty saw this same regression a few years ago: > [https://github.com/netty/netty/pull/8895]. Hat tip to [~zhangduo] for > pointing this out. > One of the fixes there was to copy some internal code from deeper in the ref > counting, so that the call stack was smaller and inlining was possible. We > can't really do that. > Another thing they did was add a boolean field in their CompositeByteBuffer, > which gets set to true when the buffer is recycled. So they don't need to do > reference counting on every operation, instead they can just check a boolean. > I tried adding a boolean to our RefCnt.java, and it indeed fixes the > regression. The problem is, due to class alignment issues in java, adding > this boolean field increases the heap size of RefCnt from 24 to 32 bytes. > This seems non-trivial given it's used in bucket cache where there could be > many millions of them. > I think we can get around this by simply nulling out the recycler in RefCnt > after it has been called. Then, instead of doing a boolean check we can do a > null check. This performs similarly to the boolean, but without any extra > memory. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-27730) Optimize reference counting in off-heap ByteBuff
[ https://issues.apache.org/jira/browse/HBASE-27730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701774#comment-17701774 ] Bryan Beaudreault commented on HBASE-27730: --- Good idea, that makes sense for sure. > Optimize reference counting in off-heap ByteBuff > > > Key: HBASE-27730 > URL: https://issues.apache.org/jira/browse/HBASE-27730 > Project: HBase > Issue Type: Improvement >Reporter: Bryan Beaudreault >Priority: Major > > In HBASE-27710 we uncovered a performance regression in reference counting of > ByteBuff. This was especially prominent in on-heap buffers when doing a > simple HFile.Reader iteration of a file. For that case, we saw a 4x > regression when reference counting was in play. > It stands to reason that this same regression exists in off-heap buffers, and > I've run a microbenchmark which indeed shows the same issue. With existing > reference counting, scanning a 20gb hfile takes 40s. With an optimized > version, scanning the same file takes 20s. We don't typically see this in > profiling live regionservers where so much else goes on, but optimizing this > would eliminate some cpu cycles. > It's worth noting that netty saw this same regression a few years ago: > [https://github.com/netty/netty/pull/8895]. Hat tip to [~zhangduo] for > pointing this out. > One of the fixes there was to copy some internal code from deeper in the ref > counting, so that the call stack was smaller and inlining was possible. We > can't really do that. > Another thing they did was add a boolean field in their CompositeByteBuffer, > which gets set to true when the buffer is recycled. So they don't need to do > reference counting on every operation, instead they can just check a boolean. > I tried adding a boolean to our RefCnt.java, and it indeed fixes the > regression. The problem is, due to class alignment issues in java, adding > this boolean field increases the heap size of RefCnt from 24 to 32 bytes. > This seems non-trivial given it's used in bucket cache where there could be > many millions of them. > I think we can get around this by simply nulling out the recycler in RefCnt > after it has been called. Then, instead of doing a boolean check we can do a > null check. This performs similarly to the boolean, but without any extra > memory. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-27730) Optimize reference counting in off-heap ByteBuff
[ https://issues.apache.org/jira/browse/HBASE-27730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701767#comment-17701767 ] Duo Zhang commented on HBASE-27730: --- I prefer to implement deallocate like this {code} Recycler r = this.recycler; if (r == null) { return; } // set to null before actually releasing to minimize the time window that we could use a recycled instance this.recycler = null; r.free(); if (leak != null) { this.leak.close(this); } {code} > Optimize reference counting in off-heap ByteBuff > > > Key: HBASE-27730 > URL: https://issues.apache.org/jira/browse/HBASE-27730 > Project: HBase > Issue Type: Improvement >Reporter: Bryan Beaudreault >Priority: Major > > In HBASE-27710 we uncovered a performance regression in reference counting of > ByteBuff. This was especially prominent in on-heap buffers when doing a > simple HFile.Reader iteration of a file. For that case, we saw a 4x > regression when reference counting was in play. > It stands to reason that this same regression exists in off-heap buffers, and > I've run a microbenchmark which indeed shows the same issue. With existing > reference counting, scanning a 20gb hfile takes 40s. With an optimized > version, scanning the same file takes 20s. We don't typically see this in > profiling live regionservers where so much else goes on, but optimizing this > would eliminate some cpu cycles. > It's worth noting that netty saw this same regression a few years ago: > [https://github.com/netty/netty/pull/8895]. Hat tip to [~zhangduo] for > pointing this out. > One of the fixes there was to copy some internal code from deeper in the ref > counting, so that the call stack was smaller and inlining was possible. We > can't really do that. > Another thing they did was add a boolean field in their CompositeByteBuffer, > which gets set to true when the buffer is recycled. So they don't need to do > reference counting on every operation, instead they can just check a boolean. > I tried adding a boolean to our RefCnt.java, and it indeed fixes the > regression. The problem is, due to class alignment issues in java, adding > this boolean field increases the heap size of RefCnt from 24 to 32 bytes. > This seems non-trivial given it's used in bucket cache where there could be > many millions of them. > I think we can get around this by simply nulling out the recycler in RefCnt > after it has been called. Then, instead of doing a boolean check we can do a > null check. This performs similarly to the boolean, but without any extra > memory. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-27730) Optimize reference counting in off-heap ByteBuff
[ https://issues.apache.org/jira/browse/HBASE-27730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701755#comment-17701755 ] Bryan Beaudreault commented on HBASE-27730: --- I wasn't sure if we wanted to go down that path of copying netty code.. in some ways I wish we could do the opposite, use netty buffers at the lower levels rather than our own. Perhaps that's not possible, I just imagine they will continue to optimize things over time. What do you think about this: {code:java} private volatile Recycler recycler; public RefCnt(Recycler recycler) { this.recycler = Preconditions.checkNotNull(recycler, "recycler cannot be null, pass NONE instead"); this.leak = recycler == ByteBuffAllocator.NONE ? null : detector.track(this); } @Override protected final void deallocate() { if (recycler == null) { return; } this.recycler.free(); this.recycler = null; // NOTE THIS if (leak != null) { this.leak.close(this); } } // Call this in ByteBuff.java checkRefCount instead public boolean isRecycled() { return recycler == null; }{code} No extra memory, fast, safe. We don't have to worry about synchronizing deallocate, because it will only be called once by AbstractReferenceCounted. That is handled by the AtomicIntegerFieldUpdater usage in ReferenceCountUpdater. > Optimize reference counting in off-heap ByteBuff > > > Key: HBASE-27730 > URL: https://issues.apache.org/jira/browse/HBASE-27730 > Project: HBase > Issue Type: Improvement >Reporter: Bryan Beaudreault >Priority: Major > > In HBASE-27710 we uncovered a performance regression in reference counting of > ByteBuff. This was especially prominent in on-heap buffers when doing a > simple HFile.Reader iteration of a file. For that case, we saw a 4x > regression when reference counting was in play. > It stands to reason that this same regression exists in off-heap buffers, and > I've run a microbenchmark which indeed shows the same issue. With existing > reference counting, scanning a 20gb hfile takes 40s. With an optimized > version, scanning the same file takes 20s. We don't typically see this in > profiling live regionservers where so much else goes on, but optimizing this > would eliminate some cpu cycles. > It's worth noting that netty saw this same regression a few years ago: > [https://github.com/netty/netty/pull/8895]. Hat tip to [~zhangduo] for > pointing this out. > One of the fixes there was to copy some internal code from deeper in the ref > counting, so that the call stack was smaller and inlining was possible. We > can't really do that. > Another thing they did was add a boolean field in their CompositeByteBuffer, > which gets set to true when the buffer is recycled. So they don't need to do > reference counting on every operation, instead they can just check a boolean. > I tried adding a boolean to our RefCnt.java, and it indeed fixes the > regression. The problem is, due to class alignment issues in java, adding > this boolean field increases the heap size of RefCnt from 24 to 32 bytes. > This seems non-trivial given it's used in bucket cache where there could be > many millions of them. > I think we can get around this by simply nulling out the recycler in RefCnt > after it has been called. Then, instead of doing a boolean check we can do a > null check. This performs similarly to the boolean, but without any extra > memory. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-27730) Optimize reference counting in off-heap ByteBuff
[ https://issues.apache.org/jira/browse/HBASE-27730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17701740#comment-17701740 ] Duo Zhang commented on HBASE-27730: --- I think we could copy the code in AbstractReferenceCounted into RefCnt directly and use the same trick on the updater field, so we do not need an extra boolean field? > Optimize reference counting in off-heap ByteBuff > > > Key: HBASE-27730 > URL: https://issues.apache.org/jira/browse/HBASE-27730 > Project: HBase > Issue Type: Improvement >Reporter: Bryan Beaudreault >Priority: Major > > In HBASE-27710 we uncovered a performance regression in reference counting of > ByteBuff. This was especially prominent in on-heap buffers when doing a > simple HFile.Reader iteration of a file. For that case, we saw a 4x > regression when reference counting was in play. > It stands to reason that this same regression exists in off-heap buffers, and > I've run a microbenchmark which indeed shows the same issue. With existing > reference counting, scanning a 20gb hfile takes 40s. With an optimized > version, scanning the same file takes 20s. We don't typically see this in > profiling live regionservers where so much else goes on, but optimizing this > would eliminate some cpu cycles. > It's worth noting that netty saw this same regression a few years ago: > [https://github.com/netty/netty/pull/8895]. Hat tip to [~zhangduo] for > pointing this out. > One of the fixes there was to copy some internal code from deeper in the ref > counting, so that the call stack was smaller and inlining was possible. We > can't really do that. > Another thing they did was add a boolean field in their CompositeByteBuffer, > which gets set to true when the buffer is recycled. So they don't need to do > reference counting on every operation, instead they can just check a boolean. > I tried adding a boolean to our RefCnt.java, and it indeed fixes the > regression. The problem is, due to class alignment issues in java, adding > this boolean field increases the heap size of RefCnt from 24 to 32 bytes. > This seems non-trivial given it's used in bucket cache where there could be > many millions of them. > I think we can get around this by simply nulling out the recycler in RefCnt > after it has been called. Then, instead of doing a boolean check we can do a > null check. This performs similarly to the boolean, but without any extra > memory. -- This message was sent by Atlassian Jira (v8.20.10#820010)