I used PE to generate 10M row tables with one family with either 1, 10, 20,
50, or 100 values per row (unique column-qualifiers). An increase in wall
clock time was noticeable, for example:

1.6.0

time ./bin/hbase pe --rows=5000000 --table=TestTable_f1_c20 --columns=20
--nomapred scan 2
real 1m20.179s
user 0m45.470s
sys 0m16.572s

2.2.5

time ./bin/hbase pe --rows=5000000 --table=TestTable_f1_c20 --columns=20
--nomapred scan 2
real 1m31.016s
user 0m48.658s
sys 0m16.965s

It didnt really make a difference if I used 1 thread or 4 or 10, the delta
was about the same, proportionally. I picked two threads in the end so I'd
have enough time to launch async-profiler twice in another shell, to
capture flamegraph and call tree output, respectively. async-profiler
captured 10 seconds at steady state per test case. Upon first inspection
what jumps out is an increasing proportion of CPU time spent in GC in 2.2.5
vs 1.6.0. The difference increases as the number of column families
increase. There is little apparent difference at 1 column family, but a 2x
or more difference at 20 columns, and a 10x or more difference at 100
columns, eyeballing the charts, flipping back and forth between browser
windows. This seems more than coincidental but obviously calls for capture
and analysis of GC trace, with JFR. Will do that next.

JVM: openjdk version "1.8.0_232" OpenJDK Runtime Environment (Zulu
8.42.0.21-CA-macosx) (build 1.8.0_232-b18) OpenJDK 64-Bit Server VM (Zulu
8.42.0.21-CA-macosx) (build 25.232-b18, mixed mode)

Regionserver JVM flags: -Xms10g -Xmx10g -XX:+UseG1GC -XX:+AlwaysPreTouch
-XX:+UseNUMA -XX:-UseBiasedLocking -XX:+ParallelRefProcEnabled


On Thu, Jun 11, 2020 at 7:06 AM Jan Van Besien <[email protected]> wrote:

> This is promising, thanks a lot. Testing with hbase 2.2.5 shows an
> improvement, but we're not there yet.
>
> As reported earlier, hbase 2.1.0 was about 60% slower than hbase 1.2.0
> in a test that simply scans all the regions in parallel without any
> filter. A test with hbase 2.2.5 shows it to be about 40% slower than
> 1.2.0. So that is better than 2.1.0, but still substantially slower
> than what hbase 1.2.0 was.
>
> As before, I tested this both on a 3 node cluster as well as with a
> unit test using HBaseTestingUtility. Both tests show very similar
> relative differences.
>
> Jan
>
> On Thu, Jun 11, 2020 at 2:16 PM Anoop John <[email protected]> wrote:
> >
> > In another mail thread Zheng Hu brought up an important Jra fix
> > https://issues.apache.org/jira/browse/HBASE-21657
> > Can u pls check with this once?
> >
> > Anoop
> >
> >
> > On Tue, Jun 9, 2020 at 8:08 PM Jan Van Besien <[email protected]> wrote:
> >
> > > On Sun, Jun 7, 2020 at 7:49 AM Anoop John <[email protected]>
> wrote:
> > > > As per the above configs, it looks like Bucket Cache is not being
> used.
> > > > Only on heap LRU cache in use.
> > >
> > > True (but it is large enough to hold everything, so I don't think it
> > > matters).
> > >
> > > > @Jan - Is it possible for you to test with off heap Bucket Cache?
> > >  Config
> > > > bucket cache off heap mode with size ~7.5 GB
> > >
> > > I did a quick test but it seems not to make a measurable difference.
> > > If anything, it is actually slightly slower even. I see 100% hit ratio
> > > in the L1
> > > LruBlockCache and effectively also 100% in the L2 BucketCache (hit
> > > ratio is not yet at 100% but hits increase with every test and misses
> > > do not).
> > >
> > > Given that the LruBlockCache was already large enough to cache all the
> > > data anyway, I did not expect this to help either, to be honest.
> > >
> > > > Do you have any DataBlockEncoding enabled on the CF?
> > >
> > > Yes, FAST_DIFF. But this is of course true in both the tests with
> > > hbase2 and hbase1.
> > >
> > > Jan
> > >
>


-- 
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
   - A23, Crosstalk

Reply via email to