Thanks a lot for doing this test. Its results are encouraging. My
non-cluster testing was more focussed on full table scans, which YSCB does
not do. The full table scans are only done by batch jobs, so if they are a
bit slower it is not much of a problem, but in our case they seemed a lot
slower.

I agree that testing overall performance on a non-cluster environment is
not a good idea, but it doesn't seem unreasonable when focussing on a
specific algorithm? I only started testing in this manner after noticing
problems in cluster-based tests.

I meanwhile tried a variant of my test where I used the same number of
cells, but spread over much less rows, in total 1500 rows of each 10K
(small) cells. In that test, the difference in the scan speed is much lower
(HBase 2.2.4 being only about 10% slower). This suggests that the slowdown
in HBase 2 might be due to things that happen per row being scanned.

Anyway, we'll do some further testing, also with our normal workloads on
clusters, and try to further analyse it.


On Fri, May 22, 2020 at 1:52 AM Andrew Purtell <apurt...@apache.org> wrote:

> It depends what you are measuring and how. I test every so often with YCSB,
> which admittedly is not representative of real world workloads but is
> widely used for apples to apples testing among datastores, and we can apply
> the same test tool and test methodology to different versions to get
> meaningful results. I also test on real clusters. The single all-in-one
> process zk+master+regionserver "minicluster" cannot provide you meaningful
> performance data. Only distributed clusters can provide meaningful results.
> Some defaults are also important to change, like the number of RPC handlers
> you plan to use in production.
>
> After reading this thread I tested 1.6.0 and 2.2.4 using my standard
> methodology, described below. 2.2.4 is better, often significantly better,
> in most measures in most cases.
>
> Cluster: AWS Amazon Linux AMI, 1 x master, 5 x regionserver, 1 x client,
> m5d.4xlarge
> Hadoop: 2.10.0, ZK: 3.4.14
>
>
> JVM: 8u252 shenandoah (provided by AMI)
>
>
> GC: -XX:+UseShenandoahGC -Xms31g -Xmx31g -XX:+AlwaysPreTouch -XX:+UseNUMA
> -XX:-UseBiasedLocking
> Non-default settings: hbase.regionserver.handler.count=256
> hbase.ipc.server.callqueue.type=codel dfs.client.read.shortcircuit=true
> Methodology:
>
>
>   1. Create 100M row base table (ROW_INDEX_V1 encoding, ZSTANDARD
> compression)
>   2. Snapshot base table
>
>
>   3. Enable balancer
>
>
>   4. Clone test table from base table snapshot
>
>
>   5. Balance, then disable balancer
>
>
>   6. Run YCSB 0.18 workload --operationcount 1000000 (1M rows) -threads 200
> -target 100000 (100k/ops/sec)
>   7. Drop test table
>
>
>   8. Back to step 3 until all workloads complete
>
>
>
>
>
>
> Workload A 1.6.0 2.2.4 Difference
> [OVERALL], RunTime(ms) 20552 20655 100.50%
> [OVERALL], Throughput(ops/sec) 97314 96829 99.50%
> [READ], AverageLatency(us) 591 418 70.75%
> [READ], MinLatency(us) 191 201 105.24%
> [READ], MaxLatency(us) 146047 80895 55.39%
> [READ], 95thPercentileLatency(us) 3013 542 17.99%
> [READ], 99thPercentileLatency(us) 5427 2559 47.15%
> [UPDATE], AverageLatency(us) 833 460 55.23%
> [UPDATE], MinLatency(us) 348 230 66.09%
> [UPDATE], MaxLatency(us) 149887 80959 54.01%
> [UPDATE], 95thPercentileLatency(us) 3403 607 17.84%
> [UPDATE], 99thPercentileLatency(us) 5751 3045 52.95%
>
>
>
>
> Workload B 1.6.0 2.2.4 Difference
> [OVERALL], RunTime(ms) 20555 20679 100.60%
> [OVERALL], Throughput(ops/sec) 97300 96716 99.40%
> [READ], AverageLatency(us) 417 427 102.54%
> [READ], MinLatency(us) 179 194 108.38%
> [READ], MaxLatency(us) 124095 76799 61.89%
> [READ], 95thPercentileLatency(us) 498 564 113.25%
> [READ], 99thPercentileLatency(us) 3679 3785 102.88%
> [UPDATE], AverageLatency(us) 665 488 73.28%
> [UPDATE], MinLatency(us) 380 237 62.37%
> [UPDATE], MaxLatency(us) 95167 76287 80.16%
> [UPDATE], 95thPercentileLatency(us) 718 629 87.60%
> [UPDATE], 99thPercentileLatency(us) 4015 4023 100.20%
>
>
>
>
> Workload C 1.6.0 2.2.4 Difference
> [OVERALL], RunTime(ms) 20525 20648 100.60%
> [OVERALL], Throughput(ops/sec) 97442 96862 99.40%
> [READ], AverageLatency(us) 385 382 99.07%
> [READ], MinLatency(us) 178 198 111.24%
> [READ], MaxLatency(us) 74943 76415 101.96%
> [READ], 95thPercentileLatency(us) 437 477 109.15%
> [READ], 99thPercentileLatency(us) 3349 2219 66.26%
>
>
>
>
> Workload D 1.6.0 2.2.4 Difference
> [OVERALL], RunTime(ms) 20538 20644 100.52%
> [OVERALL], Throughput(ops/sec) 97380 96880 99.49%
> [READ], AverageLatency(us) 372 393 105.49%
> [READ], MinLatency(us) 116 137 118.10%
> [READ], MaxLatency(us) 107391 73215 68.18%
> [READ], 95thPercentileLatency(us) 916 983 107.31%
> [READ], 99thPercentileLatency(us) 3183 2473 77.69%
> [INSERT], AverageLatency(us) 732 526 71.86%
> [INSERT], MinLatency(us) 418 289 69.14%
> [INSERT], MaxLatency(us) 109183 80255 73.51%
> [INSERT], 95thPercentileLatency(us) 823 724 87.97%
> [INSERT], 99thPercentileLatency(us) 3961 3003 75.81%
>
>
>
>
> Workload E 1.6.0 2.2.4 Difference
> [OVERALL], RunTime(ms) 120157 119728 99.64%
> [OVERALL], Throughput(ops/sec) 16645 16705 100.36%
> [INSERT], AverageLatency(us) 11787 11102 94.19%
> [INSERT], MinLatency(us) 459 296 64.49%
> [INSERT], MaxLatency(us) 172927 131583 76.09%
> [INSERT], 95thPercentileLatency(us) 32143 28911 89.94%
> [INSERT], 99thPercentileLatency(us) 36063 31423 87.13%
> [SCAN], AverageLatency(us) 11891 11875 99.87%
> [SCAN], MinLatency(us) 219 255 116.44%
> [SCAN], MaxLatency(us) 179071 188671 105.36%
> [SCAN], 95thPercentileLatency(us) 32639 29615 90.74%
> [SCAN], 99thPercentileLatency(us) 36671 32175 87.74%
>
>
>
>
> Workload F 1.6.0 2.2.4 Difference
> [OVERALL], RunTime(ms) 20766 20655 99.47%
> [OVERALL], Throughput(ops/sec) 96311 96829 100.54%
> [READ], AverageLatency(us) 1242 591 47.61%
> [READ], MinLatency(us) 183 212 115.85%
> [READ], MaxLatency(us) 80959 90111 111.30%
> [READ], 95thPercentileLatency(us) 3397 1511 44.48%
> [READ], 99thPercentileLatency(us) 4515 3063 67.84%
> [READ-MODIFY-WRITE], AverageLatency(us) 2768 1193 43.10%
> [READ-MODIFY-WRITE], MinLatency(us) 596 496 83.22%
> [READ-MODIFY-WRITE], MaxLatency(us) 128639 112191 87.21%
> [READ-MODIFY-WRITE], 95thPercentileLatency(us) 7071 3263 46.15%
> [READ-MODIFY-WRITE], 99thPercentileLatency(us) 9919 6547 66.00%
> [UPDATE], AverageLatency(us) 1522 601 39.46%
> [UPDATE], MinLatency(us) 369 241 65.31%
> [UPDATE], MaxLatency(us) 89855 35775 39.81%
> [UPDATE], 95thPercentileLatency(us) 3691 1659 44.95%
> [UPDATE], 99thPercentileLatency(us) 5003 3513 70.22%
>
>
> On Wed, May 20, 2020 at 9:10 AM Bruno Dumon <bru...@ngdata.com> wrote:
>
> > Hi,
> >
> > I think that (idle) background threads would not make much of a
> difference
> > to the raw speed of iterating over cells of a single region served from
> the
> > block cache. I started testing this way after noticing slow down on a
> real
> > installation. I can imagine that there have been various improvements in
> > hbase 2 in other areas which will compensate partly the impact of what I
> > notice in this narrow test, but still I found these results remarkable
> > enough.
> >
> > On Wed, May 20, 2020 at 4:33 PM 张铎(Duo Zhang) <palomino...@gmail.com>
> > wrote:
> >
> > > Just saw that your tests were on local mode...
> > >
> > > Local mode is not for production so I do not see any related issues for
> > > improving the performance for hbase in local mode. Maybe we just have
> > more
> > > threads in HBase 2 by default which makes it slow on a single machine,
> > not
> > > sure...
> > >
> > > Could you please test it on a distributed cluster? If it is still a
> > > problem, you can open an issue and I believe there will be committers
> > offer
> > > to help verifying the problem.
> > >
> > > Thanks.
> > >
> > > Bruno Dumon <bru...@ngdata.com> 于2020年5月20日周三 下午4:45写道:
> > >
> > > > For the scan test, there is only minimal rpc involved, I verified
> > through
> > > > ScanMetrics that there are only 2 rpc calls for the scan. It is
> > > essentially
> > > > testing how fast the region server is able to iterate over the cells.
> > > There
> > > > are no delete cells, and the table is fully compacted (1 storage
> file),
> > > and
> > > > all data fits into the block cache.
> > > >
> > > > For the sequential gets (i.e. one get after the other,
> > > non-multi-threaded),
> > > > I tried the BlockingRpcClient. It is about 13% faster than the netty
> > rpc
> > > > client. But the same code on 1.6 is still 90% faster. Concretely, my
> > test
> > > > code does 100K gets of the same row in a loop. On HBase 2.2.4 with
> the
> > > > BlockingRpcClient this takes on average 9 seconds, with HBase 1.6 it
> > > takes
> > > > 4.75 seconds.
> > > >
> > > > On Wed, May 20, 2020 at 9:27 AM Debraj Manna <
> subharaj.ma...@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > I cross-posted this in slack channel as I was also observing
> > something
> > > > > quite similar. This is the suggestion I received. Reposting here
> for
> > > > > the completion.
> > > > >
> > > > > zhangduo  12:15 PM
> > > > > Does get also have the same performance drop, or only scan?
> > > > > zhangduo  12:18 PM
> > > > > For the rpc layer, hbase2 defaults to netty while hbase1 is pure
> java
> > > > > socket. You can set the rpc client to BlockingRpcClient to see if
> the
> > > > > performance is back.
> > > > >
> > > > > On Mon, May 18, 2020 at 7:58 PM Bruno Dumon <bru...@ngdata.com>
> > wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > We are looking into migrating from HBase 1.2.x to HBase 2.1.x (on
> > > > > Cloudera
> > > > > > CDH).
> > > > > >
> > > > > > It seems like HBase 2 is slower than HBase 1 for both reading and
> > > > > writing.
> > > > > >
> > > > > > I did a simple test, using HBase 1.6.0 and HBase 2.2.4 (the
> > standard
> > > > OSS
> > > > > > versions), running in local mode (no HDFS) on my computer:
> > > > > >
> > > > > >  * ingested 15M single-KV rows
> > > > > >  * full table scan over them
> > > > > >  * to remove rpc latency as much as possible, the scan had a
> filter
> > > > 'new
> > > > > > RandomRowFilter(0.0001f)', caching set to 10K (more than the
> number
> > > of
> > > > > rows
> > > > > > returned) and hbase.cells.scanned.per.heartbeat.check set to
> 100M.
> > > This
> > > > > > scan returns about 1500 rows/KVs.
> > > > > >  * HBase configured with hbase.regionserver.regionSplitLimit=1 to
> > > > remove
> > > > > > influence from region splitting
> > > > > >
> > > > > > In this test, scanning seems over 50% slower on HBase 2 compared
> to
> > > > > HBase 1.
> > > > > >
> > > > > > I tried flushing & major-compacting before doing the scan, in
> which
> > > > case
> > > > > > the scan finishes faster, but the difference between the two
> HBase
> > > > > versions
> > > > > > stays about the same.
> > > > > >
> > > > > > The test code is written in Java, using the client libraries from
> > the
> > > > > > corresponding HBase versions.
> > > > > >
> > > > > > Besides the above scan test, I also tested write performance
> > through
> > > > > > BufferedMutator, scans without the filter (thus passing much more
> > > data
> > > > > over
> > > > > > the rpc), and sequential random Get requests. They all seem
> quite a
> > > bit
> > > > > > slower on HBase 2. Interestingly, using the HBase 1.6 client to
> > talk
> > > to
> > > > > the
> > > > > > HBase 2.2.4 server is faster than using the HBase 2.2.4 client.
> So
> > it
> > > > > seems
> > > > > > the rpc latency of the new client is worse.
> > > > > >
> > > > > > So my question is, is such a large performance drop to be
> expected
> > > when
> > > > > > migrating to HBase 2? Are there any special settings we need to
> be
> > > > aware
> > > > > of?
> > > > > >
> > > > > > Thanks!
> > > > >
> > > >
> > > >
> > > > --
> > > > Bruno Dumon
> > > > NGDATA
> > > > http://www.ngdata.com/
> > > >
> > >
> >
> >
> > --
> > Bruno Dumon
> > NGDATA
> > http://www.ngdata.com/
> >
>
>
> --
> Best regards,
> Andrew
>
> Words like orphans lost among the crosstalk, meaning torn from truth's
> decrepit hands
>    - A23, Crosstalk
>


-- 
Bruno Dumon
NGDATA
http://www.ngdata.com/

Reply via email to