Thank you Lars. I suppose it is not possible to characterize the problem with anonymous detail enough to provide some clues for follow up, or you would have done it.
> On May 22, 2020, at 6:01 AM, Lars Francke <[email protected]> wrote: > > I've refrained from commenting here so far because I cannot share much/any > data but I can also report that we've seen worse performance with HBase 2 > (similar/same settings and same workload, same hardware). This is on a 40+ > node cluster. > Unfortunately, I wasn't tasked with debugging. The customer decided to stay > on 1.x for this reason. > >> On Fri, May 22, 2020 at 1:52 AM Andrew Purtell <[email protected]> wrote: >> >> It depends what you are measuring and how. I test every so often with YCSB, >> which admittedly is not representative of real world workloads but is >> widely used for apples to apples testing among datastores, and we can apply >> the same test tool and test methodology to different versions to get >> meaningful results. I also test on real clusters. The single all-in-one >> process zk+master+regionserver "minicluster" cannot provide you meaningful >> performance data. Only distributed clusters can provide meaningful results. >> Some defaults are also important to change, like the number of RPC handlers >> you plan to use in production. >> >> After reading this thread I tested 1.6.0 and 2.2.4 using my standard >> methodology, described below. 2.2.4 is better, often significantly better, >> in most measures in most cases. >> >> Cluster: AWS Amazon Linux AMI, 1 x master, 5 x regionserver, 1 x client, >> m5d.4xlarge >> Hadoop: 2.10.0, ZK: 3.4.14 >> >> >> JVM: 8u252 shenandoah (provided by AMI) >> >> >> GC: -XX:+UseShenandoahGC -Xms31g -Xmx31g -XX:+AlwaysPreTouch -XX:+UseNUMA >> -XX:-UseBiasedLocking >> Non-default settings: hbase.regionserver.handler.count=256 >> hbase.ipc.server.callqueue.type=codel dfs.client.read.shortcircuit=true >> Methodology: >> >> >> 1. Create 100M row base table (ROW_INDEX_V1 encoding, ZSTANDARD >> compression) >> 2. Snapshot base table >> >> >> 3. Enable balancer >> >> >> 4. Clone test table from base table snapshot >> >> >> 5. Balance, then disable balancer >> >> >> 6. Run YCSB 0.18 workload --operationcount 1000000 (1M rows) -threads 200 >> -target 100000 (100k/ops/sec) >> 7. Drop test table >> >> >> 8. Back to step 3 until all workloads complete >> >> >> >> >> >> >> Workload A 1.6.0 2.2.4 Difference >> [OVERALL], RunTime(ms) 20552 20655 100.50% >> [OVERALL], Throughput(ops/sec) 97314 96829 99.50% >> [READ], AverageLatency(us) 591 418 70.75% >> [READ], MinLatency(us) 191 201 105.24% >> [READ], MaxLatency(us) 146047 80895 55.39% >> [READ], 95thPercentileLatency(us) 3013 542 17.99% >> [READ], 99thPercentileLatency(us) 5427 2559 47.15% >> [UPDATE], AverageLatency(us) 833 460 55.23% >> [UPDATE], MinLatency(us) 348 230 66.09% >> [UPDATE], MaxLatency(us) 149887 80959 54.01% >> [UPDATE], 95thPercentileLatency(us) 3403 607 17.84% >> [UPDATE], 99thPercentileLatency(us) 5751 3045 52.95% >> >> >> >> >> Workload B 1.6.0 2.2.4 Difference >> [OVERALL], RunTime(ms) 20555 20679 100.60% >> [OVERALL], Throughput(ops/sec) 97300 96716 99.40% >> [READ], AverageLatency(us) 417 427 102.54% >> [READ], MinLatency(us) 179 194 108.38% >> [READ], MaxLatency(us) 124095 76799 61.89% >> [READ], 95thPercentileLatency(us) 498 564 113.25% >> [READ], 99thPercentileLatency(us) 3679 3785 102.88% >> [UPDATE], AverageLatency(us) 665 488 73.28% >> [UPDATE], MinLatency(us) 380 237 62.37% >> [UPDATE], MaxLatency(us) 95167 76287 80.16% >> [UPDATE], 95thPercentileLatency(us) 718 629 87.60% >> [UPDATE], 99thPercentileLatency(us) 4015 4023 100.20% >> >> >> >> >> Workload C 1.6.0 2.2.4 Difference >> [OVERALL], RunTime(ms) 20525 20648 100.60% >> [OVERALL], Throughput(ops/sec) 97442 96862 99.40% >> [READ], AverageLatency(us) 385 382 99.07% >> [READ], MinLatency(us) 178 198 111.24% >> [READ], MaxLatency(us) 74943 76415 101.96% >> [READ], 95thPercentileLatency(us) 437 477 109.15% >> [READ], 99thPercentileLatency(us) 3349 2219 66.26% >> >> >> >> >> Workload D 1.6.0 2.2.4 Difference >> [OVERALL], RunTime(ms) 20538 20644 100.52% >> [OVERALL], Throughput(ops/sec) 97380 96880 99.49% >> [READ], AverageLatency(us) 372 393 105.49% >> [READ], MinLatency(us) 116 137 118.10% >> [READ], MaxLatency(us) 107391 73215 68.18% >> [READ], 95thPercentileLatency(us) 916 983 107.31% >> [READ], 99thPercentileLatency(us) 3183 2473 77.69% >> [INSERT], AverageLatency(us) 732 526 71.86% >> [INSERT], MinLatency(us) 418 289 69.14% >> [INSERT], MaxLatency(us) 109183 80255 73.51% >> [INSERT], 95thPercentileLatency(us) 823 724 87.97% >> [INSERT], 99thPercentileLatency(us) 3961 3003 75.81% >> >> >> >> >> Workload E 1.6.0 2.2.4 Difference >> [OVERALL], RunTime(ms) 120157 119728 99.64% >> [OVERALL], Throughput(ops/sec) 16645 16705 100.36% >> [INSERT], AverageLatency(us) 11787 11102 94.19% >> [INSERT], MinLatency(us) 459 296 64.49% >> [INSERT], MaxLatency(us) 172927 131583 76.09% >> [INSERT], 95thPercentileLatency(us) 32143 28911 89.94% >> [INSERT], 99thPercentileLatency(us) 36063 31423 87.13% >> [SCAN], AverageLatency(us) 11891 11875 99.87% >> [SCAN], MinLatency(us) 219 255 116.44% >> [SCAN], MaxLatency(us) 179071 188671 105.36% >> [SCAN], 95thPercentileLatency(us) 32639 29615 90.74% >> [SCAN], 99thPercentileLatency(us) 36671 32175 87.74% >> >> >> >> >> Workload F 1.6.0 2.2.4 Difference >> [OVERALL], RunTime(ms) 20766 20655 99.47% >> [OVERALL], Throughput(ops/sec) 96311 96829 100.54% >> [READ], AverageLatency(us) 1242 591 47.61% >> [READ], MinLatency(us) 183 212 115.85% >> [READ], MaxLatency(us) 80959 90111 111.30% >> [READ], 95thPercentileLatency(us) 3397 1511 44.48% >> [READ], 99thPercentileLatency(us) 4515 3063 67.84% >> [READ-MODIFY-WRITE], AverageLatency(us) 2768 1193 43.10% >> [READ-MODIFY-WRITE], MinLatency(us) 596 496 83.22% >> [READ-MODIFY-WRITE], MaxLatency(us) 128639 112191 87.21% >> [READ-MODIFY-WRITE], 95thPercentileLatency(us) 7071 3263 46.15% >> [READ-MODIFY-WRITE], 99thPercentileLatency(us) 9919 6547 66.00% >> [UPDATE], AverageLatency(us) 1522 601 39.46% >> [UPDATE], MinLatency(us) 369 241 65.31% >> [UPDATE], MaxLatency(us) 89855 35775 39.81% >> [UPDATE], 95thPercentileLatency(us) 3691 1659 44.95% >> [UPDATE], 99thPercentileLatency(us) 5003 3513 70.22% >> >> >>> On Wed, May 20, 2020 at 9:10 AM Bruno Dumon <[email protected]> wrote: >>> >>> Hi, >>> >>> I think that (idle) background threads would not make much of a >> difference >>> to the raw speed of iterating over cells of a single region served from >> the >>> block cache. I started testing this way after noticing slow down on a >> real >>> installation. I can imagine that there have been various improvements in >>> hbase 2 in other areas which will compensate partly the impact of what I >>> notice in this narrow test, but still I found these results remarkable >>> enough. >>> >>> On Wed, May 20, 2020 at 4:33 PM 张铎(Duo Zhang) <[email protected]> >>> wrote: >>> >>>> Just saw that your tests were on local mode... >>>> >>>> Local mode is not for production so I do not see any related issues for >>>> improving the performance for hbase in local mode. Maybe we just have >>> more >>>> threads in HBase 2 by default which makes it slow on a single machine, >>> not >>>> sure... >>>> >>>> Could you please test it on a distributed cluster? If it is still a >>>> problem, you can open an issue and I believe there will be committers >>> offer >>>> to help verifying the problem. >>>> >>>> Thanks. >>>> >>>> Bruno Dumon <[email protected]> 于2020年5月20日周三 下午4:45写道: >>>> >>>>> For the scan test, there is only minimal rpc involved, I verified >>> through >>>>> ScanMetrics that there are only 2 rpc calls for the scan. It is >>>> essentially >>>>> testing how fast the region server is able to iterate over the cells. >>>> There >>>>> are no delete cells, and the table is fully compacted (1 storage >> file), >>>> and >>>>> all data fits into the block cache. >>>>> >>>>> For the sequential gets (i.e. one get after the other, >>>> non-multi-threaded), >>>>> I tried the BlockingRpcClient. It is about 13% faster than the netty >>> rpc >>>>> client. But the same code on 1.6 is still 90% faster. Concretely, my >>> test >>>>> code does 100K gets of the same row in a loop. On HBase 2.2.4 with >> the >>>>> BlockingRpcClient this takes on average 9 seconds, with HBase 1.6 it >>>> takes >>>>> 4.75 seconds. >>>>> >>>>> On Wed, May 20, 2020 at 9:27 AM Debraj Manna < >> [email protected] >>>> >>>>> wrote: >>>>> >>>>>> I cross-posted this in slack channel as I was also observing >>> something >>>>>> quite similar. This is the suggestion I received. Reposting here >> for >>>>>> the completion. >>>>>> >>>>>> zhangduo 12:15 PM >>>>>> Does get also have the same performance drop, or only scan? >>>>>> zhangduo 12:18 PM >>>>>> For the rpc layer, hbase2 defaults to netty while hbase1 is pure >> java >>>>>> socket. You can set the rpc client to BlockingRpcClient to see if >> the >>>>>> performance is back. >>>>>> >>>>>> On Mon, May 18, 2020 at 7:58 PM Bruno Dumon <[email protected]> >>> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> We are looking into migrating from HBase 1.2.x to HBase 2.1.x (on >>>>>> Cloudera >>>>>>> CDH). >>>>>>> >>>>>>> It seems like HBase 2 is slower than HBase 1 for both reading and >>>>>> writing. >>>>>>> >>>>>>> I did a simple test, using HBase 1.6.0 and HBase 2.2.4 (the >>> standard >>>>> OSS >>>>>>> versions), running in local mode (no HDFS) on my computer: >>>>>>> >>>>>>> * ingested 15M single-KV rows >>>>>>> * full table scan over them >>>>>>> * to remove rpc latency as much as possible, the scan had a >> filter >>>>> 'new >>>>>>> RandomRowFilter(0.0001f)', caching set to 10K (more than the >> number >>>> of >>>>>> rows >>>>>>> returned) and hbase.cells.scanned.per.heartbeat.check set to >> 100M. >>>> This >>>>>>> scan returns about 1500 rows/KVs. >>>>>>> * HBase configured with hbase.regionserver.regionSplitLimit=1 to >>>>> remove >>>>>>> influence from region splitting >>>>>>> >>>>>>> In this test, scanning seems over 50% slower on HBase 2 compared >> to >>>>>> HBase 1. >>>>>>> >>>>>>> I tried flushing & major-compacting before doing the scan, in >> which >>>>> case >>>>>>> the scan finishes faster, but the difference between the two >> HBase >>>>>> versions >>>>>>> stays about the same. >>>>>>> >>>>>>> The test code is written in Java, using the client libraries from >>> the >>>>>>> corresponding HBase versions. >>>>>>> >>>>>>> Besides the above scan test, I also tested write performance >>> through >>>>>>> BufferedMutator, scans without the filter (thus passing much more >>>> data >>>>>> over >>>>>>> the rpc), and sequential random Get requests. They all seem >> quite a >>>> bit >>>>>>> slower on HBase 2. Interestingly, using the HBase 1.6 client to >>> talk >>>> to >>>>>> the >>>>>>> HBase 2.2.4 server is faster than using the HBase 2.2.4 client. >> So >>> it >>>>>> seems >>>>>>> the rpc latency of the new client is worse. >>>>>>> >>>>>>> So my question is, is such a large performance drop to be >> expected >>>> when >>>>>>> migrating to HBase 2? Are there any special settings we need to >> be >>>>> aware >>>>>> of? >>>>>>> >>>>>>> Thanks! >>>>>> >>>>> >>>>> >>>>> -- >>>>> Bruno Dumon >>>>> NGDATA >>>>> http://www.ngdata.com/ >>>>> >>>> >>> >>> >>> -- >>> Bruno Dumon >>> NGDATA >>> http://www.ngdata.com/ >>> >> >> >> -- >> Best regards, >> Andrew >> >> Words like orphans lost among the crosstalk, meaning torn from truth's >> decrepit hands >> - A23, Crosstalk >>
