Re: HBase 2 slower than HBase 1?

Andrew Purtell Fri, 22 May 2020 09:01:33 -0700

Thank you Lars. I suppose it is not possible to characterize the problem with 
anonymous detail enough to provide some clues for follow up, or you would have 
done it.


> On May 22, 2020, at 6:01 AM, Lars Francke <[email protected]> wrote:
> 
> I've refrained from commenting here so far because I cannot share much/any
> data but I can also report that we've seen worse performance with HBase 2
> (similar/same settings and same workload, same hardware). This is on a 40+
> node cluster.
> Unfortunately, I wasn't tasked with debugging. The customer decided to stay
> on 1.x for this reason.
> 
>> On Fri, May 22, 2020 at 1:52 AM Andrew Purtell <[email protected]> wrote:
>> 
>> It depends what you are measuring and how. I test every so often with YCSB,
>> which admittedly is not representative of real world workloads but is
>> widely used for apples to apples testing among datastores, and we can apply
>> the same test tool and test methodology to different versions to get
>> meaningful results. I also test on real clusters. The single all-in-one
>> process zk+master+regionserver "minicluster" cannot provide you meaningful
>> performance data. Only distributed clusters can provide meaningful results.
>> Some defaults are also important to change, like the number of RPC handlers
>> you plan to use in production.
>> 
>> After reading this thread I tested 1.6.0 and 2.2.4 using my standard
>> methodology, described below. 2.2.4 is better, often significantly better,
>> in most measures in most cases.
>> 
>> Cluster: AWS Amazon Linux AMI, 1 x master, 5 x regionserver, 1 x client,
>> m5d.4xlarge
>> Hadoop: 2.10.0, ZK: 3.4.14
>> 
>> 
>> JVM: 8u252 shenandoah (provided by AMI)
>> 
>> 
>> GC: -XX:+UseShenandoahGC -Xms31g -Xmx31g -XX:+AlwaysPreTouch -XX:+UseNUMA
>> -XX:-UseBiasedLocking
>> Non-default settings: hbase.regionserver.handler.count=256
>> hbase.ipc.server.callqueue.type=codel dfs.client.read.shortcircuit=true
>> Methodology:
>> 
>> 
>>  1. Create 100M row base table (ROW_INDEX_V1 encoding, ZSTANDARD
>> compression)
>>  2. Snapshot base table
>> 
>> 
>>  3. Enable balancer
>> 
>> 
>>  4. Clone test table from base table snapshot
>> 
>> 
>>  5. Balance, then disable balancer
>> 
>> 
>>  6. Run YCSB 0.18 workload --operationcount 1000000 (1M rows) -threads 200
>> -target 100000 (100k/ops/sec)
>>  7. Drop test table
>> 
>> 
>>  8. Back to step 3 until all workloads complete
>> 
>> 
>> 
>> 
>> 
>> 
>> Workload A 1.6.0 2.2.4 Difference
>> [OVERALL], RunTime(ms) 20552 20655 100.50%
>> [OVERALL], Throughput(ops/sec) 97314 96829 99.50%
>> [READ], AverageLatency(us) 591 418 70.75%
>> [READ], MinLatency(us) 191 201 105.24%
>> [READ], MaxLatency(us) 146047 80895 55.39%
>> [READ], 95thPercentileLatency(us) 3013 542 17.99%
>> [READ], 99thPercentileLatency(us) 5427 2559 47.15%
>> [UPDATE], AverageLatency(us) 833 460 55.23%
>> [UPDATE], MinLatency(us) 348 230 66.09%
>> [UPDATE], MaxLatency(us) 149887 80959 54.01%
>> [UPDATE], 95thPercentileLatency(us) 3403 607 17.84%
>> [UPDATE], 99thPercentileLatency(us) 5751 3045 52.95%
>> 
>> 
>> 
>> 
>> Workload B 1.6.0 2.2.4 Difference
>> [OVERALL], RunTime(ms) 20555 20679 100.60%
>> [OVERALL], Throughput(ops/sec) 97300 96716 99.40%
>> [READ], AverageLatency(us) 417 427 102.54%
>> [READ], MinLatency(us) 179 194 108.38%
>> [READ], MaxLatency(us) 124095 76799 61.89%
>> [READ], 95thPercentileLatency(us) 498 564 113.25%
>> [READ], 99thPercentileLatency(us) 3679 3785 102.88%
>> [UPDATE], AverageLatency(us) 665 488 73.28%
>> [UPDATE], MinLatency(us) 380 237 62.37%
>> [UPDATE], MaxLatency(us) 95167 76287 80.16%
>> [UPDATE], 95thPercentileLatency(us) 718 629 87.60%
>> [UPDATE], 99thPercentileLatency(us) 4015 4023 100.20%
>> 
>> 
>> 
>> 
>> Workload C 1.6.0 2.2.4 Difference
>> [OVERALL], RunTime(ms) 20525 20648 100.60%
>> [OVERALL], Throughput(ops/sec) 97442 96862 99.40%
>> [READ], AverageLatency(us) 385 382 99.07%
>> [READ], MinLatency(us) 178 198 111.24%
>> [READ], MaxLatency(us) 74943 76415 101.96%
>> [READ], 95thPercentileLatency(us) 437 477 109.15%
>> [READ], 99thPercentileLatency(us) 3349 2219 66.26%
>> 
>> 
>> 
>> 
>> Workload D 1.6.0 2.2.4 Difference
>> [OVERALL], RunTime(ms) 20538 20644 100.52%
>> [OVERALL], Throughput(ops/sec) 97380 96880 99.49%
>> [READ], AverageLatency(us) 372 393 105.49%
>> [READ], MinLatency(us) 116 137 118.10%
>> [READ], MaxLatency(us) 107391 73215 68.18%
>> [READ], 95thPercentileLatency(us) 916 983 107.31%
>> [READ], 99thPercentileLatency(us) 3183 2473 77.69%
>> [INSERT], AverageLatency(us) 732 526 71.86%
>> [INSERT], MinLatency(us) 418 289 69.14%
>> [INSERT], MaxLatency(us) 109183 80255 73.51%
>> [INSERT], 95thPercentileLatency(us) 823 724 87.97%
>> [INSERT], 99thPercentileLatency(us) 3961 3003 75.81%
>> 
>> 
>> 
>> 
>> Workload E 1.6.0 2.2.4 Difference
>> [OVERALL], RunTime(ms) 120157 119728 99.64%
>> [OVERALL], Throughput(ops/sec) 16645 16705 100.36%
>> [INSERT], AverageLatency(us) 11787 11102 94.19%
>> [INSERT], MinLatency(us) 459 296 64.49%
>> [INSERT], MaxLatency(us) 172927 131583 76.09%
>> [INSERT], 95thPercentileLatency(us) 32143 28911 89.94%
>> [INSERT], 99thPercentileLatency(us) 36063 31423 87.13%
>> [SCAN], AverageLatency(us) 11891 11875 99.87%
>> [SCAN], MinLatency(us) 219 255 116.44%
>> [SCAN], MaxLatency(us) 179071 188671 105.36%
>> [SCAN], 95thPercentileLatency(us) 32639 29615 90.74%
>> [SCAN], 99thPercentileLatency(us) 36671 32175 87.74%
>> 
>> 
>> 
>> 
>> Workload F 1.6.0 2.2.4 Difference
>> [OVERALL], RunTime(ms) 20766 20655 99.47%
>> [OVERALL], Throughput(ops/sec) 96311 96829 100.54%
>> [READ], AverageLatency(us) 1242 591 47.61%
>> [READ], MinLatency(us) 183 212 115.85%
>> [READ], MaxLatency(us) 80959 90111 111.30%
>> [READ], 95thPercentileLatency(us) 3397 1511 44.48%
>> [READ], 99thPercentileLatency(us) 4515 3063 67.84%
>> [READ-MODIFY-WRITE], AverageLatency(us) 2768 1193 43.10%
>> [READ-MODIFY-WRITE], MinLatency(us) 596 496 83.22%
>> [READ-MODIFY-WRITE], MaxLatency(us) 128639 112191 87.21%
>> [READ-MODIFY-WRITE], 95thPercentileLatency(us) 7071 3263 46.15%
>> [READ-MODIFY-WRITE], 99thPercentileLatency(us) 9919 6547 66.00%
>> [UPDATE], AverageLatency(us) 1522 601 39.46%
>> [UPDATE], MinLatency(us) 369 241 65.31%
>> [UPDATE], MaxLatency(us) 89855 35775 39.81%
>> [UPDATE], 95thPercentileLatency(us) 3691 1659 44.95%
>> [UPDATE], 99thPercentileLatency(us) 5003 3513 70.22%
>> 
>> 
>>> On Wed, May 20, 2020 at 9:10 AM Bruno Dumon <[email protected]> wrote:
>>> 
>>> Hi,
>>> 
>>> I think that (idle) background threads would not make much of a
>> difference
>>> to the raw speed of iterating over cells of a single region served from
>> the
>>> block cache. I started testing this way after noticing slow down on a
>> real
>>> installation. I can imagine that there have been various improvements in
>>> hbase 2 in other areas which will compensate partly the impact of what I
>>> notice in this narrow test, but still I found these results remarkable
>>> enough.
>>> 
>>> On Wed, May 20, 2020 at 4:33 PM 张铎(Duo Zhang) <[email protected]>
>>> wrote:
>>> 
>>>> Just saw that your tests were on local mode...
>>>> 
>>>> Local mode is not for production so I do not see any related issues for
>>>> improving the performance for hbase in local mode. Maybe we just have
>>> more
>>>> threads in HBase 2 by default which makes it slow on a single machine,
>>> not
>>>> sure...
>>>> 
>>>> Could you please test it on a distributed cluster? If it is still a
>>>> problem, you can open an issue and I believe there will be committers
>>> offer
>>>> to help verifying the problem.
>>>> 
>>>> Thanks.
>>>> 
>>>> Bruno Dumon <[email protected]> 于2020年5月20日周三 下午4:45写道：
>>>> 
>>>>> For the scan test, there is only minimal rpc involved, I verified
>>> through
>>>>> ScanMetrics that there are only 2 rpc calls for the scan. It is
>>>> essentially
>>>>> testing how fast the region server is able to iterate over the cells.
>>>> There
>>>>> are no delete cells, and the table is fully compacted (1 storage
>> file),
>>>> and
>>>>> all data fits into the block cache.
>>>>> 
>>>>> For the sequential gets (i.e. one get after the other,
>>>> non-multi-threaded),
>>>>> I tried the BlockingRpcClient. It is about 13% faster than the netty
>>> rpc
>>>>> client. But the same code on 1.6 is still 90% faster. Concretely, my
>>> test
>>>>> code does 100K gets of the same row in a loop. On HBase 2.2.4 with
>> the
>>>>> BlockingRpcClient this takes on average 9 seconds, with HBase 1.6 it
>>>> takes
>>>>> 4.75 seconds.
>>>>> 
>>>>> On Wed, May 20, 2020 at 9:27 AM Debraj Manna <
>> [email protected]
>>>> 
>>>>> wrote:
>>>>> 
>>>>>> I cross-posted this in slack channel as I was also observing
>>> something
>>>>>> quite similar. This is the suggestion I received. Reposting here
>> for
>>>>>> the completion.
>>>>>> 
>>>>>> zhangduo  12:15 PM
>>>>>> Does get also have the same performance drop, or only scan?
>>>>>> zhangduo  12:18 PM
>>>>>> For the rpc layer, hbase2 defaults to netty while hbase1 is pure
>> java
>>>>>> socket. You can set the rpc client to BlockingRpcClient to see if
>> the
>>>>>> performance is back.
>>>>>> 
>>>>>> On Mon, May 18, 2020 at 7:58 PM Bruno Dumon <[email protected]>
>>> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> We are looking into migrating from HBase 1.2.x to HBase 2.1.x (on
>>>>>> Cloudera
>>>>>>> CDH).
>>>>>>> 
>>>>>>> It seems like HBase 2 is slower than HBase 1 for both reading and
>>>>>> writing.
>>>>>>> 
>>>>>>> I did a simple test, using HBase 1.6.0 and HBase 2.2.4 (the
>>> standard
>>>>> OSS
>>>>>>> versions), running in local mode (no HDFS) on my computer:
>>>>>>> 
>>>>>>> * ingested 15M single-KV rows
>>>>>>> * full table scan over them
>>>>>>> * to remove rpc latency as much as possible, the scan had a
>> filter
>>>>> 'new
>>>>>>> RandomRowFilter(0.0001f)', caching set to 10K (more than the
>> number
>>>> of
>>>>>> rows
>>>>>>> returned) and hbase.cells.scanned.per.heartbeat.check set to
>> 100M.
>>>> This
>>>>>>> scan returns about 1500 rows/KVs.
>>>>>>> * HBase configured with hbase.regionserver.regionSplitLimit=1 to
>>>>> remove
>>>>>>> influence from region splitting
>>>>>>> 
>>>>>>> In this test, scanning seems over 50% slower on HBase 2 compared
>> to
>>>>>> HBase 1.
>>>>>>> 
>>>>>>> I tried flushing & major-compacting before doing the scan, in
>> which
>>>>> case
>>>>>>> the scan finishes faster, but the difference between the two
>> HBase
>>>>>> versions
>>>>>>> stays about the same.
>>>>>>> 
>>>>>>> The test code is written in Java, using the client libraries from
>>> the
>>>>>>> corresponding HBase versions.
>>>>>>> 
>>>>>>> Besides the above scan test, I also tested write performance
>>> through
>>>>>>> BufferedMutator, scans without the filter (thus passing much more
>>>> data
>>>>>> over
>>>>>>> the rpc), and sequential random Get requests. They all seem
>> quite a
>>>> bit
>>>>>>> slower on HBase 2. Interestingly, using the HBase 1.6 client to
>>> talk
>>>> to
>>>>>> the
>>>>>>> HBase 2.2.4 server is faster than using the HBase 2.2.4 client.
>> So
>>> it
>>>>>> seems
>>>>>>> the rpc latency of the new client is worse.
>>>>>>> 
>>>>>>> So my question is, is such a large performance drop to be
>> expected
>>>> when
>>>>>>> migrating to HBase 2? Are there any special settings we need to
>> be
>>>>> aware
>>>>>> of?
>>>>>>> 
>>>>>>> Thanks!
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Bruno Dumon
>>>>> NGDATA
>>>>> http://www.ngdata.com/
>>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Bruno Dumon
>>> NGDATA
>>> http://www.ngdata.com/
>>> 
>> 
>> 
>> --
>> Best regards,
>> Andrew
>> 
>> Words like orphans lost among the crosstalk, meaning torn from truth's
>> decrepit hands
>>   - A23, Crosstalk
>>

Re: HBase 2 slower than HBase 1?

Reply via email to