HBase scans seem slow, compute bound. How to improve?

Marcell Ortutay Mon, 16 Apr 2018 14:51:06 -0700

I'm new to HBase and looking at some performance testing for my use case.
I've noticed that HBase scans seem "slow" compared to machine capabilities.


Here is a bit more detail on the testing I am running. I have loaded 3 test
tables into HBase and sqlite3 for comparison. I'm using sqlite3 as a
stand-in for what the "peak" performance can be for this operation. For
HBase running a 2 node (1 name node / 1 data note) cluster on EMR with
m3.2xlarge instances. The test tables each have 1 million rows with data
like this:

(1) 1 bigint column, 1 float column
(2) 1 bigint column, 1 float column, 100 bytes of filler data
(3) 1 bigint column, 1 float column, 1000 bytes of filler data

I randomized the filler data to attempt to limit the effects of
compression, and also ran tests with compression turned off, but that
didn't seem to have much impact.

I ran the following test queries on both HBase and sqlite3:

(a) SELECT count(*) FROM table WHERE val > .5
(b) SELECT count(*) FROM table WHERE filler like '%x%'

In each case I ran the query twice to account for block cache in HBase.
Below are the performance numbers on the 2nd (block cached) run:

HBase:
1a: 1.373s
2a: 1.538s
2b: 3.582s
3a: 0.98s
3b: 11.354s

sqlite3:
1a: 0.156000s
2a: 0.212000s
2b: 0.660000s
3a: 0.252000s
3b: 4.364000s

In each case, sqlite3 performs much better (2x-9x) than HBase for an
equivalent operation.

I ran some rudimentary profiling on HBase and it seems like the bulk of the
time is spent in this function
https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java#L6485
so I'm guessing the computation in there is taking a long time.

I have two questions that I'm hoping to get some guidance on:

(1) Is this expected performance for HBase range scans? I'm told that HBase
is optimized for random key access, not range scans, so perhaps this
performance is a result of that tradeoff?
(2) Is there anything I can do the improve HBase range scan performance in
terms of configuration, data layout, etc.?

A few other notes:
- I'm using Phoenix for the SQL layer on top of HBase, but my profiling
revealed that the limiting factor is HBase, specifically the function cited
above
- The table sizes are 33MB, 130MB and 990MB in HBase and similar in sqlite3.

Thanks,
Marcell

HBase scans seem slow, compute bound. How to improve?

Reply via email to