I'm new to HBase and looking at some performance testing for my use case. I've noticed that HBase scans seem "slow" compared to machine capabilities.
Here is a bit more detail on the testing I am running. I have loaded 3 test tables into HBase and sqlite3 for comparison. I'm using sqlite3 as a stand-in for what the "peak" performance can be for this operation. For HBase running a 2 node (1 name node / 1 data note) cluster on EMR with m3.2xlarge instances. The test tables each have 1 million rows with data like this: (1) 1 bigint column, 1 float column (2) 1 bigint column, 1 float column, 100 bytes of filler data (3) 1 bigint column, 1 float column, 1000 bytes of filler data I randomized the filler data to attempt to limit the effects of compression, and also ran tests with compression turned off, but that didn't seem to have much impact. I ran the following test queries on both HBase and sqlite3: (a) SELECT count(*) FROM table WHERE val > .5 (b) SELECT count(*) FROM table WHERE filler like '%x%' In each case I ran the query twice to account for block cache in HBase. Below are the performance numbers on the 2nd (block cached) run: HBase: 1a: 1.373s 2a: 1.538s 2b: 3.582s 3a: 0.98s 3b: 11.354s sqlite3: 1a: 0.156000s 2a: 0.212000s 2b: 0.660000s 3a: 0.252000s 3b: 4.364000s In each case, sqlite3 performs much better (2x-9x) than HBase for an equivalent operation. I ran some rudimentary profiling on HBase and it seems like the bulk of the time is spent in this function https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java#L6485 so I'm guessing the computation in there is taking a long time. I have two questions that I'm hoping to get some guidance on: (1) Is this expected performance for HBase range scans? I'm told that HBase is optimized for random key access, not range scans, so perhaps this performance is a result of that tradeoff? (2) Is there anything I can do the improve HBase range scan performance in terms of configuration, data layout, etc.? A few other notes: - I'm using Phoenix for the SQL layer on top of HBase, but my profiling revealed that the limiting factor is HBase, specifically the function cited above - The table sizes are 33MB, 130MB and 990MB in HBase and similar in sqlite3. Thanks, Marcell