Rows per second for RegionScanner

hongbin ma Thu, 21 Apr 2016 03:05:16 -0700

Hi, experts,

I'm trying to figure out how fast hbase can scan. I'm setting up the
RegionScan in a endpoint coprocessor so that no network overhead will be
included. My average key length is 35 and average value length is 5.


My test result is that if I warm all my interested blocks in the block
cache, I'm only able to scan around 300,000 rows per second per region
(with endpoint I guess it's one thread per region), so it's like getting15M
data per second. I'm not sure if this is already an acceptable number for
HBase. The answers from you experts might help me to decide if it's worth
to further dig into tuning it.

thanks!






other info:

My hbase cluster is on 8 AWS m1.xlarge instance, with 4 CPU cores and 16G
RAM. Each region server is configured 10G heap size. The test HTable has 23
regions, one hfile per region (just major compacted). There's no other
resource contention when I ran the tests.

Attached is the HFile output of one of the region hfile:
=============================================
 hbase  org.apache.hadoop.hbase.io.hfile.HFile -m -s -v -f
/apps/hbase/data/data/default/KYLIN_YMSGYYXO12/d42b9faf43eafcc9640aa256143d5be3/F1/30b8a8ff5a82458481846e364974bf06
2016-04-21 09:16:04,091 INFO  [main] Configuration.deprecation:
hadoop.native.lib is deprecated. Instead, use io.native.lib.available
2016-04-21 09:16:04,292 INFO  [main] util.ChecksumType: Checksum using
org.apache.hadoop.util.PureJavaCrc32
2016-04-21 09:16:04,294 INFO  [main] util.ChecksumType: Checksum can use
org.apache.hadoop.util.PureJavaCrc32C
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/usr/hdp/2.2.9.0-3393/hadoop/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/usr/hdp/2.2.9.0-3393/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
2016-04-21 09:16:05,654 INFO  [main] Configuration.deprecation:
fs.default.name is deprecated. Instead, use fs.defaultFS
Scanning ->
/apps/hbase/data/data/default/KYLIN_YMSGYYXO12/d42b9faf43eafcc9640aa256143d5be3/F1/30b8a8ff5a82458481846e364974bf06
Block index size as per heapsize: 3640
reader=/apps/hbase/data/data/default/KYLIN_YMSGYYXO12/d42b9faf43eafcc9640aa256143d5be3/F1/30b8a8ff5a82458481846e364974bf06,
    compression=none,
    cacheConf=CacheConfig:disabled,

firstKey=\x00\x0B\x00\x00\x00\x00\x00\x00\x00\x09\x00\x00\x00\x00\x00\x01\xF4/F1:M/0/Put,

lastKey=\x00\x0B\x00\x00\x00\x00\x00\x00\x00\x1F\x06-?\x0F"U\x00\x00\x03[^\xD9/F1:M/0/Put,
    avgKeyLen=35,
    avgValueLen=5,
    entries=160988965,
    length=1832309188
Trailer:
    fileinfoOffset=1832308623,
    loadOnOpenDataOffset=1832306641,
    dataIndexCount=43,
    metaIndexCount=0,
    totalUncomressedBytes=1831809883,
    entryCount=160988965,
    compressionCodec=NONE,
    uncompressedDataIndexSize=5558733,
    numDataIndexLevels=2,
    firstDataBlockOffset=0,
    lastDataBlockOffset=1832250057,
    comparatorClassName=org.apache.hadoop.hbase.KeyValue$KeyComparator,
    majorVersion=2,
    minorVersion=3
Fileinfo:
    DATA_BLOCK_ENCODING = FAST_DIFF
    DELETE_FAMILY_COUNT = \x00\x00\x00\x00\x00\x00\x00\x00
    EARLIEST_PUT_TS = \x00\x00\x00\x00\x00\x00\x00\x00
    MAJOR_COMPACTION_KEY = \xFF
    MAX_SEQ_ID_KEY = 4
    TIMERANGE = 0....0
    hfile.AVG_KEY_LEN = 35
    hfile.AVG_VALUE_LEN = 5
    hfile.LASTKEY =
\x00\x16\x00\x0B\x00\x00\x00\x00\x00\x00\x00\x1F\x06-?\x0F"U\x00\x00\x03[^\xD9\x02F1M\x00\x00\x00\x00\x00\x00\x00\x00\x04
Mid-key:
\x00\x12\x00\x0B\x00\x00\x00\x00\x00\x00\x00\x1D\x04_\x07\x89\x00\x00\x02l\x00\x7F\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\x00\x00\x00\x007|\xBE$\x00\x00;\x81
Bloom filter:
    Not present
Delete Family Bloom filter:
    Not present
Stats:
   Key length:
               min = 32.00
               max = 37.00
              mean = 35.11
            stddev = 1.46
            median = 35.00
              75% <= 37.00
              95% <= 37.00
              98% <= 37.00
              99% <= 37.00
            99.9% <= 37.00
             count = 160988965
   Row size (bytes):
               min = 44.00
               max = 55.00
              mean = 48.17
            stddev = 1.43
            median = 48.00
              75% <= 50.00
              95% <= 50.00
              98% <= 50.00
              99% <= 50.00
            99.9% <= 51.97
             count = 160988965
   Row size (columns):
               min = 1.00
               max = 1.00
              mean = 1.00
            stddev = 0.00
            median = 1.00
              75% <= 1.00
              95% <= 1.00
              98% <= 1.00
              99% <= 1.00
            99.9% <= 1.00
             count = 160988965
   Val length:
               min = 4.00
               max = 12.00
              mean = 5.06
            stddev = 0.33
            median = 5.00
              75% <= 5.00
              95% <= 5.00
              98% <= 6.00
              99% <= 8.00
            99.9% <= 9.00
             count = 160988965
Key of biggest row:
\x00\x0B\x00\x00\x00\x00\x00\x00\x00\x1F\x04\xDD:\x06\x00U\x00\x00\x00\x8DS\xD2
Scanned kv count -> 160988965

Rows per second for RegionScanner

Reply via email to