Hi, experts, I'm trying to figure out how fast hbase can scan. I'm setting up the RegionScan in a endpoint coprocessor so that no network overhead will be included. My average key length is 35 and average value length is 5.
My test result is that if I warm all my interested blocks in the block cache, I'm only able to scan around 300,000 rows per second per region (with endpoint I guess it's one thread per region), so it's like getting15M data per second. I'm not sure if this is already an acceptable number for HBase. The answers from you experts might help me to decide if it's worth to further dig into tuning it. thanks! other info: My hbase cluster is on 8 AWS m1.xlarge instance, with 4 CPU cores and 16G RAM. Each region server is configured 10G heap size. The test HTable has 23 regions, one hfile per region (just major compacted). There's no other resource contention when I ran the tests. Attached is the HFile output of one of the region hfile: ============================================= hbase org.apache.hadoop.hbase.io.hfile.HFile -m -s -v -f /apps/hbase/data/data/default/KYLIN_YMSGYYXO12/d42b9faf43eafcc9640aa256143d5be3/F1/30b8a8ff5a82458481846e364974bf06 2016-04-21 09:16:04,091 INFO [main] Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available 2016-04-21 09:16:04,292 INFO [main] util.ChecksumType: Checksum using org.apache.hadoop.util.PureJavaCrc32 2016-04-21 09:16:04,294 INFO [main] util.ChecksumType: Checksum can use org.apache.hadoop.util.PureJavaCrc32C SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/hdp/2.2.9.0-3393/hadoop/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/hdp/2.2.9.0-3393/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. 2016-04-21 09:16:05,654 INFO [main] Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS Scanning -> /apps/hbase/data/data/default/KYLIN_YMSGYYXO12/d42b9faf43eafcc9640aa256143d5be3/F1/30b8a8ff5a82458481846e364974bf06 Block index size as per heapsize: 3640 reader=/apps/hbase/data/data/default/KYLIN_YMSGYYXO12/d42b9faf43eafcc9640aa256143d5be3/F1/30b8a8ff5a82458481846e364974bf06, compression=none, cacheConf=CacheConfig:disabled, firstKey=\x00\x0B\x00\x00\x00\x00\x00\x00\x00\x09\x00\x00\x00\x00\x00\x01\xF4/F1:M/0/Put, lastKey=\x00\x0B\x00\x00\x00\x00\x00\x00\x00\x1F\x06-?\x0F"U\x00\x00\x03[^\xD9/F1:M/0/Put, avgKeyLen=35, avgValueLen=5, entries=160988965, length=1832309188 Trailer: fileinfoOffset=1832308623, loadOnOpenDataOffset=1832306641, dataIndexCount=43, metaIndexCount=0, totalUncomressedBytes=1831809883, entryCount=160988965, compressionCodec=NONE, uncompressedDataIndexSize=5558733, numDataIndexLevels=2, firstDataBlockOffset=0, lastDataBlockOffset=1832250057, comparatorClassName=org.apache.hadoop.hbase.KeyValue$KeyComparator, majorVersion=2, minorVersion=3 Fileinfo: DATA_BLOCK_ENCODING = FAST_DIFF DELETE_FAMILY_COUNT = \x00\x00\x00\x00\x00\x00\x00\x00 EARLIEST_PUT_TS = \x00\x00\x00\x00\x00\x00\x00\x00 MAJOR_COMPACTION_KEY = \xFF MAX_SEQ_ID_KEY = 4 TIMERANGE = 0....0 hfile.AVG_KEY_LEN = 35 hfile.AVG_VALUE_LEN = 5 hfile.LASTKEY = \x00\x16\x00\x0B\x00\x00\x00\x00\x00\x00\x00\x1F\x06-?\x0F"U\x00\x00\x03[^\xD9\x02F1M\x00\x00\x00\x00\x00\x00\x00\x00\x04 Mid-key: \x00\x12\x00\x0B\x00\x00\x00\x00\x00\x00\x00\x1D\x04_\x07\x89\x00\x00\x02l\x00\x7F\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\x00\x00\x00\x007|\xBE$\x00\x00;\x81 Bloom filter: Not present Delete Family Bloom filter: Not present Stats: Key length: min = 32.00 max = 37.00 mean = 35.11 stddev = 1.46 median = 35.00 75% <= 37.00 95% <= 37.00 98% <= 37.00 99% <= 37.00 99.9% <= 37.00 count = 160988965 Row size (bytes): min = 44.00 max = 55.00 mean = 48.17 stddev = 1.43 median = 48.00 75% <= 50.00 95% <= 50.00 98% <= 50.00 99% <= 50.00 99.9% <= 51.97 count = 160988965 Row size (columns): min = 1.00 max = 1.00 mean = 1.00 stddev = 0.00 median = 1.00 75% <= 1.00 95% <= 1.00 98% <= 1.00 99% <= 1.00 99.9% <= 1.00 count = 160988965 Val length: min = 4.00 max = 12.00 mean = 5.06 stddev = 0.33 median = 5.00 75% <= 5.00 95% <= 5.00 98% <= 6.00 99% <= 8.00 99.9% <= 9.00 count = 160988965 Key of biggest row: \x00\x0B\x00\x00\x00\x00\x00\x00\x00\x1F\x04\xDD:\x06\x00U\x00\x00\x00\x8DS\xD2 Scanned kv count -> 160988965
