hi Thakrar Thanks for your reply.
My settings for the RegionScanner Scan is scan.setCaching(1024) scan.setMaxResultSize(5M) even if I change the caching to 100000 I'm still not getting any improvements. I guess the caching works for remote scan through RPC, however not helping too much for region side scan? I also tried the PREFETCH_BLOCKS_ON_OPEN for the whole table, however no improvement was observed. I'm pursuing for pure scan-read performance optimization because our application is sort of read-only. And I observed that even if I did no other thing (only scanning) in my coprocessor, the scan speed is not satisfying. The CPU seems to be fully utilized. May be the process of decoding FAST_DIFF rows is too heavy for CPU? How many rows/second scan speed would your expect on a normal setting? I mean per region scan speed, not the overall scan speed counting in all regions. thanks On Thu, Apr 21, 2016 at 10:24 PM, Thakrar, Jayesh < [email protected]> wrote: > Just curious - have you set the scanner caching to some high value - say > 1000 (or even higher in your small value case)? > > The parameter is hbase.client.scanner.caching > > You can read up on it - https://hbase.apache.org/book.html > > Another thing, are you just looking for pure scan-read performance > optimization? > Depending upon the table size you can also look into caching the table or > not caching at all. > > -----Original Message----- > From: hongbin ma [mailto:[email protected]] > Sent: Thursday, April 21, 2016 5:04 AM > To: [email protected] > Subject: Rows per second for RegionScanner > > Hi, experts, > > I'm trying to figure out how fast hbase can scan. I'm setting up the > RegionScan in a endpoint coprocessor so that no network overhead will be > included. My average key length is 35 and average value length is 5. > > My test result is that if I warm all my interested blocks in the block > cache, I'm only able to scan around 300,000 rows per second per region > (with endpoint I guess it's one thread per region), so it's like getting15M > data per second. I'm not sure if this is already an acceptable number for > HBase. The answers from you experts might help me to decide if it's worth > to further dig into tuning it. > > thanks! > > > > > > > other info: > > My hbase cluster is on 8 AWS m1.xlarge instance, with 4 CPU cores and 16G > RAM. Each region server is configured 10G heap size. The test HTable has 23 > regions, one hfile per region (just major compacted). There's no other > resource contention when I ran the tests. > > Attached is the HFile output of one of the region hfile: > ============================================= > hbase org.apache.hadoop.hbase.io.hfile.HFile -m -s -v -f > > /apps/hbase/data/data/default/KYLIN_YMSGYYXO12/d42b9faf43eafcc9640aa256143d5be3/F1/30b8a8ff5a82458481846e364974bf06 > 2016-04-21 09:16:04,091 INFO [main] Configuration.deprecation: > hadoop.native.lib is deprecated. Instead, use io.native.lib.available > 2016-04-21 09:16:04,292 INFO [main] util.ChecksumType: Checksum using > org.apache.hadoop.util.PureJavaCrc32 > 2016-04-21 09:16:04,294 INFO [main] util.ChecksumType: Checksum can use > org.apache.hadoop.util.PureJavaCrc32C > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > > [jar:file:/usr/hdp/2.2.9.0-3393/hadoop/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > > [jar:file:/usr/hdp/2.2.9.0-3393/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > 2016-04-21 09:16:05,654 INFO [main] Configuration.deprecation: > fs.default.name is deprecated. Instead, use fs.defaultFS Scanning -> > > /apps/hbase/data/data/default/KYLIN_YMSGYYXO12/d42b9faf43eafcc9640aa256143d5be3/F1/30b8a8ff5a82458481846e364974bf06 > Block index size as per heapsize: 3640 > > reader=/apps/hbase/data/data/default/KYLIN_YMSGYYXO12/d42b9faf43eafcc9640aa256143d5be3/F1/30b8a8ff5a82458481846e364974bf06, > compression=none, > cacheConf=CacheConfig:disabled, > > > firstKey=\x00\x0B\x00\x00\x00\x00\x00\x00\x00\x09\x00\x00\x00\x00\x00\x01\xF4/F1:M/0/Put, > > > lastKey=\x00\x0B\x00\x00\x00\x00\x00\x00\x00\x1F\x06-?\x0F"U\x00\x00\x03[^\xD9/F1:M/0/Put, > avgKeyLen=35, > avgValueLen=5, > entries=160988965, > length=1832309188 > Trailer: > fileinfoOffset=1832308623, > loadOnOpenDataOffset=1832306641, > dataIndexCount=43, > metaIndexCount=0, > totalUncomressedBytes=1831809883, > entryCount=160988965, > compressionCodec=NONE, > uncompressedDataIndexSize=5558733, > numDataIndexLevels=2, > firstDataBlockOffset=0, > lastDataBlockOffset=1832250057, > comparatorClassName=org.apache.hadoop.hbase.KeyValue$KeyComparator, > majorVersion=2, > minorVersion=3 > Fileinfo: > DATA_BLOCK_ENCODING = FAST_DIFF > DELETE_FAMILY_COUNT = \x00\x00\x00\x00\x00\x00\x00\x00 > EARLIEST_PUT_TS = \x00\x00\x00\x00\x00\x00\x00\x00 > MAJOR_COMPACTION_KEY = \xFF > MAX_SEQ_ID_KEY = 4 > TIMERANGE = 0....0 > hfile.AVG_KEY_LEN = 35 > hfile.AVG_VALUE_LEN = 5 > hfile.LASTKEY = > > \x00\x16\x00\x0B\x00\x00\x00\x00\x00\x00\x00\x1F\x06-?\x0F"U\x00\x00\x03[^\xD9\x02F1M\x00\x00\x00\x00\x00\x00\x00\x00\x04 > Mid-key: > > \x00\x12\x00\x0B\x00\x00\x00\x00\x00\x00\x00\x1D\x04_\x07\x89\x00\x00\x02l\x00\x7F\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\x00\x00\x00\x007|\xBE$\x00\x00;\x81 > Bloom filter: > Not present > Delete Family Bloom filter: > Not present > Stats: > Key length: > min = 32.00 > max = 37.00 > mean = 35.11 > stddev = 1.46 > median = 35.00 > 75% <= 37.00 > 95% <= 37.00 > 98% <= 37.00 > 99% <= 37.00 > 99.9% <= 37.00 > count = 160988965 > Row size (bytes): > min = 44.00 > max = 55.00 > mean = 48.17 > stddev = 1.43 > median = 48.00 > 75% <= 50.00 > 95% <= 50.00 > 98% <= 50.00 > 99% <= 50.00 > 99.9% <= 51.97 > count = 160988965 > Row size (columns): > min = 1.00 > max = 1.00 > mean = 1.00 > stddev = 0.00 > median = 1.00 > 75% <= 1.00 > 95% <= 1.00 > 98% <= 1.00 > 99% <= 1.00 > 99.9% <= 1.00 > count = 160988965 > Val length: > min = 4.00 > max = 12.00 > mean = 5.06 > stddev = 0.33 > median = 5.00 > 75% <= 5.00 > 95% <= 5.00 > 98% <= 6.00 > 99% <= 8.00 > 99.9% <= 9.00 > count = 160988965 > Key of biggest row: > > \x00\x0B\x00\x00\x00\x00\x00\x00\x00\x1F\x04\xDD:\x06\x00U\x00\x00\x00\x8DS\xD2 > Scanned kv count -> 160988965 > > > > > This email and any files included with it may contain privileged, > proprietary and/or confidential information that is for the sole use > of the intended recipient(s). Any disclosure, copying, distribution, > posting, or use of the information contained in or attached to this > email is prohibited unless permitted by the sender. If you have > received this email in error, please immediately notify the sender > via return email, telephone, or fax and destroy this original transmission > and its included files without reading or saving it in any manner. > Thank you. > -- Regards, *Bin Mahone | 马洪宾* Apache Kylin: http://kylin.io Github: https://github.com/binmahone
