Re: Rows per second for RegionScanner

hongbin ma Thu, 21 Apr 2016 19:23:20 -0700

hi Thakrar

Thanks for your reply.


My settings for the RegionScanner Scan is

scan.setCaching(1024)
scan.setMaxResultSize(5M)

even if I change the caching to 100000 I'm still not getting any
improvements. I guess the caching works for remote scan through RPC,
however not helping too much for region side scan?

I also tried the PREFETCH_BLOCKS_ON_OPEN for the whole table, however no
improvement was observed.

I'm pursuing for pure scan-read performance optimization because our
application is sort of read-only. And I observed that even if I did no
other thing (only scanning) in my coprocessor, the scan speed is not
satisfying. The CPU seems to be fully utilized. May be the process of
decoding FAST_DIFF rows is too heavy for CPU? How many rows/second scan
speed would your expect on a normal setting? I mean per region scan speed,
not the overall scan speed counting in all regions.

thanks

On Thu, Apr 21, 2016 at 10:24 PM, Thakrar, Jayesh <
[email protected]> wrote:

> Just curious - have you set the scanner caching to some high value - say
> 1000 (or even higher in your small value case)?
>
> The parameter is hbase.client.scanner.caching
>
> You can read up on it - https://hbase.apache.org/book.html
>
> Another thing, are you just looking for pure scan-read performance
> optimization?
> Depending upon the table size you can also look into caching the table or
> not caching at all.
>
> -----Original Message-----
> From: hongbin ma [mailto:[email protected]]
> Sent: Thursday, April 21, 2016 5:04 AM
> To: [email protected]
> Subject: Rows per second for RegionScanner
>
> Hi, experts,
>
> I'm trying to figure out how fast hbase can scan. I'm setting up the
> RegionScan in a endpoint coprocessor so that no network overhead will be
> included. My average key length is 35 and average value length is 5.
>
> My test result is that if I warm all my interested blocks in the block
> cache, I'm only able to scan around 300,000 rows per second per region
> (with endpoint I guess it's one thread per region), so it's like getting15M
> data per second. I'm not sure if this is already an acceptable number for
> HBase. The answers from you experts might help me to decide if it's worth
> to further dig into tuning it.
>
> thanks!
>
>
>
>
>
>
> other info:
>
> My hbase cluster is on 8 AWS m1.xlarge instance, with 4 CPU cores and 16G
> RAM. Each region server is configured 10G heap size. The test HTable has 23
> regions, one hfile per region (just major compacted). There's no other
> resource contention when I ran the tests.
>
> Attached is the HFile output of one of the region hfile:
> =============================================
>  hbase  org.apache.hadoop.hbase.io.hfile.HFile -m -s -v -f
>
> /apps/hbase/data/data/default/KYLIN_YMSGYYXO12/d42b9faf43eafcc9640aa256143d5be3/F1/30b8a8ff5a82458481846e364974bf06
> 2016-04-21 09:16:04,091 INFO  [main] Configuration.deprecation:
> hadoop.native.lib is deprecated. Instead, use io.native.lib.available
> 2016-04-21 09:16:04,292 INFO  [main] util.ChecksumType: Checksum using
> org.apache.hadoop.util.PureJavaCrc32
> 2016-04-21 09:16:04,294 INFO  [main] util.ChecksumType: Checksum can use
> org.apache.hadoop.util.PureJavaCrc32C
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in
>
> [jar:file:/usr/hdp/2.2.9.0-3393/hadoop/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
>
> [jar:file:/usr/hdp/2.2.9.0-3393/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
> 2016-04-21 09:16:05,654 INFO  [main] Configuration.deprecation:
> fs.default.name is deprecated. Instead, use fs.defaultFS Scanning ->
>
> /apps/hbase/data/data/default/KYLIN_YMSGYYXO12/d42b9faf43eafcc9640aa256143d5be3/F1/30b8a8ff5a82458481846e364974bf06
> Block index size as per heapsize: 3640
>
> reader=/apps/hbase/data/data/default/KYLIN_YMSGYYXO12/d42b9faf43eafcc9640aa256143d5be3/F1/30b8a8ff5a82458481846e364974bf06,
>     compression=none,
>     cacheConf=CacheConfig:disabled,
>
>
> firstKey=\x00\x0B\x00\x00\x00\x00\x00\x00\x00\x09\x00\x00\x00\x00\x00\x01\xF4/F1:M/0/Put,
>
>
> lastKey=\x00\x0B\x00\x00\x00\x00\x00\x00\x00\x1F\x06-?\x0F"U\x00\x00\x03[^\xD9/F1:M/0/Put,
>     avgKeyLen=35,
>     avgValueLen=5,
>     entries=160988965,
>     length=1832309188
> Trailer:
>     fileinfoOffset=1832308623,
>     loadOnOpenDataOffset=1832306641,
>     dataIndexCount=43,
>     metaIndexCount=0,
>     totalUncomressedBytes=1831809883,
>     entryCount=160988965,
>     compressionCodec=NONE,
>     uncompressedDataIndexSize=5558733,
>     numDataIndexLevels=2,
>     firstDataBlockOffset=0,
>     lastDataBlockOffset=1832250057,
>     comparatorClassName=org.apache.hadoop.hbase.KeyValue$KeyComparator,
>     majorVersion=2,
>     minorVersion=3
> Fileinfo:
>     DATA_BLOCK_ENCODING = FAST_DIFF
>     DELETE_FAMILY_COUNT = \x00\x00\x00\x00\x00\x00\x00\x00
>     EARLIEST_PUT_TS = \x00\x00\x00\x00\x00\x00\x00\x00
>     MAJOR_COMPACTION_KEY = \xFF
>     MAX_SEQ_ID_KEY = 4
>     TIMERANGE = 0....0
>     hfile.AVG_KEY_LEN = 35
>     hfile.AVG_VALUE_LEN = 5
>     hfile.LASTKEY =
>
> \x00\x16\x00\x0B\x00\x00\x00\x00\x00\x00\x00\x1F\x06-?\x0F"U\x00\x00\x03[^\xD9\x02F1M\x00\x00\x00\x00\x00\x00\x00\x00\x04
> Mid-key:
>
> \x00\x12\x00\x0B\x00\x00\x00\x00\x00\x00\x00\x1D\x04_\x07\x89\x00\x00\x02l\x00\x7F\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF\x00\x00\x00\x007|\xBE$\x00\x00;\x81
> Bloom filter:
>     Not present
> Delete Family Bloom filter:
>     Not present
> Stats:
>    Key length:
>                min = 32.00
>                max = 37.00
>               mean = 35.11
>             stddev = 1.46
>             median = 35.00
>               75% <= 37.00
>               95% <= 37.00
>               98% <= 37.00
>               99% <= 37.00
>             99.9% <= 37.00
>              count = 160988965
>    Row size (bytes):
>                min = 44.00
>                max = 55.00
>               mean = 48.17
>             stddev = 1.43
>             median = 48.00
>               75% <= 50.00
>               95% <= 50.00
>               98% <= 50.00
>               99% <= 50.00
>             99.9% <= 51.97
>              count = 160988965
>    Row size (columns):
>                min = 1.00
>                max = 1.00
>               mean = 1.00
>             stddev = 0.00
>             median = 1.00
>               75% <= 1.00
>               95% <= 1.00
>               98% <= 1.00
>               99% <= 1.00
>             99.9% <= 1.00
>              count = 160988965
>    Val length:
>                min = 4.00
>                max = 12.00
>               mean = 5.06
>             stddev = 0.33
>             median = 5.00
>               75% <= 5.00
>               95% <= 5.00
>               98% <= 6.00
>               99% <= 8.00
>             99.9% <= 9.00
>              count = 160988965
> Key of biggest row:
>
> \x00\x0B\x00\x00\x00\x00\x00\x00\x00\x1F\x04\xDD:\x06\x00U\x00\x00\x00\x8DS\xD2
> Scanned kv count -> 160988965
>
>
>
>
> This email and any files included with it may contain privileged,
> proprietary and/or confidential information that is for the sole use
> of the intended recipient(s).  Any disclosure, copying, distribution,
> posting, or use of the information contained in or attached to this
> email is prohibited unless permitted by the sender.  If you have
> received this email in error, please immediately notify the sender
> via return email, telephone, or fax and destroy this original transmission
> and its included files without reading or saving it in any manner.
> Thank you.
>



-- 
Regards,

*Bin Mahone | 马洪宾*
Apache Kylin: http://kylin.io
Github: https://github.com/binmahone

Re: Rows per second for RegionScanner

Reply via email to