Re: scan performance improvement

Friso van Vollenhoven Thu, 11 Nov 2010 07:37:09 -0800

> Great , thank you for the explanation.
> 
>  my table schema is:
> 
>         {NAME => 'URLs_sanity', FAMILIES => [{NAME => 'gs', VERSIONS =>
> '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536',
> IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'meta-data', VERSIONS
> => '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536',
> IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'snt', VERSIONS =>
> '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536',
> IN_MEMORY => 'false', BLOCKCACHE => 'true'}]
> 
> couple of questions:
>     1) How can I know what is the optimal size of BlockSize? What is the
> best practice regarding this issue


Check the link I sent. There is an explanation on this setting in there.

>     2) Assuming that I have a record 4 k and changed to 50 --> 4*50 = 200
> and it is ~ 3 blocks , so performance had to be improved , but execution
> time was the same.

There is of course more involved than just this. And also, you may be already 
getting the most of what your hardware can give you. You should also try to 
find out what bottleneck you have (IO or CPU or network). Hadoop and HBase have 
many settings. There is no magic single knob that makes things fast or slow.

> 
> Oleg.
> 
> 
> On Thu, Nov 11, 2010 at 3:08 PM, Friso van Vollenhoven <
> [email protected]> wrote:
> 
>> Not that block size (that's the HDFS one), but the HBase block size. You
>> set it at table creation or it uses the default of 64K.
>> 
>> The description of hbase.client.scanner.caching says:
>> Number of rows that will be fetched when calling next
>> on a scanner if it is not served from memory. Higher caching values
>> will enable faster scanners but will eat up more memory and some
>> calls of next may take longer and longer times when the cache is empty.
>> 
>> That means that it will pre-fetch that number of rows, if the next row does
>> not come from memory. So if your rows are small enough to fit 100 of them in
>> one block, it doesn't matter whether you pre-fetch 1, 50 or 99, because it
>> will only go to disk when it exhausts the whole block, which sticks in block
>> cache. So, it will still fetch the same amount of data from disk every time.
>> If you increase the number to a value that is certain to load multiple
>> blocks at a time from disk, it will increase performance.
>> 
>> 
>> 
>> On 11 nov 2010, at 12:55, Oleg Ruchovets wrote:
>> 
>>> Yes , I thought about large number , so you said it depends on block
>> size.
>>> Good point.
>>> 
>>> I have one recored ~ 4k ,
>>> block size is:
>>> 
>>> <property>
>>> <name>dfs.block.size</name>
>>> <value>268435456</value>
>>> <description>HDFS blocksize of 256MB for large file-systems.
>>> </description>
>>> </property>
>>> 
>>> what is the number that I have choose? Assuming
>>> I am afraid that using number which is equal one block brings to
>>> socketTimeOutException? Am I write?
>>> 
>>> Thanks Oleg.
>>> 
>>> 
>>> 
>>> 
>>> On Thu, Nov 11, 2010 at 1:30 PM, Friso van Vollenhoven <
>>> [email protected]> wrote:
>>> 
>>>> How small is small? If it is bytes, then setting the value to 50 is not
>> so
>>>> much different from 1, I suppose. If 50 rows fit in one block, it will
>> just
>>>> fetch one block whether the setting is 1 or 50. You might want to try a
>>>> larger value. It should be fine if the records are small and you need
>> them
>>>> all on the client side anyway.
>>>> 
>>>> It also depends on the block size, of course. When you only ever do full
>>>> scans on a table and little random access, you might want to increase
>> that.
>>>> 
>>>> Friso
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 11 nov 2010, at 12:15, Oleg Ruchovets wrote:
>>>> 
>>>>> Hi ,
>>>>> To improve client performance I  changed
>>>>> hbase.client.scanner.caching from 1 to 50.
>>>>> After running client with new value( hbase.client.scanner.caching from
>> =
>>>> 50
>>>>> ) it didn't improve execution time at all.
>>>>> 
>>>>> I have ~ 9 million small records.
>>>>> I have to do full scan  , so it brings all 9 million records to client
>> .
>>>>> My assumption -- this change have to bring significant improvement ,
>> but
>>>> it
>>>>> is not.
>>>>> 
>>>>> Additional Information.
>>>>> I scan table which has 100 regions
>>>>> 5 server
>>>>> 20 map
>>>>> 4  concurrent map
>>>>> Scan process takes 5.5 - 6 hours. As for me it is too much time? Am I
>>>> write?
>>>>> and how can I improve it
>>>>> 
>>>>> 
>>>>> I changed the value in all hbase-site.xml files and restart hbase.
>>>>> 
>>>>> Any suggestions.
>>>> 
>>>> 
>> 
>>

Re: scan performance improvement

Reply via email to