Great , thank you for the explanation.
my table schema is:
{NAME => 'URLs_sanity', FAMILIES => [{NAME => 'gs', VERSIONS =>
'1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536',
IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'meta-data', VERSIONS
=> '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536',
IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'snt', VERSIONS =>
'1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536',
IN_MEMORY => 'false', BLOCKCACHE => 'true'}]
couple of questions:
1) How can I know what is the optimal size of BlockSize? What is the
best practice regarding this issue
2) Assuming that I have a record 4 k and changed to 50 --> 4*50 = 200
and it is ~ 3 blocks , so performance had to be improved , but execution
time was the same.
Oleg.
On Thu, Nov 11, 2010 at 3:08 PM, Friso van Vollenhoven <
[email protected]> wrote:
> Not that block size (that's the HDFS one), but the HBase block size. You
> set it at table creation or it uses the default of 64K.
>
> The description of hbase.client.scanner.caching says:
> Number of rows that will be fetched when calling next
> on a scanner if it is not served from memory. Higher caching values
> will enable faster scanners but will eat up more memory and some
> calls of next may take longer and longer times when the cache is empty.
>
> That means that it will pre-fetch that number of rows, if the next row does
> not come from memory. So if your rows are small enough to fit 100 of them in
> one block, it doesn't matter whether you pre-fetch 1, 50 or 99, because it
> will only go to disk when it exhausts the whole block, which sticks in block
> cache. So, it will still fetch the same amount of data from disk every time.
> If you increase the number to a value that is certain to load multiple
> blocks at a time from disk, it will increase performance.
>
>
>
> On 11 nov 2010, at 12:55, Oleg Ruchovets wrote:
>
> > Yes , I thought about large number , so you said it depends on block
> size.
> > Good point.
> >
> > I have one recored ~ 4k ,
> > block size is:
> >
> > <property>
> > <name>dfs.block.size</name>
> > <value>268435456</value>
> > <description>HDFS blocksize of 256MB for large file-systems.
> > </description>
> > </property>
> >
> > what is the number that I have choose? Assuming
> > I am afraid that using number which is equal one block brings to
> > socketTimeOutException? Am I write?
> >
> > Thanks Oleg.
> >
> >
> >
> >
> > On Thu, Nov 11, 2010 at 1:30 PM, Friso van Vollenhoven <
> > [email protected]> wrote:
> >
> >> How small is small? If it is bytes, then setting the value to 50 is not
> so
> >> much different from 1, I suppose. If 50 rows fit in one block, it will
> just
> >> fetch one block whether the setting is 1 or 50. You might want to try a
> >> larger value. It should be fine if the records are small and you need
> them
> >> all on the client side anyway.
> >>
> >> It also depends on the block size, of course. When you only ever do full
> >> scans on a table and little random access, you might want to increase
> that.
> >>
> >> Friso
> >>
> >>
> >>
> >>
> >> On 11 nov 2010, at 12:15, Oleg Ruchovets wrote:
> >>
> >>> Hi ,
> >>> To improve client performance I changed
> >>> hbase.client.scanner.caching from 1 to 50.
> >>> After running client with new value( hbase.client.scanner.caching from
> =
> >> 50
> >>> ) it didn't improve execution time at all.
> >>>
> >>> I have ~ 9 million small records.
> >>> I have to do full scan , so it brings all 9 million records to client
> .
> >>> My assumption -- this change have to bring significant improvement ,
> but
> >> it
> >>> is not.
> >>>
> >>> Additional Information.
> >>> I scan table which has 100 regions
> >>> 5 server
> >>> 20 map
> >>> 4 concurrent map
> >>> Scan process takes 5.5 - 6 hours. As for me it is too much time? Am I
> >> write?
> >>> and how can I improve it
> >>>
> >>>
> >>> I changed the value in all hbase-site.xml files and restart hbase.
> >>>
> >>> Any suggestions.
> >>
> >>
>
>