RE: scan performance improvement

Michael Segel Thu, 11 Nov 2010 05:11:47 -0800

Correct me if I'm wrong, but isn't hbase's default block size 256MB while 
hadoop's default blocksize is 64MB?



> From: [email protected]
> To: [email protected]
> Subject: Re: scan performance improvement
> Date: Thu, 11 Nov 2010 13:08:56 +0000
> 
> Not that block size (that's the HDFS one), but the HBase block size. You set 
> it at table creation or it uses the default of 64K.
> 
> The description of hbase.client.scanner.caching says:
> Number of rows that will be fetched when calling next
> on a scanner if it is not served from memory. Higher caching values
> will enable faster scanners but will eat up more memory and some
> calls of next may take longer and longer times when the cache is empty.
> 
> That means that it will pre-fetch that number of rows, if the next row does 
> not come from memory. So if your rows are small enough to fit 100 of them in 
> one block, it doesn't matter whether you pre-fetch 1, 50 or 99, because it 
> will only go to disk when it exhausts the whole block, which sticks in block 
> cache. So, it will still fetch the same amount of data from disk every time. 
> If you increase the number to a value that is certain to load multiple blocks 
> at a time from disk, it will increase performance.
> 
> 
> 
> On 11 nov 2010, at 12:55, Oleg Ruchovets wrote:
> 
> > Yes , I thought about large number , so you said it depends on block size.
> > Good point.
> > 
> > I have one recored ~ 4k ,
> > block size is:
> > 
> > <property>
> >  <name>dfs.block.size</name>
> >  <value>268435456</value>
> >  <description>HDFS blocksize of 256MB for large file-systems.
> > </description>
> > </property>
> > 
> > what is the number that I have choose? Assuming
> > I am afraid that using number which is equal one block brings to
> > socketTimeOutException? Am I write?
> > 
> > Thanks Oleg.
> > 
> > 
> > 
> > 
> > On Thu, Nov 11, 2010 at 1:30 PM, Friso van Vollenhoven <
> > [email protected]> wrote:
> > 
> >> How small is small? If it is bytes, then setting the value to 50 is not so
> >> much different from 1, I suppose. If 50 rows fit in one block, it will just
> >> fetch one block whether the setting is 1 or 50. You might want to try a
> >> larger value. It should be fine if the records are small and you need them
> >> all on the client side anyway.
> >> 
> >> It also depends on the block size, of course. When you only ever do full
> >> scans on a table and little random access, you might want to increase that.
> >> 
> >> Friso
> >> 
> >> 
> >> 
> >> 
> >> On 11 nov 2010, at 12:15, Oleg Ruchovets wrote:
> >> 
> >>> Hi ,
> >>>  To improve client performance I  changed
> >>> hbase.client.scanner.caching from 1 to 50.
> >>> After running client with new value( hbase.client.scanner.caching from =
> >> 50
> >>> ) it didn't improve execution time at all.
> >>> 
> >>> I have ~ 9 million small records.
> >>> I have to do full scan  , so it brings all 9 million records to client .
> >>> My assumption -- this change have to bring significant improvement , but
> >> it
> >>> is not.
> >>> 
> >>> Additional Information.
> >>> I scan table which has 100 regions
> >>> 5 server
> >>> 20 map
> >>> 4  concurrent map
> >>> Scan process takes 5.5 - 6 hours. As for me it is too much time? Am I
> >> write?
> >>> and how can I improve it
> >>> 
> >>> 
> >>> I changed the value in all hbase-site.xml files and restart hbase.
> >>> 
> >>> Any suggestions.
> >> 
> >> 
>

RE: scan performance improvement

Reply via email to