Re: scan performance improvement

Oleg Ruchovets Thu, 11 Nov 2010 11:03:31 -0800

Hi

I didn't change a block size ( it is still 64k).
Running test configured with caching size of 3600.
The test is still running , but I already see that there is NO performance
improvement.
    How can I check that hbase works with changed  caching size.
Can I see it from logs or some debugging?


Thanks
Oleg.

On Thu, Nov 11, 2010 at 8:03 PM, Ryan Rawson <[email protected]> wrote:

> I'd be careful about adjusting HFile block size, we took 64k after
> benchmarking a bunch of things, and it seemed to e a good performance
> point.
>
> As for scanning small rows, I'd go with a caching size of 1000-3000.
> When I set my scanners to that, I can pull 50k+ rows/sec from 1
> client.
>
> On Thu, Nov 11, 2010 at 7:36 AM, Friso van Vollenhoven
> <[email protected]> wrote:
> >> Great , thank you for the explanation.
> >>
> >>  my table schema is:
> >>
> >>         {NAME => 'URLs_sanity', FAMILIES => [{NAME => 'gs', VERSIONS =>
> >> '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536',
> >> IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'meta-data',
> VERSIONS
> >> => '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE =>
> '65536',
> >> IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'snt', VERSIONS =>
> >> '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536',
> >> IN_MEMORY => 'false', BLOCKCACHE => 'true'}]
> >>
> >> couple of questions:
> >>     1) How can I know what is the optimal size of BlockSize? What is the
> >> best practice regarding this issue
> >
> > Check the link I sent. There is an explanation on this setting in there.
> >
> >>     2) Assuming that I have a record 4 k and changed to 50 --> 4*50 =
> 200
> >> and it is ~ 3 blocks , so performance had to be improved , but execution
> >> time was the same.
> >
> > There is of course more involved than just this. And also, you may be
> already getting the most of what your hardware can give you. You should also
> try to find out what bottleneck you have (IO or CPU or network). Hadoop and
> HBase have many settings. There is no magic single knob that makes things
> fast or slow.
> >
> >>
> >> Oleg.
> >>
> >>
> >> On Thu, Nov 11, 2010 at 3:08 PM, Friso van Vollenhoven <
> >> [email protected]> wrote:
> >>
> >>> Not that block size (that's the HDFS one), but the HBase block size.
> You
> >>> set it at table creation or it uses the default of 64K.
> >>>
> >>> The description of hbase.client.scanner.caching says:
> >>> Number of rows that will be fetched when calling next
> >>> on a scanner if it is not served from memory. Higher caching values
> >>> will enable faster scanners but will eat up more memory and some
> >>> calls of next may take longer and longer times when the cache is empty.
> >>>
> >>> That means that it will pre-fetch that number of rows, if the next row
> does
> >>> not come from memory. So if your rows are small enough to fit 100 of
> them in
> >>> one block, it doesn't matter whether you pre-fetch 1, 50 or 99, because
> it
> >>> will only go to disk when it exhausts the whole block, which sticks in
> block
> >>> cache. So, it will still fetch the same amount of data from disk every
> time.
> >>> If you increase the number to a value that is certain to load multiple
> >>> blocks at a time from disk, it will increase performance.
> >>>
> >>>
> >>>
> >>> On 11 nov 2010, at 12:55, Oleg Ruchovets wrote:
> >>>
> >>>> Yes , I thought about large number , so you said it depends on block
> >>> size.
> >>>> Good point.
> >>>>
> >>>> I have one recored ~ 4k ,
> >>>> block size is:
> >>>>
> >>>> <property>
> >>>> <name>dfs.block.size</name>
> >>>> <value>268435456</value>
> >>>> <description>HDFS blocksize of 256MB for large file-systems.
> >>>> </description>
> >>>> </property>
> >>>>
> >>>> what is the number that I have choose? Assuming
> >>>> I am afraid that using number which is equal one block brings to
> >>>> socketTimeOutException? Am I write?
> >>>>
> >>>> Thanks Oleg.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Thu, Nov 11, 2010 at 1:30 PM, Friso van Vollenhoven <
> >>>> [email protected]> wrote:
> >>>>
> >>>>> How small is small? If it is bytes, then setting the value to 50 is
> not
> >>> so
> >>>>> much different from 1, I suppose. If 50 rows fit in one block, it
> will
> >>> just
> >>>>> fetch one block whether the setting is 1 or 50. You might want to try
> a
> >>>>> larger value. It should be fine if the records are small and you need
> >>> them
> >>>>> all on the client side anyway.
> >>>>>
> >>>>> It also depends on the block size, of course. When you only ever do
> full
> >>>>> scans on a table and little random access, you might want to increase
> >>> that.
> >>>>>
> >>>>> Friso
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 11 nov 2010, at 12:15, Oleg Ruchovets wrote:
> >>>>>
> >>>>>> Hi ,
> >>>>>> To improve client performance I  changed
> >>>>>> hbase.client.scanner.caching from 1 to 50.
> >>>>>> After running client with new value( hbase.client.scanner.caching
> from
> >>> =
> >>>>> 50
> >>>>>> ) it didn't improve execution time at all.
> >>>>>>
> >>>>>> I have ~ 9 million small records.
> >>>>>> I have to do full scan  , so it brings all 9 million records to
> client
> >>> .
> >>>>>> My assumption -- this change have to bring significant improvement ,
> >>> but
> >>>>> it
> >>>>>> is not.
> >>>>>>
> >>>>>> Additional Information.
> >>>>>> I scan table which has 100 regions
> >>>>>> 5 server
> >>>>>> 20 map
> >>>>>> 4  concurrent map
> >>>>>> Scan process takes 5.5 - 6 hours. As for me it is too much time? Am
> I
> >>>>> write?
> >>>>>> and how can I improve it
> >>>>>>
> >>>>>>
> >>>>>> I changed the value in all hbase-site.xml files and restart hbase.
> >>>>>>
> >>>>>> Any suggestions.
> >>>>>
> >>>>>
> >>>
> >>>
> >
> >
>

Re: scan performance improvement

Reply via email to