I'd be careful about adjusting HFile block size, we took 64k after benchmarking a bunch of things, and it seemed to e a good performance point.
As for scanning small rows, I'd go with a caching size of 1000-3000. When I set my scanners to that, I can pull 50k+ rows/sec from 1 client. On Thu, Nov 11, 2010 at 7:36 AM, Friso van Vollenhoven <[email protected]> wrote: >> Great , thank you for the explanation. >> >> my table schema is: >> >> {NAME => 'URLs_sanity', FAMILIES => [{NAME => 'gs', VERSIONS => >> '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', >> IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'meta-data', VERSIONS >> => '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', >> IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'snt', VERSIONS => >> '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', >> IN_MEMORY => 'false', BLOCKCACHE => 'true'}] >> >> couple of questions: >> 1) How can I know what is the optimal size of BlockSize? What is the >> best practice regarding this issue > > Check the link I sent. There is an explanation on this setting in there. > >> 2) Assuming that I have a record 4 k and changed to 50 --> 4*50 = 200 >> and it is ~ 3 blocks , so performance had to be improved , but execution >> time was the same. > > There is of course more involved than just this. And also, you may be already > getting the most of what your hardware can give you. You should also try to > find out what bottleneck you have (IO or CPU or network). Hadoop and HBase > have many settings. There is no magic single knob that makes things fast or > slow. > >> >> Oleg. >> >> >> On Thu, Nov 11, 2010 at 3:08 PM, Friso van Vollenhoven < >> [email protected]> wrote: >> >>> Not that block size (that's the HDFS one), but the HBase block size. You >>> set it at table creation or it uses the default of 64K. >>> >>> The description of hbase.client.scanner.caching says: >>> Number of rows that will be fetched when calling next >>> on a scanner if it is not served from memory. Higher caching values >>> will enable faster scanners but will eat up more memory and some >>> calls of next may take longer and longer times when the cache is empty. >>> >>> That means that it will pre-fetch that number of rows, if the next row does >>> not come from memory. So if your rows are small enough to fit 100 of them in >>> one block, it doesn't matter whether you pre-fetch 1, 50 or 99, because it >>> will only go to disk when it exhausts the whole block, which sticks in block >>> cache. So, it will still fetch the same amount of data from disk every time. >>> If you increase the number to a value that is certain to load multiple >>> blocks at a time from disk, it will increase performance. >>> >>> >>> >>> On 11 nov 2010, at 12:55, Oleg Ruchovets wrote: >>> >>>> Yes , I thought about large number , so you said it depends on block >>> size. >>>> Good point. >>>> >>>> I have one recored ~ 4k , >>>> block size is: >>>> >>>> <property> >>>> <name>dfs.block.size</name> >>>> <value>268435456</value> >>>> <description>HDFS blocksize of 256MB for large file-systems. >>>> </description> >>>> </property> >>>> >>>> what is the number that I have choose? Assuming >>>> I am afraid that using number which is equal one block brings to >>>> socketTimeOutException? Am I write? >>>> >>>> Thanks Oleg. >>>> >>>> >>>> >>>> >>>> On Thu, Nov 11, 2010 at 1:30 PM, Friso van Vollenhoven < >>>> [email protected]> wrote: >>>> >>>>> How small is small? If it is bytes, then setting the value to 50 is not >>> so >>>>> much different from 1, I suppose. If 50 rows fit in one block, it will >>> just >>>>> fetch one block whether the setting is 1 or 50. You might want to try a >>>>> larger value. It should be fine if the records are small and you need >>> them >>>>> all on the client side anyway. >>>>> >>>>> It also depends on the block size, of course. When you only ever do full >>>>> scans on a table and little random access, you might want to increase >>> that. >>>>> >>>>> Friso >>>>> >>>>> >>>>> >>>>> >>>>> On 11 nov 2010, at 12:15, Oleg Ruchovets wrote: >>>>> >>>>>> Hi , >>>>>> To improve client performance I changed >>>>>> hbase.client.scanner.caching from 1 to 50. >>>>>> After running client with new value( hbase.client.scanner.caching from >>> = >>>>> 50 >>>>>> ) it didn't improve execution time at all. >>>>>> >>>>>> I have ~ 9 million small records. >>>>>> I have to do full scan , so it brings all 9 million records to client >>> . >>>>>> My assumption -- this change have to bring significant improvement , >>> but >>>>> it >>>>>> is not. >>>>>> >>>>>> Additional Information. >>>>>> I scan table which has 100 regions >>>>>> 5 server >>>>>> 20 map >>>>>> 4 concurrent map >>>>>> Scan process takes 5.5 - 6 hours. As for me it is too much time? Am I >>>>> write? >>>>>> and how can I improve it >>>>>> >>>>>> >>>>>> I changed the value in all hbase-site.xml files and restart hbase. >>>>>> >>>>>> Any suggestions. >>>>> >>>>> >>> >>> > >
