When thinking about your region size (max file size), you need to consider the 
number of regions you will have and how many regions there will then be on each 
node.  This has a significant impact on the _actual_ size of files that get 
flushed to disk.

You can increase your flush size to 256 but at the number of regions/node you 
currently have, this won't make any difference.  You will always flush because 
of global heap pressure rather than reaching this threshold.

Global MemStore size / # regions per node = average size of memstores.  You 
will then flush files that are somewhere between that average and 2X that 
average, because of heap pressure (assuming random insertions).

If you have 3GB for MemStore and 1000 regions/node, you will have approx 3MB 
for each region.  With random distribution, your biggest MemStores could be up 
to 6MB, so you will be flushing files that are somewhere between 3MB and 6MB in 
size, more than an order of magnitude smaller than the flush size.

JG

> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Jean-
> Daniel Cryans
> Sent: Tuesday, June 08, 2010 10:12 AM
> To: [email protected]
> Subject: Re: Suggested config changes to be made
> 
> On Mon, Jun 7, 2010 at 5:52 PM, Daniel Einspanjer
> <[email protected]> wrote:
> >  For Socorro, we currently have a 15 node HBase 0.20.3 cluster.
> > The hardware is dual hyperthreaded quads with 24GB of RAM (RS JVM is
> > allocated 8GB).
> > HDFS Health reports that we are currently using 20TB out of 60TB.
> (Storage
> > is only HBase related at the moment.)
> > hadoop dfs -dus /hbase reports about 7TB of usage.
> >
> > In production at the moment, we have a single HBase table,
> crash_reports.
> >  The table has a poorly chosen rowkey format that starts with the
> current
> > date, so all inserts currently go into a single region.  In our next
> > release, the rowkey will be salted to prevent this problem.
> >
> > We are currently inserting 10 to 20 new records per second.  In our
> next
> > Socorro release, that number will be multiplied by 5 due to inserts
> into
> > different index tables.
> >
> > At the moment, we have 40k regions on our 15 servers.  There were
> some
> > questions raised on the #hbase IRC channel about different settings.
>  I'm
> > posting this e-mail to collect the suggestions for changes we should
> make
> > during our scheduled upgrade to 0.20.5 in less than two weeks.
> >
> > Currently, our region.max.size is the default 256.  It was suggested
> that
> > this should be at least 1GB.  What are the steps to ensure that we
> have the
> > right size for the new tables we'll create during our upgrade, and
> what we
> > should do about our existing table?
> 
> I think MAX_FILESIZE should be 1GB and MEMSTORE_FLUSHSIZE be 256MB,
> because if you let it at 64MB you will end up rewriting your data a
> lot during minor compactions. Also, your insert pattern not being
> totally random (even if salted), makes it that you likely won't hit
> the global memstore limit. But, this will really only be helpful once
> you use the durability release since IIRC your hlog size is lower than
> the default (which makes sense with sync, you don't want to lose too
> much data).
> 
> To change it, after restarting on 0.20.5 simply disable the table, do
> an alter table, then re-enable it.
> 
> For the new tables, it depends... will they have as much data as this
> one? For example, if you don't think that the index tables can grow
> larger than 20GB, then I wouldn't set their split size higher since
> that would lower their distribution too much.
> >
> > This output indicates that block cache is disabled on -ROOT-.  It
> sounds
> > like it was recommended to enable this.  Is it just an alter table or
> is
> > there anything else that needs to be done?
> >
> > $ hbase shell
> > HBase Shell; enter 'help<RETURN>' for list of supported commands.
> > Version: 0.20.3, rUnknown, Tue Feb  2 08:32:37 PST 2010
> > hbase(main):001:0> scan '-ROOT-'
> > ROW                          COLUMN+CELL
> >  .META.,,1                   column=info:regioninfo,
> > timestamp=1259618213386, value=REGION => {NAME => '.META.
> >                             ,,1', STARTKEY => '', ENDKEY => '',
> ENCODED =>
> > 1028785192, TABLE => {{NAME => '.M
> >                             ETA.', IS_META => 'true',
> MEMSTORE_FLUSHSIZE =>
> > '16384', FAMILIES => [{NAME => 'h
> >                             istorian', VERSIONS => '2147483647',
> COMPRESSION
> > => 'NONE', TTL => '604800', BLOC
> >                             KSIZE => '8192', IN_MEMORY => 'false',
> > BLOCKCACHE => 'false'}, {NAME => 'info', V
> >                             ERSIONS => '10', COMPRESSION => 'NONE',
> TTL =>
> > '2147483647', BLOCKSIZE => '8192',
> >                              IN_MEMORY => 'false', BLOCKCACHE =>
> 'false'}]}}
> >  .META.,,1                   column=info:server,
> timestamp=1275952945634,
> > value=10.2.72.80:60020
> >  .META.,,1                   column=info:serverstartcode,
> > timestamp=1275952945634, value=1275952942699
> > 1 row(s) in 0.0780 seconds
> 
> 
> See bin/set_meta_block_caching.rb
> 
> >
> >
> > {NAME => 'crash_reports', FAMILIES => [{NAME => 'meta_data',
> COMPRESSION =>
> > 'LZO', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536',
> IN_MEMORY
> > => 'false', BLOCKCACHE => 'true'}, {NAME => 'processed_data',
> VERSIONS =>
> > '3', COMPRESSION => 'LZO', TTL => '2147483647', BLOCKSIZE => '65536',
> > IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'raw_data',
> > COMPRESSION => 'LZO', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE
> =>
> > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
> >
> >
> > Is there any other information I should provide that could lead to
> other
> > important config changes we should make on this upgrade?
> 
> If you are still planning to let HBase manage the ZK ensemble, do
> change its dataDir ;)
> 
> J-D

Reply via email to