The LZO did not seem to work with the 1GB region size. It was causing several minute pauses followed by 5 seconds of requests being processed and then again 30+ second pauses (is this GC, compaction, or splits??...all regions go to 0 requests but the Meta region which has a few hundred requests). Once going back to the default region size the pauses seemed to go away. I still see all nodes going to zero requests except the Meta table region, but it only lasts for a few seconds at the most. We also had the max open files problem (which has now been fixed), so don't know if that could have caused that.
I have been loading for several days and since fixing the max file problem everything has been smooth. We are up to 2+ TB compressed and almost 5,000 regions on 10 nodes. Lzop sure helps pack a lot of data into a small space. The writes seem to hover in the 7.5-8k/node/sec range. This week we will test the reads. So far so good. Thanks. On Sun, Jan 2, 2011 at 8:11 PM, Sean Bigdatafun <[email protected]>wrote: > Has this cured the GC pause at all? I do not see why turning on LZO is > relavent at all (I read your email, and it sounds that you saw pause after > the LZO is turned on). > > BTW, are you using CMS on a 8GB Heapsize JVM and experiencing a 4 mins > pause? That sounds a lot. > > On Thu, Dec 30, 2010 at 1:51 PM, Wayne <[email protected]> wrote: > > > Lesson learned...restart thrift servers *after* restarting hadoop+hbase. > > > > On Thu, Dec 30, 2010 at 3:39 PM, Wayne <[email protected]> wrote: > > > > > We have restarted with lzop compression, and now I am seeing some > really > > > long and frequent stop the world pauses of the entire cluster. The load > > > requests for all regions all go to zero except for the meta table > region. > > No > > > data batches are getting in (no loads are occurring) and everything > seems > > > frozen. It seems to last for 5+ seconds. Is this GC on the master or GC > > in > > > the meta region? What could cause everything to stop for several > seconds? > > It > > > appears to happen on a recurring basis as well. I think we saw it > before > > > switching to lzo but it seems much worse now (lasts longer and occurs > > more > > > frequently). > > > > > > Thanks. > > > > > > > > > > > > On Thu, Dec 30, 2010 at 12:20 PM, Wayne <[email protected]> wrote: > > > > > >> HBase Version 0.89.20100924, r1001068 w/ 8GB heap > > >> > > >> I plan to run for 1 week straight maxed out. I am worried about GC > > pauses, > > >> especially concurrent mode failures (does hbase/hadoop suffer these > > under > > >> extended load?). What should I be looking for in the gc log in terms > of > > >> problem signs? The ParNews are quick but the CMS concurrent marks are > > taking > > >> as much as 4 mins with an average of 20-30 secs. > > >> > > >> Thanks. > > >> > > >> > > >> > > >> On Thu, Dec 30, 2010 at 12:00 PM, Stack <[email protected]> wrote: > > >> > > >>> Oh, what versions are you using? > > >>> St.Ack > > >>> > > >>> On Thu, Dec 30, 2010 at 9:00 AM, Stack <[email protected]> wrote: > > >>> > Keep going. Let it run longer. Get the servers as loaded as you > > think > > >>> > they'll be in production. Make sure the perf numbers are not > because > > >>> > cluster is 'fresh'. > > >>> > St.Ack > > >>> > > > >>> > On Thu, Dec 30, 2010 at 5:51 AM, Wayne <[email protected]> wrote: > > >>> >> We finally got our cluster up and running and write performance > > looks > > >>> very > > >>> >> good. We are getting sustained 8-10k writes/sec/node on a 10 node > > >>> cluster > > >>> >> from Python through thrift. These are values written to 3 CFs so > > >>> actual > > >>> >> hbase performance is 25-30k writes/sec/node. The nodes are > currently > > >>> disk > > >>> >> i/o bound (40-50% utilization) but hopefully once we get lzop > > working > > >>> this > > >>> >> will go down. We have been running for 12 hours without a problem. > > We > > >>> hope > > >>> >> to get lzop going today and then load all through the long > weekend. > > >>> >> > > >>> >> We plan to then test reads next week after we get some data in > > there. > > >>> Looks > > >>> >> good so far! Below are our settings in case there are some > > >>> >> suggestions/concerns. > > >>> >> > > >>> >> Thanks for everyone's help. It is pretty exciting to get > performance > > >>> like > > >>> >> this from the start. > > >>> >> > > >>> >> > > >>> >> *Global* > > >>> >> > > >>> >> client.write.buffer = 10485760 (10MB = 5x default) > > >>> >> > > >>> >> optionalLogFlushInterval = 10000 (10 secs = 10x default) > > >>> >> > > >>> >> memstore.flush.size = 268435456 (256MB = 4x default) > > >>> >> > > >>> >> hregion.max.filesize = 1073741824 (1GB = 4x default) > > >>> >> > > >>> >> *Table* > > >>> >> > > >>> >> alter 'xxx', METHOD => 'table_att', DEFERRED_LOG_FLUSH => true > > >>> >> > > >>> >> > > >>> >> > > >>> >> > > >>> >> > > >>> >> On Wed, Dec 29, 2010 at 12:55 AM, Stack <[email protected]> wrote: > > >>> >> > > >>> >>> On Mon, Dec 27, 2010 at 11:47 AM, Wayne <[email protected]> > wrote: > > >>> >>> > All data is written to 3 CFs. Basically 2 of the CFs are > > secondary > > >>> >>> indexes > > >>> >>> > (manually managed as normal CFs). It sounds like we should try > > hard > > >>> to > > >>> >>> get > > >>> >>> > as much out of thrift as we can before going to a lower level. > > >>> >>> > > >>> >>> Yes. > > >>> >>> > > >>> >>> Writes need > > >>> >>> > to be "fast enough", but reads are more important in the end > (and > > >>> are the > > >>> >>> > reason we are switching from a different solution). The numbers > > you > > >>> >>> quoted > > >>> >>> > below sound like they are in the ballpark of what we are > looking > > to > > >>> do. > > >>> >>> > > > >>> >>> > > >>> >>> Even the tens per second that I threw in there to CMA? > > >>> >>> > > >>> >>> > Much of our data is cold, and we expect reads to be disk i/o > > based. > > >>> >>> > > >>> >>> OK. FYI, we're not the best at this -- cache-miss cold reads -- > > what > > >>> >>> w/ a network hop in the way and currently we'll open a socket per > > >>> >>> access. > > >>> >>> > > >>> >>> > Given > > >>> >>> > this is 8GB heap a good place to start on the data nodes (24GB > > >>> ram)? Is > > >>> >>> the > > >>> >>> > block cache managed on its own (being it won't blow up causing > > >>> OOM), > > >>> >>> > > >>> >>> It won't. Its constrained. Does our home-brewed sizeof. > Default, > > >>> >>> its 0.2 of total heap. If you think cache will help, you could > go > > up > > >>> >>> from there. 0.4 or 0.5 of heap. > > >>> >>> > > >>> >>> > and if > > >>> >>> > we do not use it (block cache) should we go even lower for the > > heap > > >>> (we > > >>> >>> want > > >>> >>> > to avoid CMF and long GC pauses)? > > >>> >>> > > >>> >>> If you are going to be doing cache-miss most of the time and cold > > >>> >>> reads, then yes, you can do away with cache. > > >>> >>> > > >>> >>> In testing of 0.90.x I've been running w/ 1MB heaps with 1k > regions > > >>> >>> but this is my trying to break stuff. > > >>> >>> > > >>> >>> > Are there any timeouts we need to tweak to > > >>> >>> > make the cluster more "accepting" of long GC pauses while under > > >>> sustained > > >>> >>> > load (7+ days of 10k/inserts/sec/node)? > > >>> >>> > > > >>> >>> > > >>> >>> If zookeeper client timesout, the regionserver will shut itself > > down. > > >>> >>> In 0.90.0RC2, the client sessionout is set high -- 3 minutes. If > > you > > >>> >>> timeout that, then thats pretty extreme... something badly wrong > > I'd > > >>> >>> say. Heres' a few notes on the config and others that you might > > want > > >>> >>> to twiddle (see previous section on required configs... make sure > > >>> >>> you've got those too): > > >>> >>> > > >>> >>> > > >>> > > > http://people.apache.org/~stack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations<http://people.apache.org/%7Estack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations> > > < > > > http://people.apache.org/%7Estack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations > < > http://people.apache.org/~stack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations<http://people.apache.org/%7Estack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations> > > > > > > > >>> < > > >>> > > > http://people.apache.org/%7Estack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations > < > http://people.apache.org/~stack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations<http://people.apache.org/%7Estack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations> > > > > >>> > > > >>> >>> > > >>> >>> > > >>> >>> > Does LZO compression speed up reads/writes where there is > excess > > >>> CPU to > > >>> >>> do > > >>> >>> > the compression? I assume it would lower disk i/o but increase > > CPU > > >>> a lot. > > >>> >>> Is > > >>> >>> > data compressed on the initial write or only after compaction? > > >>> >>> > > > >>> >>> > > >>> >>> LZO is pretty frictionless -- i.e. little CPU cost -- and yes, > > >>> usually > > >>> >>> helps speed things up (grab more in the one go). What size are > > your > > >>> >>> records? You might want to mess w/ hfile block sizes though the > > 64k > > >>> >>> default is usually good enough for all but very small cell sizes. > > >>> >>> > > >>> >>> > > >>> >>> > With the replication in the HDFS layer how are reads managed in > > >>> terms of > > >>> >>> > load balancing across region servers? Does HDFS know to spread > > >>> multiple > > >>> >>> > requests across the 3 region servers that contain the same > data? > > >>> >>> > > >>> >>> You only read from one of the replicas, always the 'closest'. If > > the > > >>> >>> DFSClient has trouble getting the first of the replicas, it moves > > on > > >>> >>> to the second, etc. > > >>> >>> > > >>> >>> > > >>> >>> > For example > > >>> >>> > with 10 data nodes if we have 50 concurrent readers with very > > >>> "random" > > >>> >>> key > > >>> >>> > requests we would expect to have 5 reads occurring on each data > > >>> node at > > >>> >>> the > > >>> >>> > same time. We plan to have a thrift server on each data node, > so > > 5 > > >>> >>> > concurrent readers will be connected to each thrift server at > any > > >>> given > > >>> >>> time > > >>> >>> > (50 in aggregate across 10 nodes). We want to be sure > everything > > is > > >>> >>> designed > > >>> >>> > to evenly spread this load to avoid any possible hot-spots. > > >>> >>> > > > >>> >>> > > >>> >>> This is different. This is key design. A thrift server will be > > >>> doing > > >>> >>> some subset of the key space. If the requests are evenly > > distributed > > >>> >>> over all of the key space, then you should be fine; all thrift > > >>> servers > > >>> >>> will be evenly loaded. If not, then there could be hot spots. > > >>> >>> > > >>> >>> We have a balancer that currently only counts regions per server, > > not > > >>> >>> regions per server plus hits per region so it could be the case > > that > > >>> a > > >>> >>> server by chance ends up carrying all of the hot regions. HBase > > >>> >>> itself is not too smart dealing with this. In 0.90.0, there is > > >>> >>> facility for manually moving regions -- i.e. closing in current > > >>> >>> location and moving the region off to another server w/ some > outage > > >>> >>> while the move is happening (usually seconds) -- or you could > split > > >>> >>> the hot region manually and then the daughters could be moved off > > to > > >>> >>> other servers... Primitive for now but should be better in next > > HBase > > >>> >>> versions. > > >>> >>> > > >>> >>> Have you been able to test w/ your data and your query pattern? > > >>> >>> That'll tell you way more than I ever could. > > >>> >>> > > >>> >>> Good luck, > > >>> >>> St.Ack > > >>> >>> > > >>> >>> > > >>> >>> > > > >>> >>> > > > >>> >>> > On Mon, Dec 27, 2010 at 1:49 PM, Stack <[email protected]> > wrote: > > >>> >>> > > > >>> >>> >> On Fri, Dec 24, 2010 at 5:09 AM, Wayne <[email protected]> > > wrote: > > >>> >>> >> > We are in the process of evaluating hbase in an effort to > > switch > > >>> from > > >>> >>> a > > >>> >>> >> > different nosql solution. Performance is of course an > > important > > >>> part > > >>> >>> of > > >>> >>> >> our > > >>> >>> >> > evaluation. We are a python shop and we are very worried > that > > we > > >>> can > > >>> >>> not > > >>> >>> >> get > > >>> >>> >> > any real performance out of hbase using thrift (and must > drop > > >>> down to > > >>> >>> >> java). > > >>> >>> >> > We are aware of the various lower level options for bulk > > insert > > >>> or > > >>> >>> java > > >>> >>> >> > based inserts with turning off WAL etc. but none of these > are > > >>> >>> available > > >>> >>> >> to > > >>> >>> >> > us in python so are not part of our evaluation. > > >>> >>> >> > > >>> >>> >> I can understand python for continuous updates from your > > frontend > > >>> or > > >>> >>> >> whatever but you might consider hacking up a bit of java to > make > > >>> us of > > >>> >>> >> the bulk updater; you'll get upload rates orders of magnitude > > >>> beyond > > >>> >>> >> what you'd achieve going via the API via python (or java for > > that > > >>> >>> >> matter). You can also do incremental updates using the bulk > > >>> loader. > > >>> >>> >> > > >>> >>> >> > > >>> >>> >> We have a 10 node cluster > > >>> >>> >> > (24gb, 6 x 1TB, 16 core) that we setting up as data/region > > >>> nodes, and > > >>> >>> we > > >>> >>> >> are > > >>> >>> >> > looking for suggestions on configuration as well as > benchmarks > > >>> in > > >>> >>> terms > > >>> >>> >> of > > >>> >>> >> > expectations of performance. Below are some specific > > questions. > > >>> I > > >>> >>> realize > > >>> >>> >> > there are a million factors that help determine specific > > >>> performance > > >>> >>> >> > numbers, so any examples of performance from running > clusters > > >>> would be > > >>> >>> >> great > > >>> >>> >> > as examples of what can be done. > > >>> >>> >> > > >>> >>> >> Yeah, you have been around the block obviously. Its hard to > give > > >>> out > > >>> >>> >> 'numbers' since so many different factors involved. > > >>> >>> >> > > >>> >>> >> > > >>> >>> >> Again thrift seems to be our "problem" so > > >>> >>> >> > non java based solutions are preferred (do any non java > based > > >>> shops > > >>> >>> run > > >>> >>> >> > large scale hbase clusters?). Our total production cluster > > size > > >>> is > > >>> >>> >> estimated > > >>> >>> >> > to be 50TB. > > >>> >>> >> > > > >>> >>> >> > > >>> >>> >> There are some substantial shops running non-java; e.g. the > > yfrog > > >>> >>> >> folks go via REST, the mozilla fellas are python over thrift, > > >>> >>> >> Stumbleupon is php over thrift. > > >>> >>> >> > > >>> >>> >> > Our data model is 3 CFs, one primary and 2 secondary > indexes. > > >>> All > > >>> >>> writes > > >>> >>> >> go > > >>> >>> >> > to all 3 CFs and are grouped as a batch of row mutations > which > > >>> should > > >>> >>> >> avoid > > >>> >>> >> > row locking issues. > > >>> >>> >> > > > >>> >>> >> > > >>> >>> >> A write updates 3CFs and secondary indices? Thats an > expensive > > >>> Put > > >>> >>> >> relatively. You have to run w/ 3CFs? It facilitates fast > > >>> querying? > > >>> >>> >> > > >>> >>> >> > > >>> >>> >> > What heap size is recommended for master, and for region > > servers > > >>> (24gb > > >>> >>> >> ram)? > > >>> >>> >> > > >>> >>> >> Master doesn't take much heap, at least not in the coming > 0.90.0 > > >>> HBase > > >>> >>> >> (Is that what you intend to run)? > > >>> >>> >> > > >>> >>> >> The more RAM you give the regionservers, the more cache your > > >>> cluster > > >>> >>> will > > >>> >>> >> have. > > >>> >>> >> > > >>> >>> >> Whats important to you read or write times? > > >>> >>> >> > > >>> >>> >> > > >>> >>> >> > What other settings can/should be tweaked in hbase to > optimize > > >>> >>> >> performance > > >>> >>> >> > (we have looked at the wiki page)? > > >>> >>> >> > > >>> >>> >> Thats a good place to start. Take a look through this mailing > > >>> list > > >>> >>> >> for others (Its time for a trawl of mailing list and then > > >>> distilling > > >>> >>> >> the findings into a reedit of our perf page). > > >>> >>> >> > > >>> >>> >> > What is a good batch size for writes? We will start with 10k > > >>> >>> >> values/batch. > > >>> >>> >> > > >>> >>> >> Start small with defaults. Make sure its all running smooth > > >>> first. > > >>> >>> >> Then rachet it up. > > >>> >>> >> > > >>> >>> >> > > >>> >>> >> > How many concurrent writers/readers can a single data node > > >>> handle with > > >>> >>> >> > evenly distributed load? Are there settings specific to > this? > > >>> >>> >> > > >>> >>> >> How many clients you going to have writing HBase? > > >>> >>> >> > > >>> >>> >> > > >>> >>> >> > What is "very good" read/write latency for a single put/get > in > > >>> hbase > > >>> >>> >> using > > >>> >>> >> > thrift? > > >>> >>> >> > > >>> >>> >> "Very Good" would be < a few milliseconds. > > >>> >>> >> > > >>> >>> >> > > >>> >>> >> > What is "very good" read/write throughput per node in hbase > > >>> using > > >>> >>> thrift? > > >>> >>> >> > > > >>> >>> >> > > >>> >>> >> Thousands of ops per second per regionserver (Sorry, can't be > > more > > >>> >>> >> specific than that). If the Puts are multi-family + updates > on > > >>> >>> >> secondary indices, hundreds -- maybe even tens... I'm not sure > > -- > > >>> >>> >> rather than thousands. > > >>> >>> >> > > >>> >>> >> > We are looking to get performance numbers in the range of > 10k > > >>> >>> aggregate > > >>> >>> >> > inserts/sec/node and read latency < 30ms/read with 3-4 > > >>> concurrent > > >>> >>> >> > readers/node. Can our expectations be met with hbase through > > >>> thrift? > > >>> >>> Can > > >>> >>> >> > they be met with hbase through java? > > >>> >>> >> > > > >>> >>> >> > > >>> >>> >> > > >>> >>> >> I wouldn't fixate on the thrift hop. At SU we can do > thousands > > of > > >>> ops > > >>> >>> >> a second per node np from PHP frontend over thrift. > > >>> >>> >> > > >>> >>> >> 10k inserts a second per node into single CF might be doable. > > If > > >>> into > > >>> >>> >> 3CFs, then you need to recalibrate your expectations (I'd > say). > > >>> >>> >> > > >>> >>> >> > Thanks in advance for any help, examples, or recommendations > > >>> that you > > >>> >>> can > > >>> >>> >> > provide! > > >>> >>> >> > > > >>> >>> >> Sorry, the above is light on recommendations (for reasons > cited > > by > > >>> >>> >> Ryan above -- smile). > > >>> >>> >> St.Ack > > >>> >>> >> > > >>> >>> > > > >>> >>> > > >>> >> > > >>> > > > >>> > > >> > > >> > > > > > > > > > -- > --Sean >
