Context: we're still on .89 - so we can't take advantage of the MemStore allocation buffers yet. One of the most important metrics for us was GC-stuck region servers, and more nodes + more memory + scheduling periodic cluster restarts helped in our situation. I wholeheartedly agree with the goal of constant uptime, but that was an operations approach we took during some rocky times that helped keep things "un-interesting" with the cluster (in a good way).
Because the GC pauses would flare up in write-heavy environments (per Todd's analysis), this seemed to hit us at the worst possible time (e.g., during an index re-built and during a split, which would lead to inconsistent metadata, etc.) We are in a happy place now, and we're always looking to make it better, but those are some "obvious but not so obvious" points on how we got here. And don't have too many column families. -----Original Message----- From: Andrew Purtell [mailto:[email protected]] Sent: Wednesday, April 13, 2011 1:51 PM To: [email protected] Cc: Robert Gonzalez Subject: RE: HBase is not ready for Primetime Hi Doug, > 3) Cluster restart > > We schedule a full shutdown and restart of our cluster each week. > It's pretty quick, and HBase just seems happier when we do this. Can you say a bit more about how HBase is happier versus not? I can speculate on a number of reasons why this may be the case, but in general we should take the view that if the OS has 1000 days of uptime etc. so should HBase, and work toward that goal. (Unless the JVM just gets in our way... but so far we have not clearly identified an intractable case.) Best regards, - Andy
