RE: Region server shutdown after putting millions of rows

Jonathan Gray Tue, 14 Sep 2010 14:04:11 -0700

Hey Matthew,

Thanks for having the patience to stick with it and work through the issues.  
If you had issues that you got solutions to, it's always helpful to make sure 
it's documented somewhere on the wiki.


Robustness is, of course, a primary goal for HBase.  I agree with many of your 
points and we do have a ways to go, but things are getting much better.

The RS used to try and do an internal restart like you are describing.  It was 
problematic.  We may put it back in, but until we can get to the point of 
stability and robustness that we can trust it, I believe strongly in the 
suicide route as it ensures overall correctness of the cluster.  Rogue 
regionservers have caused issues.

In addition, RS most often "go rogue" when they start having long GC pauses and 
eventually get ZK expiration.  In this instance, it is very desirable to 
actually restart the JVM.  No internal restart will replace a complete restart 
of the JVM process.

It is very common practice to have a monitoring process that restarts a given 
process if it dies.  At some point in the near future, I would expect HBase to 
ship with scripts to do this.  In a production environment that aims to be 
HA/FT, you likely have some kind of config management and/or deployment 
software that may do this type of process monitoring.  Stuff is out there if 
not.

Also, we've bumped into some kernel/network issues that have caused issues with 
ZK as well.  We have found and fixed a few bugs related to it and will continue 
work on making HBase more robust.  We could really use your help testing 0.90 
once we get to a release candidate.

JG

> -----Original Message-----
> From: Matthew LeMieux [mailto:[email protected]]
> Sent: Tuesday, September 14, 2010 1:40 PM
> To: [email protected]
> Subject: Re: Region server shutdown after putting millions of rows
>
> Hello Jonathan,
>
>     Please don't take my post the wrong way.  I am very happy with
> HBase.  Thank you for being so detailed in your response. I will try to
> do the same.
>
>     You are right that I originally started posting on this list
> because of problems I encountered with an EC2 based cluster, but I've
> run HBase both in EC2 and on local hardware.
>
>     My responses are below.
>
> On Sep 14, 2010, at 10:59 AM, Jonathan Gray wrote:
>
> >> * The most common reason I've had for Region Server Suicide is
> >> zookeeper.  The region server thinks zookeeper is down.  I thought
> this
> >> had to do with heavy load, but this also happens for me even when
> there
> >> is nothing running.  I haven't been able to find a quantifiable
> cause.
> >> This is just a weakness that exists in the hbase-zookeeper
> dependency.
> >> Higher loads exacerbate the problem, but are not required for a
> Region
> >> Server Suicide event to occur.
> >
> > Can you go into more detail here?  You seem to be saying that for
> absolutely no reason you can lose communication between the RS and
> ZooKeeper quorum for extended periods of time?
>
> The key phrase here is, "I haven't been able to find a quantifiable
> cause."  I believe there exists a root cause.  This root cause may or
> may not be observable in the log files.  I have not found it, but I
> will admit that my primary goal in dealing with a Region Server Suicide
> is getting back up and running rather than characterizing the fault. In
> every situation, many times thanks to the generous help of posters on
> this list,  I've found some configuration component that can be changed
> to hopefully make things better.  Further investigation into old log
> files is a moot exercise, given that any evidence found many not be
> valid in the face of the new configuration.
>
> > The "suicide" only happens if the RS loses its ZK session which can
> take upwards of a minute.
> > I've been running clusters of all sizes and have not seen this
> happen.  Over the past few weeks I've been running kill tests on ZK,
> ensuring we properly ride over the failure of a ZK node.  Things
> generally work but it depends what version you are on.  In any case
> there should not be random session expirations w/o reason.
>
> I've seen the RS lose its session when zookeeper is up and running,
> other RS's don't lose there session, and the cluster is not under heavy
> load.
>
> Some of the suicides seemed to be related to problems run Ubuntu Lucid
> on EC2.  That distro has also unexplained moments where the machine
> seems to "pause" (https://bugs.launchpad.net/ubuntu-on-ec2/+bug/574910
> - some people report there not just having reporting issues but also
> real responsiveness problems).  The unresponsiveness does not last for
> the session timeout period, but I experienced a Region Server Suicide
> during such times nonetheless.
>
> Others are entirely unexplained.  The point is that something happens
> making the environment unresponsive (i.e., network equipment is
> rebooted, making the network temporarily unavailable, a runaway process
> is running, etc....).  As a result, an RS kills itself.
>
> Another time, the cluster _was_ under heavy load.  I had a user start
> up a hadoop job on the cluster that slowed everything down to a
> grinding halt.  Once the job was killed, most things came back to
> normal, but again some RS's decided to kill themselves.
>
> I'm not sure if the loss of the session is caused by the RS or by ZK,
> but the end result is that of an RS that stops running in an
> environment where it seems that it should be able to recover.  i.e., ZK
> is still up, the Master is still up, other RS's are still up.  I'm not
> criticizing the current behavior, but I am hoping for increased
> robustness in the future.
>
> > Do you have networking issues?  Do you see dropped packets on your
> device stats?
>
> On EC2, there might be networking issues, I am not sure.
>
> >> * Another reason is the HDFS dependency... if a file is perhaps
> >> temporarily unavailable for any reason, HBase handles this situation
> >> with Region Server Suicide.
> >
> > Saying "perhaps temporarily unavailable" is not exactly right.  There
> are retries and timeouts.  Files do not just disappear for no reason.
> Yes, given enough time or issues, the RS will eventually kill itself.
> We can do better here but in the end any database that loses its file
> system is going to have no choice but to shutdown.
>
> Here is one such scenario...   I had an HDFS node die.  "hadoop fsck"
> reported a healthy filesystem even with this node missing (as expected
> due to replication).  But, for some reason one particular file could
> not be read.   This caused a Region Server Suicide event.  I don't know
> why the one file could not be read.  When I deleted the file and
> restarted, everything was up and running again.  I assume that if I
> were to bring that DFS node back, the file would not have been
> unavailable.  Perhaps it had just been written and was not replicated
> yet?  But, you would think that "hadoop fsck" would not report a
> healthy filesystem in that circumstance...
>
> It is possible for any database server to have a bad block on a local
> filesystem.  It is possible to make a table or a slice of stored data
> unreadable without shutting the entire server down.  This is the
> behavior of many other databases I've used.
>
> > I certainly agree that we have work to do on robustness but I
> disagree with the picture you're painting here that HBase is this
> unstable and flaky.  Many of us have been in production for quite some
> time.
>
> I think I am almost getting to a point where my configuration is
> somewhat stable.  There were quite a few land mines along the way,
> though.  Based on my experience, it is easy to get caught up in the
> land mines.  The important thing is to take away from this is that many
> people, especially those that are new will be feeling the pain of those
> land mines.  I've learned to simply accept that they are there and find
> ways to navigate around them.  Initially, I assumed that there were no
> land mines, and wondered what I had done wrong that things kept blowing
> up.  But, as you even point out below, for many people Region Server
> Suicides are part of a normal operating mode, even for a production
> system.
>
> >> Perhaps if there were a setting, whether or not a region server is
> >> allowed to commit suicide, some of us would feel more comfortable
> with
> >> the idea.
> >
> > What you are calling suicide is something that we are using more not
> less these days.  We do it when we get into some kind of irrecoverable
> state (generally ZK session expiration or losing the FS).  The best way
> to handle this from an operations pov is to have a monitoring process
> on the RS that will start the process again if it ever dies.  Hopefully
> we can open source some scripts to do this soon.
>
> Why not build recovery into the Region Server itself?  It sounds like
> you have also experienced Region Server Suicides when there exists a
> viable cluster environment, so why not have the Region Server continue
> to be part of that viable cluster environment, performing whatever
> recovery is needed.   I agree that if a viable cluster environment did
> NOT exist, then a graceful shutdown would be better.  The difference is
> that I am talking about situations where a viable cluster environment
> DOES exist.
>
> We all agree some recovery is needed.  That recovery can be handled by
> the server experiencing the problem, or externally.  My personal
> preference would be to have less external setup, and to have the
> recovery handled by the RS.  I prefer to have a server stop running
> only when I tell it to, and not have it decide for itself when to do
> so.   We spend a lot of time thinking about fault tolerance at a
> cluster level, but it can also be important on a machine level.
>
> > So this can't be an option to turn on/off.
>
> It could be an option.  If on, then run some code that performs a
> similar recovery to your possibly soon to be open sourced scripts.  If
> off, then don't run that code, and die instead.
>
> > Are you using Ganglia or other cluster monitoring software?  These
> kinds of ZK and HDFS issues usually come from over-extended clusters
> with swapping and/or io starvation.  Sorry if you've had previous
> threads and I just forget.
>
> This is also my experience.  A common idea is to run hadoop and hbase
> on the same cluster.  Some hadoop jobs can eat up a lot of resources,
> which can in some circumstances push over HBase.  This is in no way a
> criticism of HBase, but it would be nice if HBase was more resilient to
> high load situations.  HBase is extremely robust for its maturity,
> which is a credit to those contributing code.  One day it would be nice
> if instead of dying and requiring external intervention (i.e., scripts
> or manual restarts), HBase was able to handle the fact that the cluster
> is overloaded and automatically come back once the offending job is
> killed or the high load condition no longer exists (i.e., without
> external intervention).
>
> >> In the mean time, you can try to work around any of these issues by
> >> using bigger hardware than you would otherwise think is needed and
> not
> >> letting the load get very high.  For example, I tend to have these
> >> kinds of problems much less often when the load on any individual
> >> machine never goes above the number of cores.
> >
> > What kind of hardware do you think should be needed?  And are you
> talking about machine specs of cluster size?
>
> The kind of machine and size of cluster depends on the application,
> which is why I gave the metric below.  The metric below can be used
> with ANY application and ANY cluster as a general rule of thumb.
>
> > I've run HBase on small, slow clusters and big, fast ones.  Of
> course, you need to configure properly and have the right expectations.
> I'm not even sure what you mean by "load goes above number of cores".
>
> Load is a numeric value reported by the OS (usually as an average over
> time).  It is actually a measure of the queue of threads and processes
> waiting to run in the processor.  On a 2 core machine, a load of 2
> would mean that 0 processes are waiting to run, but that the processors
> are fully utilized (remember on average over time, not
> instantaneously).  A load of 4 would mean that on average 2 processes
> are waiting to run in the CPU.  This is a quantifiable measure that can
> help drive expectations.
>
> What is meant:  "load not going above number of cores", is a specific
> way to describe conditions where I've seen HBase being very stable that
> is independent of cluster size, machine size, and application.  I've
> seen it continue to work well at much higher loads, for example 2 or 3
> times the number of cores.  But, if the load goes higher or stays this
> high for a long period of time, there is an increased risk of a Region
> Server Suicide.   The high load may not be the cause, but could
> exacerbate some other event that is the root cause.
>
> I'm a fan of HBase, but I've also had some huge frustration with it.
> Anybody contemplating an HBase cluster should be aware of this
> limitation.   A common decision to make is whether to go with a SQL
> database (possibly even distributed, i.e., greenplum) or with a tool
> like HBase.  If you run mysql on an overloaded machine, mysql may
> become unresponsive, but will probably not kill itself (i.e., it
> recovers from any problems without external intervention).  The current
> incarnation of HBase requires external intervention.   My preference
> would be for HBase to NOT require external intervention.
>
> > Looking back at your previous threads, I see you're running on EC2.
> I guess that explains much of your experiences, though I know some
> people running in production and happy on EC2 w/ HBase.  In the future,
> preface your post with that as I think it has a lot to do with the
> problems you've been seeing.
>
> It isn't just about EC2, it is about any underlying computing
> environment that may not match an "ideal" environment.   Whatever
> flakiness it is in the EC2 environment that you refer to above, I'm
> looking forward to a time when HBase is resilient enough to take the
> EC2 flakiness in stride, and continue to stay up and running WITHOUT
> external intervention.
>
> If this is not the direction that HBase is going in, then it is
> something we should all be aware of when making technology decisions.
>
> -Matthew
>

RE: Region server shutdown after putting millions of rows

Reply via email to