RE: Region server shutdown after putting millions of rows

Zhou Shuaifeng Tue, 14 Sep 2010 18:48:01 -0700

 > Saying "perhaps temporarily unavailable" is not exactly right.  There are
retries and timeouts.  Files do not just disappear for no reason.  Yes,
given enough time or issues, the RS will eventually kill itself.  We can do
better here but in the end any database that loses its file system is going
to have no choice but to shutdown.

>Here is one such scenario...   I had an HDFS node die.  "hadoop fsck"
reported a healthy filesystem even with this node missing (as expected due
to replication).  But, for some reason one particular file could not be
read.   This caused a Region Server Suicide event.  I don't know why the one
file could not be read.  When I deleted the file and restarted, everything
was up and running again.  I assume that if I were to bring that DFS node
back, the file would not have been unavailable.  Perhaps it had just been
written and was not replicated yet?  But, you would think that "hadoop fsck"
would not report a healthy filesystem in that circumstance...  

I may encounter this problem. The hdfs namenode logs showed that:
Not able to place enough replicas, still in need of 1.

Shuaifeng Zhou

-----Original Message-----
From: Matthew LeMieux [mailto:[email protected]] 
Sent: Tuesday, September 14, 2010 1:40 PM
To: [email protected]
Subject: Re: Region server shutdown after putting millions of rows

Hello Jonathan, 

    Please don't take my post the wrong way.  I am very happy with HBase.
Thank you for being so detailed in your response. I will try to do the same.

    You are right that I originally started posting on this list because of
problems I encountered with an EC2 based cluster, but I've run HBase both in
EC2 and on local hardware.

    My responses are below.  

On Sep 14, 2010, at 10:59 AM, Jonathan Gray wrote:

>> * The most common reason I've had for Region Server Suicide is 
>> zookeeper.  The region server thinks zookeeper is down.  I thought 
>> this had to do with heavy load, but this also happens for me even 
>> when there is nothing running.  I haven't been able to find a
quantifiable cause.
>> This is just a weakness that exists in the hbase-zookeeper dependency.
>> Higher loads exacerbate the problem, but are not required for a 
>> Region Server Suicide event to occur.
> 
> Can you go into more detail here?  You seem to be saying that for
absolutely no reason you can lose communication between the RS and ZooKeeper
quorum for extended periods of time?

The key phrase here is, "I haven't been able to find a quantifiable cause."
I believe there exists a root cause.  This root cause may or may not be
observable in the log files.  I have not found it, but I will admit that my
primary goal in dealing with a Region Server Suicide is getting back up and
running rather than characterizing the fault. In every situation, many times
thanks to the generous help of posters on this list,  I've found some
configuration component that can be changed to hopefully make things better.
Further investigation into old log files is a moot exercise, given that any
evidence found many not be valid in the face of the new configuration.  

> The "suicide" only happens if the RS loses its ZK session which can take
upwards of a minute.
> I've been running clusters of all sizes and have not seen this happen.
Over the past few weeks I've been running kill tests on ZK, ensuring we
properly ride over the failure of a ZK node.  Things generally work but it
depends what version you are on.  In any case there should not be random
session expirations w/o reason.

I've seen the RS lose its session when zookeeper is up and running, other
RS's don't lose there session, and the cluster is not under heavy load.   

Some of the suicides seemed to be related to problems run Ubuntu Lucid on
EC2.  That distro has also unexplained moments where the machine seems to
"pause" (https://bugs.launchpad.net/ubuntu-on-ec2/+bug/574910 - some people
report there not just having reporting issues but also real responsiveness
problems).  The unresponsiveness does not last for the session timeout
period, but I experienced a Region Server Suicide during such times
nonetheless.  

Others are entirely unexplained.  The point is that something happens making
the environment unresponsive (i.e., network equipment is rebooted, making
the network temporarily unavailable, a runaway process is running, etc....).
As a result, an RS kills itself. 

Another time, the cluster _was_ under heavy load.  I had a user start up a
hadoop job on the cluster that slowed everything down to a grinding halt.
Once the job was killed, most things came back to normal, but again some
RS's decided to kill themselves.  

I'm not sure if the loss of the session is caused by the RS or by ZK, but
the end result is that of an RS that stops running in an environment where
it seems that it should be able to recover.  i.e., ZK is still up, the
Master is still up, other RS's are still up.  I'm not criticizing the
current behavior, but I am hoping for increased robustness in the future.   

> Do you have networking issues?  Do you see dropped packets on your device
stats?

On EC2, there might be networking issues, I am not sure. 

>> * Another reason is the HDFS dependency... if a file is perhaps 
>> temporarily unavailable for any reason, HBase handles this situation 
>> with Region Server Suicide.
> 
> Saying "perhaps temporarily unavailable" is not exactly right.  There are
retries and timeouts.  Files do not just disappear for no reason.  Yes,
given enough time or issues, the RS will eventually kill itself.  We can do
better here but in the end any database that loses its file system is going
to have no choice but to shutdown.

Here is one such scenario...   I had an HDFS node die.  "hadoop fsck"
reported a healthy filesystem even with this node missing (as expected due
to replication).  But, for some reason one particular file could not be
read.   This caused a Region Server Suicide event.  I don't know why the one
file could not be read.  When I deleted the file and restarted, everything
was up and running again.  I assume that if I were to bring that DFS node
back, the file would not have been unavailable.  Perhaps it had just been
written and was not replicated yet?  But, you would think that "hadoop fsck"
would not report a healthy filesystem in that circumstance...  

It is possible for any database server to have a bad block on a local
filesystem.  It is possible to make a table or a slice of stored data
unreadable without shutting the entire server down.  This is the behavior of
many other databases I've used.  

> I certainly agree that we have work to do on robustness but I disagree
with the picture you're painting here that HBase is this unstable and flaky.
Many of us have been in production for quite some time.

I think I am almost getting to a point where my configuration is somewhat
stable.  There were quite a few land mines along the way, though.  Based on
my experience, it is easy to get caught up in the land mines.  The important
thing is to take away from this is that many people, especially those that
are new will be feeling the pain of those land mines.  I've learned to
simply accept that they are there and find ways to navigate around them.
Initially, I assumed that there were no land mines, and wondered what I had
done wrong that things kept blowing up.  But, as you even point out below,
for many people Region Server Suicides are part of a normal operating mode,
even for a production system.  

>> Perhaps if there were a setting, whether or not a region server is 
>> allowed to commit suicide, some of us would feel more comfortable 
>> with the idea.
> 
> What you are calling suicide is something that we are using more not less
these days.  We do it when we get into some kind of irrecoverable state
(generally ZK session expiration or losing the FS).  The best way to handle
this from an operations pov is to have a monitoring process on the RS that
will start the process again if it ever dies.  Hopefully we can open source
some scripts to do this soon.

Why not build recovery into the Region Server itself?  It sounds like you
have also experienced Region Server Suicides when there exists a viable
cluster environment, so why not have the Region Server continue to be part
of that viable cluster environment, performing whatever recovery is needed.
I agree that if a viable cluster environment did NOT exist, then a graceful
shutdown would be better.  The difference is that I am talking about
situations where a viable cluster environment DOES exist.  

We all agree some recovery is needed.  That recovery can be handled by the
server experiencing the problem, or externally.  My personal preference
would be to have less external setup, and to have the recovery handled by
the RS.  I prefer to have a server stop running only when I tell it to, and
not have it decide for itself when to do so.   We spend a lot of time
thinking about fault tolerance at a cluster level, but it can also be
important on a machine level.  

> So this can't be an option to turn on/off.

It could be an option.  If on, then run some code that performs a similar
recovery to your possibly soon to be open sourced scripts.  If off, then
don't run that code, and die instead.   

> Are you using Ganglia or other cluster monitoring software?  These kinds
of ZK and HDFS issues usually come from over-extended clusters with swapping
and/or io starvation.  Sorry if you've had previous threads and I just
forget.

This is also my experience.  A common idea is to run hadoop and hbase on the
same cluster.  Some hadoop jobs can eat up a lot of resources, which can in
some circumstances push over HBase.  This is in no way a criticism of HBase,
but it would be nice if HBase was more resilient to high load situations.
HBase is extremely robust for its maturity, which is a credit to those
contributing code.  One day it would be nice if instead of dying and
requiring external intervention (i.e., scripts or manual restarts), HBase
was able to handle the fact that the cluster is overloaded and automatically
come back once the offending job is killed or the high load condition no
longer exists (i.e., without external intervention). 

>> In the mean time, you can try to work around any of these issues by 
>> using bigger hardware than you would otherwise think is needed and 
>> not letting the load get very high.  For example, I tend to have 
>> these kinds of problems much less often when the load on any 
>> individual machine never goes above the number of cores.
> 
> What kind of hardware do you think should be needed?  And are you talking
about machine specs of cluster size?

The kind of machine and size of cluster depends on the application, which is
why I gave the metric below.  The metric below can be used with ANY
application and ANY cluster as a general rule of thumb.  

> I've run HBase on small, slow clusters and big, fast ones.  Of course, you
need to configure properly and have the right expectations.  I'm not even
sure what you mean by "load goes above number of cores".

Load is a numeric value reported by the OS (usually as an average over
time).  It is actually a measure of the queue of threads and processes
waiting to run in the processor.  On a 2 core machine, a load of 2 would
mean that 0 processes are waiting to run, but that the processors are fully
utilized (remember on average over time, not instantaneously).  A load of 4
would mean that on average 2 processes are waiting to run in the CPU.  This
is a quantifiable measure that can help drive expectations.  

What is meant:  "load not going above number of cores", is a specific way to
describe conditions where I've seen HBase being very stable that is
independent of cluster size, machine size, and application.  I've seen it
continue to work well at much higher loads, for example 2 or 3 times the
number of cores.  But, if the load goes higher or stays this high for a long
period of time, there is an increased risk of a Region Server Suicide.   The
high load may not be the cause, but could exacerbate some other event that
is the root cause.  

I'm a fan of HBase, but I've also had some huge frustration with it.
Anybody contemplating an HBase cluster should be aware of this limitation.
A common decision to make is whether to go with a SQL database (possibly
even distributed, i.e., greenplum) or with a tool like HBase.  If you run
mysql on an overloaded machine, mysql may become unresponsive, but will
probably not kill itself (i.e., it recovers from any problems without
external intervention).  The current incarnation of HBase requires external
intervention.   My preference would be for HBase to NOT require external
intervention.  

> Looking back at your previous threads, I see you're running on EC2.  I
guess that explains much of your experiences, though I know some people
running in production and happy on EC2 w/ HBase.  In the future, preface
your post with that as I think it has a lot to do with the problems you've
been seeing.

It isn't just about EC2, it is about any underlying computing environment
that may not match an "ideal" environment.   Whatever flakiness it is in the
EC2 environment that you refer to above, I'm looking forward to a time when
HBase is resilient enough to take the EC2 flakiness in stride, and continue
to stay up and running WITHOUT external intervention.    

If this is not the direction that HBase is going in, then it is something we
should all be aware of when making technology decisions.  

-Matthew

RE: Region server shutdown after putting millions of rows

Reply via email to