Hello Jonathan,
Please don't take my post the wrong way. I am very happy with HBase.
Thank you for being so detailed in your response. I will try to do the same.
You are right that I originally started posting on this list because of
problems I encountered with an EC2 based cluster, but I've run HBase both in
EC2 and on local hardware.
My responses are below.
On Sep 14, 2010, at 10:59 AM, Jonathan Gray wrote:
>> * The most common reason I've had for Region Server Suicide is
>> zookeeper. The region server thinks zookeeper is down. I thought this
>> had to do with heavy load, but this also happens for me even when there
>> is nothing running. I haven't been able to find a quantifiable cause.
>> This is just a weakness that exists in the hbase-zookeeper dependency.
>> Higher loads exacerbate the problem, but are not required for a Region
>> Server Suicide event to occur.
>
> Can you go into more detail here? You seem to be saying that for absolutely
> no reason you can lose communication between the RS and ZooKeeper quorum for
> extended periods of time?
The key phrase here is, "I haven't been able to find a quantifiable cause." I
believe there exists a root cause. This root cause may or may not be
observable in the log files. I have not found it, but I will admit that my
primary goal in dealing with a Region Server Suicide is getting back up and
running rather than characterizing the fault. In every situation, many times
thanks to the generous help of posters on this list, I've found some
configuration component that can be changed to hopefully make things better.
Further investigation into old log files is a moot exercise, given that any
evidence found many not be valid in the face of the new configuration.
> The "suicide" only happens if the RS loses its ZK session which can take
> upwards of a minute.
> I've been running clusters of all sizes and have not seen this happen. Over
> the past few weeks I've been running kill tests on ZK, ensuring we properly
> ride over the failure of a ZK node. Things generally work but it depends
> what version you are on. In any case there should not be random session
> expirations w/o reason.
I've seen the RS lose its session when zookeeper is up and running, other RS's
don't lose there session, and the cluster is not under heavy load.
Some of the suicides seemed to be related to problems run Ubuntu Lucid on EC2.
That distro has also unexplained moments where the machine seems to "pause"
(https://bugs.launchpad.net/ubuntu-on-ec2/+bug/574910 - some people report
there not just having reporting issues but also real responsiveness problems).
The unresponsiveness does not last for the session timeout period, but I
experienced a Region Server Suicide during such times nonetheless.
Others are entirely unexplained. The point is that something happens making
the environment unresponsive (i.e., network equipment is rebooted, making the
network temporarily unavailable, a runaway process is running, etc....). As a
result, an RS kills itself.
Another time, the cluster _was_ under heavy load. I had a user start up a
hadoop job on the cluster that slowed everything down to a grinding halt. Once
the job was killed, most things came back to normal, but again some RS's
decided to kill themselves.
I'm not sure if the loss of the session is caused by the RS or by ZK, but the
end result is that of an RS that stops running in an environment where it seems
that it should be able to recover. i.e., ZK is still up, the Master is still
up, other RS's are still up. I'm not criticizing the current behavior, but I
am hoping for increased robustness in the future.
> Do you have networking issues? Do you see dropped packets on your device
> stats?
On EC2, there might be networking issues, I am not sure.
>> * Another reason is the HDFS dependency... if a file is perhaps
>> temporarily unavailable for any reason, HBase handles this situation
>> with Region Server Suicide.
>
> Saying "perhaps temporarily unavailable" is not exactly right. There are
> retries and timeouts. Files do not just disappear for no reason. Yes, given
> enough time or issues, the RS will eventually kill itself. We can do better
> here but in the end any database that loses its file system is going to have
> no choice but to shutdown.
Here is one such scenario... I had an HDFS node die. "hadoop fsck" reported
a healthy filesystem even with this node missing (as expected due to
replication). But, for some reason one particular file could not be read.
This caused a Region Server Suicide event. I don't know why the one file could
not be read. When I deleted the file and restarted, everything was up and
running again. I assume that if I were to bring that DFS node back, the file
would not have been unavailable. Perhaps it had just been written and was not
replicated yet? But, you would think that "hadoop fsck" would not report a
healthy filesystem in that circumstance...
It is possible for any database server to have a bad block on a local
filesystem. It is possible to make a table or a slice of stored data
unreadable without shutting the entire server down. This is the behavior of
many other databases I've used.
> I certainly agree that we have work to do on robustness but I disagree with
> the picture you're painting here that HBase is this unstable and flaky. Many
> of us have been in production for quite some time.
I think I am almost getting to a point where my configuration is somewhat
stable. There were quite a few land mines along the way, though. Based on my
experience, it is easy to get caught up in the land mines. The important thing
is to take away from this is that many people, especially those that are new
will be feeling the pain of those land mines. I've learned to simply accept
that they are there and find ways to navigate around them. Initially, I
assumed that there were no land mines, and wondered what I had done wrong that
things kept blowing up. But, as you even point out below, for many people
Region Server Suicides are part of a normal operating mode, even for a
production system.
>> Perhaps if there were a setting, whether or not a region server is
>> allowed to commit suicide, some of us would feel more comfortable with
>> the idea.
>
> What you are calling suicide is something that we are using more not less
> these days. We do it when we get into some kind of irrecoverable state
> (generally ZK session expiration or losing the FS). The best way to handle
> this from an operations pov is to have a monitoring process on the RS that
> will start the process again if it ever dies. Hopefully we can open source
> some scripts to do this soon.
Why not build recovery into the Region Server itself? It sounds like you have
also experienced Region Server Suicides when there exists a viable cluster
environment, so why not have the Region Server continue to be part of that
viable cluster environment, performing whatever recovery is needed. I agree
that if a viable cluster environment did NOT exist, then a graceful shutdown
would be better. The difference is that I am talking about situations where a
viable cluster environment DOES exist.
We all agree some recovery is needed. That recovery can be handled by the
server experiencing the problem, or externally. My personal preference would
be to have less external setup, and to have the recovery handled by the RS. I
prefer to have a server stop running only when I tell it to, and not have it
decide for itself when to do so. We spend a lot of time thinking about fault
tolerance at a cluster level, but it can also be important on a machine level.
> So this can't be an option to turn on/off.
It could be an option. If on, then run some code that performs a similar
recovery to your possibly soon to be open sourced scripts. If off, then don't
run that code, and die instead.
> Are you using Ganglia or other cluster monitoring software? These kinds of
> ZK and HDFS issues usually come from over-extended clusters with swapping
> and/or io starvation. Sorry if you've had previous threads and I just forget.
This is also my experience. A common idea is to run hadoop and hbase on the
same cluster. Some hadoop jobs can eat up a lot of resources, which can in
some circumstances push over HBase. This is in no way a criticism of HBase,
but it would be nice if HBase was more resilient to high load situations.
HBase is extremely robust for its maturity, which is a credit to those
contributing code. One day it would be nice if instead of dying and requiring
external intervention (i.e., scripts or manual restarts), HBase was able to
handle the fact that the cluster is overloaded and automatically come back once
the offending job is killed or the high load condition no longer exists (i.e.,
without external intervention).
>> In the mean time, you can try to work around any of these issues by
>> using bigger hardware than you would otherwise think is needed and not
>> letting the load get very high. For example, I tend to have these
>> kinds of problems much less often when the load on any individual
>> machine never goes above the number of cores.
>
> What kind of hardware do you think should be needed? And are you talking
> about machine specs of cluster size?
The kind of machine and size of cluster depends on the application, which is
why I gave the metric below. The metric below can be used with ANY application
and ANY cluster as a general rule of thumb.
> I've run HBase on small, slow clusters and big, fast ones. Of course, you
> need to configure properly and have the right expectations. I'm not even
> sure what you mean by "load goes above number of cores".
Load is a numeric value reported by the OS (usually as an average over time).
It is actually a measure of the queue of threads and processes waiting to run
in the processor. On a 2 core machine, a load of 2 would mean that 0 processes
are waiting to run, but that the processors are fully utilized (remember on
average over time, not instantaneously). A load of 4 would mean that on
average 2 processes are waiting to run in the CPU. This is a quantifiable
measure that can help drive expectations.
What is meant: "load not going above number of cores", is a specific way to
describe conditions where I've seen HBase being very stable that is independent
of cluster size, machine size, and application. I've seen it continue to work
well at much higher loads, for example 2 or 3 times the number of cores. But,
if the load goes higher or stays this high for a long period of time, there is
an increased risk of a Region Server Suicide. The high load may not be the
cause, but could exacerbate some other event that is the root cause.
I'm a fan of HBase, but I've also had some huge frustration with it. Anybody
contemplating an HBase cluster should be aware of this limitation. A common
decision to make is whether to go with a SQL database (possibly even
distributed, i.e., greenplum) or with a tool like HBase. If you run mysql on
an overloaded machine, mysql may become unresponsive, but will probably not
kill itself (i.e., it recovers from any problems without external
intervention). The current incarnation of HBase requires external
intervention. My preference would be for HBase to NOT require external
intervention.
> Looking back at your previous threads, I see you're running on EC2. I guess
> that explains much of your experiences, though I know some people running in
> production and happy on EC2 w/ HBase. In the future, preface your post with
> that as I think it has a lot to do with the problems you've been seeing.
It isn't just about EC2, it is about any underlying computing environment that
may not match an "ideal" environment. Whatever flakiness it is in the EC2
environment that you refer to above, I'm looking forward to a time when HBase
is resilient enough to take the EC2 flakiness in stride, and continue to stay
up and running WITHOUT external intervention.
If this is not the direction that HBase is going in, then it is something we
should all be aware of when making technology decisions.
-Matthew