> Saying "perhaps temporarily unavailable" is not exactly right. There are retries and timeouts. Files do not just disappear for no reason. Yes, given enough time or issues, the RS will eventually kill itself. We can do better here but in the end any database that loses its file system is going to have no choice but to shutdown.
>Here is one such scenario... I had an HDFS node die. "hadoop fsck" reported a healthy filesystem even with this node missing (as expected due to replication). But, for some reason one particular file could not be read. This caused a Region Server Suicide event. I don't know why the one file could not be read. When I deleted the file and restarted, everything was up and running again. I assume that if I were to bring that DFS node back, the file would not have been unavailable. Perhaps it had just been written and was not replicated yet? But, you would think that "hadoop fsck" would not report a healthy filesystem in that circumstance... I may encounter this problem. The hdfs namenode logs showed that: Not able to place enough replicas, still in need of 1. Shuaifeng Zhou -----Original Message----- From: Matthew LeMieux [mailto:[email protected]] Sent: Tuesday, September 14, 2010 1:40 PM To: [email protected] Subject: Re: Region server shutdown after putting millions of rows Hello Jonathan, Please don't take my post the wrong way. I am very happy with HBase. Thank you for being so detailed in your response. I will try to do the same. You are right that I originally started posting on this list because of problems I encountered with an EC2 based cluster, but I've run HBase both in EC2 and on local hardware. My responses are below. On Sep 14, 2010, at 10:59 AM, Jonathan Gray wrote: >> * The most common reason I've had for Region Server Suicide is >> zookeeper. The region server thinks zookeeper is down. I thought >> this had to do with heavy load, but this also happens for me even >> when there is nothing running. I haven't been able to find a quantifiable cause. >> This is just a weakness that exists in the hbase-zookeeper dependency. >> Higher loads exacerbate the problem, but are not required for a >> Region Server Suicide event to occur. > > Can you go into more detail here? You seem to be saying that for absolutely no reason you can lose communication between the RS and ZooKeeper quorum for extended periods of time? The key phrase here is, "I haven't been able to find a quantifiable cause." I believe there exists a root cause. This root cause may or may not be observable in the log files. I have not found it, but I will admit that my primary goal in dealing with a Region Server Suicide is getting back up and running rather than characterizing the fault. In every situation, many times thanks to the generous help of posters on this list, I've found some configuration component that can be changed to hopefully make things better. Further investigation into old log files is a moot exercise, given that any evidence found many not be valid in the face of the new configuration. > The "suicide" only happens if the RS loses its ZK session which can take upwards of a minute. > I've been running clusters of all sizes and have not seen this happen. Over the past few weeks I've been running kill tests on ZK, ensuring we properly ride over the failure of a ZK node. Things generally work but it depends what version you are on. In any case there should not be random session expirations w/o reason. I've seen the RS lose its session when zookeeper is up and running, other RS's don't lose there session, and the cluster is not under heavy load. Some of the suicides seemed to be related to problems run Ubuntu Lucid on EC2. That distro has also unexplained moments where the machine seems to "pause" (https://bugs.launchpad.net/ubuntu-on-ec2/+bug/574910 - some people report there not just having reporting issues but also real responsiveness problems). The unresponsiveness does not last for the session timeout period, but I experienced a Region Server Suicide during such times nonetheless. Others are entirely unexplained. The point is that something happens making the environment unresponsive (i.e., network equipment is rebooted, making the network temporarily unavailable, a runaway process is running, etc....). As a result, an RS kills itself. Another time, the cluster _was_ under heavy load. I had a user start up a hadoop job on the cluster that slowed everything down to a grinding halt. Once the job was killed, most things came back to normal, but again some RS's decided to kill themselves. I'm not sure if the loss of the session is caused by the RS or by ZK, but the end result is that of an RS that stops running in an environment where it seems that it should be able to recover. i.e., ZK is still up, the Master is still up, other RS's are still up. I'm not criticizing the current behavior, but I am hoping for increased robustness in the future. > Do you have networking issues? Do you see dropped packets on your device stats? On EC2, there might be networking issues, I am not sure. >> * Another reason is the HDFS dependency... if a file is perhaps >> temporarily unavailable for any reason, HBase handles this situation >> with Region Server Suicide. > > Saying "perhaps temporarily unavailable" is not exactly right. There are retries and timeouts. Files do not just disappear for no reason. Yes, given enough time or issues, the RS will eventually kill itself. We can do better here but in the end any database that loses its file system is going to have no choice but to shutdown. Here is one such scenario... I had an HDFS node die. "hadoop fsck" reported a healthy filesystem even with this node missing (as expected due to replication). But, for some reason one particular file could not be read. This caused a Region Server Suicide event. I don't know why the one file could not be read. When I deleted the file and restarted, everything was up and running again. I assume that if I were to bring that DFS node back, the file would not have been unavailable. Perhaps it had just been written and was not replicated yet? But, you would think that "hadoop fsck" would not report a healthy filesystem in that circumstance... It is possible for any database server to have a bad block on a local filesystem. It is possible to make a table or a slice of stored data unreadable without shutting the entire server down. This is the behavior of many other databases I've used. > I certainly agree that we have work to do on robustness but I disagree with the picture you're painting here that HBase is this unstable and flaky. Many of us have been in production for quite some time. I think I am almost getting to a point where my configuration is somewhat stable. There were quite a few land mines along the way, though. Based on my experience, it is easy to get caught up in the land mines. The important thing is to take away from this is that many people, especially those that are new will be feeling the pain of those land mines. I've learned to simply accept that they are there and find ways to navigate around them. Initially, I assumed that there were no land mines, and wondered what I had done wrong that things kept blowing up. But, as you even point out below, for many people Region Server Suicides are part of a normal operating mode, even for a production system. >> Perhaps if there were a setting, whether or not a region server is >> allowed to commit suicide, some of us would feel more comfortable >> with the idea. > > What you are calling suicide is something that we are using more not less these days. We do it when we get into some kind of irrecoverable state (generally ZK session expiration or losing the FS). The best way to handle this from an operations pov is to have a monitoring process on the RS that will start the process again if it ever dies. Hopefully we can open source some scripts to do this soon. Why not build recovery into the Region Server itself? It sounds like you have also experienced Region Server Suicides when there exists a viable cluster environment, so why not have the Region Server continue to be part of that viable cluster environment, performing whatever recovery is needed. I agree that if a viable cluster environment did NOT exist, then a graceful shutdown would be better. The difference is that I am talking about situations where a viable cluster environment DOES exist. We all agree some recovery is needed. That recovery can be handled by the server experiencing the problem, or externally. My personal preference would be to have less external setup, and to have the recovery handled by the RS. I prefer to have a server stop running only when I tell it to, and not have it decide for itself when to do so. We spend a lot of time thinking about fault tolerance at a cluster level, but it can also be important on a machine level. > So this can't be an option to turn on/off. It could be an option. If on, then run some code that performs a similar recovery to your possibly soon to be open sourced scripts. If off, then don't run that code, and die instead. > Are you using Ganglia or other cluster monitoring software? These kinds of ZK and HDFS issues usually come from over-extended clusters with swapping and/or io starvation. Sorry if you've had previous threads and I just forget. This is also my experience. A common idea is to run hadoop and hbase on the same cluster. Some hadoop jobs can eat up a lot of resources, which can in some circumstances push over HBase. This is in no way a criticism of HBase, but it would be nice if HBase was more resilient to high load situations. HBase is extremely robust for its maturity, which is a credit to those contributing code. One day it would be nice if instead of dying and requiring external intervention (i.e., scripts or manual restarts), HBase was able to handle the fact that the cluster is overloaded and automatically come back once the offending job is killed or the high load condition no longer exists (i.e., without external intervention). >> In the mean time, you can try to work around any of these issues by >> using bigger hardware than you would otherwise think is needed and >> not letting the load get very high. For example, I tend to have >> these kinds of problems much less often when the load on any >> individual machine never goes above the number of cores. > > What kind of hardware do you think should be needed? And are you talking about machine specs of cluster size? The kind of machine and size of cluster depends on the application, which is why I gave the metric below. The metric below can be used with ANY application and ANY cluster as a general rule of thumb. > I've run HBase on small, slow clusters and big, fast ones. Of course, you need to configure properly and have the right expectations. I'm not even sure what you mean by "load goes above number of cores". Load is a numeric value reported by the OS (usually as an average over time). It is actually a measure of the queue of threads and processes waiting to run in the processor. On a 2 core machine, a load of 2 would mean that 0 processes are waiting to run, but that the processors are fully utilized (remember on average over time, not instantaneously). A load of 4 would mean that on average 2 processes are waiting to run in the CPU. This is a quantifiable measure that can help drive expectations. What is meant: "load not going above number of cores", is a specific way to describe conditions where I've seen HBase being very stable that is independent of cluster size, machine size, and application. I've seen it continue to work well at much higher loads, for example 2 or 3 times the number of cores. But, if the load goes higher or stays this high for a long period of time, there is an increased risk of a Region Server Suicide. The high load may not be the cause, but could exacerbate some other event that is the root cause. I'm a fan of HBase, but I've also had some huge frustration with it. Anybody contemplating an HBase cluster should be aware of this limitation. A common decision to make is whether to go with a SQL database (possibly even distributed, i.e., greenplum) or with a tool like HBase. If you run mysql on an overloaded machine, mysql may become unresponsive, but will probably not kill itself (i.e., it recovers from any problems without external intervention). The current incarnation of HBase requires external intervention. My preference would be for HBase to NOT require external intervention. > Looking back at your previous threads, I see you're running on EC2. I guess that explains much of your experiences, though I know some people running in production and happy on EC2 w/ HBase. In the future, preface your post with that as I think it has a lot to do with the problems you've been seeing. It isn't just about EC2, it is about any underlying computing environment that may not match an "ideal" environment. Whatever flakiness it is in the EC2 environment that you refer to above, I'm looking forward to a time when HBase is resilient enough to take the EC2 flakiness in stride, and continue to stay up and running WITHOUT external intervention. If this is not the direction that HBase is going in, then it is something we should all be aware of when making technology decisions. -Matthew
