After running stable for quite a while (using configured long timeouts), we recently noticed regionservers were starting to behave bad again. During compaction, regionservers complained that blocks are unavailable. Every couple of days, a regionserver decided to terminate itself because it could not recover from the DFS errors.

So, after looking into it again, we might have found the actual cause for this problem. Prior to a regionserver terminate, logs of the corresponding datanode told us that the "df" command could not be ran because it could not allocate memory. Indeed, we finetuned our nodes to use nearly all RAM for the Hadoop/Hbase and child task processes. We had swap disabled. But we had the assumption that a simple "df" check should not be that expensive, right..?

Well it seems we had to learn a bit about "Linux memory overcommit". Without going into much details, spawning processes in Linux requires the new process the have the same memory footprint as the original process, more or less. Therefore, a datanode with 1.6GB heap (in our case) should have about the same amount of memory free when spawning a new process, even though the spawned process will do little to nothing. In order to accomodate, you should either have enough free memory available (fysical / swap) or you could tweak the 'overcommit' configuration of the operating system. We decided to increase the amount of memory by enabling swap files.

We're still running Hadoop 0.20.1 and Hbase 0.20.3, presumably the newest releases has better handling of errors in the DFSClient/InputStreams. Nevertheless, we believe that we have found the root cause of our regionserver problems.

Ferdy.

Stack wrote:
The culprit might be the fragmentation calculation.  See
https://issues.apache.org/jira/browse/HBASE-2165.
St.Ack

On Wed, Mar 10, 2010 at 9:33 AM, Andrew Purtell <[email protected]> wrote:
However, once and every while our Nagios (our service monitor)
detects
that requesting the Hbase master page takes a long
time. Sometimes > 10
sec, rarely around 30 secs but most of
the time < 10 secs. In the cases
the page loads slowly,
there is a fair amount of load on Hbase.

I've noticed this also. With 0.20.4-dev. I think others have mentioned it
on the list from time to time. However, I can never seem to jump on to a
console fast enough to grab a stack dump before the UI becomes responsive
again. :-( It is not consistent behavior. It concerns me that perhaps
whatever lock is holding up the UI is also holding up any client
attempting to (re)locate a region. If I manage to capture it I will file
a jira.



  - Andy





Reply via email to