Re: Regionserver problems because of datanode timeouts

Ferdy Mon, 14 Jun 2010 03:10:20 -0700

After running stable for quite a while (using configured long timeouts),we recently noticed regionservers were starting to behave bad again.During compaction, regionservers complained that blocks are unavailable.Every couple of days, a regionserver decided to terminate itself becauseit could not recover from the DFS errors.

So, after looking into it again, we might have found the actual causefor this problem. Prior to a regionserver terminate, logs of thecorresponding datanode told us that the "df" command could not be ranbecause it could not allocate memory. Indeed, we finetuned our nodes touse nearly all RAM for the Hadoop/Hbase and child task processes. We hadswap disabled. But we had the assumption that a simple "df" check shouldnot be that expensive, right..?

Well it seems we had to learn a bit about "Linux memory overcommit".Without going into much details, spawning processes in Linux requiresthe new process the have the same memory footprint as the originalprocess, more or less. Therefore, a datanode with 1.6GB heap (in ourcase) should have about the same amount of memory free when spawning anew process, even though the spawned process will do little to nothing.In order to accomodate, you should either have enough free memoryavailable (fysical / swap) or you could tweak the 'overcommit'configuration of the operating system. We decided to increase the amountof memory by enabling swap files.

We're still running Hadoop 0.20.1 and Hbase 0.20.3, presumably thenewest releases has better handling of errors in theDFSClient/InputStreams. Nevertheless, we believe that we have found theroot cause of our regionserver problems.


Ferdy.

Stack wrote:

The culprit might be the fragmentation calculation.  See
https://issues.apache.org/jira/browse/HBASE-2165.
St.Ack

On Wed, Mar 10, 2010 at 9:33 AM, Andrew Purtell <[email protected]> wrote:

However, once and every while our Nagios (our service monitor)

detects

that requesting the Hbase master page takes a long

time. Sometimes > 10

sec, rarely around 30 secs but most of

the time < 10 secs. In the cases

the page loads slowly,

there is a fair amount of load on Hbase.

I've noticed this also. With 0.20.4-dev. I think others have mentioned it
on the list from time to time. However, I can never seem to jump on to a
console fast enough to grab a stack dump before the UI becomes responsive
again. :-( It is not consistent behavior. It concerns me that perhaps
whatever lock is holding up the UI is also holding up any client
attempting to (re)locate a region. If I manage to capture it I will file
a jira.



  - Andy

Re: Regionserver problems because of datanode timeouts

Reply via email to