Re: Regionserver problems because of datanode timeouts

Stack Mon, 14 Jun 2010 08:29:52 -0700

Thanks for updating the list Ferdy.
St.Ack


On Mon, Jun 14, 2010 at 3:09 AM, Ferdy <[email protected]> wrote:
> After running stable for quite a while (using configured long timeouts), we
> recently noticed regionservers were starting to behave bad again. During
> compaction, regionservers complained that blocks are unavailable. Every
> couple of days, a regionserver decided to terminate itself because it could
> not recover from the DFS errors.
>
> So, after looking into it again, we might have found the actual cause for
> this problem. Prior to a regionserver terminate, logs of the corresponding
> datanode told us that the "df" command could not be ran because it could not
> allocate memory. Indeed, we finetuned our nodes to use nearly all RAM for
> the Hadoop/Hbase and child task processes. We had swap disabled. But we had
> the assumption that a simple "df" check should not be that expensive,
> right..?
>
> Well it seems we had to learn a bit about "Linux memory overcommit". Without
> going into much details, spawning processes in Linux requires the new
> process the have the same memory footprint as the original process, more or
> less. Therefore, a datanode with 1.6GB heap (in our case) should have about
> the same amount of memory free when spawning a new process, even though the
> spawned process will do little to nothing. In order to accomodate, you
> should either have enough free memory available (fysical / swap) or you
> could tweak the 'overcommit' configuration of the operating system. We
> decided to increase the amount of memory by enabling swap files.
>
> We're still running Hadoop 0.20.1 and Hbase 0.20.3, presumably the newest
> releases has better handling of errors in the DFSClient/InputStreams.
> Nevertheless, we believe that we have found the root cause of our
> regionserver problems.
>
> Ferdy.
>
> Stack wrote:
>>
>> The culprit might be the fragmentation calculation.  See
>> https://issues.apache.org/jira/browse/HBASE-2165.
>> St.Ack
>>
>> On Wed, Mar 10, 2010 at 9:33 AM, Andrew Purtell <[email protected]>
>> wrote:
>>
>>>>
>>>> However, once and every while our Nagios (our service monitor)
>>>>
>>>
>>> detects
>>>
>>>>
>>>> that requesting the Hbase master page takes a long
>>>>
>>>
>>> time. Sometimes > 10
>>>
>>>>
>>>> sec, rarely around 30 secs but most of
>>>>
>>>
>>> the time < 10 secs. In the cases
>>>
>>>>
>>>> the page loads slowly,
>>>>
>>>
>>> there is a fair amount of load on Hbase.
>>>
>>> I've noticed this also. With 0.20.4-dev. I think others have mentioned it
>>> on the list from time to time. However, I can never seem to jump on to a
>>> console fast enough to grab a stack dump before the UI becomes responsive
>>> again. :-( It is not consistent behavior. It concerns me that perhaps
>>> whatever lock is holding up the UI is also holding up any client
>>> attempting to (re)locate a region. If I manage to capture it I will file
>>> a jira.
>>>
>>>
>>>
>>>  - Andy
>>>
>>>
>>>
>>>
>>>
>>>
>

Re: Regionserver problems because of datanode timeouts

Reply via email to