Re: Region servers exiting, not recovering

Ryan Rawson Tue, 21 Sep 2010 17:39:27 -0700

no JVM limitations, but some code is just not really meant to be
restarted within the same JVM and things just didnt work out well.
Specifically the DFSClient code, and I think we had to hack a bunch to
make the ZK sessions reconnect because you have to re-init the entire
stack.


When you have a bunch of code that assumes a static gets initialized
once and never again that doesnt make for a easy reinitialize.

On Tue, Sep 21, 2010 at 5:36 PM, Matthew LeMieux <[email protected]> wrote:
> What are the JVM limitations that you were you running into?
>
> -Matthew
>
> On Sep 21, 2010, at 5:31 PM, Ryan Rawson wrote:
>
>> We tried that before, but some things are difficult to reset in the same JVM.
>>
>> A clean restart just works better :-)
>>
>> On Tue, Sep 21, 2010 at 5:29 PM, Dmitriy Lyubimov <[email protected]> wrote:
>>> Thanks a lot, Ryan.
>>>
>>> That's what i thought, I knew this explanation that the regions are split;
>>> although I guess one might reason there's no reason why we can't try to
>>> start a new life by rejoining cluster again as a new region server (but the
>>> same process). Or at least have such an option. Just wanted to double-check
>>> before wrapping it into some sort of a kicker.
>>> -Dmitriy
>>>
>>>
>>> On Tue, Sep 21, 2010 at 5:24 PM, Ryan Rawson <[email protected]> wrote:
>>>
>>>> You could wrap the regionserver in a script that auto-reboots them?
>>>>
>>>> We cant really recover from this scenario, because the master notices
>>>> we are dead, then splits our logs and reassigns the regions to other
>>>> nodes.  This is the basis of how reliable hbase works in the face of
>>>> machine failure.
>>>>
>>>> -ryan
>>>>
>>>> On Tue, Sep 21, 2010 at 5:20 PM, Dmitriy Lyubimov <[email protected]>
>>>> wrote:
>>>>> Hi,
>>>>>
>>>>> so in our production, we see temporary networking failures (we are not
>>>> quite
>>>>> 100% sure what they are) but now and then region server's zookeeper
>>>> session
>>>>> would get expired and in addition some ipc channels would throw 'channel
>>>>> closed'.
>>>>>
>>>>> This causes region server to exit. Which is not a very big deal, our
>>>>> monitoring system would send a text message so somebody would restart the
>>>>> region server.
>>>>>
>>>>> however, this does happen a little more often than we probably would have
>>>>> liked to do it manually.
>>>>>
>>>>> Why is server not recovering/reconnecting automatically? is there a
>>>> facility
>>>>> to enable server restarts and region server nodes to rejoin the cluster
>>>>> automatically?
>>>>>
>>>>> Thanks.
>>>>> -Dmitriy
>>>>>
>>>>
>>>
>
>

Re: Region servers exiting, not recovering

Reply via email to