no JVM limitations, but some code is just not really meant to be restarted within the same JVM and things just didnt work out well. Specifically the DFSClient code, and I think we had to hack a bunch to make the ZK sessions reconnect because you have to re-init the entire stack.
When you have a bunch of code that assumes a static gets initialized once and never again that doesnt make for a easy reinitialize. On Tue, Sep 21, 2010 at 5:36 PM, Matthew LeMieux <[email protected]> wrote: > What are the JVM limitations that you were you running into? > > -Matthew > > On Sep 21, 2010, at 5:31 PM, Ryan Rawson wrote: > >> We tried that before, but some things are difficult to reset in the same JVM. >> >> A clean restart just works better :-) >> >> On Tue, Sep 21, 2010 at 5:29 PM, Dmitriy Lyubimov <[email protected]> wrote: >>> Thanks a lot, Ryan. >>> >>> That's what i thought, I knew this explanation that the regions are split; >>> although I guess one might reason there's no reason why we can't try to >>> start a new life by rejoining cluster again as a new region server (but the >>> same process). Or at least have such an option. Just wanted to double-check >>> before wrapping it into some sort of a kicker. >>> -Dmitriy >>> >>> >>> On Tue, Sep 21, 2010 at 5:24 PM, Ryan Rawson <[email protected]> wrote: >>> >>>> You could wrap the regionserver in a script that auto-reboots them? >>>> >>>> We cant really recover from this scenario, because the master notices >>>> we are dead, then splits our logs and reassigns the regions to other >>>> nodes. This is the basis of how reliable hbase works in the face of >>>> machine failure. >>>> >>>> -ryan >>>> >>>> On Tue, Sep 21, 2010 at 5:20 PM, Dmitriy Lyubimov <[email protected]> >>>> wrote: >>>>> Hi, >>>>> >>>>> so in our production, we see temporary networking failures (we are not >>>> quite >>>>> 100% sure what they are) but now and then region server's zookeeper >>>> session >>>>> would get expired and in addition some ipc channels would throw 'channel >>>>> closed'. >>>>> >>>>> This causes region server to exit. Which is not a very big deal, our >>>>> monitoring system would send a text message so somebody would restart the >>>>> region server. >>>>> >>>>> however, this does happen a little more often than we probably would have >>>>> liked to do it manually. >>>>> >>>>> Why is server not recovering/reconnecting automatically? is there a >>>> facility >>>>> to enable server restarts and region server nodes to rejoin the cluster >>>>> automatically? >>>>> >>>>> Thanks. >>>>> -Dmitriy >>>>> >>>> >>> > >
