Thanks a lot, Ryan. That's what i thought, I knew this explanation that the regions are split; although I guess one might reason there's no reason why we can't try to start a new life by rejoining cluster again as a new region server (but the same process). Or at least have such an option. Just wanted to double-check before wrapping it into some sort of a kicker. -Dmitriy
On Tue, Sep 21, 2010 at 5:24 PM, Ryan Rawson <[email protected]> wrote: > You could wrap the regionserver in a script that auto-reboots them? > > We cant really recover from this scenario, because the master notices > we are dead, then splits our logs and reassigns the regions to other > nodes. This is the basis of how reliable hbase works in the face of > machine failure. > > -ryan > > On Tue, Sep 21, 2010 at 5:20 PM, Dmitriy Lyubimov <[email protected]> > wrote: > > Hi, > > > > so in our production, we see temporary networking failures (we are not > quite > > 100% sure what they are) but now and then region server's zookeeper > session > > would get expired and in addition some ipc channels would throw 'channel > > closed'. > > > > This causes region server to exit. Which is not a very big deal, our > > monitoring system would send a text message so somebody would restart the > > region server. > > > > however, this does happen a little more often than we probably would have > > liked to do it manually. > > > > Why is server not recovering/reconnecting automatically? is there a > facility > > to enable server restarts and region server nodes to rejoin the cluster > > automatically? > > > > Thanks. > > -Dmitriy > > >
