You could wrap the regionserver in a script that auto-reboots them? We cant really recover from this scenario, because the master notices we are dead, then splits our logs and reassigns the regions to other nodes. This is the basis of how reliable hbase works in the face of machine failure.
-ryan On Tue, Sep 21, 2010 at 5:20 PM, Dmitriy Lyubimov <[email protected]> wrote: > Hi, > > so in our production, we see temporary networking failures (we are not quite > 100% sure what they are) but now and then region server's zookeeper session > would get expired and in addition some ipc channels would throw 'channel > closed'. > > This causes region server to exit. Which is not a very big deal, our > monitoring system would send a text message so somebody would restart the > region server. > > however, this does happen a little more often than we probably would have > liked to do it manually. > > Why is server not recovering/reconnecting automatically? is there a facility > to enable server restarts and region server nodes to rejoin the cluster > automatically? > > Thanks. > -Dmitriy >
