If the machine was completely partitioned, as far as I know, it would lose
it's lease so the only thing we have to make sure about is clearing the
state of the region server by doing a "restart" so that it's ready to come
back in the cluster. If ZK is down but the rest is up, closing the files in
HDFS should ensure that we lose a minimum of data if not losing any.

I think that in a multi-rack setup it is possible to not be able to talk to
ZK but to be able to talk to the Namenode as machines can be anywhere.
Especially in HBase 0.20, the master can failover on any node that has a
backup Master ready. So in that case, the region server should consider
itself gone from the cluster and close any connection it has and restart.

Those are very legetimate questions Gustavo, thanks for asking.


On Wed, Jun 24, 2009 at 3:38 PM, Gustavo Niemeyer <gust...@niemeyer.net>wrote:

> > Ben's opinion is that it should not belong in the default API but in the
> > common client that another recent thread was about. My opinion is just
> that
> > I need such a functionality, wherever it is.
> Understood, sorry.  I just meant that it feels like something that
> would likely be useful to other people too, so might have a role in
> the default API to ensure it gets done properly considering the
> details that Ben brought up.
> > If the node gets the exception (or has it's own timer), as I wrote, it
> will
> > shut itself down to release HDFS leases as fast as possible. If ZK is
> really
> > down and it's not a network partition, then HBase is down and this is
> fine
> > because it won't be able to work anyway.
> Right, that's mostly what I was wondering.  I was pondering about
> under which circumstances the node would be unable to talk to the
> ZooKeeper server but would still be holding the HDFS lease in a way
> that prevented the rest of the system from going on.  If I understand
> what you mean, if ZooKeeper is down entirely, HBase would be down for
> good. If the machine was partitioned off entirely, the HDFS side of
> things will also be disconnected, so shutting the node down won't help
> the rest of the system recovering.
> --
> Gustavo Niemeyer
> http://niemeyer.net

Reply via email to