This looks a bit like a small bobble we had when upgrading a bit ago.
I THINK that the answer here is to mind-wipe the misbehaving node and have
it resynch from scratch from the other nodes.
Wait for confirmation from somebody real.
On Wed, Jun 2, 2010 at 11:11 AM, Charity Majors
I knew Patrick would remember to add an important detail.
On Wed, Jun 2, 2010 at 11:49 AM, Patrick Hunt ph...@apache.org wrote:
As Ted suggested you can remove the datadir -- *only on the effected
server* -- and then restart it.
Thanks. That worked for me. I'm a little confused about why it threw the
entire cluster into an unusable state, though.
I said before that we restarted all three nodes, but tracing back, we actually
didn't. The zookeeper cluster was refusing all connections until we restarted
node one. But
Hi Charity, This is certainly not expected. It would be very useful if
you could provide us with as much information about your issue as
possible. I would suggest that either you create a new jira and link
it to ZOOKEEPER-335, or that you add to 335 directly.
We'll be looking further into
charity, do you mind going through your scenario again to give a
timeline for the failure? i'm a bit confused as to what happened.
ben
On 06/02/2010 01:32 PM, Charity Majors wrote:
Thanks. That worked for me. I'm a little confused about why it threw the
entire cluster into an unusable
Sure thing.
We got paged this morning because backend services were not able to write to
the database. Each server discovers the DB master using zookeeper, so when
zookeeper goes down, they assume they no longer know who the DB master is and
stop working.
When we realized there were no
It does look like a special case of that JIRA item. I read back through the
Chubby paper and it sounds like they solve this problem using a similar
mechanism. They just block the client until either they manage to
re-establish a session or until the session timeout expires (at which case
they