Re: zookeeper crash

2010-06-02 Thread Ted Dunning
This looks a bit like a small bobble we had when upgrading a bit ago. I THINK that the answer here is to mind-wipe the misbehaving node and have it resynch from scratch from the other nodes. Wait for confirmation from somebody real. On Wed, Jun 2, 2010 at 11:11 AM, Charity Majors

Re: zookeeper crash

2010-06-02 Thread Ted Dunning
I knew Patrick would remember to add an important detail. On Wed, Jun 2, 2010 at 11:49 AM, Patrick Hunt ph...@apache.org wrote: As Ted suggested you can remove the datadir -- *only on the effected server* -- and then restart it.

Re: zookeeper crash

2010-06-02 Thread Charity Majors
Thanks. That worked for me. I'm a little confused about why it threw the entire cluster into an unusable state, though. I said before that we restarted all three nodes, but tracing back, we actually didn't. The zookeeper cluster was refusing all connections until we restarted node one. But

Re: zookeeper crash

2010-06-02 Thread Flavio Junqueira
Hi Charity, This is certainly not expected. It would be very useful if you could provide us with as much information about your issue as possible. I would suggest that either you create a new jira and link it to ZOOKEEPER-335, or that you add to 335 directly. We'll be looking further into

Re: zookeeper crash

2010-06-02 Thread Benjamin Reed
charity, do you mind going through your scenario again to give a timeline for the failure? i'm a bit confused as to what happened. ben On 06/02/2010 01:32 PM, Charity Majors wrote: Thanks. That worked for me. I'm a little confused about why it threw the entire cluster into an unusable

Re: zookeeper crash

2010-06-02 Thread Charity Majors
Sure thing. We got paged this morning because backend services were not able to write to the database. Each server discovers the DB master using zookeeper, so when zookeeper goes down, they assume they no longer know who the DB master is and stop working. When we realized there were no

Re: Locking and Partial Failure

2010-06-02 Thread Charles Gordon
It does look like a special case of that JIRA item. I read back through the Chubby paper and it sounds like they solve this problem using a similar mechanism. They just block the client until either they manage to re-establish a session or until the session timeout expires (at which case they