zookeeper crash

2010-06-02 Thread Charity Majors
I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1, in an attempt to get away from a client bug that was crashing my backend services. Unfortunately, this morning I had a server crash, and it brought down my entire cluster. I don't have the logs leading up to the crash, because -- a

Re: zookeeper crash

2010-06-02 Thread Ted Dunning
This looks a bit like a small bobble we had when upgrading a bit ago. I THINK that the answer here is to mind-wipe the misbehaving node and have it resynch from scratch from the other nodes. Wait for confirmation from somebody real. On Wed, Jun 2, 2010 at 11:11 AM, Charity Majors wrote: > I upg

Re: zookeeper crash

2010-06-02 Thread Patrick Hunt
Hi Charity, unfortunately this is a known issue not specific to 3.3 that we are working to address. See this thread for some background: http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761.html I've raised the JIRA level to "blocker" to ensure we address this asap. As Ted su

Re: zookeeper crash

2010-06-02 Thread Ted Dunning
I knew Patrick would remember to add an important detail. On Wed, Jun 2, 2010 at 11:49 AM, Patrick Hunt wrote: > As Ted suggested you can remove the datadir -- *only on the effected > server* -- and then restart it.

Re: zookeeper crash

2010-06-02 Thread Charity Majors
Thanks. That worked for me. I'm a little confused about why it threw the entire cluster into an unusable state, though. I said before that we restarted all three nodes, but tracing back, we actually didn't. The zookeeper cluster was refusing all connections until we restarted node one. But

Re: zookeeper crash

2010-06-02 Thread Flavio Junqueira
Hi Charity, This is certainly not expected. It would be very useful if you could provide us with as much information about your issue as possible. I would suggest that either you create a new jira and link it to ZOOKEEPER-335, or that you add to 335 directly. We'll be looking further into w

Re: zookeeper crash

2010-06-02 Thread Benjamin Reed
charity, do you mind going through your scenario again to give a timeline for the failure? i'm a bit confused as to what happened. ben On 06/02/2010 01:32 PM, Charity Majors wrote: Thanks. That worked for me. I'm a little confused about why it threw the entire cluster into an unusable state

Re: zookeeper crash

2010-06-02 Thread Charity Majors
Sure thing. We got paged this morning because backend services were not able to write to the database. Each server discovers the DB master using zookeeper, so when zookeeper goes down, they assume they no longer know who the DB master is and stop working. When we realized there were no proble

Re: Locking and Partial Failure

2010-06-02 Thread Charles Gordon
It does look like a special case of that JIRA item. I read back through the Chubby paper and it sounds like they solve this problem using a similar mechanism. They just block the client until either they manage to re-establish a session or until the session timeout expires (at which case they retur