It does look like a special case of that JIRA item. I read back through the
Chubby paper and it sounds like they solve this problem using a similar
mechanism. They just block the client until either they manage to
re-establish a session or until the session timeout expires (at which case
they retur
Sure thing.
We got paged this morning because backend services were not able to write to
the database. Each server discovers the DB master using zookeeper, so when
zookeeper goes down, they assume they no longer know who the DB master is and
stop working.
When we realized there were no proble
charity, do you mind going through your scenario again to give a
timeline for the failure? i'm a bit confused as to what happened.
ben
On 06/02/2010 01:32 PM, Charity Majors wrote:
Thanks. That worked for me. I'm a little confused about why it threw the
entire cluster into an unusable state
Hi Charity, This is certainly not expected. It would be very useful if
you could provide us with as much information about your issue as
possible. I would suggest that either you create a new jira and link
it to ZOOKEEPER-335, or that you add to 335 directly.
We'll be looking further into w
Thanks. That worked for me. I'm a little confused about why it threw the
entire cluster into an unusable state, though.
I said before that we restarted all three nodes, but tracing back, we actually
didn't. The zookeeper cluster was refusing all connections until we restarted
node one. But
I knew Patrick would remember to add an important detail.
On Wed, Jun 2, 2010 at 11:49 AM, Patrick Hunt wrote:
> As Ted suggested you can remove the datadir -- *only on the effected
> server* -- and then restart it.
Hi Charity, unfortunately this is a known issue not specific to 3.3 that
we are working to address. See this thread for some background:
http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761.html
I've raised the JIRA level to "blocker" to ensure we address this asap.
As Ted su
This looks a bit like a small bobble we had when upgrading a bit ago.
I THINK that the answer here is to mind-wipe the misbehaving node and have
it resynch from scratch from the other nodes.
Wait for confirmation from somebody real.
On Wed, Jun 2, 2010 at 11:11 AM, Charity Majors wrote:
> I upg
I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1, in an attempt to
get away from a client bug that was crashing my backend services.
Unfortunately, this morning I had a server crash, and it brought down my entire
cluster. I don't have the logs leading up to the crash, because --
a