Re: zookeeper crash

2010-07-06 Thread Travis Crawford
Hey all -

I believe we just suffered an outage from this issue. Short version is
while restarting quorum members with GC flags recommended in the
Troubleshooting wiki page a follower logged messages similar two the
following jiras:

2010-07-06 23:14:01,438 - FATAL
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] - Leader epoch 20 is
less than our epoch 21
2010-07-06 23:14:01,438 - WARN
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@82] - Exception when
following the leader
java.io.IOException: Error: Epoch of leader is lower

https://issues.apache.org/jira/browse/ZOOKEEPER-335
https://issues.apache.org/jira/browse/ZOOKEEPER-790

Reading through the jira's its unclear if the issue is well understood
at this point (as there's a patch available) or still being
understood.

If its still being understood let me know and I can attach the
relevant log lines to the appropriate jira.

Or if the patch appears good I can make a new release and help test.
Let me know :)

--travis





On Wed, Jun 16, 2010 at 3:25 PM, Flavio Junqueira  wrote:
> I would recommend opening a separate jira issue. I'm not convinced the
> issues are the same, so I'd rather keep them separate and link the issues if
> it is the case.
>
> -Flavio
>
> On Jun 17, 2010, at 12:16 AM, Patrick Hunt wrote:
>
>> We are unable to reproduce this issue. If you can provide the server
>> logs (all servers) and attach them to the jira it would be very helpful.
>> Some detail on the approx time of the issue so we can correlate to the
>> logs would help too (summary of what you did/do to cause it, etc...
>> anything that might help us nail this one down).
>>
>> https://issues.apache.org/jira/browse/ZOOKEEPER-335
>>
>> Some detail on ZK version, OS, Java version, HW info, etc... would also
>> be of use to us.
>>
>> Patrick
>>
>> On 06/16/2010 02:49 PM, Vishal K wrote:
>>>
>>> Hi,
>>>
>>> We are running into this bug very often (almost 60-75% hit rate) while
>>> testing our newly developed application over ZK. This is almost a blocker
>>> for us. Will the fix be simplified if backward compatibility was not an
>>> issue?
>>>
>>> Considering that this bug is rarely reported, I am wondering why we are
>>> running into this problem so often. Also, on a side note, I am curious
>>> why
>>> the systest that comes with ZooKeeper did not detect this bug. Can anyone
>>> please give an overview of the problem?
>>>
>>> Thanks.
>>> -Vishal
>>>
>>>
>>> On Wed, Jun 2, 2010 at 8:17 PM, Charity Majors
>>>  wrote:
>>>
 Sure thing.

 We got paged this morning because backend services were not able to
 write
 to the database.  Each server discovers the DB master using zookeeper,
 so
 when zookeeper goes down, they assume they no longer know who the DB
 master
 is and stop working.

 When we realized there were no problems with the database, we logged in
 to
 the zookeeper nodes.  We weren't able to connect to zookeeper using
 zkCli.sh
 from any of the three nodes, so we decided to restart them all, starting
 with node one.  However, after restarting node one, the cluster started
 responding normally again.

 (The timestamps on the zookeeper processes on nodes two and three *are*
 dated today, but none of us restarted them.  We checked shell histories
 and
 sudo logs, and they seem to back us up.)

 We tried getting node one to come back up and join the cluster, but
 that's
 when we realized we weren't getting any logs, because log4j.properties
 was
 in the wrong location.  Sorry -- I REALLY wish I had those logs for you.
  We
 put log4j back in place, and that's when we saw the spew I pasted in my
 first message.

 I'll tack this on to ZK-335.



 On Jun 2, 2010, at 4:17 PM, Benjamin Reed wrote:

> charity, do you mind going through your scenario again to give a
> timeline for the failure? i'm a bit confused as to what happened.
>
> ben
>
> On 06/02/2010 01:32 PM, Charity Majors wrote:
>>
>> Thanks.  That worked for me.  I'm a little confused about why it threw

 the entire cluster into an unusable state, though.
>>
>> I said before that we restarted all three nodes, but tracing back, we

 actually didn't.  The zookeeper cluster was refusing all connections
 until
 we restarted node one.  But once node one had been dropped from the
 cluster,
 the other two nodes formed a quorum and started responding to queries on
 their own.
>>
>> Is that expected as well?  I didn't see it in ZOOKEEPER-335, so
>> thought

 I'd mention it.
>>
>>
>>
>> On Jun 2, 2010, at 11:49 AM, Patrick Hunt wrote:
>>
>>
>>> Hi Charity, unfortunately this is a known issue not specific to 3.3

 that
>>>
>>> we are working to address. See this thread for some background:
>>>
>>>

 http://zookeeper-user.578899.n

Re: zookeeper crash

2010-06-16 Thread Flavio Junqueira
I would recommend opening a separate jira issue. I'm not convinced the  
issues are the same, so I'd rather keep them separate and link the  
issues if it is the case.


-Flavio

On Jun 17, 2010, at 12:16 AM, Patrick Hunt wrote:


We are unable to reproduce this issue. If you can provide the server
logs (all servers) and attach them to the jira it would be very  
helpful.

Some detail on the approx time of the issue so we can correlate to the
logs would help too (summary of what you did/do to cause it, etc...
anything that might help us nail this one down).

https://issues.apache.org/jira/browse/ZOOKEEPER-335

Some detail on ZK version, OS, Java version, HW info, etc... would  
also

be of use to us.

Patrick

On 06/16/2010 02:49 PM, Vishal K wrote:

Hi,

We are running into this bug very often (almost 60-75% hit rate)  
while
testing our newly developed application over ZK. This is almost a  
blocker
for us. Will the fix be simplified if backward compatibility was  
not an

issue?

Considering that this bug is rarely reported, I am wondering why we  
are
running into this problem so often. Also, on a side note, I am  
curious why
the systest that comes with ZooKeeper did not detect this bug. Can  
anyone

please give an overview of the problem?

Thanks.
-Vishal


On Wed, Jun 2, 2010 at 8:17 PM, Charity  
Majors  wrote:



Sure thing.

We got paged this morning because backend services were not able  
to write
to the database.  Each server discovers the DB master using  
zookeeper, so
when zookeeper goes down, they assume they no longer know who the  
DB master

is and stop working.

When we realized there were no problems with the database, we  
logged in to
the zookeeper nodes.  We weren't able to connect to zookeeper  
using zkCli.sh
from any of the three nodes, so we decided to restart them all,  
starting
with node one.  However, after restarting node one, the cluster  
started

responding normally again.

(The timestamps on the zookeeper processes on nodes two and three  
*are*
dated today, but none of us restarted them.  We checked shell  
histories and

sudo logs, and they seem to back us up.)

We tried getting node one to come back up and join the cluster,  
but that's
when we realized we weren't getting any logs, because  
log4j.properties was
in the wrong location.  Sorry -- I REALLY wish I had those logs  
for you.  We
put log4j back in place, and that's when we saw the spew I pasted  
in my

first message.

I'll tack this on to ZK-335.



On Jun 2, 2010, at 4:17 PM, Benjamin Reed wrote:


charity, do you mind going through your scenario again to give a
timeline for the failure? i'm a bit confused as to what happened.

ben

On 06/02/2010 01:32 PM, Charity Majors wrote:
Thanks.  That worked for me.  I'm a little confused about why it  
threw

the entire cluster into an unusable state, though.


I said before that we restarted all three nodes, but tracing  
back, we
actually didn't.  The zookeeper cluster was refusing all  
connections until
we restarted node one.  But once node one had been dropped from  
the cluster,
the other two nodes formed a quorum and started responding to  
queries on

their own.


Is that expected as well?  I didn't see it in ZOOKEEPER-335, so  
thought

I'd mention it.




On Jun 2, 2010, at 11:49 AM, Patrick Hunt wrote:


Hi Charity, unfortunately this is a known issue not specific to  
3.3

that

we are working to address. See this thread for some background:



http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761.html


I've raised the JIRA level to "blocker" to ensure we address  
this asap.


As Ted suggested you can remove the datadir -- only on the  
effected

server -- and then restart it. That should resolve the issue (the

server

will d/l a snapshot of the current db from the leader).

Patrick

On 06/02/2010 11:11 AM, Charity Majors wrote:

I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1,  
in an
attempt to get away from a client bug that was crashing my backend  
services.


Unfortunately, this morning I had a server crash, and it  
brought down
my entire cluster.  I don't have the logs leading up to the crash,  
because
-- argghffbuggle -- log4j wasn't set up correctly.  But I  
restarted all
three nodes, and odes two and three came back up and formed a  
quorum.


Node one, meanwhile, does this:

2010-06-02 17:04:56,446 - INFO

 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@620] - LOOKING

2010-06-02 17:04:56,446 - INFO

 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:files...@82] - Reading snapshot
/services/zookeeper/data/zookeeper/version-2/snapshot.a0045

2010-06-02 17:04:56,476 - INFO
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@649] - New  
election.

My id =  1, Proposed zxid = 47244640287

2010-06-02 17:04:56,486 - INFO
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] -  
Notification:

1, 47244640287, 4, 1, LOOKING, LOOKING, 1

2010-06-02 17:04:56,486 - INFO
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelec

Re: zookeeper crash

2010-06-16 Thread Patrick Hunt
We are unable to reproduce this issue. If you can provide the server 
logs (all servers) and attach them to the jira it would be very helpful. 
Some detail on the approx time of the issue so we can correlate to the 
logs would help too (summary of what you did/do to cause it, etc... 
anything that might help us nail this one down).


https://issues.apache.org/jira/browse/ZOOKEEPER-335

Some detail on ZK version, OS, Java version, HW info, etc... would also 
be of use to us.


Patrick

On 06/16/2010 02:49 PM, Vishal K wrote:

Hi,

We are running into this bug very often (almost 60-75% hit rate) while
testing our newly developed application over ZK. This is almost a blocker
for us. Will the fix be simplified if backward compatibility was not an
issue?

Considering that this bug is rarely reported, I am wondering why we are
running into this problem so often. Also, on a side note, I am curious why
the systest that comes with ZooKeeper did not detect this bug. Can anyone
please give an overview of the problem?

Thanks.
-Vishal


On Wed, Jun 2, 2010 at 8:17 PM, Charity Majors  wrote:


Sure thing.

We got paged this morning because backend services were not able to write
to the database.  Each server discovers the DB master using zookeeper, so
when zookeeper goes down, they assume they no longer know who the DB master
is and stop working.

When we realized there were no problems with the database, we logged in to
the zookeeper nodes.  We weren't able to connect to zookeeper using zkCli.sh
from any of the three nodes, so we decided to restart them all, starting
with node one.  However, after restarting node one, the cluster started
responding normally again.

(The timestamps on the zookeeper processes on nodes two and three *are*
dated today, but none of us restarted them.  We checked shell histories and
sudo logs, and they seem to back us up.)

We tried getting node one to come back up and join the cluster, but that's
when we realized we weren't getting any logs, because log4j.properties was
in the wrong location.  Sorry -- I REALLY wish I had those logs for you.  We
put log4j back in place, and that's when we saw the spew I pasted in my
first message.

I'll tack this on to ZK-335.



On Jun 2, 2010, at 4:17 PM, Benjamin Reed wrote:


charity, do you mind going through your scenario again to give a
timeline for the failure? i'm a bit confused as to what happened.

ben

On 06/02/2010 01:32 PM, Charity Majors wrote:

Thanks.  That worked for me.  I'm a little confused about why it threw

the entire cluster into an unusable state, though.


I said before that we restarted all three nodes, but tracing back, we

actually didn't.  The zookeeper cluster was refusing all connections until
we restarted node one.  But once node one had been dropped from the cluster,
the other two nodes formed a quorum and started responding to queries on
their own.


Is that expected as well?  I didn't see it in ZOOKEEPER-335, so thought

I'd mention it.




On Jun 2, 2010, at 11:49 AM, Patrick Hunt wrote:



Hi Charity, unfortunately this is a known issue not specific to 3.3

that

we are working to address. See this thread for some background:



http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761.html


I've raised the JIRA level to "blocker" to ensure we address this asap.

As Ted suggested you can remove the datadir -- only on the effected
server -- and then restart it. That should resolve the issue (the

server

will d/l a snapshot of the current db from the leader).

Patrick

On 06/02/2010 11:11 AM, Charity Majors wrote:


I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1, in an

attempt to get away from a client bug that was crashing my backend services.


Unfortunately, this morning I had a server crash, and it brought down

my entire cluster.  I don't have the logs leading up to the crash, because
-- argghffbuggle -- log4j wasn't set up correctly.  But I restarted all
three nodes, and odes two and three came back up and formed a quorum.


Node one, meanwhile, does this:

2010-06-02 17:04:56,446 - INFO

  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@620] - LOOKING

2010-06-02 17:04:56,446 - INFO

  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:files...@82] - Reading snapshot
/services/zookeeper/data/zookeeper/version-2/snapshot.a0045

2010-06-02 17:04:56,476 - INFO

  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@649] - New election.
My id =  1, Proposed zxid = 47244640287

2010-06-02 17:04:56,486 - INFO

  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification:
1, 47244640287, 4, 1, LOOKING, LOOKING, 1

2010-06-02 17:04:56,486 - INFO

  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification:
3, 38654707048, 3, 1, LOOKING, LEADING, 3

2010-06-02 17:04:56,486 - INFO

  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification:
3, 38654707048, 3, 1, LOOKING, FOLLOWING, 2

2010-06-02 17:04:56,486 - INFO

  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@6

Re: zookeeper crash

2010-06-16 Thread Vishal K
Hi,

We are running into this bug very often (almost 60-75% hit rate) while
testing our newly developed application over ZK. This is almost a blocker
for us. Will the fix be simplified if backward compatibility was not an
issue?

Considering that this bug is rarely reported, I am wondering why we are
running into this problem so often. Also, on a side note, I am curious why
the systest that comes with ZooKeeper did not detect this bug. Can anyone
please give an overview of the problem?

Thanks.
-Vishal


On Wed, Jun 2, 2010 at 8:17 PM, Charity Majors  wrote:

> Sure thing.
>
> We got paged this morning because backend services were not able to write
> to the database.  Each server discovers the DB master using zookeeper, so
> when zookeeper goes down, they assume they no longer know who the DB master
> is and stop working.
>
> When we realized there were no problems with the database, we logged in to
> the zookeeper nodes.  We weren't able to connect to zookeeper using zkCli.sh
> from any of the three nodes, so we decided to restart them all, starting
> with node one.  However, after restarting node one, the cluster started
> responding normally again.
>
> (The timestamps on the zookeeper processes on nodes two and three *are*
> dated today, but none of us restarted them.  We checked shell histories and
> sudo logs, and they seem to back us up.)
>
> We tried getting node one to come back up and join the cluster, but that's
> when we realized we weren't getting any logs, because log4j.properties was
> in the wrong location.  Sorry -- I REALLY wish I had those logs for you.  We
> put log4j back in place, and that's when we saw the spew I pasted in my
> first message.
>
> I'll tack this on to ZK-335.
>
>
>
> On Jun 2, 2010, at 4:17 PM, Benjamin Reed wrote:
>
> > charity, do you mind going through your scenario again to give a
> > timeline for the failure? i'm a bit confused as to what happened.
> >
> > ben
> >
> > On 06/02/2010 01:32 PM, Charity Majors wrote:
> >> Thanks.  That worked for me.  I'm a little confused about why it threw
> the entire cluster into an unusable state, though.
> >>
> >> I said before that we restarted all three nodes, but tracing back, we
> actually didn't.  The zookeeper cluster was refusing all connections until
> we restarted node one.  But once node one had been dropped from the cluster,
> the other two nodes formed a quorum and started responding to queries on
> their own.
> >>
> >> Is that expected as well?  I didn't see it in ZOOKEEPER-335, so thought
> I'd mention it.
> >>
> >>
> >>
> >> On Jun 2, 2010, at 11:49 AM, Patrick Hunt wrote:
> >>
> >>
> >>> Hi Charity, unfortunately this is a known issue not specific to 3.3
> that
> >>> we are working to address. See this thread for some background:
> >>>
> >>>
> http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761.html
> >>>
> >>> I've raised the JIRA level to "blocker" to ensure we address this asap.
> >>>
> >>> As Ted suggested you can remove the datadir -- only on the effected
> >>> server -- and then restart it. That should resolve the issue (the
> server
> >>> will d/l a snapshot of the current db from the leader).
> >>>
> >>> Patrick
> >>>
> >>> On 06/02/2010 11:11 AM, Charity Majors wrote:
> >>>
>  I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1, in an
> attempt to get away from a client bug that was crashing my backend services.
> 
>  Unfortunately, this morning I had a server crash, and it brought down
> my entire cluster.  I don't have the logs leading up to the crash, because
> -- argghffbuggle -- log4j wasn't set up correctly.  But I restarted all
> three nodes, and odes two and three came back up and formed a quorum.
> 
>  Node one, meanwhile, does this:
> 
>  2010-06-02 17:04:56,446 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@620] - LOOKING
>  2010-06-02 17:04:56,446 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:files...@82] - Reading snapshot
> /services/zookeeper/data/zookeeper/version-2/snapshot.a0045
>  2010-06-02 17:04:56,476 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@649] - New election.
> My id =  1, Proposed zxid = 47244640287
>  2010-06-02 17:04:56,486 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification:
> 1, 47244640287, 4, 1, LOOKING, LOOKING, 1
>  2010-06-02 17:04:56,486 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification:
> 3, 38654707048, 3, 1, LOOKING, LEADING, 3
>  2010-06-02 17:04:56,486 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification:
> 3, 38654707048, 3, 1, LOOKING, FOLLOWING, 2
>  2010-06-02 17:04:56,486 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@642] - FOLLOWING
>  2010-06-02 17:04:56,486 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:zookeeperser...@151] - Created server
> with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 4 datadir
> /services/zookeeper/data/

Re: zookeeper crash

2010-06-02 Thread Charity Majors
Sure thing.

We got paged this morning because backend services were not able to write to 
the database.  Each server discovers the DB master using zookeeper, so when 
zookeeper goes down, they assume they no longer know who the DB master is and 
stop working.

When we realized there were no problems with the database, we logged in to the 
zookeeper nodes.  We weren't able to connect to zookeeper using zkCli.sh from 
any of the three nodes, so we decided to restart them all, starting with node 
one.  However, after restarting node one, the cluster started responding 
normally again. 

(The timestamps on the zookeeper processes on nodes two and three *are* dated 
today, but none of us restarted them.  We checked shell histories and sudo 
logs, and they seem to back us up.)

We tried getting node one to come back up and join the cluster, but that's when 
we realized we weren't getting any logs, because log4j.properties was in the 
wrong location.  Sorry -- I REALLY wish I had those logs for you.  We put log4j 
back in place, and that's when we saw the spew I pasted in my first message.

I'll tack this on to ZK-335.



On Jun 2, 2010, at 4:17 PM, Benjamin Reed wrote:

> charity, do you mind going through your scenario again to give a 
> timeline for the failure? i'm a bit confused as to what happened.
> 
> ben
> 
> On 06/02/2010 01:32 PM, Charity Majors wrote:
>> Thanks.  That worked for me.  I'm a little confused about why it threw the 
>> entire cluster into an unusable state, though.
>> 
>> I said before that we restarted all three nodes, but tracing back, we 
>> actually didn't.  The zookeeper cluster was refusing all connections until 
>> we restarted node one.  But once node one had been dropped from the cluster, 
>> the other two nodes formed a quorum and started responding to queries on 
>> their own.
>> 
>> Is that expected as well?  I didn't see it in ZOOKEEPER-335, so thought I'd 
>> mention it.
>> 
>> 
>> 
>> On Jun 2, 2010, at 11:49 AM, Patrick Hunt wrote:
>> 
>> 
>>> Hi Charity, unfortunately this is a known issue not specific to 3.3 that
>>> we are working to address. See this thread for some background:
>>> 
>>> http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761.html
>>> 
>>> I've raised the JIRA level to "blocker" to ensure we address this asap.
>>> 
>>> As Ted suggested you can remove the datadir -- only on the effected
>>> server -- and then restart it. That should resolve the issue (the server
>>> will d/l a snapshot of the current db from the leader).
>>> 
>>> Patrick
>>> 
>>> On 06/02/2010 11:11 AM, Charity Majors wrote:
>>> 
 I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1, in an 
 attempt to get away from a client bug that was crashing my backend 
 services.
 
 Unfortunately, this morning I had a server crash, and it brought down my 
 entire cluster.  I don't have the logs leading up to the crash, because -- 
 argghffbuggle -- log4j wasn't set up correctly.  But I restarted all three 
 nodes, and odes two and three came back up and formed a quorum.
 
 Node one, meanwhile, does this:
 
 2010-06-02 17:04:56,446 - INFO  
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@620] - LOOKING
 2010-06-02 17:04:56,446 - INFO  
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:files...@82] - Reading snapshot 
 /services/zookeeper/data/zookeeper/version-2/snapshot.a0045
 2010-06-02 17:04:56,476 - INFO  
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@649] - New election. 
 My id =  1, Proposed zxid = 47244640287
 2010-06-02 17:04:56,486 - INFO  
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: 
 1, 47244640287, 4, 1, LOOKING, LOOKING, 1
 2010-06-02 17:04:56,486 - INFO  
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 
 3, 38654707048, 3, 1, LOOKING, LEADING, 3
 2010-06-02 17:04:56,486 - INFO  
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 
 3, 38654707048, 3, 1, LOOKING, FOLLOWING, 2
 2010-06-02 17:04:56,486 - INFO  
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@642] - FOLLOWING
 2010-06-02 17:04:56,486 - INFO  
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:zookeeperser...@151] - Created server 
 with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 4 datadir 
 /services/zookeeper/data/zookeeper/version-2 snapdir 
 /services/zookeeper/data/zookeeper/version-2
 2010-06-02 17:04:56,486 - FATAL 
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] - Leader epoch a is less 
 than our epoch b
 2010-06-02 17:04:56,486 - WARN  
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@82] - Exception when following 
 the leader
 java.io.IOException: Error: Epoch of leader is lower
at 
 org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:73)
at 
 org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.

Re: zookeeper crash

2010-06-02 Thread Benjamin Reed
charity, do you mind going through your scenario again to give a 
timeline for the failure? i'm a bit confused as to what happened.


ben

On 06/02/2010 01:32 PM, Charity Majors wrote:

Thanks.  That worked for me.  I'm a little confused about why it threw the 
entire cluster into an unusable state, though.

I said before that we restarted all three nodes, but tracing back, we actually 
didn't.  The zookeeper cluster was refusing all connections until we restarted 
node one.  But once node one had been dropped from the cluster, the other two 
nodes formed a quorum and started responding to queries on their own.

Is that expected as well?  I didn't see it in ZOOKEEPER-335, so thought I'd 
mention it.



On Jun 2, 2010, at 11:49 AM, Patrick Hunt wrote:

   

Hi Charity, unfortunately this is a known issue not specific to 3.3 that
we are working to address. See this thread for some background:

http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761.html

I've raised the JIRA level to "blocker" to ensure we address this asap.

As Ted suggested you can remove the datadir -- only on the effected
server -- and then restart it. That should resolve the issue (the server
will d/l a snapshot of the current db from the leader).

Patrick

On 06/02/2010 11:11 AM, Charity Majors wrote:
 

I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1, in an attempt to 
get away from a client bug that was crashing my backend services.

Unfortunately, this morning I had a server crash, and it brought down my entire 
cluster.  I don't have the logs leading up to the crash, because -- 
argghffbuggle -- log4j wasn't set up correctly.  But I restarted all three 
nodes, and odes two and three came back up and formed a quorum.

Node one, meanwhile, does this:

2010-06-02 17:04:56,446 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@620] - LOOKING
2010-06-02 17:04:56,446 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:files...@82] 
- Reading snapshot 
/services/zookeeper/data/zookeeper/version-2/snapshot.a0045
2010-06-02 17:04:56,476 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@649] - New election. My id 
=  1, Proposed zxid = 47244640287
2010-06-02 17:04:56,486 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: 1, 
47244640287, 4, 1, LOOKING, LOOKING, 1
2010-06-02 17:04:56,486 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 3, 
38654707048, 3, 1, LOOKING, LEADING, 3
2010-06-02 17:04:56,486 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 3, 
38654707048, 3, 1, LOOKING, FOLLOWING, 2
2010-06-02 17:04:56,486 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@642] - FOLLOWING
2010-06-02 17:04:56,486 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:zookeeperser...@151] - Created server with 
tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 4 datadir 
/services/zookeeper/data/zookeeper/version-2 snapdir 
/services/zookeeper/data/zookeeper/version-2
2010-06-02 17:04:56,486 - FATAL [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] 
- Leader epoch a is less than our epoch b
2010-06-02 17:04:56,486 - WARN  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@82] 
- Exception when following the leader
java.io.IOException: Error: Epoch of leader is lower
at 
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:73)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:644)
2010-06-02 17:04:56,486 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@166] 
- shutdown called
java.lang.Exception: shutdown Follower
at 
org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:648)



All I can find is this, 
http://www.mail-archive.com/zookeeper-comm...@hadoop.apache.org/msg00449.html, 
which implies that this state should never happen.

Any suggestions?  If it happens again, I'll just have to roll everything back 
to 3.2.1 and live with the client crashes.




   
   




Re: zookeeper crash

2010-06-02 Thread Flavio Junqueira
Hi Charity, This is certainly not expected. It would be very useful if  
you could provide us with as much information about your issue as  
possible. I would suggest that either you create a new jira and link  
it to ZOOKEEPER-335, or that you add to 335 directly.


We'll be looking further into why you have seen this problem and  
working on a fix.


Cheers,
-Flavio

On Jun 2, 2010, at 10:32 PM, Charity Majors wrote:

Thanks.  That worked for me.  I'm a little confused about why it  
threw the entire cluster into an unusable state, though.


I said before that we restarted all three nodes, but tracing back,  
we actually didn't.  The zookeeper cluster was refusing all  
connections until we restarted node one.  But once node one had been  
dropped from the cluster, the other two nodes formed a quorum and  
started responding to queries on their own.


Is that expected as well?  I didn't see it in ZOOKEEPER-335, so  
thought I'd mention it.




On Jun 2, 2010, at 11:49 AM, Patrick Hunt wrote:

Hi Charity, unfortunately this is a known issue not specific to 3.3  
that

we are working to address. See this thread for some background:

http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761.html

I've raised the JIRA level to "blocker" to ensure we address this  
asap.


As Ted suggested you can remove the datadir -- only on the effected
server -- and then restart it. That should resolve the issue (the  
server

will d/l a snapshot of the current db from the leader).

Patrick

On 06/02/2010 11:11 AM, Charity Majors wrote:
I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1, in  
an attempt to get away from a client bug that was crashing my  
backend services.


Unfortunately, this morning I had a server crash, and it brought  
down my entire cluster.  I don't have the logs leading up to the  
crash, because -- argghffbuggle -- log4j wasn't set up correctly.   
But I restarted all three nodes, and odes two and three came back  
up and formed a quorum.


Node one, meanwhile, does this:

2010-06-02 17:04:56,446 - INFO  [QuorumPeer:/ 
0:0:0:0:0:0:0:0:2181:quorump...@620] - LOOKING
2010-06-02 17:04:56,446 - INFO  [QuorumPeer:/ 
0:0:0:0:0:0:0:0:2181:files...@82] - Reading snapshot /services/ 
zookeeper/data/zookeeper/version-2/snapshot.a0045
2010-06-02 17:04:56,476 - INFO  [QuorumPeer:/ 
0:0:0:0:0:0:0:0:2181:fastleaderelect...@649] - New election. My id  
=  1, Proposed zxid = 47244640287
2010-06-02 17:04:56,486 - INFO  [QuorumPeer:/ 
0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: 1,  
47244640287, 4, 1, LOOKING, LOOKING, 1
2010-06-02 17:04:56,486 - INFO  [QuorumPeer:/ 
0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 3,  
38654707048, 3, 1, LOOKING, LEADING, 3
2010-06-02 17:04:56,486 - INFO  [QuorumPeer:/ 
0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 3,  
38654707048, 3, 1, LOOKING, FOLLOWING, 2
2010-06-02 17:04:56,486 - INFO  [QuorumPeer:/ 
0:0:0:0:0:0:0:0:2181:quorump...@642] - FOLLOWING
2010-06-02 17:04:56,486 - INFO  [QuorumPeer:/ 
0:0:0:0:0:0:0:0:2181:zookeeperser...@151] - Created server with  
tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 4  
datadir /services/zookeeper/data/zookeeper/version-2 snapdir / 
services/zookeeper/data/zookeeper/version-2
2010-06-02 17:04:56,486 - FATAL [QuorumPeer:/ 
0:0:0:0:0:0:0:0:2181:follo...@71] - Leader epoch a is less than  
our epoch b
2010-06-02 17:04:56,486 - WARN  [QuorumPeer:/ 
0:0:0:0:0:0:0:0:2181:follo...@82] - Exception when following the  
leader

java.io.IOException: Error: Epoch of leader is lower
  at  
org 
.apache 
.zookeeper.server.quorum.Follower.followLeader(Follower.java:73)
  at  
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java: 
644)
2010-06-02 17:04:56,486 - INFO  [QuorumPeer:/ 
0:0:0:0:0:0:0:0:2181:follo...@166] - shutdown called

java.lang.Exception: shutdown Follower
  at  
org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java: 
166)
  at  
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java: 
648)




All I can find is this, http://www.mail-archive.com/zookeeper-comm...@hadoop.apache.org/msg00449.html 
, which implies that this state should never happen.


Any suggestions?  If it happens again, I'll just have to roll  
everything back to 3.2.1 and live with the client crashes.











Re: zookeeper crash

2010-06-02 Thread Charity Majors
Thanks.  That worked for me.  I'm a little confused about why it threw the 
entire cluster into an unusable state, though.

I said before that we restarted all three nodes, but tracing back, we actually 
didn't.  The zookeeper cluster was refusing all connections until we restarted 
node one.  But once node one had been dropped from the cluster, the other two 
nodes formed a quorum and started responding to queries on their own.

Is that expected as well?  I didn't see it in ZOOKEEPER-335, so thought I'd 
mention it.



On Jun 2, 2010, at 11:49 AM, Patrick Hunt wrote:

> Hi Charity, unfortunately this is a known issue not specific to 3.3 that 
> we are working to address. See this thread for some background:
> 
> http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761.html
> 
> I've raised the JIRA level to "blocker" to ensure we address this asap.
> 
> As Ted suggested you can remove the datadir -- only on the effected 
> server -- and then restart it. That should resolve the issue (the server 
> will d/l a snapshot of the current db from the leader).
> 
> Patrick
> 
> On 06/02/2010 11:11 AM, Charity Majors wrote:
>> I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1, in an attempt 
>> to get away from a client bug that was crashing my backend services.
>> 
>> Unfortunately, this morning I had a server crash, and it brought down my 
>> entire cluster.  I don't have the logs leading up to the crash, because -- 
>> argghffbuggle -- log4j wasn't set up correctly.  But I restarted all three 
>> nodes, and odes two and three came back up and formed a quorum.
>> 
>> Node one, meanwhile, does this:
>> 
>> 2010-06-02 17:04:56,446 - INFO  
>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@620] - LOOKING
>> 2010-06-02 17:04:56,446 - INFO  
>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:files...@82] - Reading snapshot 
>> /services/zookeeper/data/zookeeper/version-2/snapshot.a0045
>> 2010-06-02 17:04:56,476 - INFO  
>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@649] - New election. My 
>> id =  1, Proposed zxid = 47244640287
>> 2010-06-02 17:04:56,486 - INFO  
>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: 1, 
>> 47244640287, 4, 1, LOOKING, LOOKING, 1
>> 2010-06-02 17:04:56,486 - INFO  
>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 3, 
>> 38654707048, 3, 1, LOOKING, LEADING, 3
>> 2010-06-02 17:04:56,486 - INFO  
>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 3, 
>> 38654707048, 3, 1, LOOKING, FOLLOWING, 2
>> 2010-06-02 17:04:56,486 - INFO  
>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@642] - FOLLOWING
>> 2010-06-02 17:04:56,486 - INFO  
>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:zookeeperser...@151] - Created server with 
>> tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 4 datadir 
>> /services/zookeeper/data/zookeeper/version-2 snapdir 
>> /services/zookeeper/data/zookeeper/version-2
>> 2010-06-02 17:04:56,486 - FATAL 
>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] - Leader epoch a is less than 
>> our epoch b
>> 2010-06-02 17:04:56,486 - WARN  
>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@82] - Exception when following 
>> the leader
>> java.io.IOException: Error: Epoch of leader is lower
>>at 
>> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:73)
>>at 
>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:644)
>> 2010-06-02 17:04:56,486 - INFO  
>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@166] - shutdown called
>> java.lang.Exception: shutdown Follower
>>at 
>> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
>>at 
>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:648)
>> 
>> 
>> 
>> All I can find is this, 
>> http://www.mail-archive.com/zookeeper-comm...@hadoop.apache.org/msg00449.html,
>>  which implies that this state should never happen.
>> 
>> Any suggestions?  If it happens again, I'll just have to roll everything 
>> back to 3.2.1 and live with the client crashes.
>> 
>> 
>> 
>> 



Re: zookeeper crash

2010-06-02 Thread Ted Dunning
I knew Patrick would remember to add an important detail.

On Wed, Jun 2, 2010 at 11:49 AM, Patrick Hunt  wrote:

> As Ted suggested you can remove the datadir -- *only on the effected
> server* -- and then restart it.


Re: zookeeper crash

2010-06-02 Thread Patrick Hunt
Hi Charity, unfortunately this is a known issue not specific to 3.3 that 
we are working to address. See this thread for some background:


http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761.html

I've raised the JIRA level to "blocker" to ensure we address this asap.

As Ted suggested you can remove the datadir -- only on the effected 
server -- and then restart it. That should resolve the issue (the server 
will d/l a snapshot of the current db from the leader).


Patrick

On 06/02/2010 11:11 AM, Charity Majors wrote:

I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1, in an attempt to 
get away from a client bug that was crashing my backend services.

Unfortunately, this morning I had a server crash, and it brought down my entire 
cluster.  I don't have the logs leading up to the crash, because -- 
argghffbuggle -- log4j wasn't set up correctly.  But I restarted all three 
nodes, and odes two and three came back up and formed a quorum.

Node one, meanwhile, does this:

2010-06-02 17:04:56,446 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@620] - LOOKING
2010-06-02 17:04:56,446 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:files...@82] 
- Reading snapshot 
/services/zookeeper/data/zookeeper/version-2/snapshot.a0045
2010-06-02 17:04:56,476 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@649] - New election. My id 
=  1, Proposed zxid = 47244640287
2010-06-02 17:04:56,486 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: 1, 
47244640287, 4, 1, LOOKING, LOOKING, 1
2010-06-02 17:04:56,486 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 3, 
38654707048, 3, 1, LOOKING, LEADING, 3
2010-06-02 17:04:56,486 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 3, 
38654707048, 3, 1, LOOKING, FOLLOWING, 2
2010-06-02 17:04:56,486 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@642] - FOLLOWING
2010-06-02 17:04:56,486 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:zookeeperser...@151] - Created server with 
tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 4 datadir 
/services/zookeeper/data/zookeeper/version-2 snapdir 
/services/zookeeper/data/zookeeper/version-2
2010-06-02 17:04:56,486 - FATAL [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] 
- Leader epoch a is less than our epoch b
2010-06-02 17:04:56,486 - WARN  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@82] 
- Exception when following the leader
java.io.IOException: Error: Epoch of leader is lower
at 
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:73)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:644)
2010-06-02 17:04:56,486 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@166] 
- shutdown called
java.lang.Exception: shutdown Follower
at 
org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:648)



All I can find is this, 
http://www.mail-archive.com/zookeeper-comm...@hadoop.apache.org/msg00449.html, 
which implies that this state should never happen.

Any suggestions?  If it happens again, I'll just have to roll everything back 
to 3.2.1 and live with the client crashes.






Re: zookeeper crash

2010-06-02 Thread Ted Dunning
This looks a bit like a small bobble we had when upgrading a bit ago.

I THINK that the answer here is to mind-wipe the misbehaving node and have
it resynch from scratch from the other nodes.

Wait for confirmation from somebody real.

On Wed, Jun 2, 2010 at 11:11 AM, Charity Majors wrote:

> I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1, in an
> attempt to get away from a client bug that was crashing my backend services.
>
> Unfortunately, this morning I had a server crash, and it brought down my
> entire cluster.  I don't have the logs leading up to the crash, because --
> argghffbuggle -- log4j wasn't set up correctly.  But I restarted all three
> nodes, and odes two and three came back up and formed a quorum.
>
> Node one, meanwhile, does this:
>
> 2010-06-02 17:04:56,446 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@620] - LOOKING
> 2010-06-02 17:04:56,446 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:files...@82] - Reading snapshot
> /services/zookeeper/data/zookeeper/version-2/snapshot.a0045
> 2010-06-02 17:04:56,476 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@649] - New election.
> My id =  1, Proposed zxid = 47244640287
> 2010-06-02 17:04:56,486 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification:
> 1, 47244640287, 4, 1, LOOKING, LOOKING, 1
> 2010-06-02 17:04:56,486 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification:
> 3, 38654707048, 3, 1, LOOKING, LEADING, 3
> 2010-06-02 17:04:56,486 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification:
> 3, 38654707048, 3, 1, LOOKING, FOLLOWING, 2
> 2010-06-02 17:04:56,486 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@642] - FOLLOWING
> 2010-06-02 17:04:56,486 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:zookeeperser...@151] - Created server
> with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 4 datadir
> /services/zookeeper/data/zookeeper/version-2 snapdir
> /services/zookeeper/data/zookeeper/version-2
> 2010-06-02 17:04:56,486 - FATAL
> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] - Leader epoch a is less
> than our epoch b
> 2010-06-02 17:04:56,486 - WARN
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@82] - Exception when following
> the leader
> java.io.IOException: Error: Epoch of leader is lower
>   at
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:73)
>   at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:644)
> 2010-06-02 17:04:56,486 - INFO
>  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@166] - shutdown called
> java.lang.Exception: shutdown Follower
>   at
> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
>   at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:648)
>
>
>
> All I can find is this,
> http://www.mail-archive.com/zookeeper-comm...@hadoop.apache.org/msg00449.html,
> which implies that this state should never happen.
>
> Any suggestions?  If it happens again, I'll just have to roll everything
> back to 3.2.1 and live with the client crashes.
>
>
>
>
>


zookeeper crash

2010-06-02 Thread Charity Majors
I upgraded my zookeeper cluster last week from 3.2.1 to 3.3.1, in an attempt to 
get away from a client bug that was crashing my backend services.

Unfortunately, this morning I had a server crash, and it brought down my entire 
cluster.  I don't have the logs leading up to the crash, because -- 
argghffbuggle -- log4j wasn't set up correctly.  But I restarted all three 
nodes, and odes two and three came back up and formed a quorum.  

Node one, meanwhile, does this:

2010-06-02 17:04:56,446 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@620] - LOOKING
2010-06-02 17:04:56,446 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:files...@82] 
- Reading snapshot 
/services/zookeeper/data/zookeeper/version-2/snapshot.a0045
2010-06-02 17:04:56,476 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@649] - New election. My id 
=  1, Proposed zxid = 47244640287
2010-06-02 17:04:56,486 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: 1, 
47244640287, 4, 1, LOOKING, LOOKING, 1
2010-06-02 17:04:56,486 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 3, 
38654707048, 3, 1, LOOKING, LEADING, 3
2010-06-02 17:04:56,486 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 3, 
38654707048, 3, 1, LOOKING, FOLLOWING, 2
2010-06-02 17:04:56,486 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@642] - FOLLOWING
2010-06-02 17:04:56,486 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:zookeeperser...@151] - Created server with 
tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 4 datadir 
/services/zookeeper/data/zookeeper/version-2 snapdir 
/services/zookeeper/data/zookeeper/version-2
2010-06-02 17:04:56,486 - FATAL [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] 
- Leader epoch a is less than our epoch b
2010-06-02 17:04:56,486 - WARN  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@82] 
- Exception when following the leader
java.io.IOException: Error: Epoch of leader is lower
   at 
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:73)
   at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:644)
2010-06-02 17:04:56,486 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@166] 
- shutdown called
java.lang.Exception: shutdown Follower
   at 
org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
   at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:648)



All I can find is this, 
http://www.mail-archive.com/zookeeper-comm...@hadoop.apache.org/msg00449.html, 
which implies that this state should never happen.

Any suggestions?  If it happens again, I'll just have to roll everything back 
to 3.2.1 and live with the client crashes.