Re: Recovery issue - how to debug?

Patrick Hunt Mon, 19 Apr 2010 12:11:48 -0700


On 04/19/2010 11:55 AM, Travis Crawford wrote:

To double-check, is the best way to tell a ZK instance is up-to-date
by looking at its ``LastZxid`` value? For example:


$ java -jar /home/travis/cmdline-jmxclient-0.10.5.jar - localhost:8081
org.apache.ZooKeeperService:name0=ReplicatedServer_id1,name1=replica.1,name2=Follower,name3=InMemoryDataTree
LastZxid
04/19/2010 18:42:45 +0000 org.archive.jmx.Client LastZxid: 0xf000420ad

I believe the ``LastZxid`` for each ZK instance needs to be compared
to the leader to see how far behind it is.

Well the server will only be "active" once it joins the quorum (usuallyas a follower) so if it's having trouble joining that data might not beavailable. But yes, once the server is active then you could examine thelastzxid to determine if/howmuch it's lagging the leader (quorum).



It would be a lot easier from the operations perspective if the leader
explicitly published some health stats:

(a) Count of instances in the ensemble.
(b) Count of up-to-date instances in the ensemble.

This would greatly simplify monitoring&  alerting - when an instance
falls behind one could configure their monitoring system to let
someone know and take a look at the logs.

That's a great idea. Please enter a JIRA for this - a new 4 letter wordand JMX support. It would also be a great starter project for someoneinterested in becoming more familiar with the server code.


Patrick


--travis




On Mon, Apr 19, 2010 at 10:14 AM, Patrick Hunt<ph...@apache.org>  wrote:

Usually the server logs will shed light on such issues. If we had access to
them it might be easier to speculate.

Patrick

On 04/19/2010 09:22 AM, Mahadev Konar wrote:


Hi Hao,
   As Vishal already asked, how are you determining if the writes are being
received?
  Also, what was the status of C2 when you checked for these writes? Do you
have the output of echo "stat" | nc localhost port?

How long did you wait when you say that C2 did not received the writes?
What
was the status of C2 (again echo "stat" | nc localhost port) when you saw
the C2 had received the writes?

Thanks
mahadev


On 4/18/10 10:54 PM, "Dr Hao He"<h...@softtouchit.com>    wrote:

I have zookeeper cluster E1 with 3 nodes A,B, and C.

I stopped C and did some writes on E1.  Both A and B received the writes.
  I
then started C and after a short while, C also received the writes.

All seem to go well so I replicated the setup to another cluster E2 with
exactly 3 nodes: A2, B2, and C2.

I stopped C2 and did some writes on E2.  A2 received the writes.  I then
started C2.  However, no matter how long I wait, C2 never received the
writes.

I then did more writes on E2.  Then C2 can receive all the writes
including
the old writes when it was down.

How do I find out what was wrong withe E2 setup?

I am running 3.2.2 on all nodes.

Regards,

Dr Hao He

XPE - the truly SOA platform

h...@softtouchit.com
http://softtouchit.com

Re: Recovery issue - how to debug?

Reply via email to