Re: How to handle Node does not exist error?

2010-08-16 Thread Patrick Hunt
Try using the logs, stat command or JMX to verify that each ZK server is 
indeed a leader/follower as expected. You should have one leader and n-1 
followers. Verify that you don't have any standalone servers (this is 
the most frequent error I see - misconfiguration of a server such that 
it thinks it's a standalone server; I often see where a user has 3 
standalone servers which they think is a single quorum, all of the 
servers will therefore be inconsistent to each other).


Patrick

On 08/12/2010 05:42 PM, Ted Dunning wrote:

On Thu, Aug 12, 2010 at 4:57 PM, Dr Hao Heh...@softtouchit.com  wrote:


hi, Ted,

I am a little bit confused here.  So, is the node inconsistency problem
that Vishal and I have seen here most likely caused by configurations or
embedding?

If it is the former, I'd appreciate if you can point out where those silly
mistakes have been made and the correct way to embed ZK.



I think it is likely due to misconfiguration, but I don't know what the
issue is exactly.  I think that another poster suggested that you ape the
normal ZK startup process more closely.  That sounds good but it may be
incompatible with your goals of integrating all configuration into a single
XML file and not using the normal ZK configuration process.

Your thought about forking ZK is a good one since there are calls to
System.exit() that could wreak havoc.




Although I agree with your comments about the architectural issues that
embedding may lead to and we are aware of those,  I do not agree that
embedding will always lead to those issues.



I agree that embedding won't always lead to those issues and your
application is a reasonable counter-example.  As is common, I think that the
exception proves the rule since your system is really just another way to
launch an independent ZK cluster rather than an example of ZK being embedded
into an application.



Re: How to handle Node does not exist error?

2010-08-16 Thread Vishal K
In my case, I am pretty sure that the configuration was right. I will
reproduce it and post more info later. Thanks.

On Mon, Aug 16, 2010 at 1:08 PM, Patrick Hunt ph...@apache.org wrote:

 Try using the logs, stat command or JMX to verify that each ZK server is
 indeed a leader/follower as expected. You should have one leader and n-1
 followers. Verify that you don't have any standalone servers (this is the
 most frequent error I see - misconfiguration of a server such that it thinks
 it's a standalone server; I often see where a user has 3 standalone servers
 which they think is a single quorum, all of the servers will therefore be
 inconsistent to each other).

 Patrick


 On 08/12/2010 05:42 PM, Ted Dunning wrote:

 On Thu, Aug 12, 2010 at 4:57 PM, Dr Hao Heh...@softtouchit.com  wrote:

  hi, Ted,

 I am a little bit confused here.  So, is the node inconsistency problem
 that Vishal and I have seen here most likely caused by configurations or
 embedding?

 If it is the former, I'd appreciate if you can point out where those
 silly
 mistakes have been made and the correct way to embed ZK.


 I think it is likely due to misconfiguration, but I don't know what the
 issue is exactly.  I think that another poster suggested that you ape the
 normal ZK startup process more closely.  That sounds good but it may be
 incompatible with your goals of integrating all configuration into a
 single
 XML file and not using the normal ZK configuration process.

 Your thought about forking ZK is a good one since there are calls to
 System.exit() that could wreak havoc.



  Although I agree with your comments about the architectural issues that
 embedding may lead to and we are aware of those,  I do not agree that
 embedding will always lead to those issues.



 I agree that embedding won't always lead to those issues and your
 application is a reasonable counter-example.  As is common, I think that
 the
 exception proves the rule since your system is really just another way to
 launch an independent ZK cluster rather than an example of ZK being
 embedded
 into an application.




Re: How to handle Node does not exist error?

2010-08-12 Thread Ted Dunning
It doesn't.

But running a ZK cluster that is incorrectly configured can cause this
problem and configuring ZK using setters is likely to be subject to changes
in what configuration is needed.  Thus, your style of code is more subject
to decay over time than is nice.

The rest of my comments detail *other* reasons why embedding a coordination
layer in the code being coordinated is a bad idea.

On Thu, Aug 12, 2010 at 6:33 AM, Vishal K vishalm...@gmail.com wrote:

 Hi Ted,

 Can you explain why running ZK in embedded mode can cause znode
 inconsistencies?
 Thanks.

 -Vishal

 On Thu, Aug 12, 2010 at 12:01 AM, Ted Dunning ted.dunn...@gmail.com
 wrote:

  Try running the server in non-embedded mode.
 
  Also, you are assuming that you know everything about how to configure
 the
  quorumPeer.  That is going to change and your code will break at that
 time.
   If you use a non-embedded cluster, this won't be a problem and you will
 be
  able to upgrade ZK version without having to restart your service.
 
  My own opinion is that running an embedded ZK is a serious architectural
  error.  Since I don't know your particular situation, it might be
  different,
  but there is an inherent contradiction involved in running a coordination
  layer as part of the thing being coordinated.  Whatever your software
 does,
  it isn't what ZK does.  As such, it is better to factor out the ZK
  functionality and make it completely stable.  That gives you a much
 simpler
  world and will make it easier for you to trouble shoot your system.  The
  simple fact that you can't take down your service without affecting the
  reliability of your ZK layer makes this a very bad idea.
 
  The problems you are having now are only a preview of what this
  architectural error leads to.  There will be more problems and many of
 them
  are likely to be more subtle and lead to service interruptions and lots
 of
  wasted time.
 
  On Wed, Aug 11, 2010 at 8:49 PM, Dr Hao He h...@softtouchit.com wrote:
 
   hi, Ted and Mahadev,
  
  
   Here are some more details about my setup:
  
   I run zookeeper in the embedded mode with the following code:
  
  quorumPeer = new QuorumPeer();
  
quorumPeer.setClientPort(getClientPort());
  quorumPeer.setTxnFactory(new
   FileTxnSnapLog(new File(getDataLogDir()), new File(getDataDir(;
  
quorumPeer.setQuorumPeers(getServers());
  
quorumPeer.setElectionType(getElectionAlg());
  
  quorumPeer.setMyid(getServerId());
  
quorumPeer.setTickTime(getTickTime());
  
quorumPeer.setInitLimit(getInitLimit());
  
quorumPeer.setSyncLimit(getSyncLimit());
  
quorumPeer.setQuorumVerifier(getQuorumVerifier());
  
quorumPeer.setCnxnFactory(cnxnFactory);
  quorumPeer.start();
  
  
   The configuration values are read from the following XML document for
   server 1:
  
   cluster tickTime=1000 initLimit=10 syncLimit=5 clientPort=2181
   serverId=1
member id=1 host=192.168.2.6:2888:3888/
member id=2 host=192.168.2.3:2888:3888/
member id=3 host=192.168.2.4:2888:3888/
   /cluster
  
  
   The other servers have the same configurations except their ids being
   changed to 2 and 3.
  
   The error occurred on server 3 when I batch loaded some messages to
  server
   1.  However, this error does not always happen.  I am not sure exactly
  what
   trigged this error yet.
  
   I also performed the stat operation on one of the No exit node and
  got:
  
   stat
   /xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs/msg001583
   Exception in thread main java.lang.NullPointerException
  at
   org.apache.zookeeper.ZooKeeperMain.printStat(ZooKeeperMain.java:129)
  at
   org.apache.zookeeper.ZooKeeperMain.processZKCmd(ZooKeeperMain.java:715)
  at
   org.apache.zookeeper.ZooKeeperMain.processCmd(ZooKeeperMain.java:579)
  at
   org.apache.zookeeper.ZooKeeperMain.executeLine(ZooKeeperMain.java:351)
  at
 org.apache.zookeeper.ZooKeeperMain.run(ZooKeeperMain.java:309)
  at
 org.apache.zookeeper.ZooKeeperMain.main(ZooKeeperMain.java:268)
   [...@t43 zookeeper-3.2.2]$ bin/zkCli.sh
  
  
   Those message nodes are created as CreateMode.PERSISTENT_SEQUENTIAL and
  are
   deleted by the last server who has read them.
  
   If I remove the troubled server's zookeeper log directory and restart
 the
   server, then everything is ok.
  
   I will try to get the nc result next time I see this problem.
  
  
   Dr Hao He
  
   XPE - the truly SOA platform
  
   h...@softtouchit.com
   http://softtouchit.com
   http://itunes.com/apps/Scanmobile
  
   On 12/08/2010, at 12:32 AM, Mahadev Konar wrote:
  
HI Dr Hao,
 Can you please post the configuration of all the 3 zookeeper
 servers?
  I
suspect it might be misconfigured clusters and they might not belong
 to
   the
same 

Re: How to handle Node does not exist error?

2010-08-12 Thread Vishal K
Hi,

I don't intend to hijack Dr. Hao's email thread here, but I would like to
point out two things:

1. I  use embedded server as well. But I don't use any setters. We extend
QuorumPeerMain and call initializeAndRun() function. So we are doing pretty
much the same thing that QuorumPeerMain is doing. However, note that I am
seeing the same problem (in ZK 3.3.0) as Dr Hao is seeing. I haven't
debugged the cause yet. I assumed that this was my implementation error (and
it could still be). Nevertheless, this could turn out to be a bug as well.

2. With respect to Ted's point about backward compatibility, I would suggest
to take an approach of having an API to support embedded ZK instead of
asking users to not embed ZK.

-Vishal

On Thu, Aug 12, 2010 at 3:18 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 It doesn't.

 But running a ZK cluster that is incorrectly configured can cause this
 problem and configuring ZK using setters is likely to be subject to changes
 in what configuration is needed.  Thus, your style of code is more subject
 to decay over time than is nice.

 The rest of my comments detail *other* reasons why embedding a coordination
 layer in the code being coordinated is a bad idea.

 On Thu, Aug 12, 2010 at 6:33 AM, Vishal K vishalm...@gmail.com wrote:

  Hi Ted,
 
  Can you explain why running ZK in embedded mode can cause znode
  inconsistencies?
  Thanks.
 
  -Vishal
 
  On Thu, Aug 12, 2010 at 12:01 AM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
   Try running the server in non-embedded mode.
  
   Also, you are assuming that you know everything about how to configure
  the
   quorumPeer.  That is going to change and your code will break at that
  time.
If you use a non-embedded cluster, this won't be a problem and you
 will
  be
   able to upgrade ZK version without having to restart your service.
  
   My own opinion is that running an embedded ZK is a serious
 architectural
   error.  Since I don't know your particular situation, it might be
   different,
   but there is an inherent contradiction involved in running a
 coordination
   layer as part of the thing being coordinated.  Whatever your software
  does,
   it isn't what ZK does.  As such, it is better to factor out the ZK
   functionality and make it completely stable.  That gives you a much
  simpler
   world and will make it easier for you to trouble shoot your system.
  The
   simple fact that you can't take down your service without affecting the
   reliability of your ZK layer makes this a very bad idea.
  
   The problems you are having now are only a preview of what this
   architectural error leads to.  There will be more problems and many of
  them
   are likely to be more subtle and lead to service interruptions and lots
  of
   wasted time.
  
   On Wed, Aug 11, 2010 at 8:49 PM, Dr Hao He h...@softtouchit.com wrote:
  
hi, Ted and Mahadev,
   
   
Here are some more details about my setup:
   
I run zookeeper in the embedded mode with the following code:
   
   quorumPeer = new QuorumPeer();
   
 quorumPeer.setClientPort(getClientPort());
   quorumPeer.setTxnFactory(new
FileTxnSnapLog(new File(getDataLogDir()), new File(getDataDir(;
   
 quorumPeer.setQuorumPeers(getServers());
   
 quorumPeer.setElectionType(getElectionAlg());
   
   quorumPeer.setMyid(getServerId());
   
 quorumPeer.setTickTime(getTickTime());
   
 quorumPeer.setInitLimit(getInitLimit());
   
 quorumPeer.setSyncLimit(getSyncLimit());
   
 quorumPeer.setQuorumVerifier(getQuorumVerifier());
   
 quorumPeer.setCnxnFactory(cnxnFactory);
   quorumPeer.start();
   
   
The configuration values are read from the following XML document for
server 1:
   
cluster tickTime=1000 initLimit=10 syncLimit=5
 clientPort=2181
serverId=1
 member id=1 host=192.168.2.6:2888:3888/
 member id=2 host=192.168.2.3:2888:3888/
 member id=3 host=192.168.2.4:2888:3888/
/cluster
   
   
The other servers have the same configurations except their ids being
changed to 2 and 3.
   
The error occurred on server 3 when I batch loaded some messages to
   server
1.  However, this error does not always happen.  I am not sure
 exactly
   what
trigged this error yet.
   
I also performed the stat operation on one of the No exit node
 and
   got:
   
stat
   
 /xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs/msg001583
Exception in thread main java.lang.NullPointerException
   at
org.apache.zookeeper.ZooKeeperMain.printStat(ZooKeeperMain.java:129)
   at
   
 org.apache.zookeeper.ZooKeeperMain.processZKCmd(ZooKeeperMain.java:715)
   at
org.apache.zookeeper.ZooKeeperMain.processCmd(ZooKeeperMain.java:579)
   at
   
 

Re: How to handle Node does not exist error?

2010-08-12 Thread Ted Dunning
I am not saying that the API shouldn't support embedded ZK.

I am just saying that it is almost always a bad idea.  It isn't that I am
asking you to not do it, it is just that I am describing the experience I
have had and that I have seen others have.  In a nutshell, embedding leads
to problems and it isn't hard to see why.

On Thu, Aug 12, 2010 at 3:02 PM, Vishal K vishalm...@gmail.com wrote:

 2. With respect to Ted's point about backward compatibility, I would
 suggest
 to take an approach of having an API to support embedded ZK instead of
 asking users to not embed ZK.



Re: How to handle Node does not exist error?

2010-08-12 Thread Benjamin Reed
i thought there was a jira about supporting embedded zookeeper. (i 
remember rejecting a patch to fix it. one of the problems is that we 
have a couple of places that do System.exit().) i can't seem to find it 
though.


one case that would be great for embedding is writing test cases, so i 
think it would be useful for that.


ben

On 08/12/2010 03:25 PM, Ted Dunning wrote:

I am not saying that the API shouldn't support embedded ZK.

I am just saying that it is almost always a bad idea.  It isn't that I am
asking you to not do it, it is just that I am describing the experience I
have had and that I have seen others have.  In a nutshell, embedding leads
to problems and it isn't hard to see why.

On Thu, Aug 12, 2010 at 3:02 PM, Vishal Kvishalm...@gmail.com  wrote:

   

2. With respect to Ted's point about backward compatibility, I would
suggest
to take an approach of having an API to support embedded ZK instead of
asking users to not embed ZK.

 




How to handle Node does not exist error?

2010-08-11 Thread Dr Hao He
hi, All,

I have a 3-host cluster running ZooKeeper 3.2.2.  On one of the hosts, there 
are a number of nodes that I can get and ls using zkCli.sh .  However, when 
I tried to delete any of them, I got Node does not exist error.Those 
nodes do not exist on the other two hosts. 

Any idea how we should handle this type of errors and what might have caused 
this problem?

Dr Hao He

XPE - the truly SOA platform

h...@softtouchit.com
http://softtouchit.com
http://itunes.com/apps/Scanmobile



Re: How to handle Node does not exist error?

2010-08-11 Thread Dr Hao He
hi, Ted,

Thanks for the reply.  Here is what I did:

[zk: localhost:2181(CONNECTED) 0] ls 
/xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs/msg002948
[]
zk: localhost:2181(CONNECTED) 1] ls 
/xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs  
[msg002807, msg002700, msg002701, msg002804, msg002704, 
msg002706, msg002601, msg001849, msg001847, msg002508, 
msg002609, msg001841, msg002607, msg002606, msg002604, 
msg002809, msg002817, msg001633, msg002812, msg002814, 
msg002711, msg002815, msg002713, msg002716, msg001772, 
msg002811, msg001635, msg001774, msg002515, msg002610, 
msg001838, msg002517, msg002612, msg002519, msg001973, 
msg001835, msg001974, msg002619, msg001831, msg002510, 
msg002512, msg002615, msg002614, msg002617, msg002104, 
msg002106, msg001769, msg001768, msg002828, msg002822, 
msg001760, msg002820, msg001963, msg001961, msg002110, 
msg002118, msg002900, msg002836, msg001757, msg002907, 
msg001753, msg001752, msg001755, msg001952, msg001958, 
msg001852, msg001956, msg001854, msg002749, msg001608, 
msg001609, msg002747, msg002882, msg001743, msg002888, 
msg001605, msg002885, msg001487, msg001746, msg002330, 
msg001749, msg001488, msg001489, msg001881, msg001491, 
msg002890, msg001889, msg002758, msg002241, msg002892, 
msg002852, msg002759, msg002898, msg002850, msg001733, 
msg002751, msg001739, msg002753, msg002756, msg002332, 
msg001872, msg002233, msg001721, msg001627, msg001720, 
msg001625, msg001628, msg001629, msg001729, msg002350, 
msg001727, msg002352, msg001622, msg001726, msg001623, 
msg001723, msg001724, msg001621, msg002736, msg002738, 
msg002363, msg001717, msg002878, msg002362, msg002361, 
msg001611, msg001894, msg002357, msg002218, msg002358, 
msg002355, msg001895, msg002356, msg001898, msg002354, 
msg001996, msg001990, msg002093, msg002880, msg002576, 
msg002579, msg002267, msg002266, msg002366, msg001901, 
msg002365, msg001903, msg001799, msg001906, msg002368, 
msg001597, msg002679, msg002166, msg001595, msg002481, 
msg002482, msg002373, msg002374, msg002371, msg001599, 
msg002773, msg002274, msg002275, msg002270, msg002583, 
msg002271, msg002580, msg002067, msg002277, msg002278, 
msg002376, msg002180, msg002467, msg002378, msg002182, 
msg002377, msg002184, msg002379, msg002187, msg002186, 
msg002665, msg002666, msg002381, msg002382, msg002661, 
msg002662, msg002663, msg002385, msg002284, msg002766, 
msg002282, msg002190, msg002599, msg002054, msg002596, 
msg002453, msg002459, msg002457, msg002456, msg002191, 
msg002652, msg002395, msg002650, msg002656, msg002655, 
msg002189, msg002047, msg002658, msg002659, msg002796, 
msg002250, msg002255, msg002589, msg002257, msg002061, 
msg002064, msg002585, msg002258, msg002587, msg002444, 
msg002446, msg002447, msg002450, msg002646, msg001501, 
msg002591, msg002592, msg001503, msg001506, msg002260, 
msg002594, msg002262, msg002263, msg002264, msg002590, 
msg002132, msg002130, msg002530, msg002931, msg001559, 
msg001808, msg002024, msg001553, msg002939, msg002937, 
msg001556, msg002935, msg002933, msg002140, msg001937, 
msg002143, msg002520, msg002522, msg002429, msg002524, 
msg002920, msg002035, msg001561, msg002134, msg002138, 
msg002925, msg002151, msg002287, msg002555, msg002010, 
msg002002, msg002290, msg001537, msg002005, msg002147, 
msg002145, msg002698, msg001592, msg001810, msg002690, 
msg002691, msg001911, msg001910, msg002693, msg001812, 
msg001817, msg001547, msg002012, msg002015, msg002941, 
msg001688, msg002018, msg002684, msg002944, msg001540, 
msg002686, msg001541, msg002946, msg002688, msg001584, 
msg002948]

[zk: localhost:2181(CONNECTED) 7] delete 
/xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs/msg002948
Node does not exist: 
/xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs/msg002948

When I 

Re: How to handle Node does not exist error?

2010-08-11 Thread Ted Dunning
What do your nodes  have in their logs during startup?   Are you sure  
you have them configured correctly?  Are the file ephemeral? Could  
they have disappeared on their own?


Sent from my iPhone

On Aug 11, 2010, at 12:10 AM, Dr Hao He h...@softtouchit.com wrote:


hi, Ted,

Thanks for the reply.  Here is what I did:

[zk: localhost:2181(CONNECTED) 0] ls /xpe/queues/ 
3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs/msg002948

[]
zk: localhost:2181(CONNECTED) 1] ls /xpe/queues/ 
3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs
[msg002807, msg002700, msg002701, msg002804,  
msg002704, msg002706, msg002601, msg001849,  
msg001847, msg002508, msg002609, msg001841,  
msg002607, msg002606, msg002604, msg002809,  
msg002817, msg001633, msg002812, msg002814,  
msg002711, msg002815, msg002713, msg002716,  
msg001772, msg002811, msg001635, msg001774,  
msg002515, msg002610, msg001838, msg002517,  
msg002612, msg002519, msg001973, msg001835,  
msg001974, msg002619, msg001831, msg002510,  
msg002512, msg002615, msg002614, msg002617,  
msg002104, msg002106, msg001769, msg001768,  
msg002828, msg002822, msg001760, msg002820,  
msg001963, msg001961, msg002110, msg002118,  
msg002900, msg002836, msg001757, msg002907,  
msg001753, msg001752, msg001755, msg001952,  
msg001958, msg001852, msg001956, msg001854,  
msg002749, msg001608, msg001609, msg002747,  
msg002882, msg001743, msg002888, msg001605,  
msg002885, msg001487, msg001746, msg002330,  
msg001749, msg001488, msg001489, msg001881,  
msg001491, msg002890, msg001889, msg002758,  
msg002241, msg002892, msg002852, msg002759,  
msg002898, msg002850, msg001733, msg002751,  
msg001739, msg002753, msg002756, msg002332,  
msg001872, msg002233, msg001721, msg001627,  
msg001720, msg001625, msg001628, msg001629,  
msg001729, msg002350, msg001727, msg002352,  
msg001622, msg001726, msg001623, msg001723,  
msg001724, msg001621, msg002736, msg002738,  
msg002363, msg001717, msg002878, msg002362,  
msg002361, msg001611, msg001894, msg002357,  
msg002218, msg002358, msg002355, msg001895,  
msg002356, msg001898, msg002354, msg001996,  
msg001990, msg002093, msg002880, msg002576,  
msg002579, msg002267, msg002266, msg002366,  
msg001901, msg002365, msg001903, msg001799,  
msg001906, msg002368, msg001597, msg002679,  
msg002166, msg001595, msg002481, msg002482,  
msg002373, msg002374, msg002371, msg001599,  
msg002773, msg002274, msg002275, msg002270,  
msg002583, msg002271, msg002580, msg002067,  
msg002277, msg002278, msg002376, msg002180,  
msg002467, msg002378, msg002182, msg002377,  
msg002184, msg002379, msg002187, msg002186,  
msg002665, msg002666, msg002381, msg002382,  
msg002661, msg002662, msg002663, msg002385,  
msg002284, msg002766, msg002282, msg002190,  
msg002599, msg002054, msg002596, msg002453,  
msg002459, msg002457, msg002456, msg002191,  
msg002652, msg002395, msg002650, msg002656,  
msg002655, msg002189, msg002047, msg002658,  
msg002659, msg002796, msg002250, msg002255,  
msg002589, msg002257, msg002061, msg002064,  
msg002585, msg002258, msg002587, msg002444,  
msg002446, msg002447, msg002450, msg002646,  
msg001501, msg002591, msg002592, msg001503,  
msg001506, msg002260, msg002594, msg002262,  
msg002263, msg002264, msg002590, msg002132,  
msg002130, msg002530, msg002931, msg001559,  
msg001808, msg002024, msg001553, msg002939,  
msg002937, msg001556, msg002935, msg002933,  
msg002140, msg001937, msg002143, msg002520,  
msg002522, msg002429, msg002524, msg002920,  
msg002035, msg001561, msg002134, msg002138,  
msg002925, msg002151, msg002287, msg002555,  
msg002010, msg002002, msg002290, msg001537,  
msg002005, msg002147, msg002145, msg002698,  
msg001592, msg001810, msg002690, msg002691,  
msg001911, msg001910, msg002693, msg001812,  
msg001817, msg001547, msg002012, msg002015,  
msg002941, msg001688, msg002018, msg002684,  

Re: How to handle Node does not exist error?

2010-08-11 Thread Mahadev Konar
HI Dr Hao,
  Can you please post the configuration of all the 3 zookeeper servers? I
suspect it might be misconfigured clusters and they might not belong to the
same ensemble.

Just to be clear:
/xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs/msg002807

And other such nodes exist on one of the zookeeper servers and the same node
does not exist on other servers?

Also, as ted pointed out, can you please post the output of echo ³stat² | nc
localhost 2181 (on all the 3 servers) to the list?

Thanks
mahadev



On 8/11/10 12:10 AM, Dr Hao He h...@softtouchit.com wrote:

 hi, Ted,
 
 Thanks for the reply.  Here is what I did:
 
 [zk: localhost:2181(CONNECTED) 0] ls
 /xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs/msg002948
 []
 zk: localhost:2181(CONNECTED) 1] ls
 /xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs
 [msg002807, msg002700, msg002701, msg002804, msg002704,
 msg002706, msg002601, msg001849, msg001847, msg002508,
 msg002609, msg001841, msg002607, msg002606, msg002604,
 msg002809, msg002817, msg001633, msg002812, msg002814,
 msg002711, msg002815, msg002713, msg002716, msg001772,
 msg002811, msg001635, msg001774, msg002515, msg002610,
 msg001838, msg002517, msg002612, msg002519, msg001973,
 msg001835, msg001974, msg002619, msg001831, msg002510,
 msg002512, msg002615, msg002614, msg002617, msg002104,
 msg002106, msg001769, msg001768, msg002828, msg002822,
 msg001760, msg002820, msg001963, msg001961, msg002110,
 msg002118, msg002900, msg002836, msg001757, msg002907,
 msg001753, msg001752, msg001755, msg001952, msg001958,
 msg001852, msg001956, msg001854, msg002749, msg001608,
 msg001609, msg002747, msg002882, msg001743, msg002888,
 msg001605, msg002885, msg001487, msg001746, msg002330,
 msg001749, msg001488, msg001489, msg001881, msg001491,
 msg002890, msg001889, msg002758, msg002241, msg002892,
 msg002852, msg002759, msg002898, msg002850, msg001733,
 msg002751, msg001739, msg002753, msg002756, msg002332,
 msg001872, msg002233, msg001721, msg001627, msg001720,
 msg001625, msg001628, msg001629, msg001729, msg002350,
 msg001727, msg002352, msg001622, msg001726, msg001623,
 msg001723, msg001724, msg001621, msg002736, msg002738,
 msg002363, msg001717, msg002878, msg002362, msg002361,
 msg001611, msg001894, msg002357, msg002218, msg002358,
 msg002355, msg001895, msg002356, msg001898, msg002354,
 msg001996, msg001990, msg002093, msg002880, msg002576,
 msg002579, msg002267, msg002266, msg002366, msg001901,
 msg002365, msg001903, msg001799, msg001906, msg002368,
 msg001597, msg002679, msg002166, msg001595, msg002481,
 msg002482, msg002373, msg002374, msg002371, msg001599,
 msg002773, msg002274, msg002275, msg002270, msg002583,
 msg002271, msg002580, msg002067, msg002277, msg002278,
 msg002376, msg002180, msg002467, msg002378, msg002182,
 msg002377, msg002184, msg002379, msg002187, msg002186,
 msg002665, msg002666, msg002381, msg002382, msg002661,
 msg002662, msg002663, msg002385, msg002284, msg002766,
 msg002282, msg002190, msg002599, msg002054, msg002596,
 msg002453, msg002459, msg002457, msg002456, msg002191,
 msg002652, msg002395, msg002650, msg002656, msg002655,
 msg002189, msg002047, msg002658, msg002659, msg002796,
 msg002250, msg002255, msg002589, msg002257, msg002061,
 msg002064, msg002585, msg002258, msg002587, msg002444,
 msg002446, msg002447, msg002450, msg002646, msg001501,
 msg002591, msg002592, msg001503, msg001506, msg002260,
 msg002594, msg002262, msg002263, msg002264, msg002590,
 msg002132, msg002130, msg002530, msg002931, msg001559,
 msg001808, msg002024, msg001553, msg002939, msg002937,
 msg001556, msg002935, msg002933, msg002140, msg001937,
 msg002143, msg002520, msg002522, msg002429, msg002524,
 msg002920, msg002035, msg001561, msg002134, msg002138,
 msg002925, msg002151, msg002287, msg002555, msg002010,
 msg002002, msg002290, msg001537, msg002005, msg002147,
 msg002145, msg002698, 

Re: How to handle Node does not exist error?

2010-08-11 Thread Ted Dunning
Try running the server in non-embedded mode.

Also, you are assuming that you know everything about how to configure the
quorumPeer.  That is going to change and your code will break at that time.
 If you use a non-embedded cluster, this won't be a problem and you will be
able to upgrade ZK version without having to restart your service.

My own opinion is that running an embedded ZK is a serious architectural
error.  Since I don't know your particular situation, it might be different,
but there is an inherent contradiction involved in running a coordination
layer as part of the thing being coordinated.  Whatever your software does,
it isn't what ZK does.  As such, it is better to factor out the ZK
functionality and make it completely stable.  That gives you a much simpler
world and will make it easier for you to trouble shoot your system.  The
simple fact that you can't take down your service without affecting the
reliability of your ZK layer makes this a very bad idea.

The problems you are having now are only a preview of what this
architectural error leads to.  There will be more problems and many of them
are likely to be more subtle and lead to service interruptions and lots of
wasted time.

On Wed, Aug 11, 2010 at 8:49 PM, Dr Hao He h...@softtouchit.com wrote:

 hi, Ted and Mahadev,


 Here are some more details about my setup:

 I run zookeeper in the embedded mode with the following code:

quorumPeer = new QuorumPeer();

  quorumPeer.setClientPort(getClientPort());
quorumPeer.setTxnFactory(new
 FileTxnSnapLog(new File(getDataLogDir()), new File(getDataDir(;

  quorumPeer.setQuorumPeers(getServers());

  quorumPeer.setElectionType(getElectionAlg());
quorumPeer.setMyid(getServerId());

  quorumPeer.setTickTime(getTickTime());

  quorumPeer.setInitLimit(getInitLimit());

  quorumPeer.setSyncLimit(getSyncLimit());

  quorumPeer.setQuorumVerifier(getQuorumVerifier());

  quorumPeer.setCnxnFactory(cnxnFactory);
quorumPeer.start();


 The configuration values are read from the following XML document for
 server 1:

 cluster tickTime=1000 initLimit=10 syncLimit=5 clientPort=2181
 serverId=1
  member id=1 host=192.168.2.6:2888:3888/
  member id=2 host=192.168.2.3:2888:3888/
  member id=3 host=192.168.2.4:2888:3888/
 /cluster


 The other servers have the same configurations except their ids being
 changed to 2 and 3.

 The error occurred on server 3 when I batch loaded some messages to server
 1.  However, this error does not always happen.  I am not sure exactly what
 trigged this error yet.

 I also performed the stat operation on one of the No exit node and got:

 stat
 /xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs/msg001583
 Exception in thread main java.lang.NullPointerException
at
 org.apache.zookeeper.ZooKeeperMain.printStat(ZooKeeperMain.java:129)
at
 org.apache.zookeeper.ZooKeeperMain.processZKCmd(ZooKeeperMain.java:715)
at
 org.apache.zookeeper.ZooKeeperMain.processCmd(ZooKeeperMain.java:579)
at
 org.apache.zookeeper.ZooKeeperMain.executeLine(ZooKeeperMain.java:351)
at org.apache.zookeeper.ZooKeeperMain.run(ZooKeeperMain.java:309)
at org.apache.zookeeper.ZooKeeperMain.main(ZooKeeperMain.java:268)
 [...@t43 zookeeper-3.2.2]$ bin/zkCli.sh


 Those message nodes are created as CreateMode.PERSISTENT_SEQUENTIAL and are
 deleted by the last server who has read them.

 If I remove the troubled server's zookeeper log directory and restart the
 server, then everything is ok.

 I will try to get the nc result next time I see this problem.


 Dr Hao He

 XPE - the truly SOA platform

 h...@softtouchit.com
 http://softtouchit.com
 http://itunes.com/apps/Scanmobile

 On 12/08/2010, at 12:32 AM, Mahadev Konar wrote:

  HI Dr Hao,
   Can you please post the configuration of all the 3 zookeeper servers? I
  suspect it might be misconfigured clusters and they might not belong to
 the
  same ensemble.
 
  Just to be clear:
  /xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs/msg002807
 
  And other such nodes exist on one of the zookeeper servers and the same
 node
  does not exist on other servers?
 
  Also, as ted pointed out, can you please post the output of echo ³stat² |
 nc
  localhost 2181 (on all the 3 servers) to the list?
 
  Thanks
  mahadev
 
 
 
  On 8/11/10 12:10 AM, Dr Hao He h...@softtouchit.com wrote:
 
  hi, Ted,
 
  Thanks for the reply.  Here is what I did:
 
  [zk: localhost:2181(CONNECTED) 0] ls
  /xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs/msg002948
  []
  zk: localhost:2181(CONNECTED) 1] ls
  /xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs
  [msg002807, msg002700, msg002701, msg002804,
 msg002704,
  msg002706, msg002601, msg001849, msg001847,