RE: How do we find the Server the client is connected to?
Failover testing. -Original Message- From: Patrick Hunt [mailto:ph...@apache.org] Sent: Thursday, October 01, 2009 3:44 PM To: zookeeper-user@hadoop.apache.org; Rob Baccus Subject: Re: How do we find the Server the client is connected to? That detail is purposefully not exposed through the client api, however it is output to the log on connection establishment. Why would your client code need to know which server in the ensemble it is connected to? Patrick Rob Baccus wrote: How do I determine the server the client is connected to? It is not exposed as far as I can see in either the ZooKeep object or the ClentCnxn object. I did find on line 790 in ClientCnxn.StartConnect() method the place the actual server connection is happening but that is not exposed. Rob Baccus 425-201-3812
RE: ACL question w/ Zookeeper 3.1.1
= false eventOfDeath = {java.lang.obj...@1392} lastZxid = 1 xid = 3 response = {org.apache.zookeeper.proto.createrespo...@1365}\n r = {org.apache.zookeeper.proto.replyhea...@1445}0,0,-112\n request = {org.apache.zookeeper.proto.createrequ...@1360}'/ACLTest,,v{s{31,s{'aut h,'}}},0\n path = {java.lang.str...@1314}/ACLTest data = {byte...@1339} acl = {java.util.arrayl...@1242} size = 1 flags = 0 path = {java.lang.str...@1314}/ACLTest h = {org.apache.zookeeper.proto.requesthea...@1352}2,1\n cnxn = {org.apache.zookeeper.clientc...@1381}sessionId: 0x123de5b3b1b\nlastZxid: 1\nxid: 3\nnextAddrToTry: 0\nserverAddrs: /127.0.0.1:2181\n -- v5 NOTE: If I use Ids.OPEN_ACL_UNSAFE, then everything works fine. Here's an example of the debug state after a create()... -- this = {org.apache.zookeeper.zookee...@1266} watchManager = {org.apache.zookeeper.zookeeper$zkwatchmana...@1397} state = {org.apache.zookeeper.zookeeper$sta...@1398}CONNECTED cnxn = {org.apache.zookeeper.clientc...@1374}sessionId: 0x123de6ba8de\nlastZxid: 2\nxid: 3\nnextAddrToTry: 0\nserverAddrs: /127.0.0.1:2181\n serverAddrs = {java.util.arrayl...@1403} size = 1 authInfo = {java.util.arrayl...@1404} size = 1 [0] = {org.apache.zookeeper.clientcnxn$authd...@1415} scheme = {java.lang.str...@1244}digest data = {byte[...@1416} pendingQueue = {java.util.linkedl...@1405} size = 0 outgoingQueue = {java.util.linkedl...@1406} size = 0 nextAddrToTry = 0 connectTimeout = 4 readTimeout = 2 sessionTimeout = 5 zooKeeper = {org.apache.zookeeper.zookee...@1266} watcher = {org.apache.zookeeper.zookeeper$zkwatchmana...@1397} sessionId = 82153772198789120 sessionPasswd = {byte[...@1407} sendThread = {org.apache.zookeeper.clientcnxn$sendthr...@1259}Thread[main-SendThread ,5,main] eventThread = {org.apache.zookeeper.clientcnxn$eventthr...@1265}Thread[main-EventThre ad,5,main] selector = {sun.nio.ch.epollselectori...@1408} closing = false eventOfDeath = {java.lang.obj...@1409} lastZxid = 2 xid = 3 response = {org.apache.zookeeper.proto.createrespo...@1360}'/ACLTest\n r = {org.apache.zookeeper.proto.replyhea...@1389}2,2,0\n xid = 2 zxid = 2 err = 0 request = {org.apache.zookeeper.proto.createrequ...@1355}'/ACLTest,,v{s{15,s{'wor ld,'anyone}}},0\n path = {java.lang.str...@1314}/ACLTest h = {org.apache.zookeeper.proto.requesthea...@1347}2,1\n cnxn = {org.apache.zookeeper.clientc...@1374}sessionId: 0x123de6ba8de\nlastZxid: 2\nxid: 3\nnextAddrToTry: 0\nserverAddrs: /127.0.0.1:2181\n -Original Message- From: Todd Greenwood [mailto:to...@audiencescience.com] Sent: Friday, September 18, 2009 11:27 AM To: Patrick Hunt; zookeeper-...@hadoop.apache.org; zookeeper- u...@hadoop.apache.org Subject: RE: ACL question w/ Zookeeper 3.1.1 Patrick / Mahadev, Thanks for the heads-up! Apparently I *am* receiving email from zookeeper-user but it is being filtered out as spam. This just started happening, but I'll rectify on my end. I'm working thru Mahadev's response and will respond shortly (and search for other postings, as well). Appologies for the cross post. -Todd -Original Message- From: Patrick Hunt [mailto:ph...@apache.org] Sent: Friday, September 18, 2009 11:19 AM To: zookeeper-...@hadoop.apache.org; zookeeper-user@hadoop.apache.org Cc: Todd Greenwood Subject: Re: ACL question w/ Zookeeper 3.1.1 Todd, there were other responses as well. Are you seeing other traffic from the lists? (perhaps a spam filtering issue?) Patrick Mahadev Konar wrote: HI todd, We did respond on zookeeper-user. Here is my response in case you didn't see it... HI todd, From what I understand, you are sayin that a creator_all_acl does not work with auth? I tried the following with CREATOR_ALL_ACL and it seemed to work for me... import org.apache.zookeeper.CreateMode; import org.apache.zookeeper.WatchedEvent; import org.apache.zookeeper.Watcher; import org.apache.zookeeper.ZooKeeper; import org.apache.zookeeper.data.ACL; import org.apache.zookeeper.ZooDefs.Ids; import java.util.ArrayList; import java.util.List; public class TestACl implements Watcher { public static void main(String[] argv) throws Exception { ListACL acls = new ArrayListACL(1); String authentication_type = digest; String authentication = mahadev:some; for (ACL ids_acl : Ids.CREATOR_ALL_ACL) { acls.add(ids_acl); } TestACl tacl = new TestACl(); ZooKeeper zoo = new ZooKeeper(localhost:2181, 3000, tacl); zoo.addAuthInfo(authentication_type, authentication.getBytes()); zoo.create(/some, new byte[0], acls, CreateMode.PERSISTENT); zoo.setData(/some, new byte[0], -1); } @Override public void process(WatchedEvent event) { } } And it worked
RE: ACL question w/ Zookeeper 3.1.1
Patrick, In v3/4, I am using Ids.CREATOR_ALL_ACL. In v5 Ids.OPEN_ACL_UNSAFE. In all cases, ACLs are specified and authentication credentials have been added to zookeeper instance. -- CODE --- // v5 //for ( ACL ids_acl : Ids.CREATOR_ALL_ACL ) //{ //acl.add( ids_acl ); //} // v3/4 for ( ACL ids_acl : Ids.OPEN_ACL_UNSAFE ) { acl.add( ids_acl ); } // all cases (v3,4,5) have authentication credentials set zoo = new ZooKeeper( connection_string, connectiontimeout, this ); zoo.addAuthInfo( authentication_type, authentication.getBytes() ); // all cases (v3,4,5) use the acl defined above zoo.create( normPath(path), new byte[0], acl, mode ); I'll investigate further and log a bug if I can isolate this. -Todd -Original Message- From: Patrick Hunt [mailto:ph...@apache.org] Sent: Monday, September 21, 2009 4:32 PM To: zookeeper-user@hadoop.apache.org; Todd Greenwood Cc: Patrick Hunt Subject: Re: ACL question w/ Zookeeper 3.1.1 Todd Greenwood wrote: Patrick, Thanks, I'll spend some more time trying to create a more concise repro, and log a bug once I do. The only reason I posted this mash was to see if the replyHeader error, 0,0,-112, made sense of the ACL exception. The rest is just context...and clearly too much of that :o). I don't see a difference between v3 and v4...The only differences that I can see are the between v4 and v5 (v4 fails and v5 succeeds): I did see this diff btw 3/4, 3 has this: request = {org.apache.zookeeper.proto.createrequ...@1360}'/ACLTest,,v{},0\n you don't have any acl specified for the node create, or is this supposed to be a working example w/o auth? (like I said, I'm confused...) v4: response = {org.apache.zookeeper.proto.createrespo...@1365}\n r = {org.apache.zookeeper.proto.replyhea...@1445}0,0,-112\n -112 return code is session expired, not auth failure. according to this your client's session expired, but w/o more info (code/log or idea of what your test is doing) I can't really speculate why you are getting this (old client session that was not shutdown correctly and finally expired while running a different/new test?) Patrick v5: response = {org.apache.zookeeper.proto.createrespo...@1360}'/ACLTest\n r = {org.apache.zookeeper.proto.replyhea...@1389}2,2,0\n -Todd -Original Message- From: Patrick Hunt [mailto:ph...@apache.org] Sent: Monday, September 21, 2009 4:14 PM To: zookeeper-user@hadoop.apache.org; Todd Greenwood Subject: Re: ACL question w/ Zookeeper 3.1.1 Todd, I spent some time looking at your output and honestly I'm having trouble making sense of what you are saying. What's the diff btw v3 v4? I'm afraid here are too many variables, can you help nail things down? 1) create a jira for this https://issues.apache.org/jira/browse/ZOOKEEPER 2) if at all possible attach the code you are running that has problems, seems like you've boiled it down to a case where it is deterministic, this would be the best for us to debug. If you can't attach the code then include snippets - in particular the addAuthInfo call (w/parameter details) for your clients, and the individual create calls, including the acl specifics - and describe what your client(s) are doing in detail so that we can attempt to reproduce. 3) attach a trace level log from both the server and client during your test run, point out the time index when you see the auth failure. btw, you might try doing a getACL(path...) just before the operation that's failing - it will give you some insight into what the acl is set to for that node. Patrick Todd Greenwood wrote: Patrick / Mahadev, I've spent the last couple of days attempting to isolate this issue, and this is what I've come up with... Mahadev's simple use case works fine, as posted. However, my more involved use cases are consistently failing w/ InvalidACL exceptions when I use digest authentication with Ids.CREATOR_ALL_ACL: java.lang.Exception: com.audiencescience.util.zookeeper.wrapper.ZooWrapperException: org.apache.zookeeper.KeeperException$InvalidACLException: KeeperErrorCode = InvalidACL for /ACLTest Prior to throwing this exception, the response is (Zookeeper.java:create()): r = {org.apache.zookeeper.proto.replyhea...@1445}0,0,-112\n mailto:{org.apache.zookeeper.proto.replyhea...@1445} . More debug data below. So, while I can get Mahadev's simple example to work, I cannot get a more involved use case to work correctly. However, if I change my code to use Ids.OPEN_ACL_UNSAFE, then everything works fine. Example debug output below at v5. Could someone point me at non-trivial test cases for ACLs, and perhaps give me some insight into how to debug this issue further? -Todd --- Code Snippet
ACL question w/ Zookeeper 3.1.1
I'm attempting to secure a zookeeper installation using zookeeper ACLs. However, I'm finding that while Ids.OPEN_ACL_UNSAFE works great, my attempts at using Ids.CREATOR_ALL_ACL are failing. Here's a code snippet: public class ZooWrapper { /* 1. Here I'm setting up my authentication. I've got an ACL list, and my authentication strings. */ private final ListACL acl = new ArrayListACL( 1 ); private static final String authentication_type = digest; private static final String authentication = audiencescience:gravy; public ZooWrapper( final String connection_string, final String path, final int connectiontimeout ) throws ZooWrapperException { ... /* 2. Here I'm adding the acls */ // This works (creates nodes, sets data on nodes) for ( ACL ids_acl : Ids.OPEN_ACL_UNSAFE ) { acl.add( ids_acl); } /* NOTE: This does not work (nodes are not created, cannot set data on nodes b/c nodes do not exist) */ //for ( ACL ids_acl : Ids.CREATOR_ALL_ACL ) //{ //acl.add( ids_acl ); //} /* 3. Finally, I create a new zookeeper instance and add my authorization info to it. */ zoo = new ZooKeeper( connection_string, connectiontimeout, this ); zoo.addAuthInfo( authentication_type, authentication.getBytes() ) /* 4. Later, I try to write some data into zookeeper by first creating the node, and then calling setdata... */ zoo.create( path, new byte[0], acl, CreateMode.PERSISTENT ); zoo.setData( path, bytes, -1 ) As I mentioned above, when I add Ids.OPEN_ACL_UNSAFE to acl, then both the create and setData succeed. However, when I use Ids.CREATOR_ALL_ACL, then the nodes are not created. Am I missing something obvious w/ respect to configuring ACLs? I've used the following references: http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperProgrammers.html http://mail-archives.apache.org/mod_mbox/hadoop-zookeeper-commits/200807 .mbox/%3c20080731201025.c62092388...@eris.apache.org%3e http://books.google.com/books?id=bKPEwR-Pt6ECpg=PT404lpg=PT404dq=zook eeper+ACL+digest+%22new+Id%22source=blots=kObz0y8eFksig=VFCAsNW0mBJyZ swoweJDI31iNlohl=enei=Z82ySojRFsqRlAeqxsyIDwsa=Xoi=book_resultct=re sultresnum=6#v=onepageq=zookeeper%20ACL%20digest%20%22new%20Id%22f=fa lse -Todd
RE: Unending Leader Elections in WAN deploy
Great news. Thank you Mahadev. I'll report our findings later today. -Todd -Original Message- From: Mahadev Konar [mailto:maha...@yahoo-inc.com] Sent: Tuesday, August 04, 2009 11:20 AM To: zookeeper-user@hadoop.apache.org Subject: Re: Unending Leader Elections in WAN deploy Hi Todd, I just committed 480 and 491. You can checkout the 3.2 branch now. Thanks mahadev On 8/3/09 4:29 PM, Todd Greenwood to...@audiencescience.com wrote: That'd be perfect. Thanks! -Original Message- From: Mahadev Konar [mailto:maha...@yahoo-inc.com] Sent: Monday, August 03, 2009 4:24 PM To: zookeeper-user@hadoop.apache.org Subject: Re: Unending Leader Elections in WAN deploy Hi Todd, Most of the patches that you mention should be in the branch 3.2 by tomm or so. 481, 479 are already in. 480 and 491 should be in by tomm. Would that suffice for you? Thanks mahadev On 8/3/09 4:21 PM, Todd Greenwood to...@audiencescience.com wrote: Another problem...I've reverted to the latest versions of the patches that are not specific to branch-3.2, and I'm getting two compilation errors: build-generated: [javac] Compiling 44 source files to /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p atched/branch-3.2/build/classes compile-main: [javac] Compiling 2 source files to /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p atched/branch-3.2/build/classes [javac] /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru mStats.java:30: name clash: getQuorumPeers() and getQuorumPeers() have the same erasure [javac] public String[] getQuorumPeers(); [javac] ^ [javac] /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru mStats.java:31: name clash: getServerState() and getServerState() have the same erasure [javac] public String getServerState(); [javac] ^ [javac] 2 errors My build process is pretty simple: 1. copy the branch-3.2 source to a temp directory (src/patched/branch-3.2) 2. apply the ZOOKEEPER patches in my patches directory 3. build zookeeper in the temp directory -Todd -Original Message- From: Todd Greenwood [mailto:to...@audiencescience.com] Sent: Monday, August 03, 2009 4:09 PM To: zookeeper-user@hadoop.apache.org Subject: RE: Unending Leader Elections in WAN deploy Flavio, I notice that you've updated the patches referenced for the WAN deployment. There appears to be an order dependency w/ respect to these four patches... ZOOKEEPER-473.patch ZOOKEEPER-479-branch3.2.patch ZOOKEEPER-481-branch3.2.patch ZOOKEEPER-491.patch 473 - 479 (479 fails) to...@toddg01lt:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper /src/patched/branch-3.2$ patch -p0 ../patches/ZOOKEEPER-479-branch3.2.patch patching file src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch ical.java patching file src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java patching file src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier .java patching file src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java Hunk #1 FAILED at 93. Hunk #2 FAILED at 145. 2 out of 2 hunks FAILED -- saving rejects to file src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej to...@toddg01lt:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper /src/patched/branch-3.2$ h ../patches/ Could you advise as to which patches I need to apply, and in what order? -Todd -Original Message- From: Flavio Junqueira [mailto:f...@yahoo-inc.com] Sent: Friday, July 31, 2009 9:51 PM To: zookeeper-user@hadoop.apache.org Subject: Re: Unending Leader Elections in WAN deploy Perfect! Thanks for the update, Todd. -Flavio On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote: Thanks. You were right, I had a stale version of 479. Compilation succeeds and all tests pass on branch-3.2 with the latest patches 473, 479, 481, and 491. -Todd -Original Message- From: Flavio Junqueira [mailto:f...@yahoo-inc.com] Sent: Friday, July 31, 2009 7:48 PM To: zookeeper-user@hadoop.apache.org Subject: Re: Unending Leader Elections in WAN deploy It should be in 479. Perhaps you have a stale version of the patch. -Flavio On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote: Flavio, I'm getting a compilation error for patch 491: compile-main: [javac] Compiling 1 source file to /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/ src/p
Zookeeper Client
Is org.apache.zookeeper.ZooKeeper thread safe? I've started walking through the code to check for mutability, and although the first level children are protected, I haven't fully walked the graph. Perhaps I should ask, is it supposed to be thread safe? -Todd
RE: Unending Leader Elections in WAN deploy
Flavio, I notice that you've updated the patches referenced for the WAN deployment. There appears to be an order dependency w/ respect to these four patches... ZOOKEEPER-473.patch ZOOKEEPER-479-branch3.2.patch ZOOKEEPER-481-branch3.2.patch ZOOKEEPER-491.patch 473 - 479 (479 fails) to...@toddg01lt:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper /src/patched/branch-3.2$ patch -p0 ../patches/ZOOKEEPER-479-branch3.2.patch patching file src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch ical.java patching file src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java patching file src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier .java patching file src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java Hunk #1 FAILED at 93. Hunk #2 FAILED at 145. 2 out of 2 hunks FAILED -- saving rejects to file src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej to...@toddg01lt:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper /src/patched/branch-3.2$ h ../patches/ Could you advise as to which patches I need to apply, and in what order? -Todd -Original Message- From: Flavio Junqueira [mailto:f...@yahoo-inc.com] Sent: Friday, July 31, 2009 9:51 PM To: zookeeper-user@hadoop.apache.org Subject: Re: Unending Leader Elections in WAN deploy Perfect! Thanks for the update, Todd. -Flavio On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote: Thanks. You were right, I had a stale version of 479. Compilation succeeds and all tests pass on branch-3.2 with the latest patches 473, 479, 481, and 491. -Todd -Original Message- From: Flavio Junqueira [mailto:f...@yahoo-inc.com] Sent: Friday, July 31, 2009 7:48 PM To: zookeeper-user@hadoop.apache.org Subject: Re: Unending Leader Elections in WAN deploy It should be in 479. Perhaps you have a stale version of the patch. -Flavio On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote: Flavio, I'm getting a compilation error for patch 491: compile-main: [javac] Compiling 1 source file to /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/ src/p atched/branch-3.2/build/classes [javac] /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/ src/p atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/ FastL eaderElection.java:601: cannot find symbol [javac] symbol : method getWeight(long) [javac] location: interface org.apache.zookeeper.server.quorum.flexible.QuorumVerifier [javac] if(self.getQuorumVerifier().getWeight(n.sid) != 0) [javac]^ [javac] 1 error I see a reference to getWeight in both FastLeaderElection.java in patch 491: patches/ZOOKEEPER-491.patch:+ if(self.getQuorumVerifier().getWeight(n.sid) != 0) src/java/main/org/apache/zookeeper/server/quorum/ FastLeaderElection.java : if(self.getQuorumVerifier().getWeight(n.sid) != 0) However, I don't see a reference to this method in patches 473, 479, or 481. I also don't see a reference to this method in the trunk... -Todd -Original Message- From: Todd Greenwood [mailto:to...@audiencescience.com] Sent: Friday, July 31, 2009 7:30 PM To: zookeeper-user@hadoop.apache.org Subject: RE: Unending Leader Elections in WAN deploy Ok, I'll apply that patch and report back. -Todd -Original Message- From: Flavio Junqueira [mailto:f...@yahoo-inc.com] Sent: Friday, July 31, 2009 7:18 PM To: zookeeper-user@hadoop.apache.org Subject: Re: Unending Leader Elections in WAN deploy You're missing 491 from your set of patches. -Flavio On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote: This repro's in both branch-3.2, and branch-3.2+patches(473, 479, 481). Basically, it seems like the nodes are electing pd4-zook02 to be the leader. However, pd4-zook02 seems to realize it's not supposed to be and then disconnects everyone. Then they re-elect it again, and it loops over and over. - Server config - server.1=dc1-zook01.dc01.revsci.net:2888:3888 server.2=dc1-zook02.dc01.revsci.net:2888:3888 server.3=dc1-zook03.dc01.revsci.net:2888:3888 server.4=dc1-zook04.dc01.revsci.net:2888:3888 server.5=dc1-zook05.dc01.revsci.net:2888:3888 server.6=pd1-zook01.pd01.revsci.net:2888:3888 server.7=pd1-zook02.pd01.revsci.net:2888:3888 server.8=pd4-zook01.iad1.audsci.net:2888:3888 server.9=pd4-zook02.iad1.audsci.net:2888:3888 group.1:1:2:3:4:5 weight.1=1 weight.2=1 weight.3=1 weight.4=1 weight.5=1 group.2:6:7:8:9 weight.6=0 weight.7=0 weight.8=0 weight.9=0 Note that we have 2 groups, composed of machines in 3 different locations (dc1, pd1, and pd4). The idea is that only machines in dc1 have voting rights, and the ability to become a leader. The machines
RE: Unending Leader Elections in WAN deploy
Another problem...I've reverted to the latest versions of the patches that are not specific to branch-3.2, and I'm getting two compilation errors: build-generated: [javac] Compiling 44 source files to /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p atched/branch-3.2/build/classes compile-main: [javac] Compiling 2 source files to /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p atched/branch-3.2/build/classes [javac] /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru mStats.java:30: name clash: getQuorumPeers() and getQuorumPeers() have the same erasure [javac] public String[] getQuorumPeers(); [javac] ^ [javac] /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru mStats.java:31: name clash: getServerState() and getServerState() have the same erasure [javac] public String getServerState(); [javac] ^ [javac] 2 errors My build process is pretty simple: 1. copy the branch-3.2 source to a temp directory (src/patched/branch-3.2) 2. apply the ZOOKEEPER patches in my patches directory 3. build zookeeper in the temp directory -Todd -Original Message- From: Todd Greenwood [mailto:to...@audiencescience.com] Sent: Monday, August 03, 2009 4:09 PM To: zookeeper-user@hadoop.apache.org Subject: RE: Unending Leader Elections in WAN deploy Flavio, I notice that you've updated the patches referenced for the WAN deployment. There appears to be an order dependency w/ respect to these four patches... ZOOKEEPER-473.patch ZOOKEEPER-479-branch3.2.patch ZOOKEEPER-481-branch3.2.patch ZOOKEEPER-491.patch 473 - 479 (479 fails) to...@toddg01lt:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper /src/patched/branch-3.2$ patch -p0 ../patches/ZOOKEEPER-479-branch3.2.patch patching file src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch ical.java patching file src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java patching file src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier .java patching file src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java Hunk #1 FAILED at 93. Hunk #2 FAILED at 145. 2 out of 2 hunks FAILED -- saving rejects to file src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej to...@toddg01lt:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper /src/patched/branch-3.2$ h ../patches/ Could you advise as to which patches I need to apply, and in what order? -Todd -Original Message- From: Flavio Junqueira [mailto:f...@yahoo-inc.com] Sent: Friday, July 31, 2009 9:51 PM To: zookeeper-user@hadoop.apache.org Subject: Re: Unending Leader Elections in WAN deploy Perfect! Thanks for the update, Todd. -Flavio On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote: Thanks. You were right, I had a stale version of 479. Compilation succeeds and all tests pass on branch-3.2 with the latest patches 473, 479, 481, and 491. -Todd -Original Message- From: Flavio Junqueira [mailto:f...@yahoo-inc.com] Sent: Friday, July 31, 2009 7:48 PM To: zookeeper-user@hadoop.apache.org Subject: Re: Unending Leader Elections in WAN deploy It should be in 479. Perhaps you have a stale version of the patch. -Flavio On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote: Flavio, I'm getting a compilation error for patch 491: compile-main: [javac] Compiling 1 source file to /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/ src/p atched/branch-3.2/build/classes [javac] /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/ src/p atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/ FastL eaderElection.java:601: cannot find symbol [javac] symbol : method getWeight(long) [javac] location: interface org.apache.zookeeper.server.quorum.flexible.QuorumVerifier [javac] if(self.getQuorumVerifier().getWeight(n.sid) != 0) [javac]^ [javac] 1 error I see a reference to getWeight in both FastLeaderElection.java in patch 491: patches/ZOOKEEPER-491.patch:+ if(self.getQuorumVerifier().getWeight(n.sid) != 0) src/java/main/org/apache/zookeeper/server/quorum/ FastLeaderElection.java : if(self.getQuorumVerifier().getWeight(n.sid) != 0) However, I don't see a reference to this method in patches 473, 479, or 481. I also don't see a reference to this method in the trunk... -Todd -Original Message- From: Todd Greenwood [mailto:to...@audiencescience.com
RE: Unending Leader Elections in WAN deploy
That'd be perfect. Thanks! -Original Message- From: Mahadev Konar [mailto:maha...@yahoo-inc.com] Sent: Monday, August 03, 2009 4:24 PM To: zookeeper-user@hadoop.apache.org Subject: Re: Unending Leader Elections in WAN deploy Hi Todd, Most of the patches that you mention should be in the branch 3.2 by tomm or so. 481, 479 are already in. 480 and 491 should be in by tomm. Would that suffice for you? Thanks mahadev On 8/3/09 4:21 PM, Todd Greenwood to...@audiencescience.com wrote: Another problem...I've reverted to the latest versions of the patches that are not specific to branch-3.2, and I'm getting two compilation errors: build-generated: [javac] Compiling 44 source files to /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p atched/branch-3.2/build/classes compile-main: [javac] Compiling 2 source files to /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p atched/branch-3.2/build/classes [javac] /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru mStats.java:30: name clash: getQuorumPeers() and getQuorumPeers() have the same erasure [javac] public String[] getQuorumPeers(); [javac] ^ [javac] /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/src/p atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/Quoru mStats.java:31: name clash: getServerState() and getServerState() have the same erasure [javac] public String getServerState(); [javac] ^ [javac] 2 errors My build process is pretty simple: 1. copy the branch-3.2 source to a temp directory (src/patched/branch-3.2) 2. apply the ZOOKEEPER patches in my patches directory 3. build zookeeper in the temp directory -Todd -Original Message- From: Todd Greenwood [mailto:to...@audiencescience.com] Sent: Monday, August 03, 2009 4:09 PM To: zookeeper-user@hadoop.apache.org Subject: RE: Unending Leader Elections in WAN deploy Flavio, I notice that you've updated the patches referenced for the WAN deployment. There appears to be an order dependency w/ respect to these four patches... ZOOKEEPER-473.patch ZOOKEEPER-479-branch3.2.patch ZOOKEEPER-481-branch3.2.patch ZOOKEEPER-491.patch 473 - 479 (479 fails) to...@toddg01lt:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper /src/patched/branch-3.2$ patch -p0 ../patches/ZOOKEEPER-479-branch3.2.patch patching file src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumHierarch ical.java patching file src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumMaj.java patching file src/java/main/org/apache/zookeeper/server/quorum/flexible/QuorumVerifier .java patching file src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java Hunk #1 FAILED at 93. Hunk #2 FAILED at 145. 2 out of 2 hunks FAILED -- saving rejects to file src/java/test/org/apache/zookeeper/test/HierarchicalQuorumTest.java.rej to...@toddg01lt:~/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper /src/patched/branch-3.2$ h ../patches/ Could you advise as to which patches I need to apply, and in what order? -Todd -Original Message- From: Flavio Junqueira [mailto:f...@yahoo-inc.com] Sent: Friday, July 31, 2009 9:51 PM To: zookeeper-user@hadoop.apache.org Subject: Re: Unending Leader Elections in WAN deploy Perfect! Thanks for the update, Todd. -Flavio On Jul 31, 2009, at 8:17 PM, Todd Greenwood wrote: Thanks. You were right, I had a stale version of 479. Compilation succeeds and all tests pass on branch-3.2 with the latest patches 473, 479, 481, and 491. -Todd -Original Message- From: Flavio Junqueira [mailto:f...@yahoo-inc.com] Sent: Friday, July 31, 2009 7:48 PM To: zookeeper-user@hadoop.apache.org Subject: Re: Unending Leader Elections in WAN deploy It should be in 479. Perhaps you have a stale version of the patch. -Flavio On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote: Flavio, I'm getting a compilation error for patch 491: compile-main: [javac] Compiling 1 source file to /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/ src/p atched/branch-3.2/build/classes [javac] /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/ src/p atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/ FastL eaderElection.java:601: cannot find symbol [javac] symbol : method getWeight(long) [javac] location: interface org.apache.zookeeper.server.quorum.flexible.QuorumVerifier [javac] if(self.getQuorumVerifier().getWeight(n.sid) != 0) [javac
RE: test failures in branch-3.2
Patrick, Thank you for the background (and I hope you and Mahadev recover quickly). On a plus note, I'm finding that this morning, @work rather than @home, the tests continue to completion. However, there are other issues that I'll bring up on the dev list, such as a requirement to have autoconf installed, and problems in the create-cppunit-configure task that can't exec libtoolize, fun stuff like tha. I need to proceed with the manual patches to branch-3.2, as I am under some time constraints to get our infrastructure deployed such that QA can start playing with it. However, I'll switch to 3.2.1 as soon as I can. -Todd -Original Message- From: Patrick Hunt [mailto:ph...@apache.org] Sent: Friday, July 31, 2009 11:38 AM To: zookeeper-user@hadoop.apache.org; Todd Greenwood Subject: Re: test failures in branch-3.2 Hi Todd, Sorry for the clutter/confusion. Usually things aren't this cumbersome ;-) In particular: 1 committer is on vacation Mahadev's been out sick for multiple days I'm sick but trying to hang in there, but def not 100% Hudson (CI) has been offline for effectively the past 3 weeks (that gates all our commits) and is just now back but flaky. 3.2 had some bugs that we are trying to address, but the afore mentioned issues are slowing us down. Otw we'd have all this straightened out by now At this point you should move this discussion to the dev list - Apache doesn't really like us to discuss code changes/futures here (user list). On that list you'll also see the plan for upcoming releases - I mention all this because we are actively working toward 3.2.1 which will include the JIRAs slated for that release (I'm sure you've seen). If you can wait a bit you might be able to avoid some pain by using the upcoming 3.2.1 release. Once the patches land into that branch your issues will be resolved w/o you needing to manually apply patches, etc... I did look at the files you attached - it looks fine so I'm not sure the issue. The form of this test makes it harder - we are verifying that the log contains sufficient information when a particular error occurs. We fiddle with log4j in order to do this, which means that the log you are including doesn't specify the problem. Try instrumenting this test with a try/catch around the content of the test method (all the code in the failing method inside a big try/catch is what I mean). Then print the error to std out as part of the catch. That should shed some light. If you could debug it a bit that would help - because we aren't seeing this in our environment. Again, sort of a moot point if you can wait a week or so... Regards, Patrick Todd Greenwood wrote: Inline. -Original Message- From: Patrick Hunt [mailto:ph...@apache.org] Sent: Thursday, July 30, 2009 10:57 PM To: zookeeper-user@hadoop.apache.org Subject: Re: test failures in branch-3.2 Todd Greenwood wrote: Starting w/ branch-3.2 (no changes) I applied patches in this order: 1. Apply ZOOKEEPER-479.patch. Builds, but HierarchicalQuorumTest fails. 2. Apply ZOOKEEPER-481.patch. Fails to build, b/c of missing file - PortAssignment.java. PortAssignment.java was added by Patrick as part of ZOOKEEPER-473.patch, which is a pretty hefty patch ( 2k lines) and touches a large number of files. Hrm, those patches were probably created against the trunk. We'll have to have separate patches for trunk and 3.2 branch on 481. If you could update the jira with this detail (481 needs two patches, one for each branch) that would be great! Done. 3. Apply ZOOKEEPER-473.patch. Builds, but QuorumPeerMainTest fails (jvm crashes). 473 is special (unique) in the sense that it changes log4j while the the vm is running. In general though it's a pretty boring test and shouldn't be failing. Are you sure you have the right patch file? there are 2 patch files on the JIRA for 473, make sure that you have the one from 7/16, NOT the one from 7/15. Check that the patch file, the correct one should NOT contain changes to build.xml or conf/log4j* files. If this still happens send me your build.xml, conf/log4j* and QuroumPeerMainTest.java files in email for review. I'll take a look. I've annotated the files w/ their date while downloading: 112700 2009-07-31 11:02 ZOOKEEPER-473-7-15.patch 110607 2009-07-31 11:01 ZOOKEEPER-473-7-16.patch It appears I applied the 7-16 patch, as that is the matching file size of the patch file I applied. If there are to be multiple patch files for multiple branches (3.2, trunk, etc.) would it make sense to lable the patch files accordingly? Requested files in attached tar. -Todd Patrick [junit] Running org.apache.zookeeper.server.quorum.QuorumPeerMainTest [junit] Running org.apache.zookeeper.server.quorum.QuorumPeerMainTest [junit] Tests run: 1, Failures: 0, Errors: 1, Time
Unending Leader Elections in WAN deploy
This repro's in both branch-3.2, and branch-3.2+patches(473, 479, 481). Basically, it seems like the nodes are electing pd4-zook02 to be the leader. However, pd4-zook02 seems to realize it's not supposed to be and then disconnects everyone. Then they re-elect it again, and it loops over and over. - Server config - server.1=dc1-zook01.dc01.revsci.net:2888:3888 server.2=dc1-zook02.dc01.revsci.net:2888:3888 server.3=dc1-zook03.dc01.revsci.net:2888:3888 server.4=dc1-zook04.dc01.revsci.net:2888:3888 server.5=dc1-zook05.dc01.revsci.net:2888:3888 server.6=pd1-zook01.pd01.revsci.net:2888:3888 server.7=pd1-zook02.pd01.revsci.net:2888:3888 server.8=pd4-zook01.iad1.audsci.net:2888:3888 server.9=pd4-zook02.iad1.audsci.net:2888:3888 group.1:1:2:3:4:5 weight.1=1 weight.2=1 weight.3=1 weight.4=1 weight.5=1 group.2:6:7:8:9 weight.6=0 weight.7=0 weight.8=0 weight.9=0 Note that we have 2 groups, composed of machines in 3 different locations (dc1, pd1, and pd4). The idea is that only machines in dc1 have voting rights, and the ability to become a leader. The machines in the pods all have a weight of zero, and are not expected to become leaders, or to vote on transactions. Let me know what I can do to help resolve this issue. -Todd
RE: Unending Leader Elections in WAN deploy
Ok, I'll apply that patch and report back. -Todd -Original Message- From: Flavio Junqueira [mailto:f...@yahoo-inc.com] Sent: Friday, July 31, 2009 7:18 PM To: zookeeper-user@hadoop.apache.org Subject: Re: Unending Leader Elections in WAN deploy You're missing 491 from your set of patches. -Flavio On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote: This repro's in both branch-3.2, and branch-3.2+patches(473, 479, 481). Basically, it seems like the nodes are electing pd4-zook02 to be the leader. However, pd4-zook02 seems to realize it's not supposed to be and then disconnects everyone. Then they re-elect it again, and it loops over and over. - Server config - server.1=dc1-zook01.dc01.revsci.net:2888:3888 server.2=dc1-zook02.dc01.revsci.net:2888:3888 server.3=dc1-zook03.dc01.revsci.net:2888:3888 server.4=dc1-zook04.dc01.revsci.net:2888:3888 server.5=dc1-zook05.dc01.revsci.net:2888:3888 server.6=pd1-zook01.pd01.revsci.net:2888:3888 server.7=pd1-zook02.pd01.revsci.net:2888:3888 server.8=pd4-zook01.iad1.audsci.net:2888:3888 server.9=pd4-zook02.iad1.audsci.net:2888:3888 group.1:1:2:3:4:5 weight.1=1 weight.2=1 weight.3=1 weight.4=1 weight.5=1 group.2:6:7:8:9 weight.6=0 weight.7=0 weight.8=0 weight.9=0 Note that we have 2 groups, composed of machines in 3 different locations (dc1, pd1, and pd4). The idea is that only machines in dc1 have voting rights, and the ability to become a leader. The machines in the pods all have a weight of zero, and are not expected to become leaders, or to vote on transactions. Let me know what I can do to help resolve this issue. -Todd
RE: Unending Leader Elections in WAN deploy
Thanks. You were right, I had a stale version of 479. Compilation succeeds and all tests pass on branch-3.2 with the latest patches 473, 479, 481, and 491. -Todd -Original Message- From: Flavio Junqueira [mailto:f...@yahoo-inc.com] Sent: Friday, July 31, 2009 7:48 PM To: zookeeper-user@hadoop.apache.org Subject: Re: Unending Leader Elections in WAN deploy It should be in 479. Perhaps you have a stale version of the patch. -Flavio On Jul 31, 2009, at 7:46 PM, Todd Greenwood wrote: Flavio, I'm getting a compilation error for patch 491: compile-main: [javac] Compiling 1 source file to /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/ src/p atched/branch-3.2/build/classes [javac] /home/toddg/asi/workspaces/main/Main/RSI/etc/holmes/main/zookeeper/ src/p atched/branch-3.2/src/java/main/org/apache/zookeeper/server/quorum/ FastL eaderElection.java:601: cannot find symbol [javac] symbol : method getWeight(long) [javac] location: interface org.apache.zookeeper.server.quorum.flexible.QuorumVerifier [javac] if(self.getQuorumVerifier().getWeight(n.sid) != 0) [javac]^ [javac] 1 error I see a reference to getWeight in both FastLeaderElection.java in patch 491: patches/ZOOKEEPER-491.patch:+ if(self.getQuorumVerifier().getWeight(n.sid) != 0) src/java/main/org/apache/zookeeper/server/quorum/ FastLeaderElection.java : if(self.getQuorumVerifier().getWeight(n.sid) != 0) However, I don't see a reference to this method in patches 473, 479, or 481. I also don't see a reference to this method in the trunk... -Todd -Original Message- From: Todd Greenwood [mailto:to...@audiencescience.com] Sent: Friday, July 31, 2009 7:30 PM To: zookeeper-user@hadoop.apache.org Subject: RE: Unending Leader Elections in WAN deploy Ok, I'll apply that patch and report back. -Todd -Original Message- From: Flavio Junqueira [mailto:f...@yahoo-inc.com] Sent: Friday, July 31, 2009 7:18 PM To: zookeeper-user@hadoop.apache.org Subject: Re: Unending Leader Elections in WAN deploy You're missing 491 from your set of patches. -Flavio On Jul 31, 2009, at 7:15 PM, Todd Greenwood wrote: This repro's in both branch-3.2, and branch-3.2+patches(473, 479, 481). Basically, it seems like the nodes are electing pd4-zook02 to be the leader. However, pd4-zook02 seems to realize it's not supposed to be and then disconnects everyone. Then they re-elect it again, and it loops over and over. - Server config - server.1=dc1-zook01.dc01.revsci.net:2888:3888 server.2=dc1-zook02.dc01.revsci.net:2888:3888 server.3=dc1-zook03.dc01.revsci.net:2888:3888 server.4=dc1-zook04.dc01.revsci.net:2888:3888 server.5=dc1-zook05.dc01.revsci.net:2888:3888 server.6=pd1-zook01.pd01.revsci.net:2888:3888 server.7=pd1-zook02.pd01.revsci.net:2888:3888 server.8=pd4-zook01.iad1.audsci.net:2888:3888 server.9=pd4-zook02.iad1.audsci.net:2888:3888 group.1:1:2:3:4:5 weight.1=1 weight.2=1 weight.3=1 weight.4=1 weight.5=1 group.2:6:7:8:9 weight.6=0 weight.7=0 weight.8=0 weight.9=0 Note that we have 2 groups, composed of machines in 3 different locations (dc1, pd1, and pd4). The idea is that only machines in dc1 have voting rights, and the ability to become a leader. The machines in the pods all have a weight of zero, and are not expected to become leaders, or to vote on transactions. Let me know what I can do to help resolve this issue. -Todd
RE: Zookeeper WAN Configuration
Patrick - Thank you, I'll proceed accordingly. -Todd -Original Message- From: Patrick Hunt [mailto:ph...@apache.org] Sent: Wednesday, July 29, 2009 10:30 PM To: zookeeper-user@hadoop.apache.org Subject: Re: Zookeeper WAN Configuration [Todd] What is the recommended policy regarding patching zookeeper locally? As an external user, should I patch and compile in the trunk or in the branch (branch-3.2)? I've looked at : http://wiki.apache.org/hadoop/ZooKeeper/HowToContribute http://wiki.apache.org/hadoop/HowToRelease And both of these seem well thought out but aimed at commiters commiting to the trunk. In your context (want 3.2 features) you probably want to build based on the 3.2 tag, that way you are working off a known quantity. I'd suggest strongly that as part of your build you document the source base and which patches/changes you have applied. Having this information will be critical for you (or someone using your build) in case bugs have to be filed, or further changes/patches have to be applied, etc... Patrick
RE: bad svn url : test-patch
Thanks Mahadev. -Original Message- From: Mahadev Konar [mailto:maha...@yahoo-inc.com] Sent: Thursday, July 30, 2009 3:00 PM To: zookeeper-user@hadoop.apache.org Subject: Re: bad svn url : test-patch Hi Todd, Yes this happens with the branch 3.2. The test-patch link is broken becasuse of the hadoop split. This file is used for hudson test environment. It isnt used anywhere else, so the svn co otherwise should be fine. We should fix it anyways. Thanks mahadev On 7/30/09 2:57 PM, Todd Greenwood to...@audiencescience.com wrote: FYI - looks like there is a bad url in svn... $ svn co http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.2 branch-3.2 ... Abranch-3.2/build.xml Fetching external item into 'branch-3.2/src/java/test/bin' svn: URL 'http://svn.apache.org/repos/asf/hadoop/common/nightly/test-patch' doesn't exist This does not repro w/ 3.1: $ svn co http://svn.apache.org/repos/asf/hadoop/zookeeper/branches/branch-3.1 branch-3.1 -Todd
RE: test failures in branch-3.2
No edits to conf/log4j.properties. -Original Message- From: Patrick Hunt [mailto:ph...@apache.org] Sent: Thursday, July 30, 2009 9:25 PM To: Patrick Hunt Cc: zookeeper-user@hadoop.apache.org Subject: Re: test failures in branch-3.2 btw QuorumPeerMainTest uses the CONSOLE appender which is setup in conf/log4j.properties, now that I think of it perhaps not such a good idea :-) If you edited cong/log4j.properties it may be causing the test to fail, did you do this? (if you run the test by itself using -Dtestcase does it always fail?) I've entered a jira to address this: https://issues.apache.org/jira/browse/ZOOKEEPER-492 Patrick Patrick Hunt wrote: Todd Greenwood wrote: The build succeeds, but not the all of the tests. In previous test runs, I noticed an error in org.apache.zookeeper.test.FLETest. It was not able to bind to a port or something. Now, after a machine reboot, I'm getting different failures. address in use? That's a problem in the test framework pre-3.3. In 3.3 (current svn trunk) I fixed it but it's not in 3.2.x. This is a problem with the test framework though and not a real problem, it shows up occasionally (depends on timing). branch-3.2 $ ant test [junit] Test org.apache.zookeeper.server.quorum.QuorumPeerMainTest FAILED (crashed) [junit] Test org.apache.zookeeper.test.HierarchicalQuorumTest FAILED Test logs for these two tests attached. This is unusual though - looking at the log it seems that the JVM itself crashed for the QPMainTest! for HQT we are seeing: junit.framework.AssertionFailedError: Threads didn't join which Flavio mentioned to me once is possible to happen but not a real problem (he can elaborate). What version of java are you using? OS, other environment that might be interesting? (vm? etc...) You might try looking at the jvm crash dump file (I think it's in /tmp) If you run each of these two tests individually do they run? example: ant -Dtestcase=FLENewEpochTest test-core-java My goal here is to get to a known state (all tests succeeding or have workarounds for the failures). Following that, I plan to apply the patches Flavio recommended for a WAN deploy (479 and 481). After I verify that the tests continue to run, I'll package this up and deploy it to our WAN for testing. Sounds like a good plan. So, are these known issues? Do the tests normally run en masse, or do some of the tests hold on to resources and prevent other tests from passing? Typically they do run to completion, but occasionally on my machine (java 1.6, linux32bit, 1.6g single core cpu, 1gigmem) I'll get some random failure due to address in use, or the same didn't join that you saw. Usually I see this if I'm multitasking (vs just letting the tests run w/o using the box). As I said this is addressed in 3.3 (address reuse at the very least, and I haven't see the other issues). Patrick
RE: test failures in branch-3.2
Patrick, inline. -Original Message- From: Patrick Hunt [mailto:ph...@apache.org] Sent: Thursday, July 30, 2009 9:13 PM To: zookeeper-user@hadoop.apache.org Subject: Re: test failures in branch-3.2 Todd Greenwood wrote: The build succeeds, but not the all of the tests. In previous test runs, I noticed an error in org.apache.zookeeper.test.FLETest. It was not able to bind to a port or something. Now, after a machine reboot, I'm getting different failures. address in use? That's a problem in the test framework pre-3.3. In 3.3 (current svn trunk) I fixed it but it's not in 3.2.x. This is a problem with the test framework though and not a real problem, it shows up occasionally (depends on timing). [Todd] Yes, I believe address in use was the problem w/ FLETest. I assumed it was a timing issue w/ respect to test A not fully releasing resources before test B started. branch-3.2 $ ant test [junit] Test org.apache.zookeeper.server.quorum.QuorumPeerMainTest FAILED (crashed) [junit] Test org.apache.zookeeper.test.HierarchicalQuorumTest FAILED Test logs for these two tests attached. This is unusual though - looking at the log it seems that the JVM itself crashed for the QPMainTest! for HQT we are seeing: junit.framework.AssertionFailedError: Threads didn't join which Flavio mentioned to me once is possible to happen but not a real problem (he can elaborate). What version of java are you using? OS, other environment that might be interesting? (vm? etc...) You might try looking at the jvm crash dump file (I think it's in /tmp) [Todd] --- $ uname -a Linux TODDG01LT 2.6.28-14-generic #47-Ubuntu SMP Sat Jul 25 01:19:55 UTC 2009 x86_64 GNU/Linux $ which java /home/toddg/bin/x64/java/jdk1.6.0_13/bin/java $ java -version java version 1.6.0_13 Java(TM) SE Runtime Environment (build 1.6.0_13-b03) Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode) Memory = 4GB [Todd] --- If you run each of these two tests individually do they run? example: ant -Dtestcase=FLENewEpochTest test-core-java [Todd] Will try this once my local build is working and report back. I'll open a separate mail thread on applying patches. My goal here is to get to a known state (all tests succeeding or have workarounds for the failures). Following that, I plan to apply the patches Flavio recommended for a WAN deploy (479 and 481). After I verify that the tests continue to run, I'll package this up and deploy it to our WAN for testing. Sounds like a good plan. So, are these known issues? Do the tests normally run en masse, or do some of the tests hold on to resources and prevent other tests from passing? Typically they do run to completion, but occasionally on my machine (java 1.6, linux32bit, 1.6g single core cpu, 1gigmem) I'll get some random failure due to address in use, or the same didn't join that you saw. Usually I see this if I'm multitasking (vs just letting the tests run w/o using the box). As I said this is addressed in 3.3 (address reuse at the very least, and I haven't see the other issues). Patrick
RE: test failures in branch-3.2
Patrick/Flavio - Starting w/ branch-3.2 (no changes) I applied patches in this order: 1. Apply ZOOKEEPER-479.patch. Builds, but HierarchicalQuorumTest fails. 2. Apply ZOOKEEPER-481.patch. Fails to build, b/c of missing file - PortAssignment.java. PortAssignment.java was added by Patrick as part of ZOOKEEPER-473.patch, which is a pretty hefty patch ( 2k lines) and touches a large number of files. 3. Apply ZOOKEEPER-473.patch. Builds, but QuorumPeerMainTest fails (jvm crashes). [junit] Running org.apache.zookeeper.server.quorum.QuorumPeerMainTest [junit] Running org.apache.zookeeper.server.quorum.QuorumPeerMainTest [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec [junit] Test org.apache.zookeeper.server.quorum.QuorumPeerMainTest FAILED (crashed) Test Log Testsuite: org.apache.zookeeper.server.quorum.QuorumPeerMainTest Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec Testcase: testBadPeerAddressInQuorum took 0.004 sec Caused an ERROR Forked Java VM exited abnormally. Please note the time in the report does not reflect the time until the VM exit. junit.framework.AssertionFailedError: Forked Java VM exited abnormally. Please note the time in the report does not reflect the time until the VM exit. -Todd -Original Message- From: Patrick Hunt [mailto:ph...@apache.org] Sent: Thursday, July 30, 2009 10:13 PM To: zookeeper-user@hadoop.apache.org Subject: Re: test failures in branch-3.2 Todd Greenwood wrote: [Todd] Yes, I believe address in use was the problem w/ FLETest. I assumed it was a timing issue w/ respect to test A not fully releasing resources before test B started. Might be, but actually I think it's related to this: http://hea-www.harvard.edu/~fine/Tech/addrinuse.html Patrick
Zookeeper WAN Configuration
Like most folks, our WAN is composed of various zones, some central processing, some edge, some corp, and some in between (DMZs). In this model, a given Zookeeper server will not have direct connectivity to all of it's peers in the ensemble due to various security constraints. Is this a problem? Are there special configurations for this model? Given 3 Zones - A -- B B -- C A cannot see C, and vice versa. B can see A and C. 1. Will zookeeper servers function properly even if a given set of servers can only see some of the servers in the ensemble? For example, the shared config lists all zk servers in A, B, and C, but A can only see B, C can only see B, and B can see both A and C. 2. Will zookeeper servers flood the log with error messages if only a subset of the ensemble members are visible? 3. Will the zk ensemble function properly if the config used by each server only lists the servers in the ensemble that are visible? Suppose that A has a config that only list servers in A and B, C a config for C and B, and B has a config that lists servers in A, B, and C. Is this the recommended approach? http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperAdmin.html
RE: Zookeeper WAN Configuration
Flavio Ted, thank you for your comments. So it sounds like the only way to currently deploy to the WAN is to deploy ZK Servers to the central DC and open up client connections to these ZK servers from the edge nodes. True? In the future, once the Observers feature is implemented, then we should be able to deploy zk servers to both the DC and to the pods...with all the goodness that Flavio mentions below. Flavio - do you have a doc that describes exactly what happens in the transaction of a write operation? For instance, I'd like to know at exactly what stage a write has been commited to the ensemble, and not just the zk server the client is connected to. I figure it must be something like: clientA.write(path, value) - serverA writes to memory - serverA writes to transacted disk every n/seconds or m/bytes - serverA sends write to Leader - Leader stamps with transaction id - Leader responds to ensemble with update + transaction id -Todd -Original Message- From: Flavio Junqueira [mailto:f...@yahoo-inc.com] Sent: Friday, July 24, 2009 4:50 PM To: zookeeper-user@hadoop.apache.org Subject: Re: Zookeeper WAN Configuration Just a few quick observations: On Jul 24, 2009, at 4:40 PM, Ted Dunning wrote: On Fri, Jul 24, 2009 at 4:23 PM, Todd Greenwood to...@audiencescience.comwrote: Could you explain the idea behind the Observers feature, what this concept is supposed to address, and how it applies to the WAN configuration problem in particular? Not really. I am just echoing comments on observers from them that know. Without observers, increasing the number of servers in an ensemble enables higher read throughput, but causes write throughput to drop because the number of votes to order each write operation increases. Essentially, observers are zookeeper servers that don't vote when ordering updates to the zookeeper state. Adding observers enables higher read throughput affecting minimally write throughput (leader still has to send commits to everyone, at least in the version we have been working on). The ideas for federating ZK or allowing observers would likely do what you want. I can imagine that an observer would only care that it can see it's local peers and one of the observers would be elected to get updates (and thus would care about the central service). This certainly sounds like exactly what I want...Was this introduced in 3.2 in full, or only partially? I don't think it is even in trunk yet. Look on Jira or at the recent logs of this mailing list. It is not on trunk yet. -Flavio
RE: Leader Elections
Flavio, Ted, Henry, Scott, this would perfectly well for my use case provided: SINGLE ENSEMBLE: GROUP A : ZK Servers w/ read/write AND Leader Elections GROUP B : ZK Servers w/ read/write W/O Leader Elections So, we can craft this via Observers and Hiererarchial Quorum groups? Great. Problem solved. When will this be production ready? :o) Scott brought up a multi-feature that is very interesting for me. Namely: 1. Offline ZK servers that sync merge on reconnect The offline servers seems conceptually simple, it's kind of like a messaging system. However, the merge and resolve step when two servers reconnect might be challenging. Cool idea though. 2. Partial memory graph subscriptions The second idea is partial memory graph subscriptions. This would enable virtual ensembles to interract on the same physical ensemble. For my use case, this would prevent unnecessary cross talk between nodes on a WAN, allowing me to define the subsets of the memory graph that need to be replicated, and to whom. This would be a huge scalability win for WAN use cases. -Todd -Original Message- From: Scott Carey [mailto:sc...@richrelevance.com] Sent: Monday, July 20, 2009 11:00 AM To: zookeeper-user@hadoop.apache.org Subject: Re: Leader Elections Observers would be awesome especially with a couple enhancements / extensions: An option for the observers to enter a special state if the WAN link goes down to the master cluster. A read-only option would be great. However, allowing certain types of writes to continue on a limited basis would be highly valuable as well. An observer could own a special node and its subnodes. Only these subnodes would be writable by the observer when there was a session break to the master cluster, and the master cluster would take all the changes when the link is reestablished. Essentially, it is a portion of the hierarchy that is writable only by a specitfic observer, and read-only for others. The purpose of this would be for when the WAN link goes down to the master ZKs for certain types of use cases - status updates or other changes local to the observer that are strictly read-only outside the Observer's 'realm'. On 7/19/09 12:16 PM, Henry Robinson he...@cloudera.com wrote: You can. See ZOOKEEPER-368 - at first glance it sounds like observers will be a good fit for your requirements. Do bear in mind that the patch on the jira is only for discussion purposes; I would not consider it currently fit for production use. I hope to put up a much better patch this week. Henry On Sat, Jul 18, 2009 at 7:38 PM, Ted Dunning ted.dunn...@gmail.com wrote: Can you submit updates via an observer? On Sat, Jul 18, 2009 at 6:38 AM, Flavio Junqueira f...@yahoo-inc.com wrote: 2- Observers: you could have one computing center containing an ensemble and observers around the edge just learning committed values. -- Ted Dunning, CTO DeepDyve
RE: Leader Elections
Henry, cool. When youre patch is ready for testing, I'll devote some time to take a test pass on it. -Original Message- From: Henry Robinson [mailto:he...@cloudera.com] Sent: Monday, July 20, 2009 2:54 PM To: zookeeper-user@hadoop.apache.org Subject: Re: Leader Elections On Mon, Jul 20, 2009 at 7:50 PM, Todd Greenwood to...@audiencescience.comwrote: Flavio, Ted, Henry, Scott, this would perfectly well for my use case provided: SINGLE ENSEMBLE: GROUP A : ZK Servers w/ read/write AND Leader Elections GROUP B : ZK Servers w/ read/write W/O Leader Elections So, we can craft this via Observers and Hiererarchial Quorum groups? Great. Problem solved. When will this be production ready? :o) Looks to me like you don't even need hierarchical quorums for this - make everyone in group B an Observer and you're done. I've been working on this feature. Recently we've been discussing a proof-of-concept patch on the JIRA. I have nearly finished a less rough patch which I will submit for discussion and potentially commit this week. At that point it would be extremely helpful if you could help test the patch, and you can start considering it for production. To get into trunk I will have to write a comprehensive test suite and update the documentation, and then making sure all the boxes are ticked and no regressions are thrown up can take a little while. Henry Scott brought up a multi-feature that is very interesting for me. Namely: 1. Offline ZK servers that sync merge on reconnect The offline servers seems conceptually simple, it's kind of like a messaging system. However, the merge and resolve step when two servers reconnect might be challenging. Cool idea though. 2. Partial memory graph subscriptions The second idea is partial memory graph subscriptions. This would enable virtual ensembles to interract on the same physical ensemble. For my use case, this would prevent unnecessary cross talk between nodes on a WAN, allowing me to define the subsets of the memory graph that need to be replicated, and to whom. This would be a huge scalability win for WAN use cases. -Todd -Original Message- From: Scott Carey [mailto:sc...@richrelevance.com] Sent: Monday, July 20, 2009 11:00 AM To: zookeeper-user@hadoop.apache.org Subject: Re: Leader Elections Observers would be awesome especially with a couple enhancements / extensions: An option for the observers to enter a special state if the WAN link goes down to the master cluster. A read-only option would be great. However, allowing certain types of writes to continue on a limited basis would be highly valuable as well. An observer could own a special node and its subnodes. Only these subnodes would be writable by the observer when there was a session break to the master cluster, and the master cluster would take all the changes when the link is reestablished. Essentially, it is a portion of the hierarchy that is writable only by a specitfic observer, and read-only for others. The purpose of this would be for when the WAN link goes down to the master ZKs for certain types of use cases - status updates or other changes local to the observer that are strictly read-only outside the Observer's 'realm'. On 7/19/09 12:16 PM, Henry Robinson he...@cloudera.com wrote: You can. See ZOOKEEPER-368 - at first glance it sounds like observers will be a good fit for your requirements. Do bear in mind that the patch on the jira is only for discussion purposes; I would not consider it currently fit for production use. I hope to put up a much better patch this week. Henry On Sat, Jul 18, 2009 at 7:38 PM, Ted Dunning ted.dunn...@gmail.com wrote: Can you submit updates via an observer? On Sat, Jul 18, 2009 at 6:38 AM, Flavio Junqueira f...@yahoo-inc.com wrote: 2- Observers: you could have one computing center containing an ensemble and observers around the edge just learning committed values. -- Ted Dunning, CTO DeepDyve