[jira] Commented: (ZOOKEEPER-822) Leader election taking a long time to complete
[ https://issues.apache.org/jira/browse/ZOOKEEPER-822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890199#action_12890199 ] Flavio Paiva Junqueira commented on ZOOKEEPER-822: -- Hi Vishal, Do you think you can uploaded all three log files for a problematic run? We would like to put it on loggraph to visualize what's going on there. It sounds like it is somehow related to the VM reboots, I don't know why yet. Leader election taking a long time to complete --- Key: ZOOKEEPER-822 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-822 Project: Zookeeper Issue Type: Bug Components: quorum Affects Versions: 3.3.0 Reporter: Vishal K Priority: Blocker Attachments: test_zookeeper_1.log, test_zookeeper_2.log Created a 3 node cluster. 1 Fail the ZK leader 2. Let leader election finish. Restart the leader and let it join the 3. Repeat After a few rounds leader election takes anywhere 25- 60 seconds to finish. Note- we didn't have any ZK clients and no new znodes were created. zoo.cfg is shown below: #Mon Jul 19 12:15:10 UTC 2010 server.1=192.168.4.12\:2888\:3888 server.0=192.168.4.11\:2888\:3888 clientPort=2181 dataDir=/var/zookeeper syncLimit=2 server.2=192.168.4.13\:2888\:3888 initLimit=5 tickTime=2000 I have attached logs from two nodes that took a long time to form the cluster after failing the leader. The leader was down anyways so logs from that node shouldn't matter. Look for START HERE. Logs after that point should be of our interest. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-790) Last processed zxid set prematurely while establishing leadership
[ https://issues.apache.org/jira/browse/ZOOKEEPER-790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Flavio Paiva Junqueira updated ZOOKEEPER-790: - Attachment: ZOOKEEPER-790.patch This patch is very simple: it moves two function calls such that a leader only starts up and sets the the last processed zxid after it has a quorum of supporters. It also includes a unit test. I had to make a few modifications here and there to be able to write this test. I tried to minimize changes as much as possible. Last processed zxid set prematurely while establishing leadership - Key: ZOOKEEPER-790 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-790 Project: Zookeeper Issue Type: Bug Components: quorum Affects Versions: 3.3.1 Reporter: Flavio Paiva Junqueira Assignee: Flavio Paiva Junqueira Priority: Blocker Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-790.patch, ZOOKEEPER-790.patch, ZOOKEEPER-790.patch, ZOOKEEPER-790.travis.log.bz2 The leader code is setting the last processed zxid to the first of the new epoch even before connecting to a quorum of followers. Because the leader code sets this value before connecting to a quorum of followers (Leader.java:281) and the follower code throws an IOException (Follower.java:73) if the leader epoch is smaller, we have that when the false leader drops leadership and becomes a follower, it finds a smaller epoch and kills itself. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-790) Last processed zxid set prematurely while establishing leadership
[ https://issues.apache.org/jira/browse/ZOOKEEPER-790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Flavio Paiva Junqueira updated ZOOKEEPER-790: - Status: Patch Available (was: Open) Last processed zxid set prematurely while establishing leadership - Key: ZOOKEEPER-790 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-790 Project: Zookeeper Issue Type: Bug Components: quorum Affects Versions: 3.3.1 Reporter: Flavio Paiva Junqueira Assignee: Flavio Paiva Junqueira Priority: Blocker Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-790.patch, ZOOKEEPER-790.patch, ZOOKEEPER-790.patch, ZOOKEEPER-790.travis.log.bz2 The leader code is setting the last processed zxid to the first of the new epoch even before connecting to a quorum of followers. Because the leader code sets this value before connecting to a quorum of followers (Leader.java:281) and the follower code throws an IOException (Follower.java:73) if the leader epoch is smaller, we have that when the false leader drops leadership and becomes a follower, it finds a smaller epoch and kills itself. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-816) Detecting and diagnosing elusive bugs and faults in Zookeeper
[ https://issues.apache.org/jira/browse/ZOOKEEPER-816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890436#action_12890436 ] Patrick Hunt commented on ZOOKEEPER-816: You should also consider using Avro for the marshalling/unmarshal of the records. http://avro.apache.org/ Lots of benefits - in particular it's cross-language. Re writing to disk - perhaps just re-use the ZK WAL code and write to a disk that's not storing the transactional log. Detecting and diagnosing elusive bugs and faults in Zookeeper - Key: ZOOKEEPER-816 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-816 Project: Zookeeper Issue Type: New Feature Reporter: Miguel Correia Priority: Minor Complex distributed systems like Zookeeper tend to fail in strange ways that are hard to diagnose. The objective is to build a tool that helps understand when and where these problems occurred based on Zookeeper's traces (i.e., logs in TRACE level). Minor changes to the server code will be needed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-790) Last processed zxid set prematurely while establishing leadership
[ https://issues.apache.org/jira/browse/ZOOKEEPER-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890451#action_12890451 ] Hadoop QA commented on ZOOKEEPER-790: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12449972/ZOOKEEPER-790.patch against trunk revision 963957. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h1.grid.sp2.yahoo.net/150/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h1.grid.sp2.yahoo.net/150/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h1.grid.sp2.yahoo.net/150/console This message is automatically generated. Last processed zxid set prematurely while establishing leadership - Key: ZOOKEEPER-790 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-790 Project: Zookeeper Issue Type: Bug Components: quorum Affects Versions: 3.3.1 Reporter: Flavio Paiva Junqueira Assignee: Flavio Paiva Junqueira Priority: Blocker Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-790.patch, ZOOKEEPER-790.patch, ZOOKEEPER-790.patch, ZOOKEEPER-790.travis.log.bz2 The leader code is setting the last processed zxid to the first of the new epoch even before connecting to a quorum of followers. Because the leader code sets this value before connecting to a quorum of followers (Leader.java:281) and the follower code throws an IOException (Follower.java:73) if the leader epoch is smaller, we have that when the false leader drops leadership and becomes a follower, it finds a smaller epoch and kills itself. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-790) Last processed zxid set prematurely while establishing leadership
[ https://issues.apache.org/jira/browse/ZOOKEEPER-790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Flavio Paiva Junqueira updated ZOOKEEPER-790: - Attachment: ZOOKEEPER-790-3.3.patch Uploading patch for the 3.3 branch. I have also checked the results of Hudson, and I couldn't find any java test failure. The -1 on core tests seems to be unrelated to this patch. Last processed zxid set prematurely while establishing leadership - Key: ZOOKEEPER-790 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-790 Project: Zookeeper Issue Type: Bug Components: quorum Affects Versions: 3.3.1 Reporter: Flavio Paiva Junqueira Assignee: Flavio Paiva Junqueira Priority: Blocker Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-790-3.3.patch, ZOOKEEPER-790.patch, ZOOKEEPER-790.patch, ZOOKEEPER-790.patch, ZOOKEEPER-790.travis.log.bz2 The leader code is setting the last processed zxid to the first of the new epoch even before connecting to a quorum of followers. Because the leader code sets this value before connecting to a quorum of followers (Leader.java:281) and the follower code throws an IOException (Follower.java:73) if the leader epoch is smaller, we have that when the false leader drops leadership and becomes a follower, it finds a smaller epoch and kills itself. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-790) Last processed zxid set prematurely while establishing leadership
[ https://issues.apache.org/jira/browse/ZOOKEEPER-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890512#action_12890512 ] Travis Crawford commented on ZOOKEEPER-790: --- I tested this patch on a build with the following, applied in the listed order: 3.3.1 release + ZOOKEEPER-744.patch + ZOOKEEPER-790-3.3.patch Looks good! {code} 2010-07-20 23:43:34,229 - INFO [Thread-2545:nioserverc...@1516] - Closed socket connection for client /10.209.21.181:53743 (no session established for client) 2010-07-20 23:43:34,659 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:nioserverc...@639] - EndOfStreamException: Unable to read additional data from client sessionid 0x129d3fcb5a6f60d, likely client has closed socket 2010-07-20 23:43:34,660 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:nioserverc...@1516] - Closed socket connection for client /10.209.21.204:59727 which had sessionid 0x129d3fcb5a6f60d 2010-07-20 23:43:34,684 - INFO [ProcessThread:-1:preprequestproces...@385] - Processed session termination for sessionid: 0x329d3fcb6594e53 2010-07-20 23:52:14,522 - INFO [main:quorumpeercon...@90] - Reading configuration from: /etc/zookeeper/conf/zoo.cfg 2010-07-20 23:52:14,529 - INFO [main:quorumpeercon...@287] - Defaulting to majority quorums 2010-07-20 23:52:14,540 - INFO [main:quorumpeerm...@119] - Starting quorum peer 2010-07-20 23:52:14,562 - INFO [main:nioservercnxn$fact...@149] - binding to port 0.0.0.0/0.0.0.0:2181 2010-07-20 23:52:14,578 - INFO [main:quorump...@818] - tickTime set to 2000 2010-07-20 23:52:14,579 - INFO [main:quorump...@829] - minSessionTimeout set to -1 2010-07-20 23:52:14,579 - INFO [main:quorump...@840] - maxSessionTimeout set to -1 2010-07-20 23:52:14,579 - INFO [main:quorump...@855] - initLimit set to 10 2010-07-20 23:52:14,798 - INFO [main:files...@82] - Reading snapshot /data/zookeeper/version-2/snapshot.2500197ee5 2010-07-20 23:52:15,660 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:nioservercnxn$fact...@256] - Accepted socket connection from /10.209.45.76:57030 2010-07-20 23:52:15,661 - INFO [Thread-1:quorumcnxmanager$liste...@436] - My election bind port: 3888 2010-07-20 23:52:15,663 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:nioserverc...@644] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2010-07-20 23:52:15,664 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:nioserverc...@1516] - Closed socket connection for client /10.209.45.76:57030 (no session established for client) 2010-07-20 23:52:15,670 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@620] - LOOKING 2010-07-20 23:52:15,672 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@649] - New election. My id = 1, Proposed zxid = 158915472832 2010-07-20 23:52:15,674 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: 1, 158915472832, 1, 1, LOOKING, LOOKING, 1 2010-07-20 23:52:15,674 - INFO [WorkerSender Thread:quorumcnxmana...@162] - Have smaller server identifier, so dropping the connection: (2, 1) 2010-07-20 23:52:15,675 - INFO [WorkerSender Thread:quorumcnxmana...@162] - Have smaller server identifier, so dropping the connection: (3, 1) 2010-07-20 23:52:15,676 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: 2, 158915472832, 5, 1, LOOKING, LOOKING, 2 2010-07-20 23:52:15,676 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: 3, 158915472832, 5, 1, LOOKING, LOOKING, 2 2010-07-20 23:52:15,676 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@711] - Updating proposal 2010-07-20 23:52:15,677 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 3, 158915472832, 5, 1, LOOKING, FOLLOWING, 2 2010-07-20 23:52:15,677 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: 3, 158915472832, 5, 1, LOOKING, LOOKING, 3 2010-07-20 23:52:15,879 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@642] - FOLLOWING 2010-07-20 23:52:15,885 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:lear...@72] - TCP NoDelay set to: true 2010-07-20 23:52:15,893 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:environm...@97] - Server environment:zookeeper.version=3.3.1-942149, built on 05/07/2010 17:14 GMT 2010-07-20 23:52:15,893 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:environm...@97] - Server environment:host.name=sjc1k029.twitter.com 2010-07-20 23:52:15,894 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:environm...@97] - Server environment:java.version=1.6.0_16 2010-07-20 23:52:15,894 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:environm...@97] - Server environment:java.vendor=Sun Microsystems Inc. 2010-07-20 23:52:15,894 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:environm...@97] - Server environment:java.home=/usr/java/jdk1.6.0_16/jre 2010-07-20 23:52:15,894 - INFO
[jira] Commented: (ZOOKEEPER-790) Last processed zxid set prematurely while establishing leadership
[ https://issues.apache.org/jira/browse/ZOOKEEPER-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890513#action_12890513 ] Travis Crawford commented on ZOOKEEPER-790: --- Just to double-check, I'm really really sure the running jar was freshly built. For example, lsof says: java4473 root memREG 104,1 1012397 10092999 /usr/local/zookeeper/zookeeper-3.3.2-dev.jar Yes, the version number is 3.3.2, but this was built from the 3.3.1 release. Looking in the log we see: 2010-07-20 23:52:15,893 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:environm...@97] - Server environment:zookeeper.version=3.3.1-942149, built on 05/07/2010 17:14 GMT Should that say built this afternoon? I've double tripled checked and believe this is a newly built jar. Looking in the tarball we don't see zookeeper-3.3.2-dev.jar until after building. Last processed zxid set prematurely while establishing leadership - Key: ZOOKEEPER-790 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-790 Project: Zookeeper Issue Type: Bug Components: quorum Affects Versions: 3.3.1 Reporter: Flavio Paiva Junqueira Assignee: Flavio Paiva Junqueira Priority: Blocker Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-790-3.3.patch, ZOOKEEPER-790.patch, ZOOKEEPER-790.patch, ZOOKEEPER-790.patch, ZOOKEEPER-790.travis.log.bz2 The leader code is setting the last processed zxid to the first of the new epoch even before connecting to a quorum of followers. Because the leader code sets this value before connecting to a quorum of followers (Leader.java:281) and the follower code throws an IOException (Follower.java:73) if the leader epoch is smaller, we have that when the false leader drops leadership and becomes a follower, it finds a smaller epoch and kills itself. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Hudson build is back to normal : ZooKeeper-trunk #881
See http://hudson.zones.apache.org/hudson/job/ZooKeeper-trunk/881/
[jira] Commented: (ZOOKEEPER-719) Add throttling to BookKeeper client
[ https://issues.apache.org/jira/browse/ZOOKEEPER-719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890523#action_12890523 ] Hudson commented on ZOOKEEPER-719: -- Integrated in ZooKeeper-trunk #881 (See [http://hudson.zones.apache.org/hudson/job/ZooKeeper-trunk/881/]) Add throttling to BookKeeper client --- Key: ZOOKEEPER-719 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-719 Project: Zookeeper Issue Type: Bug Components: contrib-bookkeeper Affects Versions: 3.3.0 Reporter: Flavio Paiva Junqueira Assignee: Flavio Paiva Junqueira Fix For: 3.4.0 Attachments: ZOOKEEPER-719.patch, ZOOKEEPER-719.patch, ZOOKEEPER-719.patch, ZOOKEEPER-719.patch Add throttling to client to control the rate of operations to bookies. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-712) Bookie recovery
[ https://issues.apache.org/jira/browse/ZOOKEEPER-712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890524#action_12890524 ] Hudson commented on ZOOKEEPER-712: -- Integrated in ZooKeeper-trunk #881 (See [http://hudson.zones.apache.org/hudson/job/ZooKeeper-trunk/881/]) Bookie recovery --- Key: ZOOKEEPER-712 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-712 Project: Zookeeper Issue Type: New Feature Components: contrib-bookkeeper Reporter: Flavio Paiva Junqueira Assignee: Erwin Tam Fix For: 3.4.0 Attachments: ZOOKEEPER-712.patch Recover the ledger fragments of a bookie once it crashes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-799) Add tools and recipes for monitoring as a contrib
[ https://issues.apache.org/jira/browse/ZOOKEEPER-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890525#action_12890525 ] Hudson commented on ZOOKEEPER-799: -- Integrated in ZooKeeper-trunk #881 (See [http://hudson.zones.apache.org/hudson/job/ZooKeeper-trunk/881/]) Add tools and recipes for monitoring as a contrib - Key: ZOOKEEPER-799 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-799 Project: Zookeeper Issue Type: New Feature Components: contrib Reporter: Andrei Savu Assignee: Andrei Savu Fix For: 3.4.0 Attachments: monitoring.tar.gz, ZOOKEEPER-799.patch Tools and Recipes for Monitoring ZooKeeper using Cacti, Nagios or Ganglia. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.