[jira] Commented: (ZOOKEEPER-821) Add ZooKeeper version information to zkpython
[ https://issues.apache.org/jira/browse/ZOOKEEPER-821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889426#action_12889426 ] Hadoop QA commented on ZOOKEEPER-821: - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12449737/ZOOKEEPER-821.patch against trunk revision 963957. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h1.grid.sp2.yahoo.net/148/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h1.grid.sp2.yahoo.net/148/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h1.grid.sp2.yahoo.net/148/console This message is automatically generated. > Add ZooKeeper version information to zkpython > - > > Key: ZOOKEEPER-821 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-821 > Project: Zookeeper > Issue Type: Improvement > Components: contrib-bindings >Affects Versions: 3.3.1 >Reporter: Rich Schumacher >Assignee: Rich Schumacher >Priority: Trivial > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-821.patch > > > Since installing and using ZooKeeper I've built and installed no less than > four versions of the zkpython bindings. It would be really helpful if the > module had a '__version__' attribute to easily tell which version is > currently in use. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-821) Add ZooKeeper version information to zkpython
[ https://issues.apache.org/jira/browse/ZOOKEEPER-821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rich Schumacher updated ZOOKEEPER-821: -- Status: Patch Available (was: Open) Added a revised patch that includes a corresponding test case. > Add ZooKeeper version information to zkpython > - > > Key: ZOOKEEPER-821 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-821 > Project: Zookeeper > Issue Type: Improvement > Components: contrib-bindings >Affects Versions: 3.3.1 >Reporter: Rich Schumacher >Assignee: Rich Schumacher >Priority: Trivial > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-821.patch > > > Since installing and using ZooKeeper I've built and installed no less than > four versions of the zkpython bindings. It would be really helpful if the > module had a '__version__' attribute to easily tell which version is > currently in use. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-821) Add ZooKeeper version information to zkpython
[ https://issues.apache.org/jira/browse/ZOOKEEPER-821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rich Schumacher updated ZOOKEEPER-821: -- Attachment: (was: ZOOKEEPER-821.patch) > Add ZooKeeper version information to zkpython > - > > Key: ZOOKEEPER-821 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-821 > Project: Zookeeper > Issue Type: Improvement > Components: contrib-bindings >Affects Versions: 3.3.1 >Reporter: Rich Schumacher >Assignee: Rich Schumacher >Priority: Trivial > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-821.patch > > > Since installing and using ZooKeeper I've built and installed no less than > four versions of the zkpython bindings. It would be really helpful if the > module had a '__version__' attribute to easily tell which version is > currently in use. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-821) Add ZooKeeper version information to zkpython
[ https://issues.apache.org/jira/browse/ZOOKEEPER-821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rich Schumacher updated ZOOKEEPER-821: -- Attachment: ZOOKEEPER-821.patch Updated with a corresponding test case. > Add ZooKeeper version information to zkpython > - > > Key: ZOOKEEPER-821 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-821 > Project: Zookeeper > Issue Type: Improvement > Components: contrib-bindings >Affects Versions: 3.3.1 >Reporter: Rich Schumacher >Assignee: Rich Schumacher >Priority: Trivial > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-821.patch > > > Since installing and using ZooKeeper I've built and installed no less than > four versions of the zkpython bindings. It would be really helpful if the > module had a '__version__' attribute to easily tell which version is > currently in use. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-821) Add ZooKeeper version information to zkpython
[ https://issues.apache.org/jira/browse/ZOOKEEPER-821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Hunt updated ZOOKEEPER-821: --- Status: Open (was: Patch Available) I'd suggest adding a check for the version in src/contrib/zkpython/src/test/connection_test.py you probably just want to check there's a properly formatted version string, don't need to check the particular version imo. > Add ZooKeeper version information to zkpython > - > > Key: ZOOKEEPER-821 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-821 > Project: Zookeeper > Issue Type: Improvement > Components: contrib-bindings >Affects Versions: 3.3.1 >Reporter: Rich Schumacher >Assignee: Rich Schumacher >Priority: Trivial > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-821.patch > > > Since installing and using ZooKeeper I've built and installed no less than > four versions of the zkpython bindings. It would be really helpful if the > module had a '__version__' attribute to easily tell which version is > currently in use. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (ZOOKEEPER-821) Add ZooKeeper version information to zkpython
[ https://issues.apache.org/jira/browse/ZOOKEEPER-821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Hunt reassigned ZOOKEEPER-821: -- Assignee: Rich Schumacher > Add ZooKeeper version information to zkpython > - > > Key: ZOOKEEPER-821 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-821 > Project: Zookeeper > Issue Type: Improvement > Components: contrib-bindings >Affects Versions: 3.3.1 >Reporter: Rich Schumacher >Assignee: Rich Schumacher >Priority: Trivial > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-821.patch > > > Since installing and using ZooKeeper I've built and installed no less than > four versions of the zkpython bindings. It would be really helpful if the > module had a '__version__' attribute to easily tell which version is > currently in use. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-821) Add ZooKeeper version information to zkpython
[ https://issues.apache.org/jira/browse/ZOOKEEPER-821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889383#action_12889383 ] Rich Schumacher commented on ZOOKEEPER-821: --- I'm not sure that this warrants a corresponding test case, but let me know if it needs one. > Add ZooKeeper version information to zkpython > - > > Key: ZOOKEEPER-821 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-821 > Project: Zookeeper > Issue Type: Improvement > Components: contrib-bindings >Affects Versions: 3.3.1 >Reporter: Rich Schumacher >Priority: Trivial > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-821.patch > > > Since installing and using ZooKeeper I've built and installed no less than > four versions of the zkpython bindings. It would be really helpful if the > module had a '__version__' attribute to easily tell which version is > currently in use. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-821) Add ZooKeeper version information to zkpython
[ https://issues.apache.org/jira/browse/ZOOKEEPER-821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889372#action_12889372 ] Hadoop QA commented on ZOOKEEPER-821: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12449716/ZOOKEEPER-821.patch against trunk revision 963957. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h1.grid.sp2.yahoo.net/147/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h1.grid.sp2.yahoo.net/147/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h1.grid.sp2.yahoo.net/147/console This message is automatically generated. > Add ZooKeeper version information to zkpython > - > > Key: ZOOKEEPER-821 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-821 > Project: Zookeeper > Issue Type: Improvement > Components: contrib-bindings >Affects Versions: 3.3.1 >Reporter: Rich Schumacher >Priority: Trivial > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-821.patch > > > Since installing and using ZooKeeper I've built and installed no less than > four versions of the zkpython bindings. It would be really helpful if the > module had a '__version__' attribute to easily tell which version is > currently in use. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-821) Add ZooKeeper version information to zkpython
[ https://issues.apache.org/jira/browse/ZOOKEEPER-821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rich Schumacher updated ZOOKEEPER-821: -- Status: Patch Available (was: Open) > Add ZooKeeper version information to zkpython > - > > Key: ZOOKEEPER-821 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-821 > Project: Zookeeper > Issue Type: Improvement > Components: contrib-bindings >Affects Versions: 3.3.1 >Reporter: Rich Schumacher >Priority: Trivial > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-821.patch > > > Since installing and using ZooKeeper I've built and installed no less than > four versions of the zkpython bindings. It would be really helpful if the > module had a '__version__' attribute to easily tell which version is > currently in use. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-821) Add ZooKeeper version information to zkpython
[ https://issues.apache.org/jira/browse/ZOOKEEPER-821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rich Schumacher updated ZOOKEEPER-821: -- Attachment: ZOOKEEPER-821.patch Adds the ZooKeeper server version as the '__version__' attribute in the Python module. > Add ZooKeeper version information to zkpython > - > > Key: ZOOKEEPER-821 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-821 > Project: Zookeeper > Issue Type: Improvement > Components: contrib-bindings >Affects Versions: 3.3.1 >Reporter: Rich Schumacher >Priority: Trivial > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-821.patch > > > Since installing and using ZooKeeper I've built and installed no less than > four versions of the zkpython bindings. It would be really helpful if the > module had a '__version__' attribute to easily tell which version is > currently in use. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (ZOOKEEPER-821) Add ZooKeeper version information to zkpython
Add ZooKeeper version information to zkpython - Key: ZOOKEEPER-821 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-821 Project: Zookeeper Issue Type: Improvement Components: contrib-bindings Affects Versions: 3.3.1 Reporter: Rich Schumacher Priority: Trivial Fix For: 3.4.0 Since installing and using ZooKeeper I've built and installed no less than four versions of the zkpython bindings. It would be really helpful if the module had a '__version__' attribute to easily tell which version is currently in use. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-702) GSoC 2010: Failure Detector Model
[ https://issues.apache.org/jira/browse/ZOOKEEPER-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889280#action_12889280 ] Abmar Barros commented on ZOOKEEPER-702: Hi Diogo! Thank you for indicating this paper, I haven't found such failure detection type so far, it is very interesting. It proposes a simple way of estimating heartbeats arrival times based on application messages. However it does require the attachment of sending times to all application messages (or at least the ones it will use to do the estimation), which is an overhead to message size. Anyway, with the separate failure detector module, it would be easy to implement a new Failure Detector that uses such data. So far, I have adapted the proposed failure detectors in order to compute the estimated next arrival time only when a heartbeat is received. > GSoC 2010: Failure Detector Model > - > > Key: ZOOKEEPER-702 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-702 > Project: Zookeeper > Issue Type: Wish >Reporter: Henry Robinson >Assignee: Abmar Barros > Attachments: bertier-pseudo.txt, bertier-pseudo.txt, chen-pseudo.txt, > chen-pseudo.txt, phiaccrual-pseudo.txt, phiaccrual-pseudo.txt, > ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, > ZOOKEEPER-702.patch > > > Failure Detector Module > Possible Mentor > Henry Robinson (henry at apache dot org) > Requirements > Java, some distributed systems knowledge, comfort implementing distributed > systems protocols > Description > ZooKeeper servers detects the failure of other servers and clients by > counting the number of 'ticks' for which it doesn't get a heartbeat from > other machines. This is the 'timeout' method of failure detection and works > very well; however it is possible that it is too aggressive and not easily > tuned for some more unusual ZooKeeper installations (such as in a wide-area > network, or even in a mobile ad-hoc network). > This project would abstract the notion of failure detection to a dedicated > Java module, and implement several failure detectors to compare and contrast > their appropriateness for ZooKeeper. For example, Apache Cassandra uses a > phi-accrual failure detector (http://ddsg.jaist.ac.jp/pub/HDY+04.pdf) which > is much more tunable and has some very interesting properties. This is a > great project if you are interested in distributed algorithms, or want to > help re-factor some of ZooKeeper's internal code. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-816) Detecting and diagnosing elusive bugs and faults in Zookeeper
[ https://issues.apache.org/jira/browse/ZOOKEEPER-816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889264#action_12889264 ] Patrick Hunt commented on ZOOKEEPER-816: Typically log4j logs are expected to be consumed by users - i.e. debugging, here you're really talking about output that will be machine processed. Is this really logging in the log4j typical sense, perhaps a separate mechanism should be used? Another problem with log4j logging is that changes to the message format, say to make it more "readable", can cause problems for the down stream processor. Not to mention that a separate mechanism could be made much more efficient than the general log4j log mechanism. Perhaps ZK's own WAL could be used for this - reuse the WAL code or write the information directly to the ZK transaction log (might not be such a good idea, but should be considered as an option). Additionally you should consider something like Aspects for this project. This is a cross-cutting feature, something that aspects are well suited for. A benefit of this approach is that it would allow those interested in the feature to enable it while those uninterested would incur zero performance penalty. > Detecting and diagnosing elusive bugs and faults in Zookeeper > - > > Key: ZOOKEEPER-816 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-816 > Project: Zookeeper > Issue Type: New Feature >Reporter: Miguel Correia >Priority: Minor > > Complex distributed systems like Zookeeper tend to fail in strange ways that > are hard to diagnose. The objective is to build a tool that helps understand > when and where these problems occurred based on Zookeeper's traces (i.e., > logs in TRACE level). Minor changes to the server code will be needed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (ZOOKEEPER-816) Detecting and diagnosing elusive bugs and faults in Zookeeper
Hi Ivan, good stuff, please add this as a comment on the jira. Patrick On 07/16/2010 08:24 AM, Ivan Kelly wrote: Zookeeper's traces (i.e., logs in TRACE level) provide some information that can be helpful to understand what happened. For instance, they contain information about the clients that are connected, the operations issued, etc. However, in real deployments with many clients (say, hundreds), traces are typically turned off to avoid the high overhead that they cause. Furthermore, the data in the traces is probably not enough for our purposes because it does not include, e.g., the replies to operations or the data values. As far as I've seen, this overhead comes in two forms, CPU and disk. CPU overhead is mostly due to formatting. Disk obviously because tracing will fill your disk fairly quickly. Perhaps something could be done to combat both of these. To fix the formatting problem we could use a binary log format. I've seen this done in C++ but not in java. The basic idea is that if you have TRACE("operation %x happened to %s %p", obj1, obj2, obj3); a preprocessor replaces this with TRACE(0x1234, obj1, obj2, obj3) where 0x1234 is an identifier for the trace. Then when the trace occurs a binary blob [0x1234, value of obj1, value of obj2, value of obj3] is logged. Then when the logs are pulled of the machine you run a post processor to do all the formatting and you get your full trace. Regarding the disk overhead, traces are usually only interesting in the run up to a failure. We could have a ring buffer in memory that is constantly traced to, old traces being overwritten when the ring buffer reaches it's limit. These traces should only be dumped to the filesystem when an error or fatal level event occurs, thereby giving you a trace of what was happening before you fell over. -Ivan
[jira] Commented: (ZOOKEEPER-790) Last processed zxid set prematurely while establishing leadership
[ https://issues.apache.org/jira/browse/ZOOKEEPER-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889225#action_12889225 ] Vishal K commented on ZOOKEEPER-790: great! I will give it a try. Thanks. > Last processed zxid set prematurely while establishing leadership > - > > Key: ZOOKEEPER-790 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-790 > Project: Zookeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.3.1 >Reporter: Flavio Paiva Junqueira >Assignee: Flavio Paiva Junqueira >Priority: Blocker > Fix For: 3.3.2, 3.4.0 > > Attachments: ZOOKEEPER-790.patch, ZOOKEEPER-790.patch, > ZOOKEEPER-790.travis.log.bz2 > > > The leader code is setting the last processed zxid to the first of the new > epoch even before connecting to a quorum of followers. Because the leader > code sets this value before connecting to a quorum of followers > (Leader.java:281) and the follower code throws an IOException > (Follower.java:73) if the leader epoch is smaller, we have that when the > false leader drops leadership and becomes a follower, it finds a smaller > epoch and kills itself. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-790) Last processed zxid set prematurely while establishing leadership
[ https://issues.apache.org/jira/browse/ZOOKEEPER-790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Flavio Paiva Junqueira updated ZOOKEEPER-790: - Attachment: ZOOKEEPER-790.patch Hi! I'm attaching a patch that works for my setup (and passes the unit tests as well), so I would appreciate if you could give it a try. > Last processed zxid set prematurely while establishing leadership > - > > Key: ZOOKEEPER-790 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-790 > Project: Zookeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.3.1 >Reporter: Flavio Paiva Junqueira >Assignee: Flavio Paiva Junqueira >Priority: Blocker > Fix For: 3.3.2, 3.4.0 > > Attachments: ZOOKEEPER-790.patch, ZOOKEEPER-790.patch, > ZOOKEEPER-790.travis.log.bz2 > > > The leader code is setting the last processed zxid to the first of the new > epoch even before connecting to a quorum of followers. Because the leader > code sets this value before connecting to a quorum of followers > (Leader.java:281) and the follower code throws an IOException > (Follower.java:73) if the leader epoch is smaller, we have that when the > false leader drops leadership and becomes a follower, it finds a smaller > epoch and kills itself. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (ZOOKEEPER-820) update c unit tests to ensure "zombie" java server processes don't cause failure
update c unit tests to ensure "zombie" java server processes don't cause failure Key: ZOOKEEPER-820 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-820 Project: Zookeeper Issue Type: Bug Affects Versions: 3.3.1 Reporter: Patrick Hunt Priority: Critical Fix For: 3.3.2, 3.4.0 When the c unit tests are run sometimes the server doesn't shutdown at the end of the test, this causes subsequent tests (hudson esp) to fail. 1) we should try harder to make the server shut down at the end of the test, I suspect this is related to test failing/cleanup 2) before the tests are run we should see if the old server is still running and try to shut it down -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-820) update c unit tests to ensure "zombie" java server processes don't cause failure
[ https://issues.apache.org/jira/browse/ZOOKEEPER-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889216#action_12889216 ] Patrick Hunt commented on ZOOKEEPER-820: Mahadev suggested this for item 2 from the description: > I think we could some more standard tools like netstat to getthe process > using that port and try killing it. This would be less error prone nd more > reliable than other tools. > update c unit tests to ensure "zombie" java server processes don't cause > failure > > > Key: ZOOKEEPER-820 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-820 > Project: Zookeeper > Issue Type: Bug >Affects Versions: 3.3.1 >Reporter: Patrick Hunt >Priority: Critical > Fix For: 3.3.2, 3.4.0 > > > When the c unit tests are run sometimes the server doesn't shutdown at the > end of the test, this causes subsequent tests (hudson esp) to fail. > 1) we should try harder to make the server shut down at the end of the test, > I suspect this is related to test failing/cleanup > 2) before the tests are run we should see if the old server is still running > and try to shut it down -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (ZOOKEEPER-816) Detecting and diagnosing elusive bugs and faults in Zookeeper
Zookeeper's traces (i.e., logs in TRACE level) provide some information that can be helpful to understand what happened. For instance, they contain information about the clients that are connected, the operations issued, etc. However, in real deployments with many clients (say, hundreds), traces are typically turned off to avoid the high overhead that they cause. Furthermore, the data in the traces is probably not enough for our purposes because it does not include, e.g., the replies to operations or the data values. As far as I've seen, this overhead comes in two forms, CPU and disk. CPU overhead is mostly due to formatting. Disk obviously because tracing will fill your disk fairly quickly. Perhaps something could be done to combat both of these. To fix the formatting problem we could use a binary log format. I've seen this done in C++ but not in java. The basic idea is that if you have TRACE("operation %x happened to %s %p", obj1, obj2, obj3); a preprocessor replaces this with TRACE(0x1234, obj1, obj2, obj3) where 0x1234 is an identifier for the trace. Then when the trace occurs a binary blob [0x1234, value of obj1, value of obj2, value of obj3] is logged. Then when the logs are pulled of the machine you run a post processor to do all the formatting and you get your full trace. Regarding the disk overhead, traces are usually only interesting in the run up to a failure. We could have a ring buffer in memory that is constantly traced to, old traces being overwritten when the ring buffer reaches it's limit. These traces should only be dumped to the filesystem when an error or fatal level event occurs, thereby giving you a trace of what was happening before you fell over. -Ivan
[jira] Created: (ZOOKEEPER-818) improve the traces with additional information needed
improve the traces with additional information needed -- Key: ZOOKEEPER-818 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-818 Project: Zookeeper Issue Type: Sub-task Reporter: Miguel Correia Priority: Minor The current traces do not include all the information we need to do the checking. The main additions would be to log the replies and hashes of values read/written. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (ZOOKEEPER-817) improve the efficiency of tracing
improve the efficiency of tracing - Key: ZOOKEEPER-817 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-817 Project: Zookeeper Issue Type: Sub-task Reporter: Miguel Correia Priority: Minor Zookeeper uses two kinds of logs, logs for information and debugging (the ones considered in this project) and transaction logs (need for Zab/Paxos to be fault tolerant); the latter are very efficient so the idea would be to make the first likewise using similar mechanisms. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (ZOOKEEPER-816) Detecting and diagnosing elusive bugs and faults in Zookeeper
Detecting and diagnosing elusive bugs and faults in Zookeeper - Key: ZOOKEEPER-816 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-816 Project: Zookeeper Issue Type: New Feature Reporter: Miguel Correia Priority: Minor Complex distributed systems like Zookeeper tend to fail in strange ways that are hard to diagnose. The objective is to build a tool that helps understand when and where these problems occurred based on Zookeeper's traces (i.e., logs in TRACE level). Minor changes to the server code will be needed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (ZOOKEEPER-819) build the checking tool
build the checking tool --- Key: ZOOKEEPER-819 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-819 Project: Zookeeper Issue Type: Sub-task Reporter: Miguel Correia Priority: Minor Building the checking tool is the hardest part of the project. It involves putting the traces together in a unified trace and checking if this unified trace shows that Zookeeper is satisfying a set of properties (e.g., a getData returns what was stored by the previous setData or create). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-816) Detecting and diagnosing elusive bugs and faults in Zookeeper
[ https://issues.apache.org/jira/browse/ZOOKEEPER-816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889202#action_12889202 ] Miguel Correia commented on ZOOKEEPER-816: -- Let me give a longer explanation of the project. Practical experience with Zookeeper has shown that sometimes there are failures whose causes are hard to understand. Some of these failures may be caused by elusive bugs in the code; others may be due to failures rarer than crashes, say corruptions of data somewhere in a server. Zookeeper's traces (i.e., logs in TRACE level) provide some information that can be helpful to understand what happened. For instance, they contain information about the clients that are connected, the operations issued, etc. However, in real deployments with many clients (say, hundreds), traces are typically turned off to avoid the high overhead that they cause. Furthermore, the data in the traces is probably not enough for our purposes because it does not include, e.g., the replies to operations or the data values. The project involves 3 subtasks: 1- improve the efficiency of logging 2- improve the traces with additional information needed 3- build the checking tool > Detecting and diagnosing elusive bugs and faults in Zookeeper > - > > Key: ZOOKEEPER-816 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-816 > Project: Zookeeper > Issue Type: New Feature >Reporter: Miguel Correia >Priority: Minor > > Complex distributed systems like Zookeeper tend to fail in strange ways that > are hard to diagnose. The objective is to build a tool that helps understand > when and where these problems occurred based on Zookeeper's traces (i.e., > logs in TRACE level). Minor changes to the server code will be needed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-702) GSoC 2010: Failure Detector Model
[ https://issues.apache.org/jira/browse/ZOOKEEPER-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889188#action_12889188 ] Diogo commented on ZOOKEEPER-702: - Hi! There is paper by Satzger et al. where it is described an easy method to use non-periodic application messages as periodic heartbeats. The paper is called "A Lazy Monitoring Approach for Heartbeat-Style Failure Detectors". I hope that is still relevant. Cheers, Diogo > GSoC 2010: Failure Detector Model > - > > Key: ZOOKEEPER-702 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-702 > Project: Zookeeper > Issue Type: Wish >Reporter: Henry Robinson >Assignee: Abmar Barros > Attachments: bertier-pseudo.txt, bertier-pseudo.txt, chen-pseudo.txt, > chen-pseudo.txt, phiaccrual-pseudo.txt, phiaccrual-pseudo.txt, > ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, ZOOKEEPER-702.patch, > ZOOKEEPER-702.patch > > > Failure Detector Module > Possible Mentor > Henry Robinson (henry at apache dot org) > Requirements > Java, some distributed systems knowledge, comfort implementing distributed > systems protocols > Description > ZooKeeper servers detects the failure of other servers and clients by > counting the number of 'ticks' for which it doesn't get a heartbeat from > other machines. This is the 'timeout' method of failure detection and works > very well; however it is possible that it is too aggressive and not easily > tuned for some more unusual ZooKeeper installations (such as in a wide-area > network, or even in a mobile ad-hoc network). > This project would abstract the notion of failure detection to a dedicated > Java module, and implement several failure detectors to compare and contrast > their appropriateness for ZooKeeper. For example, Apache Cassandra uses a > phi-accrual failure detector (http://ddsg.jaist.ac.jp/pub/HDY+04.pdf) which > is much more tunable and has some very interesting properties. This is a > great project if you are interested in distributed algorithms, or want to > help re-factor some of ZooKeeper's internal code. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-790) Last processed zxid set prematurely while establishing leadership
[ https://issues.apache.org/jira/browse/ZOOKEEPER-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889072#action_12889072 ] Flavio Paiva Junqueira commented on ZOOKEEPER-790: -- Some additional information. We create a new transaction log when we try to append a transaction and there is no log file open. Consequently, we only create it when there is something to write. For the bad run above, here is the content of the log: {noformat} ZooKeeper Transactional Log File with dbid 0 txnlog format version 2 7/15/10 5:27:45 PM CEST session 0x129d6b61b5b cxid 0x0 zxid 0x20001 closeSession EOF reached after 1 txns. {noformat} > Last processed zxid set prematurely while establishing leadership > - > > Key: ZOOKEEPER-790 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-790 > Project: Zookeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.3.1 >Reporter: Flavio Paiva Junqueira >Assignee: Flavio Paiva Junqueira >Priority: Blocker > Fix For: 3.3.2, 3.4.0 > > Attachments: ZOOKEEPER-790.patch, ZOOKEEPER-790.travis.log.bz2 > > > The leader code is setting the last processed zxid to the first of the new > epoch even before connecting to a quorum of followers. Because the leader > code sets this value before connecting to a quorum of followers > (Leader.java:281) and the follower code throws an IOException > (Follower.java:73) if the leader epoch is smaller, we have that when the > false leader drops leadership and becomes a follower, it finds a smaller > epoch and kills itself. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.