[jira] Created: (ZOOKEEPER-759) Stop accepting connections when close to file descriptor limit
Stop accepting connections when close to file descriptor limit -- Key: ZOOKEEPER-759 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-759 Project: Zookeeper Issue Type: Improvement Components: server Reporter: Travis Crawford Zookeeper always tries to accept new connections, throwing an exception if out of file descriptors. An improvement would be denying new client connections when close to the limit. Additionally, file-descriptor limits+usage should be exported to the monitoring four-letter word, should that get implemented (see ZOOKEEPER-744). DETAILS A Zookeeper ensemble I administer recently suffered an outage when one node was restarted with the low system-default ulimit of 1024 file descriptors and later ran out. File descriptor usage+max are already being monitored by the following MBeans: - java.lang.OperatingSystem.MaxFileDescriptorCount - java.lang.OperatingSystem.OpenFileDescriptorCount They're described (rather tersely) at: http://java.sun.com/javase/6/docs/jre/api/management/extension/com/sun/management/UnixOperatingSystemMXBean.html This feature request is for the following: (a) Stop accepting new connections when OpenFileDescriptorCount is close to MaxFileDescriptorCount, defaulting to 95% FD usage. New connections should be denied, logged to disk at debug level, and increment a ``ConnectionDeniedCount`` MBean counter. (b) Begin accepting new connections when usage drops below some configurable threshold, defaulting to 90% of FD usage, basically the high/low watermark model. (c) Update the administrators guide with a comment about using an appropriate FD limit. (d) Extra credit: if ZOOKEEPER-744 is implemented export statistics for: zookeeper_open_file_descriptor_count zookeeper_max_file_descriptor_count zookeeper_max_file_descriptor_mismatch - boolean, exported by leader, if not all zk's have the same max FD value -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (ZOOKEEPER-765) Add python example script
Add python example script - Key: ZOOKEEPER-765 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-765 Project: Zookeeper Issue Type: Improvement Components: documentation Reporter: Travis Crawford Priority: Minor Attachments: zk.py When adding some zookeeper-based functionality to a python script I had to figure everything out without guidance, which while doable, would have been a lot easier with an example. I extracted a skeleton program structure out with hopes its useful to others (maybe add as an example in the source or wiki?). This script does an aget() and sets a watch, and hopefully illustrates what's going on, and where to plug in your application code that gets run when the znode changes. There are probably some bugs, which if we fix now and provide a well-reviewed example hopefully others will not run into the same mistakes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-765) Add python example script
[ https://issues.apache.org/jira/browse/ZOOKEEPER-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Travis Crawford updated ZOOKEEPER-765: -- Attachment: zk.py Add python example script - Key: ZOOKEEPER-765 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-765 Project: Zookeeper Issue Type: Improvement Components: documentation Reporter: Travis Crawford Priority: Minor Attachments: zk.py When adding some zookeeper-based functionality to a python script I had to figure everything out without guidance, which while doable, would have been a lot easier with an example. I extracted a skeleton program structure out with hopes its useful to others (maybe add as an example in the source or wiki?). This script does an aget() and sets a watch, and hopefully illustrates what's going on, and where to plug in your application code that gets run when the znode changes. There are probably some bugs, which if we fix now and provide a well-reviewed example hopefully others will not run into the same mistakes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-744) Add monitoring four-letter word
[ https://issues.apache.org/jira/browse/ZOOKEEPER-744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871479#action_12871479 ] Travis Crawford commented on ZOOKEEPER-744: --- Looks awesome! One nit -- many monitoring systems do not interpret strings so it may be appropriate to export everything as numbers. For example, consider a script that loops through these values poking them into Ganglia (or other timeseries database). The script would need special-cased to handle leader. Later, as more values are added the import script would need updated with the new strings. Doing everything as numbers ensures new values would just work without updating other systems. With that in mind, perhaps: zk_server_state 1 (instead of: leader) zk_build_timestamp unix_timestamp (instead of build string) Add monitoring four-letter word --- Key: ZOOKEEPER-744 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-744 Project: Zookeeper Issue Type: New Feature Components: server Reporter: Travis Crawford Attachments: zk-ganglia.png, ZOOKEEPER-744.patch Filing a feature request based on a zookeeper-user discussion. Zookeeper should have a new four-letter word that returns key-value pairs appropriate for importing to a monitoring system (such as Ganglia which has a large installed base) This command should initially export the following: (a) Count of instances in the ensemble. (b) Count of up-to-date instances in the ensemble. But be designed such that in the future additional data can be added. For example, the output could define the statistic in a comment, then print a key space character value line: # Total number of instances in the ensemble zk_ensemble_instances_total 5 # Number of instances currently participating in the quorum. zk_ensemble_instances_active 4 From the mailing list: Date: Mon, 19 Apr 2010 12:10:44 -0700 From: Patrick Hunt ph...@apache.org To: zookeeper-u...@hadoop.apache.org Subject: Re: Recovery issue - how to debug? On 04/19/2010 11:55 AM, Travis Crawford wrote: It would be a lot easier from the operations perspective if the leader explicitly published some health stats: (a) Count of instances in the ensemble. (b) Count of up-to-date instances in the ensemble. This would greatly simplify monitoring alerting - when an instance falls behind one could configure their monitoring system to let someone know and take a look at the logs. That's a great idea. Please enter a JIRA for this - a new 4 letter word and JMX support. It would also be a great starter project for someone interested in becoming more familiar with the server code. Patrick -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-744) Add monitoring four-letter word
[ https://issues.apache.org/jira/browse/ZOOKEEPER-744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12872284#action_12872284 ] Travis Crawford commented on ZOOKEEPER-744: --- @savu - Sometimes the right thing is publishing everything, sometimes not. In this case I'll have some filtering/formatting in the zk2ganglia script. Add monitoring four-letter word --- Key: ZOOKEEPER-744 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-744 Project: Zookeeper Issue Type: New Feature Components: server Affects Versions: 3.4.0 Reporter: Travis Crawford Assignee: Savu Andrei Attachments: zk-ganglia.png, ZOOKEEPER-744.patch Filing a feature request based on a zookeeper-user discussion. Zookeeper should have a new four-letter word that returns key-value pairs appropriate for importing to a monitoring system (such as Ganglia which has a large installed base) This command should initially export the following: (a) Count of instances in the ensemble. (b) Count of up-to-date instances in the ensemble. But be designed such that in the future additional data can be added. For example, the output could define the statistic in a comment, then print a key space character value line: # Total number of instances in the ensemble zk_ensemble_instances_total 5 # Number of instances currently participating in the quorum. zk_ensemble_instances_active 4 From the mailing list: Date: Mon, 19 Apr 2010 12:10:44 -0700 From: Patrick Hunt ph...@apache.org To: zookeeper-u...@hadoop.apache.org Subject: Re: Recovery issue - how to debug? On 04/19/2010 11:55 AM, Travis Crawford wrote: It would be a lot easier from the operations perspective if the leader explicitly published some health stats: (a) Count of instances in the ensemble. (b) Count of up-to-date instances in the ensemble. This would greatly simplify monitoring alerting - when an instance falls behind one could configure their monitoring system to let someone know and take a look at the logs. That's a great idea. Please enter a JIRA for this - a new 4 letter word and JMX support. It would also be a great starter project for someone interested in becoming more familiar with the server code. Patrick -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-801) Zookeeper outage post-mortem
[ https://issues.apache.org/jira/browse/ZOOKEEPER-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Travis Crawford updated ZOOKEEPER-801: -- Attachment: client-errors.txt client-errors.txt contains the zookeeper error messages during an outage. The interesting part is how frequently they occur. This suggests having clients backoff during issues would be an improvement, as this many connection so quickly makes things worse. Of course, there's a tradeoff between quickly connecting and backing off. For this reason, perhaps exponential backoff should be the default behavior, but clients that that require fast reconnections could optionally use the current mode. Zookeeper outage post-mortem Key: ZOOKEEPER-801 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-801 Project: Zookeeper Issue Type: Improvement Affects Versions: 3.3.0 Environment: - RHEL5 2.6.18 kernel - Zookeeper 3.3.0 - ulimit raised to 65k files - 3 cluster members - 4-5k connections in steady-state - Primarily C and python clients, plus some java Reporter: Travis Crawford Attachments: client-errors.txt [Moving a thread from the zookeeper-user] RECOVERY We eventually recovered from this situation by shutting down clients. Initially I tried restarting the Zookeepers, however, they were getting hammered and I believe sessions timing out. I shut down ~2k clients (lightweight python app; simply sets one data watch and takes an action when it changes) at which point zktop could make a connection and a leader election was verified. After resetting latency stats the numbers were very good. I do not believe it would have ever recovered without shedding load. QUORUM/ELECTIONS DURING EVENT Unfortunately I do not have logs from the event :( We had debug logging on, and logrotate configured to keep 10 100MB files, and the interesting parts rotated away. I have already switched to info logging so we don't lose this data again. During the incident I was not able to view cluster status with zktop, and never observed a successful operation beyond connections, which quickly timed out. GC PAUSE/LOGGING This is a very good question. No, Zookeeper GC is not tuned and uses whatever the defaults are in the start scripts. GC logging is not enabled either. I filed an internal bug against myself to enable logging, and tune GC. CLIENT SESSION TIMEOUTS Clients are not explicitly setting timeouts, and I believe sessionTimeout is 10 seconds based on this log entry when initially connecting. 2010-07-01 05:14:00,260:44267(0x2af330240110):zoo_i...@zookeeper_init@727: Initiating client connection, host=10.209.21.133:2181,10.209.21.175:2181,10.209.21.181:2181 sessionTimeout=1 watcher=(nil) sessionId=0 sessionPasswd=null context=(nil) flags=0 CLIENT BACKOFFS Looking in application logs, we see lots of the following: 2010-07-01 05:13:14,674:41491(0x41ebf940):zoo_er...@handle_socket_error_msg@1528: Socket [10.209.21.181:2181] zk retcode=-7, errno=110(Connection timed out): connection timed out (exceeded timeout by 0ms) Doing some simple aggregations we see 130 errors in a ~13 minute sample period. This behavior on thousands of clients appears to have been a DDoS attack against Zookeeper. Using exponential behavior as the default behavior seems appropriate looking at this data. Fulltext of the client errors is attached. I grepped errno from a Python client log; I believe it uses the same underlying C library, so I did not include example output from a C program (though I can if needed). It looks basically the same. GOING FORWARD The long-GC pause causing clients to dogpile sounds like the most plausible explanation at this time. GC logging/tuning is clearly where I dropped the ball, just using the defaults; I don't think any changes should be made related to lack of tuning. Exponential backoffs does seem like a good idea, and generally useful for most people. There will always be service interruptions and backoffs would be a great preventive measure to get out of a dogpile situation. Patrick's message: Hi Travis, as Flavio suggested would be great to get the logs. A few questions: 1) how did you eventually recover, restart the zk servers? 2) was the cluster losing quorum during this time? leader re-election? 3) Any chance this could have been initially triggered by a long GC pause on one of the servers? (is gc logging turned on, any sort of heap monitoring?) Has the GC been tuned on the servers, for example CMS and incremental? 4) what are the clients using for timeout on the sessions? 3.4 probably not for a few months yet, but we are planning for a 3.3.2 in a few weeks to fix a couple critical issues (which don't seem related to what you saw). If we can identify the
[jira] Created: (ZOOKEEPER-801) Zookeeper outage post-mortem
Zookeeper outage post-mortem Key: ZOOKEEPER-801 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-801 Project: Zookeeper Issue Type: Improvement Affects Versions: 3.3.0 Environment: - RHEL5 2.6.18 kernel - Zookeeper 3.3.0 - ulimit raised to 65k files - 3 cluster members - 4-5k connections in steady-state - Primarily C and python clients, plus some java Reporter: Travis Crawford [Moving a thread from the zookeeper-user] RECOVERY We eventually recovered from this situation by shutting down clients. Initially I tried restarting the Zookeepers, however, they were getting hammered and I believe sessions timing out. I shut down ~2k clients (lightweight python app; simply sets one data watch and takes an action when it changes) at which point zktop could make a connection and a leader election was verified. After resetting latency stats the numbers were very good. I do not believe it would have ever recovered without shedding load. QUORUM/ELECTIONS DURING EVENT Unfortunately I do not have logs from the event :( We had debug logging on, and logrotate configured to keep 10 100MB files, and the interesting parts rotated away. I have already switched to info logging so we don't lose this data again. During the incident I was not able to view cluster status with zktop, and never observed a successful operation beyond connections, which quickly timed out. GC PAUSE/LOGGING This is a very good question. No, Zookeeper GC is not tuned and uses whatever the defaults are in the start scripts. GC logging is not enabled either. I filed an internal bug against myself to enable logging, and tune GC. CLIENT SESSION TIMEOUTS Clients are not explicitly setting timeouts, and I believe sessionTimeout is 10 seconds based on this log entry when initially connecting. 2010-07-01 05:14:00,260:44267(0x2af330240110):zoo_i...@zookeeper_init@727: Initiating client connection, host=10.209.21.133:2181,10.209.21.175:2181,10.209.21.181:2181 sessionTimeout=1 watcher=(nil) sessionId=0 sessionPasswd=null context=(nil) flags=0 CLIENT BACKOFFS Looking in application logs, we see lots of the following: 2010-07-01 05:13:14,674:41491(0x41ebf940):zoo_er...@handle_socket_error_msg@1528: Socket [10.209.21.181:2181] zk retcode=-7, errno=110(Connection timed out): connection timed out (exceeded timeout by 0ms) Doing some simple aggregations we see 130 errors in a ~13 minute sample period. This behavior on thousands of clients appears to have been a DDoS attack against Zookeeper. Using exponential behavior as the default behavior seems appropriate looking at this data. Fulltext of the client errors is attached. I grepped errno from a Python client log; I believe it uses the same underlying C library, so I did not include example output from a C program (though I can if needed). It looks basically the same. GOING FORWARD The long-GC pause causing clients to dogpile sounds like the most plausible explanation at this time. GC logging/tuning is clearly where I dropped the ball, just using the defaults; I don't think any changes should be made related to lack of tuning. Exponential backoffs does seem like a good idea, and generally useful for most people. There will always be service interruptions and backoffs would be a great preventive measure to get out of a dogpile situation. Patrick's message: Hi Travis, as Flavio suggested would be great to get the logs. A few questions: 1) how did you eventually recover, restart the zk servers? 2) was the cluster losing quorum during this time? leader re-election? 3) Any chance this could have been initially triggered by a long GC pause on one of the servers? (is gc logging turned on, any sort of heap monitoring?) Has the GC been tuned on the servers, for example CMS and incremental? 4) what are the clients using for timeout on the sessions? 3.4 probably not for a few months yet, but we are planning for a 3.3.2 in a few weeks to fix a couple critical issues (which don't seem related to what you saw). If we can identify the problem here we should be able to include it in any fix release we do. fixing something like 517 might help, but it's not clear how we got to this state in the first place. fixing 517 might not have any effect if the root cause is not addressed. 662 has only ever been reported once afaik, and we weren't able to identify the root cause for that one. One thing we might also consider is modifying the zk client lib to backoff connection attempts if they keep failing (timing out say). Today the clients are pretty aggressive on reconnection attempts. Having some sort of backoff (exponential?) would provide more breathing room to the server in this situation. Patrick Flavio's message: Hi Travis, Do you think it would be possible for you to open a jira and upload your logs? Thanks, -Flavio My initial message: Hey
[jira] Updated: (ZOOKEEPER-801) Zookeeper outage post-mortem
[ https://issues.apache.org/jira/browse/ZOOKEEPER-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Travis Crawford updated ZOOKEEPER-801: -- Attachment: zookeeper-out-of-files.netstat This attached netstat output shows a large number of: SYN_RECV CLOSE_WAIT. The fast connection churn appears to have consumed all available file handles. Zookeeper outage post-mortem Key: ZOOKEEPER-801 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-801 Project: Zookeeper Issue Type: Improvement Affects Versions: 3.3.0 Environment: - RHEL5 2.6.18 kernel - Zookeeper 3.3.0 - ulimit raised to 65k files - 3 cluster members - 4-5k connections in steady-state - Primarily C and python clients, plus some java Reporter: Travis Crawford Attachments: client-errors.txt, zookeeper-out-of-files.netstat [Moving a thread from the zookeeper-user] RECOVERY We eventually recovered from this situation by shutting down clients. Initially I tried restarting the Zookeepers, however, they were getting hammered and I believe sessions timing out. I shut down ~2k clients (lightweight python app; simply sets one data watch and takes an action when it changes) at which point zktop could make a connection and a leader election was verified. After resetting latency stats the numbers were very good. I do not believe it would have ever recovered without shedding load. QUORUM/ELECTIONS DURING EVENT Unfortunately I do not have logs from the event :( We had debug logging on, and logrotate configured to keep 10 100MB files, and the interesting parts rotated away. I have already switched to info logging so we don't lose this data again. During the incident I was not able to view cluster status with zktop, and never observed a successful operation beyond connections, which quickly timed out. GC PAUSE/LOGGING This is a very good question. No, Zookeeper GC is not tuned and uses whatever the defaults are in the start scripts. GC logging is not enabled either. I filed an internal bug against myself to enable logging, and tune GC. CLIENT SESSION TIMEOUTS Clients are not explicitly setting timeouts, and I believe sessionTimeout is 10 seconds based on this log entry when initially connecting. 2010-07-01 05:14:00,260:44267(0x2af330240110):zoo_i...@zookeeper_init@727: Initiating client connection, host=10.209.21.133:2181,10.209.21.175:2181,10.209.21.181:2181 sessionTimeout=1 watcher=(nil) sessionId=0 sessionPasswd=null context=(nil) flags=0 CLIENT BACKOFFS Looking in application logs, we see lots of the following: 2010-07-01 05:13:14,674:41491(0x41ebf940):zoo_er...@handle_socket_error_msg@1528: Socket [10.209.21.181:2181] zk retcode=-7, errno=110(Connection timed out): connection timed out (exceeded timeout by 0ms) Doing some simple aggregations we see 130 errors in a ~13 minute sample period. This behavior on thousands of clients appears to have been a DDoS attack against Zookeeper. Using exponential behavior as the default behavior seems appropriate looking at this data. Fulltext of the client errors is attached. I grepped errno from a Python client log; I believe it uses the same underlying C library, so I did not include example output from a C program (though I can if needed). It looks basically the same. GOING FORWARD The long-GC pause causing clients to dogpile sounds like the most plausible explanation at this time. GC logging/tuning is clearly where I dropped the ball, just using the defaults; I don't think any changes should be made related to lack of tuning. Exponential backoffs does seem like a good idea, and generally useful for most people. There will always be service interruptions and backoffs would be a great preventive measure to get out of a dogpile situation. Patrick's message: Hi Travis, as Flavio suggested would be great to get the logs. A few questions: 1) how did you eventually recover, restart the zk servers? 2) was the cluster losing quorum during this time? leader re-election? 3) Any chance this could have been initially triggered by a long GC pause on one of the servers? (is gc logging turned on, any sort of heap monitoring?) Has the GC been tuned on the servers, for example CMS and incremental? 4) what are the clients using for timeout on the sessions? 3.4 probably not for a few months yet, but we are planning for a 3.3.2 in a few weeks to fix a couple critical issues (which don't seem related to what you saw). If we can identify the problem here we should be able to include it in any fix release we do. fixing something like 517 might help, but it's not clear how we got to this state in the first place. fixing 517 might not have any effect if the root cause is not addressed. 662 has only ever been reported once afaik, and we
[jira] Commented: (ZOOKEEPER-801) Zookeeper outage post-mortem
[ https://issues.apache.org/jira/browse/ZOOKEEPER-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884368#action_12884368 ] Travis Crawford commented on ZOOKEEPER-801: --- I believe clients were reconnecting so frequently because their connections timed out. Zookeeper outage post-mortem Key: ZOOKEEPER-801 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-801 Project: Zookeeper Issue Type: Improvement Affects Versions: 3.3.0 Environment: - RHEL5 2.6.18 kernel - Zookeeper 3.3.0 - ulimit raised to 65k files - 3 cluster members - 4-5k connections in steady-state - Primarily C and python clients, plus some java Reporter: Travis Crawford Attachments: client-errors.txt, zookeeper-out-of-files.netstat [Moving a thread from the zookeeper-user] RECOVERY We eventually recovered from this situation by shutting down clients. Initially I tried restarting the Zookeepers, however, they were getting hammered and I believe sessions timing out. I shut down ~2k clients (lightweight python app; simply sets one data watch and takes an action when it changes) at which point zktop could make a connection and a leader election was verified. After resetting latency stats the numbers were very good. I do not believe it would have ever recovered without shedding load. QUORUM/ELECTIONS DURING EVENT Unfortunately I do not have logs from the event :( We had debug logging on, and logrotate configured to keep 10 100MB files, and the interesting parts rotated away. I have already switched to info logging so we don't lose this data again. During the incident I was not able to view cluster status with zktop, and never observed a successful operation beyond connections, which quickly timed out. GC PAUSE/LOGGING This is a very good question. No, Zookeeper GC is not tuned and uses whatever the defaults are in the start scripts. GC logging is not enabled either. I filed an internal bug against myself to enable logging, and tune GC. CLIENT SESSION TIMEOUTS Clients are not explicitly setting timeouts, and I believe sessionTimeout is 10 seconds based on this log entry when initially connecting. 2010-07-01 05:14:00,260:44267(0x2af330240110):zoo_i...@zookeeper_init@727: Initiating client connection, host=10.209.21.133:2181,10.209.21.175:2181,10.209.21.181:2181 sessionTimeout=1 watcher=(nil) sessionId=0 sessionPasswd=null context=(nil) flags=0 CLIENT BACKOFFS Looking in application logs, we see lots of the following: 2010-07-01 05:13:14,674:41491(0x41ebf940):zoo_er...@handle_socket_error_msg@1528: Socket [10.209.21.181:2181] zk retcode=-7, errno=110(Connection timed out): connection timed out (exceeded timeout by 0ms) Doing some simple aggregations we see 130 errors in a ~13 minute sample period. This behavior on thousands of clients appears to have been a DDoS attack against Zookeeper. Using exponential behavior as the default behavior seems appropriate looking at this data. Fulltext of the client errors is attached. I grepped errno from a Python client log; I believe it uses the same underlying C library, so I did not include example output from a C program (though I can if needed). It looks basically the same. GOING FORWARD The long-GC pause causing clients to dogpile sounds like the most plausible explanation at this time. GC logging/tuning is clearly where I dropped the ball, just using the defaults; I don't think any changes should be made related to lack of tuning. Exponential backoffs does seem like a good idea, and generally useful for most people. There will always be service interruptions and backoffs would be a great preventive measure to get out of a dogpile situation. Patrick's message: Hi Travis, as Flavio suggested would be great to get the logs. A few questions: 1) how did you eventually recover, restart the zk servers? 2) was the cluster losing quorum during this time? leader re-election? 3) Any chance this could have been initially triggered by a long GC pause on one of the servers? (is gc logging turned on, any sort of heap monitoring?) Has the GC been tuned on the servers, for example CMS and incremental? 4) what are the clients using for timeout on the sessions? 3.4 probably not for a few months yet, but we are planning for a 3.3.2 in a few weeks to fix a couple critical issues (which don't seem related to what you saw). If we can identify the problem here we should be able to include it in any fix release we do. fixing something like 517 might help, but it's not clear how we got to this state in the first place. fixing 517 might not have any effect if the root cause is not addressed. 662 has only ever been reported once afaik, and we weren't able to identify the root cause for that one. One
[jira] Commented: (ZOOKEEPER-801) Zookeeper outage post-mortem
[ https://issues.apache.org/jira/browse/ZOOKEEPER-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884769#action_12884769 ] Travis Crawford commented on ZOOKEEPER-801: --- @Patrick - So this morning while investigating I discovered that yes, the client you pointed out was indeed misbehaving. It was a new application that had a bug. I talked with a couple teams and basically everyone is using a different Scala client. We've decided http://github.com/jcorwin/zookeeper-client is our best scala client and are forking from the main twitter account so everyone knows to use it. I looked around the docs and didn't see a place to list the scala client; please let me know if we should list this somewhere. Closing this issue as its basically a client bug on our side. Thanks for the comments and help tracking this down. Zookeeper outage post-mortem Key: ZOOKEEPER-801 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-801 Project: Zookeeper Issue Type: Improvement Affects Versions: 3.3.0 Environment: - RHEL5 2.6.18 kernel - Zookeeper 3.3.0 - ulimit raised to 65k files - 3 cluster members - 4-5k connections in steady-state - Primarily C and python clients, plus some java Reporter: Travis Crawford Attachments: client-errors.txt, zookeeper-out-of-files.netstat [Moving a thread from the zookeeper-user] RECOVERY We eventually recovered from this situation by shutting down clients. Initially I tried restarting the Zookeepers, however, they were getting hammered and I believe sessions timing out. I shut down ~2k clients (lightweight python app; simply sets one data watch and takes an action when it changes) at which point zktop could make a connection and a leader election was verified. After resetting latency stats the numbers were very good. I do not believe it would have ever recovered without shedding load. QUORUM/ELECTIONS DURING EVENT Unfortunately I do not have logs from the event :( We had debug logging on, and logrotate configured to keep 10 100MB files, and the interesting parts rotated away. I have already switched to info logging so we don't lose this data again. During the incident I was not able to view cluster status with zktop, and never observed a successful operation beyond connections, which quickly timed out. GC PAUSE/LOGGING This is a very good question. No, Zookeeper GC is not tuned and uses whatever the defaults are in the start scripts. GC logging is not enabled either. I filed an internal bug against myself to enable logging, and tune GC. CLIENT SESSION TIMEOUTS Clients are not explicitly setting timeouts, and I believe sessionTimeout is 10 seconds based on this log entry when initially connecting. 2010-07-01 05:14:00,260:44267(0x2af330240110):zoo_i...@zookeeper_init@727: Initiating client connection, host=10.209.21.133:2181,10.209.21.175:2181,10.209.21.181:2181 sessionTimeout=1 watcher=(nil) sessionId=0 sessionPasswd=null context=(nil) flags=0 CLIENT BACKOFFS Looking in application logs, we see lots of the following: 2010-07-01 05:13:14,674:41491(0x41ebf940):zoo_er...@handle_socket_error_msg@1528: Socket [10.209.21.181:2181] zk retcode=-7, errno=110(Connection timed out): connection timed out (exceeded timeout by 0ms) Doing some simple aggregations we see 130 errors in a ~13 minute sample period. This behavior on thousands of clients appears to have been a DDoS attack against Zookeeper. Using exponential behavior as the default behavior seems appropriate looking at this data. Fulltext of the client errors is attached. I grepped errno from a Python client log; I believe it uses the same underlying C library, so I did not include example output from a C program (though I can if needed). It looks basically the same. GOING FORWARD The long-GC pause causing clients to dogpile sounds like the most plausible explanation at this time. GC logging/tuning is clearly where I dropped the ball, just using the defaults; I don't think any changes should be made related to lack of tuning. Exponential backoffs does seem like a good idea, and generally useful for most people. There will always be service interruptions and backoffs would be a great preventive measure to get out of a dogpile situation. Patrick's message: Hi Travis, as Flavio suggested would be great to get the logs. A few questions: 1) how did you eventually recover, restart the zk servers? 2) was the cluster losing quorum during this time? leader re-election? 3) Any chance this could have been initially triggered by a long GC pause on one of the servers? (is gc logging turned on, any sort of heap monitoring?) Has the GC been tuned on the servers, for example CMS and incremental? 4) what are the clients using for timeout
[jira] Resolved: (ZOOKEEPER-801) Zookeeper outage post-mortem
[ https://issues.apache.org/jira/browse/ZOOKEEPER-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Travis Crawford resolved ZOOKEEPER-801. --- Assignee: Travis Crawford Resolution: Not A Problem Zookeeper outage post-mortem Key: ZOOKEEPER-801 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-801 Project: Zookeeper Issue Type: Improvement Affects Versions: 3.3.0 Environment: - RHEL5 2.6.18 kernel - Zookeeper 3.3.0 - ulimit raised to 65k files - 3 cluster members - 4-5k connections in steady-state - Primarily C and python clients, plus some java Reporter: Travis Crawford Assignee: Travis Crawford Attachments: client-errors.txt, zookeeper-out-of-files.netstat [Moving a thread from the zookeeper-user] RECOVERY We eventually recovered from this situation by shutting down clients. Initially I tried restarting the Zookeepers, however, they were getting hammered and I believe sessions timing out. I shut down ~2k clients (lightweight python app; simply sets one data watch and takes an action when it changes) at which point zktop could make a connection and a leader election was verified. After resetting latency stats the numbers were very good. I do not believe it would have ever recovered without shedding load. QUORUM/ELECTIONS DURING EVENT Unfortunately I do not have logs from the event :( We had debug logging on, and logrotate configured to keep 10 100MB files, and the interesting parts rotated away. I have already switched to info logging so we don't lose this data again. During the incident I was not able to view cluster status with zktop, and never observed a successful operation beyond connections, which quickly timed out. GC PAUSE/LOGGING This is a very good question. No, Zookeeper GC is not tuned and uses whatever the defaults are in the start scripts. GC logging is not enabled either. I filed an internal bug against myself to enable logging, and tune GC. CLIENT SESSION TIMEOUTS Clients are not explicitly setting timeouts, and I believe sessionTimeout is 10 seconds based on this log entry when initially connecting. 2010-07-01 05:14:00,260:44267(0x2af330240110):zoo_i...@zookeeper_init@727: Initiating client connection, host=10.209.21.133:2181,10.209.21.175:2181,10.209.21.181:2181 sessionTimeout=1 watcher=(nil) sessionId=0 sessionPasswd=null context=(nil) flags=0 CLIENT BACKOFFS Looking in application logs, we see lots of the following: 2010-07-01 05:13:14,674:41491(0x41ebf940):zoo_er...@handle_socket_error_msg@1528: Socket [10.209.21.181:2181] zk retcode=-7, errno=110(Connection timed out): connection timed out (exceeded timeout by 0ms) Doing some simple aggregations we see 130 errors in a ~13 minute sample period. This behavior on thousands of clients appears to have been a DDoS attack against Zookeeper. Using exponential behavior as the default behavior seems appropriate looking at this data. Fulltext of the client errors is attached. I grepped errno from a Python client log; I believe it uses the same underlying C library, so I did not include example output from a C program (though I can if needed). It looks basically the same. GOING FORWARD The long-GC pause causing clients to dogpile sounds like the most plausible explanation at this time. GC logging/tuning is clearly where I dropped the ball, just using the defaults; I don't think any changes should be made related to lack of tuning. Exponential backoffs does seem like a good idea, and generally useful for most people. There will always be service interruptions and backoffs would be a great preventive measure to get out of a dogpile situation. Patrick's message: Hi Travis, as Flavio suggested would be great to get the logs. A few questions: 1) how did you eventually recover, restart the zk servers? 2) was the cluster losing quorum during this time? leader re-election? 3) Any chance this could have been initially triggered by a long GC pause on one of the servers? (is gc logging turned on, any sort of heap monitoring?) Has the GC been tuned on the servers, for example CMS and incremental? 4) what are the clients using for timeout on the sessions? 3.4 probably not for a few months yet, but we are planning for a 3.3.2 in a few weeks to fix a couple critical issues (which don't seem related to what you saw). If we can identify the problem here we should be able to include it in any fix release we do. fixing something like 517 might help, but it's not clear how we got to this state in the first place. fixing 517 might not have any effect if the root cause is not addressed. 662 has only ever been reported once afaik, and we weren't able to identify the root cause for that one. One thing we might also consider is
[jira] Created: (ZOOKEEPER-803) Improve defenses against misbehaving clients
Improve defenses against misbehaving clients Key: ZOOKEEPER-803 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-803 Project: Zookeeper Issue Type: Bug Affects Versions: 3.3.0 Reporter: Travis Crawford This issue is in response to ZOOKEEPER-801. Short version is a small number of buggy clients opened thousands of connections and caused Zookeeper to fail. The misbehaving client did not correctly handle expired sessions, creating a new connection each time. The huge number of connections exacerbated the issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-803) Improve defenses against misbehaving clients
[ https://issues.apache.org/jira/browse/ZOOKEEPER-803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Travis Crawford updated ZOOKEEPER-803: -- Attachment: connection-bugfix-diff.png This diff shows a bug where the client developer confused disconnections and expired sessions. In the zookeeper programing model, clients reconnect themselves automatically when disconnected. However, should the session expire the application is responsible for reconnecting. In this case the developer attempted to throttle reconnects, however, due to a bug the application created a new connection each time. A small number of clients running the buggy code took down a 3 node Zookeeper cluster by exhausting 65k file descriptor limit. It only recovered after shutting down clients, restarting the Zookeepers, and then restarting the well-behaved clients. Improve defenses against misbehaving clients Key: ZOOKEEPER-803 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-803 Project: Zookeeper Issue Type: Bug Affects Versions: 3.3.0 Reporter: Travis Crawford Attachments: connection-bugfix-diff.png This issue is in response to ZOOKEEPER-801. Short version is a small number of buggy clients opened thousands of connections and caused Zookeeper to fail. The misbehaving client did not correctly handle expired sessions, creating a new connection each time. The huge number of connections exacerbated the issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-803) Improve defenses against misbehaving clients
[ https://issues.apache.org/jira/browse/ZOOKEEPER-803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884829#action_12884829 ] Travis Crawford commented on ZOOKEEPER-803: --- Maybe 8-10 clients were running the buggy code. Not too many. Improve defenses against misbehaving clients Key: ZOOKEEPER-803 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-803 Project: Zookeeper Issue Type: Bug Affects Versions: 3.3.0 Reporter: Travis Crawford Attachments: connection-bugfix-diff.png This issue is in response to ZOOKEEPER-801. Short version is a small number of buggy clients opened thousands of connections and caused Zookeeper to fail. The misbehaving client did not correctly handle expired sessions, creating a new connection each time. The huge number of connections exacerbated the issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-790) Last processed zxid set prematurely while establishing leadership
[ https://issues.apache.org/jira/browse/ZOOKEEPER-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885738#action_12885738 ] Travis Crawford commented on ZOOKEEPER-790: --- @vishal - How did your test go? I came across this issue while investigating an issue and believe this is the root cause. If your test went well I can also help test. Last processed zxid set prematurely while establishing leadership - Key: ZOOKEEPER-790 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-790 Project: Zookeeper Issue Type: Bug Components: quorum Affects Versions: 3.3.1 Reporter: Flavio Paiva Junqueira Assignee: Flavio Paiva Junqueira Priority: Blocker Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-790.patch The leader code is setting the last processed zxid to the first of the new epoch even before connecting to a quorum of followers. Because the leader code sets this value before connecting to a quorum of followers (Leader.java:281) and the follower code throws an IOException (Follower.java:73) if the leader epoch is smaller, we have that when the false leader drops leadership and becomes a follower, it finds a smaller epoch and kills itself. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-790) Last processed zxid set prematurely while establishing leadership
[ https://issues.apache.org/jira/browse/ZOOKEEPER-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888576#action_12888576 ] Travis Crawford commented on ZOOKEEPER-790: --- Sure, I'll help test this too. Last processed zxid set prematurely while establishing leadership - Key: ZOOKEEPER-790 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-790 Project: Zookeeper Issue Type: Bug Components: quorum Affects Versions: 3.3.1 Reporter: Flavio Paiva Junqueira Assignee: Flavio Paiva Junqueira Priority: Blocker Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-790.patch The leader code is setting the last processed zxid to the first of the new epoch even before connecting to a quorum of followers. Because the leader code sets this value before connecting to a quorum of followers (Leader.java:281) and the follower code throws an IOException (Follower.java:73) if the leader epoch is smaller, we have that when the false leader drops leadership and becomes a follower, it finds a smaller epoch and kills itself. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888684#action_12888684 ] Travis Crawford commented on ZOOKEEPER-335: --- Unfortunately I still observed the Leader epoch issue and needed to manually force a leader election for the cluster to recover. This test was performed with the following base+patches, applied in the order listed. Zookeeper 3.3.1 ZOOKEEPER-744 ZOOKEEPER-790 {code} 2010-07-15 02:43:57,181 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:files...@82] - Reading snapshot /data/zookeeper/version-2/snapshot.231ac2 2010-07-15 02:43:57,384 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@649] - New election. My id = 1, Proposed zxid = 154618826848 2010-07-15 02:43:57,385 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: 1, 154618826848, 4, 1, LOOKING, LOOKING, 1 2010-07-15 02:43:57,385 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 2, 146030952153, 3, 1, LOOKING, LEADING, 2 2010-07-15 02:43:57,385 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 2, 146030952153, 3, 1, LOOKING, FOLLOWING, 3 2010-07-15 02:43:57,385 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@642] - FOLLOWING 2010-07-15 02:43:57,385 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:zookeeperser...@151] - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 4 datadir /data/zookeeper/txlog/version-2 snapdir /data/zookeeper/version-2 2010-07-15 02:43:57,387 - FATAL [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] - Leader epoch 23 is less than our epoch 24 2010-07-15 02:43:57,387 - WARN [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@82] - Exception when following the leader java.io.IOException: Error: Epoch of leader is lower at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:73) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:644) 2010-07-15 02:43:57,387 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@166] - shutdown called java.lang.Exception: shutdown Follower at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:648) {code} I followed the recipe @vishal provided for recreating. (a) Stop one follower in a three node cluster (b) Get some tea while it falls behind (c) Start the node stopped in (a). These timestamps show where the follower was stopped. It also shows when it was turned back on. {code} 2010-07-15 02:35:36,398 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@1661] - Established session 0x229aa13cfc6276b with negotiated timeout 1 for client /10.209.45.114:34562 2010-07-15 02:39:18,907 - INFO [main:quorumpeercon...@90] - Reading configuration from: /etc/zookeeper/conf/zoo.cfg {code} This timestamp is the first ``Leader epoch`` line. Everything between these two points will be the interesting bits. {code} 2010-07-15 02:39:43,339 - FATAL [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] - Leader epoch 23 is less than our epoch 24 {code} zookeeper servers should commit the new leader txn to their logs. - Key: ZOOKEEPER-335 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.1.0 Reporter: Mahadev konar Assignee: Mahadev konar Priority: Blocker Fix For: 3.4.0 Attachments: faultynode-vishal.txt, zk.log.gz, zklogs.tar.gz currently the zookeeper followers do not commit the new leader election. This will cause problems in a failure scenarios with a follower acking to the same leader txn id twice, which might be two different intermittent leaders and allowing them to propose two different txn's of the same zxid. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Travis Crawford updated ZOOKEEPER-335: -- Attachment: ZOOKEEPER-790.travis.log.bz2 Please see my most recent comment for a summary of where the interesting lines are in this log file. zookeeper servers should commit the new leader txn to their logs. - Key: ZOOKEEPER-335 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335 Project: Zookeeper Issue Type: Bug Components: server Affects Versions: 3.1.0 Reporter: Mahadev konar Assignee: Mahadev konar Priority: Blocker Fix For: 3.4.0 Attachments: faultynode-vishal.txt, zk.log.gz, zklogs.tar.gz, ZOOKEEPER-790.travis.log.bz2 currently the zookeeper followers do not commit the new leader election. This will cause problems in a failure scenarios with a follower acking to the same leader txn id twice, which might be two different intermittent leaders and allowing them to propose two different txn's of the same zxid. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-790) Last processed zxid set prematurely while establishing leadership
[ https://issues.apache.org/jira/browse/ZOOKEEPER-790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Travis Crawford updated ZOOKEEPER-790: -- Attachment: ZOOKEEPER-790.travis.log.bz2 Last processed zxid set prematurely while establishing leadership - Key: ZOOKEEPER-790 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-790 Project: Zookeeper Issue Type: Bug Components: quorum Affects Versions: 3.3.1 Reporter: Flavio Paiva Junqueira Assignee: Flavio Paiva Junqueira Priority: Blocker Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-790.patch, ZOOKEEPER-790.travis.log.bz2 The leader code is setting the last processed zxid to the first of the new epoch even before connecting to a quorum of followers. Because the leader code sets this value before connecting to a quorum of followers (Leader.java:281) and the follower code throws an IOException (Follower.java:73) if the leader epoch is smaller, we have that when the false leader drops leadership and becomes a follower, it finds a smaller epoch and kills itself. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-790) Last processed zxid set prematurely while establishing leadership
[ https://issues.apache.org/jira/browse/ZOOKEEPER-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888688#action_12888688 ] Travis Crawford commented on ZOOKEEPER-790: --- I accidentally posted this in ZOOKEEPER-335 -- reposting here. Sorry for the posting mixup -- the content is correct though :) Unfortunately I still observed the Leader epoch issue and needed to manually force a leader election for the cluster to recover. This test was performed with the following base+patches, applied in the order listed. Zookeeper 3.3.1 ZOOKEEPER-744 ZOOKEEPER-790 {code} 2010-07-15 02:43:57,181 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:files...@82] - Reading snapshot /data/zookeeper/version-2/snapshot.231ac2 2010-07-15 02:43:57,384 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@649] - New election. My id = 1, Proposed zxid = 154618826848 2010-07-15 02:43:57,385 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: 1, 154618826848, 4, 1, LOOKING, LOOKING, 1 2010-07-15 02:43:57,385 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 2, 146030952153, 3, 1, LOOKING, LEADING, 2 2010-07-15 02:43:57,385 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 2, 146030952153, 3, 1, LOOKING, FOLLOWING, 3 2010-07-15 02:43:57,385 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@642] - FOLLOWING 2010-07-15 02:43:57,385 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:zookeeperser...@151] - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 4 datadir /data/zookeeper/txlog/version-2 snapdir /data/zookeeper/version-2 2010-07-15 02:43:57,387 - FATAL [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] - Leader epoch 23 is less than our epoch 24 2010-07-15 02:43:57,387 - WARN [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@82] - Exception when following the leader java.io.IOException: Error: Epoch of leader is lower at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:73) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:644) 2010-07-15 02:43:57,387 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@166] - shutdown called java.lang.Exception: shutdown Follower at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:648) {code} I followed the recipe @vishal provided for recreating. (a) Stop one follower in a three node cluster (b) Get some tea while it falls behind (c) Start the node stopped in (a). These timestamps show where the follower was stopped. It also shows when it was turned back on. {code} 2010-07-15 02:35:36,398 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@1661] - Established session 0x229aa13cfc6276b with negotiated timeout 1 for client /10.209.45.114:34562 2010-07-15 02:39:18,907 - INFO [main:quorumpeercon...@90] - Reading configuration from: /etc/zookeeper/conf/zoo.cfg {code} This timestamp is the first ``Leader epoch`` line. Everything between these two points will be the interesting bits. {code} 2010-07-15 02:39:43,339 - FATAL [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] - Leader epoch 23 is less than our epoch 24 {code} Last processed zxid set prematurely while establishing leadership - Key: ZOOKEEPER-790 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-790 Project: Zookeeper Issue Type: Bug Components: quorum Affects Versions: 3.3.1 Reporter: Flavio Paiva Junqueira Assignee: Flavio Paiva Junqueira Priority: Blocker Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-790.patch, ZOOKEEPER-790.travis.log.bz2 The leader code is setting the last processed zxid to the first of the new epoch even before connecting to a quorum of followers. Because the leader code sets this value before connecting to a quorum of followers (Leader.java:281) and the follower code throws an IOException (Follower.java:73) if the leader epoch is smaller, we have that when the false leader drops leadership and becomes a follower, it finds a smaller epoch and kills itself. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-790) Last processed zxid set prematurely while establishing leadership
[ https://issues.apache.org/jira/browse/ZOOKEEPER-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1242#action_1242 ] Travis Crawford commented on ZOOKEEPER-790: --- Good find! I can certainly help test a patch, or collect more info if needed. Last processed zxid set prematurely while establishing leadership - Key: ZOOKEEPER-790 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-790 Project: Zookeeper Issue Type: Bug Components: quorum Affects Versions: 3.3.1 Reporter: Flavio Paiva Junqueira Assignee: Flavio Paiva Junqueira Priority: Blocker Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-790.patch, ZOOKEEPER-790.travis.log.bz2 The leader code is setting the last processed zxid to the first of the new epoch even before connecting to a quorum of followers. Because the leader code sets this value before connecting to a quorum of followers (Leader.java:281) and the follower code throws an IOException (Follower.java:73) if the leader epoch is smaller, we have that when the false leader drops leadership and becomes a follower, it finds a smaller epoch and kills itself. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-790) Last processed zxid set prematurely while establishing leadership
[ https://issues.apache.org/jira/browse/ZOOKEEPER-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890512#action_12890512 ] Travis Crawford commented on ZOOKEEPER-790: --- I tested this patch on a build with the following, applied in the listed order: 3.3.1 release + ZOOKEEPER-744.patch + ZOOKEEPER-790-3.3.patch Looks good! {code} 2010-07-20 23:43:34,229 - INFO [Thread-2545:nioserverc...@1516] - Closed socket connection for client /10.209.21.181:53743 (no session established for client) 2010-07-20 23:43:34,659 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:nioserverc...@639] - EndOfStreamException: Unable to read additional data from client sessionid 0x129d3fcb5a6f60d, likely client has closed socket 2010-07-20 23:43:34,660 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:nioserverc...@1516] - Closed socket connection for client /10.209.21.204:59727 which had sessionid 0x129d3fcb5a6f60d 2010-07-20 23:43:34,684 - INFO [ProcessThread:-1:preprequestproces...@385] - Processed session termination for sessionid: 0x329d3fcb6594e53 2010-07-20 23:52:14,522 - INFO [main:quorumpeercon...@90] - Reading configuration from: /etc/zookeeper/conf/zoo.cfg 2010-07-20 23:52:14,529 - INFO [main:quorumpeercon...@287] - Defaulting to majority quorums 2010-07-20 23:52:14,540 - INFO [main:quorumpeerm...@119] - Starting quorum peer 2010-07-20 23:52:14,562 - INFO [main:nioservercnxn$fact...@149] - binding to port 0.0.0.0/0.0.0.0:2181 2010-07-20 23:52:14,578 - INFO [main:quorump...@818] - tickTime set to 2000 2010-07-20 23:52:14,579 - INFO [main:quorump...@829] - minSessionTimeout set to -1 2010-07-20 23:52:14,579 - INFO [main:quorump...@840] - maxSessionTimeout set to -1 2010-07-20 23:52:14,579 - INFO [main:quorump...@855] - initLimit set to 10 2010-07-20 23:52:14,798 - INFO [main:files...@82] - Reading snapshot /data/zookeeper/version-2/snapshot.2500197ee5 2010-07-20 23:52:15,660 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:nioservercnxn$fact...@256] - Accepted socket connection from /10.209.45.76:57030 2010-07-20 23:52:15,661 - INFO [Thread-1:quorumcnxmanager$liste...@436] - My election bind port: 3888 2010-07-20 23:52:15,663 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:nioserverc...@644] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2010-07-20 23:52:15,664 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:nioserverc...@1516] - Closed socket connection for client /10.209.45.76:57030 (no session established for client) 2010-07-20 23:52:15,670 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@620] - LOOKING 2010-07-20 23:52:15,672 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@649] - New election. My id = 1, Proposed zxid = 158915472832 2010-07-20 23:52:15,674 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: 1, 158915472832, 1, 1, LOOKING, LOOKING, 1 2010-07-20 23:52:15,674 - INFO [WorkerSender Thread:quorumcnxmana...@162] - Have smaller server identifier, so dropping the connection: (2, 1) 2010-07-20 23:52:15,675 - INFO [WorkerSender Thread:quorumcnxmana...@162] - Have smaller server identifier, so dropping the connection: (3, 1) 2010-07-20 23:52:15,676 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: 2, 158915472832, 5, 1, LOOKING, LOOKING, 2 2010-07-20 23:52:15,676 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: 3, 158915472832, 5, 1, LOOKING, LOOKING, 2 2010-07-20 23:52:15,676 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@711] - Updating proposal 2010-07-20 23:52:15,677 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 3, 158915472832, 5, 1, LOOKING, FOLLOWING, 2 2010-07-20 23:52:15,677 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: 3, 158915472832, 5, 1, LOOKING, LOOKING, 3 2010-07-20 23:52:15,879 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@642] - FOLLOWING 2010-07-20 23:52:15,885 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:lear...@72] - TCP NoDelay set to: true 2010-07-20 23:52:15,893 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:environm...@97] - Server environment:zookeeper.version=3.3.1-942149, built on 05/07/2010 17:14 GMT 2010-07-20 23:52:15,893 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:environm...@97] - Server environment:host.name=sjc1k029.twitter.com 2010-07-20 23:52:15,894 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:environm...@97] - Server environment:java.version=1.6.0_16 2010-07-20 23:52:15,894 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:environm...@97] - Server environment:java.vendor=Sun Microsystems Inc. 2010-07-20 23:52:15,894 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:environm...@97] - Server environment:java.home=/usr/java/jdk1.6.0_16/jre 2010-07-20 23:52:15,894 - INFO
[jira] Commented: (ZOOKEEPER-790) Last processed zxid set prematurely while establishing leadership
[ https://issues.apache.org/jira/browse/ZOOKEEPER-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890513#action_12890513 ] Travis Crawford commented on ZOOKEEPER-790: --- Just to double-check, I'm really really sure the running jar was freshly built. For example, lsof says: java4473 root memREG 104,1 1012397 10092999 /usr/local/zookeeper/zookeeper-3.3.2-dev.jar Yes, the version number is 3.3.2, but this was built from the 3.3.1 release. Looking in the log we see: 2010-07-20 23:52:15,893 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:environm...@97] - Server environment:zookeeper.version=3.3.1-942149, built on 05/07/2010 17:14 GMT Should that say built this afternoon? I've double tripled checked and believe this is a newly built jar. Looking in the tarball we don't see zookeeper-3.3.2-dev.jar until after building. Last processed zxid set prematurely while establishing leadership - Key: ZOOKEEPER-790 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-790 Project: Zookeeper Issue Type: Bug Components: quorum Affects Versions: 3.3.1 Reporter: Flavio Paiva Junqueira Assignee: Flavio Paiva Junqueira Priority: Blocker Fix For: 3.3.2, 3.4.0 Attachments: ZOOKEEPER-790-3.3.patch, ZOOKEEPER-790.patch, ZOOKEEPER-790.patch, ZOOKEEPER-790.patch, ZOOKEEPER-790.travis.log.bz2 The leader code is setting the last processed zxid to the first of the new epoch even before connecting to a quorum of followers. Because the leader code sets this value before connecting to a quorum of followers (Leader.java:281) and the follower code throws an IOException (Follower.java:73) if the leader epoch is smaller, we have that when the false leader drops leadership and becomes a follower, it finds a smaller epoch and kills itself. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (ZOOKEEPER-856) Connection imbalance leads to overloaded ZK instances
Connection imbalance leads to overloaded ZK instances - Key: ZOOKEEPER-856 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-856 Project: Zookeeper Issue Type: Bug Reporter: Travis Crawford We've experienced a number of issues lately where ruok requests would take upwards of 10 seconds to return, and ZooKeeper instances were extremely sluggish. The sluggish instance requires a restart to make it responsive again. I believe the issue is connections are very imbalanced, leading to certain instances having many thousands of connections, while other instances are largely idle. A potential solution is periodically disconnecting/reconnecting to balance connections over time; this seems fine because sessions should not be affected, and therefore ephemaral nodes and watches should not be affected. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-856) Connection imbalance leads to overloaded ZK instances
[ https://issues.apache.org/jira/browse/ZOOKEEPER-856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Travis Crawford updated ZOOKEEPER-856: -- Attachment: zk_open_file_descriptor_count_total.gif zk_open_file_descriptor_count_members.gif Attached are two graphs showing: - Total ZooKeeper connections to a 3 node cluster - Connections per member in the cluster In the totals graph, notice how its largely unchanged over time. This period represents a steady-state period of usage. In the members graph, notice how the number of connections is significantly different between machines. This cluster allows the leader to service reads, so that's not something to factor in when interpreting number of connections. These graphs look very similar to an issue I had with another service (scribe) and we solved the issue by disconnecting every N+-K messages. We tried getting fancy by publishing load metrics and using a smart selection algorithm. Turns out in practice though the periodic disconnect/reconnect was easier to implement and worked better, so I'm tossing that idea out as a potential solution here. Connection imbalance leads to overloaded ZK instances - Key: ZOOKEEPER-856 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-856 Project: Zookeeper Issue Type: Bug Reporter: Travis Crawford Attachments: zk_open_file_descriptor_count_members.gif, zk_open_file_descriptor_count_total.gif We've experienced a number of issues lately where ruok requests would take upwards of 10 seconds to return, and ZooKeeper instances were extremely sluggish. The sluggish instance requires a restart to make it responsive again. I believe the issue is connections are very imbalanced, leading to certain instances having many thousands of connections, while other instances are largely idle. A potential solution is periodically disconnecting/reconnecting to balance connections over time; this seems fine because sessions should not be affected, and therefore ephemaral nodes and watches should not be affected. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-856) Connection imbalance leads to overloaded ZK instances
[ https://issues.apache.org/jira/browse/ZOOKEEPER-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903065#action_12903065 ] Travis Crawford commented on ZOOKEEPER-856: --- @mahadev - I would love to help test a patch :) I'm currently using 3.3.1 + ZOOKEEPER-744 + ZOOKEEPER-790, applied in that order. If there's a knob for how frequently to disconnect/reconnect I can try out different settings to see what a sensible default would be. Do you think this should be a client or server setting? I'm thinking a server setting because otherwise its not possible to enforce the policy. Connection imbalance leads to overloaded ZK instances - Key: ZOOKEEPER-856 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-856 Project: Zookeeper Issue Type: Bug Reporter: Travis Crawford Fix For: 3.4.0 Attachments: zk_open_file_descriptor_count_members.gif, zk_open_file_descriptor_count_total.gif We've experienced a number of issues lately where ruok requests would take upwards of 10 seconds to return, and ZooKeeper instances were extremely sluggish. The sluggish instance requires a restart to make it responsive again. I believe the issue is connections are very imbalanced, leading to certain instances having many thousands of connections, while other instances are largely idle. A potential solution is periodically disconnecting/reconnecting to balance connections over time; this seems fine because sessions should not be affected, and therefore ephemaral nodes and watches should not be affected. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-856) Connection imbalance leads to overloaded ZK instances
[ https://issues.apache.org/jira/browse/ZOOKEEPER-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903170#action_12903170 ] Travis Crawford commented on ZOOKEEPER-856: --- @patrick - We're using these settings, which I believe are based on what's recommended in the troubleshooting guide. -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+UseConcMarkSweepGC Looking at the logs I do see lots of GC activity. For example: Total time for which application threads were stopped: 0.5599050 seconds Application time: 0.0056590 seconds I only see this on the hosts that became unresponsive after acquiring lots of connections. Any suggestions for the GC flags? If there's something better I can experiment, and update the wiki if we discover something interesting. http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting Connection imbalance leads to overloaded ZK instances - Key: ZOOKEEPER-856 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-856 Project: Zookeeper Issue Type: Bug Reporter: Travis Crawford Fix For: 3.4.0 Attachments: zk_open_file_descriptor_count_members.gif, zk_open_file_descriptor_count_total.gif We've experienced a number of issues lately where ruok requests would take upwards of 10 seconds to return, and ZooKeeper instances were extremely sluggish. The sluggish instance requires a restart to make it responsive again. I believe the issue is connections are very imbalanced, leading to certain instances having many thousands of connections, while other instances are largely idle. A potential solution is periodically disconnecting/reconnecting to balance connections over time; this seems fine because sessions should not be affected, and therefore ephemaral nodes and watches should not be affected. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-856) Connection imbalance leads to overloaded ZK instances
[ https://issues.apache.org/jira/browse/ZOOKEEPER-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903592#action_12903592 ] Travis Crawford commented on ZOOKEEPER-856: --- Thanks for the suggestions! I'm trying out the following options: -Xmx4000M -Xms4000M -Xloggc:/var/log/zookeeper/gc_20100827_201441.log -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing I still think the connection imbalance issue is worth addressing though. Even with properly tuned ZK servers they can become overloaded as older processes acquire connections to the point of becoming overloaded. Connection imbalance leads to overloaded ZK instances - Key: ZOOKEEPER-856 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-856 Project: Zookeeper Issue Type: Bug Reporter: Travis Crawford Fix For: 3.4.0 Attachments: zk_open_file_descriptor_count_members.gif, zk_open_file_descriptor_count_total.gif We've experienced a number of issues lately where ruok requests would take upwards of 10 seconds to return, and ZooKeeper instances were extremely sluggish. The sluggish instance requires a restart to make it responsive again. I believe the issue is connections are very imbalanced, leading to certain instances having many thousands of connections, while other instances are largely idle. A potential solution is periodically disconnecting/reconnecting to balance connections over time; this seems fine because sessions should not be affected, and therefore ephemaral nodes and watches should not be affected. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.