[jira] Created: (ZOOKEEPER-759) Stop accepting connections when close to file descriptor limit

2010-04-29 Thread Travis Crawford (JIRA)
Stop accepting connections when close to file descriptor limit
--

 Key: ZOOKEEPER-759
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-759
 Project: Zookeeper
  Issue Type: Improvement
  Components: server
Reporter: Travis Crawford


Zookeeper always tries to accept new connections, throwing an exception if out 
of file descriptors. An improvement would be denying new client connections 
when close to the limit.

Additionally, file-descriptor limits+usage should be exported to the monitoring 
four-letter word, should that get implemented (see ZOOKEEPER-744).


DETAILS

A Zookeeper ensemble I administer recently suffered an outage when one node was 
restarted with the low system-default ulimit of 1024 file descriptors and later 
ran out. File descriptor usage+max are already being monitored by the following 
MBeans:

- java.lang.OperatingSystem.MaxFileDescriptorCount
- java.lang.OperatingSystem.OpenFileDescriptorCount

They're described (rather tersely) at:

http://java.sun.com/javase/6/docs/jre/api/management/extension/com/sun/management/UnixOperatingSystemMXBean.html

This feature request is for the following:

(a) Stop accepting new connections when OpenFileDescriptorCount is close to 
MaxFileDescriptorCount, defaulting to 95% FD usage. New connections should be 
denied, logged to disk at debug level, and increment a 
``ConnectionDeniedCount`` MBean counter.

(b) Begin accepting new connections when usage drops below some configurable 
threshold, defaulting to 90% of FD usage, basically the high/low watermark 
model.

(c) Update the administrators guide with a comment about using an appropriate 
FD limit.

(d) Extra credit: if ZOOKEEPER-744 is implemented export statistics for:

zookeeper_open_file_descriptor_count
zookeeper_max_file_descriptor_count
zookeeper_max_file_descriptor_mismatch - boolean, exported by leader, if not 
all zk's have the same max FD value

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (ZOOKEEPER-765) Add python example script

2010-05-04 Thread Travis Crawford (JIRA)
Add python example script
-

 Key: ZOOKEEPER-765
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-765
 Project: Zookeeper
  Issue Type: Improvement
  Components: documentation
Reporter: Travis Crawford
Priority: Minor
 Attachments: zk.py

When adding some zookeeper-based functionality to a python script I had to 
figure everything out without guidance, which while doable, would have been a 
lot easier with an example. I extracted a skeleton program structure out with 
hopes its useful to others (maybe add as an example in the source or wiki?).

This script does an aget() and sets a watch, and hopefully illustrates what's 
going on, and where to plug in your application code that gets run when the 
znode changes.

There are probably some bugs, which if we fix now and provide a well-reviewed 
example hopefully others will not run into the same mistakes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-765) Add python example script

2010-05-04 Thread Travis Crawford (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Crawford updated ZOOKEEPER-765:
--

Attachment: zk.py

 Add python example script
 -

 Key: ZOOKEEPER-765
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-765
 Project: Zookeeper
  Issue Type: Improvement
  Components: documentation
Reporter: Travis Crawford
Priority: Minor
 Attachments: zk.py


 When adding some zookeeper-based functionality to a python script I had to 
 figure everything out without guidance, which while doable, would have been a 
 lot easier with an example. I extracted a skeleton program structure out with 
 hopes its useful to others (maybe add as an example in the source or wiki?).
 This script does an aget() and sets a watch, and hopefully illustrates what's 
 going on, and where to plug in your application code that gets run when the 
 znode changes.
 There are probably some bugs, which if we fix now and provide a well-reviewed 
 example hopefully others will not run into the same mistakes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-744) Add monitoring four-letter word

2010-05-25 Thread Travis Crawford (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871479#action_12871479
 ] 

Travis Crawford commented on ZOOKEEPER-744:
---

Looks awesome! One nit -- many monitoring systems do not interpret strings so 
it may be appropriate to export everything as numbers. For example, consider a 
script that loops through these values poking them into Ganglia (or other 
timeseries database). The script would need special-cased to handle leader. 
Later, as more values are added the import script would need updated with the 
new strings. Doing everything as numbers ensures new values would just work 
without updating other systems.

With that in mind, perhaps:

zk_server_state 1 (instead of: leader)
zk_build_timestamp unix_timestamp (instead of build string)

 Add monitoring four-letter word
 ---

 Key: ZOOKEEPER-744
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-744
 Project: Zookeeper
  Issue Type: New Feature
  Components: server
Reporter: Travis Crawford
 Attachments: zk-ganglia.png, ZOOKEEPER-744.patch


 Filing a feature request based on a zookeeper-user discussion.
 Zookeeper should have a new four-letter word that returns key-value pairs 
 appropriate for importing to a monitoring system (such as Ganglia which has a 
 large installed base)
 This command should initially export the following:
 (a) Count of instances in the ensemble.
 (b) Count of up-to-date instances in the ensemble.
 But be designed such that in the future additional data can be added. For 
 example, the output could define the statistic in a comment, then print a key 
 space character value line:
 
 # Total number of instances in the ensemble
 zk_ensemble_instances_total 5
 # Number of instances currently participating in the quorum.
 zk_ensemble_instances_active 4
 
 From the mailing list:
 
 Date: Mon, 19 Apr 2010 12:10:44 -0700
 From: Patrick Hunt ph...@apache.org
 To: zookeeper-u...@hadoop.apache.org
 Subject: Re: Recovery issue - how to debug?
 On 04/19/2010 11:55 AM, Travis Crawford wrote:
  It would be a lot easier from the operations perspective if the leader
  explicitly published some health stats:
 
  (a) Count of instances in the ensemble.
  (b) Count of up-to-date instances in the ensemble.
 
  This would greatly simplify monitoring  alerting - when an instance
  falls behind one could configure their monitoring system to let
  someone know and take a look at the logs.
 That's a great idea. Please enter a JIRA for this - a new 4 letter word 
 and JMX support. It would also be a great starter project for someone 
 interested in becoming more familiar with the server code.
 Patrick
 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-744) Add monitoring four-letter word

2010-05-27 Thread Travis Crawford (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12872284#action_12872284
 ] 

Travis Crawford commented on ZOOKEEPER-744:
---

@savu - Sometimes the right thing is publishing everything, sometimes not. In 
this case I'll have some filtering/formatting in the zk2ganglia script.

 Add monitoring four-letter word
 ---

 Key: ZOOKEEPER-744
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-744
 Project: Zookeeper
  Issue Type: New Feature
  Components: server
Affects Versions: 3.4.0
Reporter: Travis Crawford
Assignee: Savu Andrei
 Attachments: zk-ganglia.png, ZOOKEEPER-744.patch


 Filing a feature request based on a zookeeper-user discussion.
 Zookeeper should have a new four-letter word that returns key-value pairs 
 appropriate for importing to a monitoring system (such as Ganglia which has a 
 large installed base)
 This command should initially export the following:
 (a) Count of instances in the ensemble.
 (b) Count of up-to-date instances in the ensemble.
 But be designed such that in the future additional data can be added. For 
 example, the output could define the statistic in a comment, then print a key 
 space character value line:
 
 # Total number of instances in the ensemble
 zk_ensemble_instances_total 5
 # Number of instances currently participating in the quorum.
 zk_ensemble_instances_active 4
 
 From the mailing list:
 
 Date: Mon, 19 Apr 2010 12:10:44 -0700
 From: Patrick Hunt ph...@apache.org
 To: zookeeper-u...@hadoop.apache.org
 Subject: Re: Recovery issue - how to debug?
 On 04/19/2010 11:55 AM, Travis Crawford wrote:
  It would be a lot easier from the operations perspective if the leader
  explicitly published some health stats:
 
  (a) Count of instances in the ensemble.
  (b) Count of up-to-date instances in the ensemble.
 
  This would greatly simplify monitoring  alerting - when an instance
  falls behind one could configure their monitoring system to let
  someone know and take a look at the logs.
 That's a great idea. Please enter a JIRA for this - a new 4 letter word 
 and JMX support. It would also be a great starter project for someone 
 interested in becoming more familiar with the server code.
 Patrick
 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-801) Zookeeper outage post-mortem

2010-07-01 Thread Travis Crawford (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Crawford updated ZOOKEEPER-801:
--

Attachment: client-errors.txt

client-errors.txt contains the zookeeper error messages during an outage. The 
interesting part is how frequently they occur. This suggests having clients 
backoff during issues would be an improvement, as this many connection so 
quickly makes things worse.

Of course, there's a tradeoff between quickly connecting and backing off. For 
this reason, perhaps exponential backoff should be the default behavior, but 
clients that that require fast reconnections could optionally use the current 
mode.

 Zookeeper outage post-mortem
 

 Key: ZOOKEEPER-801
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-801
 Project: Zookeeper
  Issue Type: Improvement
Affects Versions: 3.3.0
 Environment: - RHEL5 2.6.18 kernel
 - Zookeeper 3.3.0
 - ulimit raised to 65k files
 - 3 cluster members
 - 4-5k connections in steady-state
 - Primarily C and python clients, plus some java
Reporter: Travis Crawford
 Attachments: client-errors.txt


 [Moving a thread from the zookeeper-user]
 RECOVERY
 We eventually recovered from this situation by shutting down clients. 
 Initially I tried restarting the Zookeepers, however, they were getting 
 hammered and I believe sessions timing out. I shut down ~2k clients 
 (lightweight python app; simply sets one data watch and takes an action when 
 it changes) at which point zktop could make a connection and a leader 
 election was verified. After resetting latency stats the numbers were very 
 good. I do not believe it would have ever recovered without shedding load.
 QUORUM/ELECTIONS DURING EVENT
 Unfortunately I do not have logs from the event :( We had debug logging on, 
 and logrotate configured to keep 10 100MB files, and the interesting parts 
 rotated away. I have already switched to info logging so we don't lose this 
 data again.
 During the incident I was not able to view cluster status with zktop, and 
 never observed a successful operation beyond connections, which quickly timed 
 out.
 GC PAUSE/LOGGING
 This is a very good question. No, Zookeeper GC is not tuned and uses whatever 
 the defaults are in the start scripts. GC logging is not enabled either. I 
 filed an internal bug against myself to enable logging, and tune GC.
 CLIENT SESSION TIMEOUTS
 Clients are not explicitly setting timeouts, and I believe sessionTimeout is 
 10 seconds based on this log entry when initially connecting.
 2010-07-01 05:14:00,260:44267(0x2af330240110):zoo_i...@zookeeper_init@727: 
 Initiating client connection, 
 host=10.209.21.133:2181,10.209.21.175:2181,10.209.21.181:2181 
 sessionTimeout=1 watcher=(nil) sessionId=0 sessionPasswd=null 
 context=(nil) flags=0
 CLIENT BACKOFFS
 Looking in application logs, we see lots of the following:
 2010-07-01 
 05:13:14,674:41491(0x41ebf940):zoo_er...@handle_socket_error_msg@1528: Socket 
 [10.209.21.181:2181] zk retcode=-7, errno=110(Connection timed out): 
 connection timed out (exceeded timeout by 0ms)
 Doing some simple aggregations we see 130 errors in a ~13 minute sample 
 period. This behavior on thousands of clients appears to have been a DDoS 
 attack against Zookeeper. Using exponential behavior as the default behavior 
 seems appropriate looking at this data.
 Fulltext of the client errors is attached. I grepped errno from a Python 
 client log; I believe it uses the same underlying C library, so I did not 
 include example output from a C program (though I can if needed). It looks 
 basically the same.
 GOING FORWARD
 The long-GC pause causing clients to dogpile sounds like the most plausible 
 explanation at this time. GC logging/tuning is clearly where I dropped the 
 ball, just using the defaults; I don't think any changes should be made 
 related to lack of tuning.
 Exponential backoffs does seem like a good idea, and generally useful for 
 most people. There will always be service interruptions and backoffs would be 
 a great preventive measure to get out of a dogpile situation.
 Patrick's message:
 
 Hi Travis, as Flavio suggested would be great to get the logs. A few 
 questions:
 1) how did you eventually recover, restart the zk servers?
 2) was the cluster losing quorum during this time? leader re-election?
 3) Any chance this could have been initially triggered by a long GC pause on 
 one of the servers? (is gc logging turned on, any sort of heap monitoring?) 
 Has the GC been tuned on the servers, for example CMS and incremental?
 4) what are the clients using for timeout on the sessions?
 3.4 probably not for a few months yet, but we are planning for a 3.3.2 in a 
 few weeks to fix a couple critical issues (which don't seem related to what 
 you saw). If we can identify the 

[jira] Created: (ZOOKEEPER-801) Zookeeper outage post-mortem

2010-07-01 Thread Travis Crawford (JIRA)
Zookeeper outage post-mortem


 Key: ZOOKEEPER-801
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-801
 Project: Zookeeper
  Issue Type: Improvement
Affects Versions: 3.3.0
 Environment: - RHEL5 2.6.18 kernel
- Zookeeper 3.3.0
- ulimit raised to 65k files
- 3 cluster members
- 4-5k connections in steady-state
- Primarily C and python clients, plus some java
Reporter: Travis Crawford


[Moving a thread from the zookeeper-user]

RECOVERY
We eventually recovered from this situation by shutting down clients. Initially 
I tried restarting the Zookeepers, however, they were getting hammered and I 
believe sessions timing out. I shut down ~2k clients (lightweight python app; 
simply sets one data watch and takes an action when it changes) at which point 
zktop could make a connection and a leader election was verified. After 
resetting latency stats the numbers were very good. I do not believe it would 
have ever recovered without shedding load.


QUORUM/ELECTIONS DURING EVENT
Unfortunately I do not have logs from the event :( We had debug logging on, and 
logrotate configured to keep 10 100MB files, and the interesting parts rotated 
away. I have already switched to info logging so we don't lose this data again.

During the incident I was not able to view cluster status with zktop, and never 
observed a successful operation beyond connections, which quickly timed out.


GC PAUSE/LOGGING
This is a very good question. No, Zookeeper GC is not tuned and uses whatever 
the defaults are in the start scripts. GC logging is not enabled either. I 
filed an internal bug against myself to enable logging, and tune GC.


CLIENT SESSION TIMEOUTS
Clients are not explicitly setting timeouts, and I believe sessionTimeout is 10 
seconds based on this log entry when initially connecting.

2010-07-01 05:14:00,260:44267(0x2af330240110):zoo_i...@zookeeper_init@727: 
Initiating client connection, 
host=10.209.21.133:2181,10.209.21.175:2181,10.209.21.181:2181 
sessionTimeout=1 watcher=(nil) sessionId=0 sessionPasswd=null 
context=(nil) flags=0


CLIENT BACKOFFS
Looking in application logs, we see lots of the following:

2010-07-01 
05:13:14,674:41491(0x41ebf940):zoo_er...@handle_socket_error_msg@1528: Socket 
[10.209.21.181:2181] zk retcode=-7, errno=110(Connection timed out): connection 
timed out (exceeded timeout by 0ms)

Doing some simple aggregations we see 130 errors in a ~13 minute sample period. 
This behavior on thousands of clients appears to have been a DDoS attack 
against Zookeeper. Using exponential behavior as the default behavior seems 
appropriate looking at this data.

Fulltext of the client errors is attached. I grepped errno from a Python 
client log; I believe it uses the same underlying C library, so I did not 
include example output from a C program (though I can if needed). It looks 
basically the same.


GOING FORWARD
The long-GC pause causing clients to dogpile sounds like the most plausible 
explanation at this time. GC logging/tuning is clearly where I dropped the 
ball, just using the defaults; I don't think any changes should be made related 
to lack of tuning.

Exponential backoffs does seem like a good idea, and generally useful for most 
people. There will always be service interruptions and backoffs would be a 
great preventive measure to get out of a dogpile situation.



Patrick's message:

Hi Travis, as Flavio suggested would be great to get the logs. A few questions:

1) how did you eventually recover, restart the zk servers?

2) was the cluster losing quorum during this time? leader re-election?

3) Any chance this could have been initially triggered by a long GC pause on 
one of the servers? (is gc logging turned on, any sort of heap monitoring?) Has 
the GC been tuned on the servers, for example CMS and incremental?

4) what are the clients using for timeout on the sessions?

3.4 probably not for a few months yet, but we are planning for a 3.3.2 in a few 
weeks to fix a couple critical issues (which don't seem related to what you 
saw). If we can identify the problem here we should be able to include it in 
any fix release we do.

fixing something like 517 might help, but it's not clear how we got to this 
state in the first place. fixing 517 might not have any effect if the root 
cause is not addressed. 662 has only ever been reported once afaik, and we 
weren't able to identify the root cause for that one.

One thing we might also consider is modifying the zk client lib to backoff 
connection attempts if they keep failing (timing out say). Today the clients 
are pretty aggressive on reconnection attempts. Having some sort of backoff 
(exponential?) would provide more breathing room to the server in this 
situation.

Patrick


Flavio's message:

Hi Travis, Do you think it would be possible for you to open a jira and upload 
your logs?

Thanks,
-Flavio


My initial message:

Hey 

[jira] Updated: (ZOOKEEPER-801) Zookeeper outage post-mortem

2010-07-01 Thread Travis Crawford (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Crawford updated ZOOKEEPER-801:
--

Attachment: zookeeper-out-of-files.netstat

This attached netstat output shows a large number of: SYN_RECV CLOSE_WAIT. The 
fast connection churn appears to have consumed all available file handles.

 Zookeeper outage post-mortem
 

 Key: ZOOKEEPER-801
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-801
 Project: Zookeeper
  Issue Type: Improvement
Affects Versions: 3.3.0
 Environment: - RHEL5 2.6.18 kernel
 - Zookeeper 3.3.0
 - ulimit raised to 65k files
 - 3 cluster members
 - 4-5k connections in steady-state
 - Primarily C and python clients, plus some java
Reporter: Travis Crawford
 Attachments: client-errors.txt, zookeeper-out-of-files.netstat


 [Moving a thread from the zookeeper-user]
 RECOVERY
 We eventually recovered from this situation by shutting down clients. 
 Initially I tried restarting the Zookeepers, however, they were getting 
 hammered and I believe sessions timing out. I shut down ~2k clients 
 (lightweight python app; simply sets one data watch and takes an action when 
 it changes) at which point zktop could make a connection and a leader 
 election was verified. After resetting latency stats the numbers were very 
 good. I do not believe it would have ever recovered without shedding load.
 QUORUM/ELECTIONS DURING EVENT
 Unfortunately I do not have logs from the event :( We had debug logging on, 
 and logrotate configured to keep 10 100MB files, and the interesting parts 
 rotated away. I have already switched to info logging so we don't lose this 
 data again.
 During the incident I was not able to view cluster status with zktop, and 
 never observed a successful operation beyond connections, which quickly timed 
 out.
 GC PAUSE/LOGGING
 This is a very good question. No, Zookeeper GC is not tuned and uses whatever 
 the defaults are in the start scripts. GC logging is not enabled either. I 
 filed an internal bug against myself to enable logging, and tune GC.
 CLIENT SESSION TIMEOUTS
 Clients are not explicitly setting timeouts, and I believe sessionTimeout is 
 10 seconds based on this log entry when initially connecting.
 2010-07-01 05:14:00,260:44267(0x2af330240110):zoo_i...@zookeeper_init@727: 
 Initiating client connection, 
 host=10.209.21.133:2181,10.209.21.175:2181,10.209.21.181:2181 
 sessionTimeout=1 watcher=(nil) sessionId=0 sessionPasswd=null 
 context=(nil) flags=0
 CLIENT BACKOFFS
 Looking in application logs, we see lots of the following:
 2010-07-01 
 05:13:14,674:41491(0x41ebf940):zoo_er...@handle_socket_error_msg@1528: Socket 
 [10.209.21.181:2181] zk retcode=-7, errno=110(Connection timed out): 
 connection timed out (exceeded timeout by 0ms)
 Doing some simple aggregations we see 130 errors in a ~13 minute sample 
 period. This behavior on thousands of clients appears to have been a DDoS 
 attack against Zookeeper. Using exponential behavior as the default behavior 
 seems appropriate looking at this data.
 Fulltext of the client errors is attached. I grepped errno from a Python 
 client log; I believe it uses the same underlying C library, so I did not 
 include example output from a C program (though I can if needed). It looks 
 basically the same.
 GOING FORWARD
 The long-GC pause causing clients to dogpile sounds like the most plausible 
 explanation at this time. GC logging/tuning is clearly where I dropped the 
 ball, just using the defaults; I don't think any changes should be made 
 related to lack of tuning.
 Exponential backoffs does seem like a good idea, and generally useful for 
 most people. There will always be service interruptions and backoffs would be 
 a great preventive measure to get out of a dogpile situation.
 Patrick's message:
 
 Hi Travis, as Flavio suggested would be great to get the logs. A few 
 questions:
 1) how did you eventually recover, restart the zk servers?
 2) was the cluster losing quorum during this time? leader re-election?
 3) Any chance this could have been initially triggered by a long GC pause on 
 one of the servers? (is gc logging turned on, any sort of heap monitoring?) 
 Has the GC been tuned on the servers, for example CMS and incremental?
 4) what are the clients using for timeout on the sessions?
 3.4 probably not for a few months yet, but we are planning for a 3.3.2 in a 
 few weeks to fix a couple critical issues (which don't seem related to what 
 you saw). If we can identify the problem here we should be able to include it 
 in any fix release we do.
 fixing something like 517 might help, but it's not clear how we got to this 
 state in the first place. fixing 517 might not have any effect if the root 
 cause is not addressed. 662 has only ever been reported once afaik, and we 

[jira] Commented: (ZOOKEEPER-801) Zookeeper outage post-mortem

2010-07-01 Thread Travis Crawford (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884368#action_12884368
 ] 

Travis Crawford commented on ZOOKEEPER-801:
---

I believe clients were reconnecting so frequently because their
connections timed out.


 Zookeeper outage post-mortem
 

 Key: ZOOKEEPER-801
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-801
 Project: Zookeeper
  Issue Type: Improvement
Affects Versions: 3.3.0
 Environment: - RHEL5 2.6.18 kernel
 - Zookeeper 3.3.0
 - ulimit raised to 65k files
 - 3 cluster members
 - 4-5k connections in steady-state
 - Primarily C and python clients, plus some java
Reporter: Travis Crawford
 Attachments: client-errors.txt, zookeeper-out-of-files.netstat


 [Moving a thread from the zookeeper-user]
 RECOVERY
 We eventually recovered from this situation by shutting down clients. 
 Initially I tried restarting the Zookeepers, however, they were getting 
 hammered and I believe sessions timing out. I shut down ~2k clients 
 (lightweight python app; simply sets one data watch and takes an action when 
 it changes) at which point zktop could make a connection and a leader 
 election was verified. After resetting latency stats the numbers were very 
 good. I do not believe it would have ever recovered without shedding load.
 QUORUM/ELECTIONS DURING EVENT
 Unfortunately I do not have logs from the event :( We had debug logging on, 
 and logrotate configured to keep 10 100MB files, and the interesting parts 
 rotated away. I have already switched to info logging so we don't lose this 
 data again.
 During the incident I was not able to view cluster status with zktop, and 
 never observed a successful operation beyond connections, which quickly timed 
 out.
 GC PAUSE/LOGGING
 This is a very good question. No, Zookeeper GC is not tuned and uses whatever 
 the defaults are in the start scripts. GC logging is not enabled either. I 
 filed an internal bug against myself to enable logging, and tune GC.
 CLIENT SESSION TIMEOUTS
 Clients are not explicitly setting timeouts, and I believe sessionTimeout is 
 10 seconds based on this log entry when initially connecting.
 2010-07-01 05:14:00,260:44267(0x2af330240110):zoo_i...@zookeeper_init@727: 
 Initiating client connection, 
 host=10.209.21.133:2181,10.209.21.175:2181,10.209.21.181:2181 
 sessionTimeout=1 watcher=(nil) sessionId=0 sessionPasswd=null 
 context=(nil) flags=0
 CLIENT BACKOFFS
 Looking in application logs, we see lots of the following:
 2010-07-01 
 05:13:14,674:41491(0x41ebf940):zoo_er...@handle_socket_error_msg@1528: Socket 
 [10.209.21.181:2181] zk retcode=-7, errno=110(Connection timed out): 
 connection timed out (exceeded timeout by 0ms)
 Doing some simple aggregations we see 130 errors in a ~13 minute sample 
 period. This behavior on thousands of clients appears to have been a DDoS 
 attack against Zookeeper. Using exponential behavior as the default behavior 
 seems appropriate looking at this data.
 Fulltext of the client errors is attached. I grepped errno from a Python 
 client log; I believe it uses the same underlying C library, so I did not 
 include example output from a C program (though I can if needed). It looks 
 basically the same.
 GOING FORWARD
 The long-GC pause causing clients to dogpile sounds like the most plausible 
 explanation at this time. GC logging/tuning is clearly where I dropped the 
 ball, just using the defaults; I don't think any changes should be made 
 related to lack of tuning.
 Exponential backoffs does seem like a good idea, and generally useful for 
 most people. There will always be service interruptions and backoffs would be 
 a great preventive measure to get out of a dogpile situation.
 Patrick's message:
 
 Hi Travis, as Flavio suggested would be great to get the logs. A few 
 questions:
 1) how did you eventually recover, restart the zk servers?
 2) was the cluster losing quorum during this time? leader re-election?
 3) Any chance this could have been initially triggered by a long GC pause on 
 one of the servers? (is gc logging turned on, any sort of heap monitoring?) 
 Has the GC been tuned on the servers, for example CMS and incremental?
 4) what are the clients using for timeout on the sessions?
 3.4 probably not for a few months yet, but we are planning for a 3.3.2 in a 
 few weeks to fix a couple critical issues (which don't seem related to what 
 you saw). If we can identify the problem here we should be able to include it 
 in any fix release we do.
 fixing something like 517 might help, but it's not clear how we got to this 
 state in the first place. fixing 517 might not have any effect if the root 
 cause is not addressed. 662 has only ever been reported once afaik, and we 
 weren't able to identify the root cause for that one.
 One 

[jira] Commented: (ZOOKEEPER-801) Zookeeper outage post-mortem

2010-07-02 Thread Travis Crawford (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884769#action_12884769
 ] 

Travis Crawford commented on ZOOKEEPER-801:
---

@Patrick - So this morning while investigating I discovered that yes, the 
client you pointed out was indeed misbehaving. It was a new application that 
had a bug.

I talked with a couple teams and basically everyone is using a different Scala 
client. We've decided http://github.com/jcorwin/zookeeper-client is our best 
scala client and are forking from the main twitter account so everyone knows to 
use it.

I looked around the docs and didn't see a place to list the scala client; 
please let me know if we should list this somewhere.

Closing this issue as its basically a client bug on our side. Thanks for the 
comments and help tracking this down.

 Zookeeper outage post-mortem
 

 Key: ZOOKEEPER-801
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-801
 Project: Zookeeper
  Issue Type: Improvement
Affects Versions: 3.3.0
 Environment: - RHEL5 2.6.18 kernel
 - Zookeeper 3.3.0
 - ulimit raised to 65k files
 - 3 cluster members
 - 4-5k connections in steady-state
 - Primarily C and python clients, plus some java
Reporter: Travis Crawford
 Attachments: client-errors.txt, zookeeper-out-of-files.netstat


 [Moving a thread from the zookeeper-user]
 RECOVERY
 We eventually recovered from this situation by shutting down clients. 
 Initially I tried restarting the Zookeepers, however, they were getting 
 hammered and I believe sessions timing out. I shut down ~2k clients 
 (lightweight python app; simply sets one data watch and takes an action when 
 it changes) at which point zktop could make a connection and a leader 
 election was verified. After resetting latency stats the numbers were very 
 good. I do not believe it would have ever recovered without shedding load.
 QUORUM/ELECTIONS DURING EVENT
 Unfortunately I do not have logs from the event :( We had debug logging on, 
 and logrotate configured to keep 10 100MB files, and the interesting parts 
 rotated away. I have already switched to info logging so we don't lose this 
 data again.
 During the incident I was not able to view cluster status with zktop, and 
 never observed a successful operation beyond connections, which quickly timed 
 out.
 GC PAUSE/LOGGING
 This is a very good question. No, Zookeeper GC is not tuned and uses whatever 
 the defaults are in the start scripts. GC logging is not enabled either. I 
 filed an internal bug against myself to enable logging, and tune GC.
 CLIENT SESSION TIMEOUTS
 Clients are not explicitly setting timeouts, and I believe sessionTimeout is 
 10 seconds based on this log entry when initially connecting.
 2010-07-01 05:14:00,260:44267(0x2af330240110):zoo_i...@zookeeper_init@727: 
 Initiating client connection, 
 host=10.209.21.133:2181,10.209.21.175:2181,10.209.21.181:2181 
 sessionTimeout=1 watcher=(nil) sessionId=0 sessionPasswd=null 
 context=(nil) flags=0
 CLIENT BACKOFFS
 Looking in application logs, we see lots of the following:
 2010-07-01 
 05:13:14,674:41491(0x41ebf940):zoo_er...@handle_socket_error_msg@1528: Socket 
 [10.209.21.181:2181] zk retcode=-7, errno=110(Connection timed out): 
 connection timed out (exceeded timeout by 0ms)
 Doing some simple aggregations we see 130 errors in a ~13 minute sample 
 period. This behavior on thousands of clients appears to have been a DDoS 
 attack against Zookeeper. Using exponential behavior as the default behavior 
 seems appropriate looking at this data.
 Fulltext of the client errors is attached. I grepped errno from a Python 
 client log; I believe it uses the same underlying C library, so I did not 
 include example output from a C program (though I can if needed). It looks 
 basically the same.
 GOING FORWARD
 The long-GC pause causing clients to dogpile sounds like the most plausible 
 explanation at this time. GC logging/tuning is clearly where I dropped the 
 ball, just using the defaults; I don't think any changes should be made 
 related to lack of tuning.
 Exponential backoffs does seem like a good idea, and generally useful for 
 most people. There will always be service interruptions and backoffs would be 
 a great preventive measure to get out of a dogpile situation.
 Patrick's message:
 
 Hi Travis, as Flavio suggested would be great to get the logs. A few 
 questions:
 1) how did you eventually recover, restart the zk servers?
 2) was the cluster losing quorum during this time? leader re-election?
 3) Any chance this could have been initially triggered by a long GC pause on 
 one of the servers? (is gc logging turned on, any sort of heap monitoring?) 
 Has the GC been tuned on the servers, for example CMS and incremental?
 4) what are the clients using for timeout 

[jira] Resolved: (ZOOKEEPER-801) Zookeeper outage post-mortem

2010-07-02 Thread Travis Crawford (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Crawford resolved ZOOKEEPER-801.
---

  Assignee: Travis Crawford
Resolution: Not A Problem

 Zookeeper outage post-mortem
 

 Key: ZOOKEEPER-801
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-801
 Project: Zookeeper
  Issue Type: Improvement
Affects Versions: 3.3.0
 Environment: - RHEL5 2.6.18 kernel
 - Zookeeper 3.3.0
 - ulimit raised to 65k files
 - 3 cluster members
 - 4-5k connections in steady-state
 - Primarily C and python clients, plus some java
Reporter: Travis Crawford
Assignee: Travis Crawford
 Attachments: client-errors.txt, zookeeper-out-of-files.netstat


 [Moving a thread from the zookeeper-user]
 RECOVERY
 We eventually recovered from this situation by shutting down clients. 
 Initially I tried restarting the Zookeepers, however, they were getting 
 hammered and I believe sessions timing out. I shut down ~2k clients 
 (lightweight python app; simply sets one data watch and takes an action when 
 it changes) at which point zktop could make a connection and a leader 
 election was verified. After resetting latency stats the numbers were very 
 good. I do not believe it would have ever recovered without shedding load.
 QUORUM/ELECTIONS DURING EVENT
 Unfortunately I do not have logs from the event :( We had debug logging on, 
 and logrotate configured to keep 10 100MB files, and the interesting parts 
 rotated away. I have already switched to info logging so we don't lose this 
 data again.
 During the incident I was not able to view cluster status with zktop, and 
 never observed a successful operation beyond connections, which quickly timed 
 out.
 GC PAUSE/LOGGING
 This is a very good question. No, Zookeeper GC is not tuned and uses whatever 
 the defaults are in the start scripts. GC logging is not enabled either. I 
 filed an internal bug against myself to enable logging, and tune GC.
 CLIENT SESSION TIMEOUTS
 Clients are not explicitly setting timeouts, and I believe sessionTimeout is 
 10 seconds based on this log entry when initially connecting.
 2010-07-01 05:14:00,260:44267(0x2af330240110):zoo_i...@zookeeper_init@727: 
 Initiating client connection, 
 host=10.209.21.133:2181,10.209.21.175:2181,10.209.21.181:2181 
 sessionTimeout=1 watcher=(nil) sessionId=0 sessionPasswd=null 
 context=(nil) flags=0
 CLIENT BACKOFFS
 Looking in application logs, we see lots of the following:
 2010-07-01 
 05:13:14,674:41491(0x41ebf940):zoo_er...@handle_socket_error_msg@1528: Socket 
 [10.209.21.181:2181] zk retcode=-7, errno=110(Connection timed out): 
 connection timed out (exceeded timeout by 0ms)
 Doing some simple aggregations we see 130 errors in a ~13 minute sample 
 period. This behavior on thousands of clients appears to have been a DDoS 
 attack against Zookeeper. Using exponential behavior as the default behavior 
 seems appropriate looking at this data.
 Fulltext of the client errors is attached. I grepped errno from a Python 
 client log; I believe it uses the same underlying C library, so I did not 
 include example output from a C program (though I can if needed). It looks 
 basically the same.
 GOING FORWARD
 The long-GC pause causing clients to dogpile sounds like the most plausible 
 explanation at this time. GC logging/tuning is clearly where I dropped the 
 ball, just using the defaults; I don't think any changes should be made 
 related to lack of tuning.
 Exponential backoffs does seem like a good idea, and generally useful for 
 most people. There will always be service interruptions and backoffs would be 
 a great preventive measure to get out of a dogpile situation.
 Patrick's message:
 
 Hi Travis, as Flavio suggested would be great to get the logs. A few 
 questions:
 1) how did you eventually recover, restart the zk servers?
 2) was the cluster losing quorum during this time? leader re-election?
 3) Any chance this could have been initially triggered by a long GC pause on 
 one of the servers? (is gc logging turned on, any sort of heap monitoring?) 
 Has the GC been tuned on the servers, for example CMS and incremental?
 4) what are the clients using for timeout on the sessions?
 3.4 probably not for a few months yet, but we are planning for a 3.3.2 in a 
 few weeks to fix a couple critical issues (which don't seem related to what 
 you saw). If we can identify the problem here we should be able to include it 
 in any fix release we do.
 fixing something like 517 might help, but it's not clear how we got to this 
 state in the first place. fixing 517 might not have any effect if the root 
 cause is not addressed. 662 has only ever been reported once afaik, and we 
 weren't able to identify the root cause for that one.
 One thing we might also consider is 

[jira] Created: (ZOOKEEPER-803) Improve defenses against misbehaving clients

2010-07-02 Thread Travis Crawford (JIRA)
Improve defenses against misbehaving clients


 Key: ZOOKEEPER-803
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-803
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.3.0
Reporter: Travis Crawford


This issue is in response to ZOOKEEPER-801. Short version is a small number of 
buggy clients opened thousands of connections and caused Zookeeper to fail.

The misbehaving client did not correctly handle expired sessions, creating a 
new connection each time. The huge number of connections exacerbated the issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-803) Improve defenses against misbehaving clients

2010-07-02 Thread Travis Crawford (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Crawford updated ZOOKEEPER-803:
--

Attachment: connection-bugfix-diff.png

This diff shows a bug where the client developer confused disconnections and 
expired sessions. In the zookeeper programing model, clients reconnect 
themselves automatically when disconnected. However, should the session expire 
the application is responsible for reconnecting.

In this case the developer attempted to throttle reconnects, however, due to a 
bug the application created a new connection each time.

A small number of clients running the buggy code took down a 3 node Zookeeper 
cluster by exhausting 65k file descriptor limit. It only recovered after 
shutting down clients, restarting the Zookeepers, and then restarting the 
well-behaved clients.

 Improve defenses against misbehaving clients
 

 Key: ZOOKEEPER-803
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-803
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.3.0
Reporter: Travis Crawford
 Attachments: connection-bugfix-diff.png


 This issue is in response to ZOOKEEPER-801. Short version is a small number 
 of buggy clients opened thousands of connections and caused Zookeeper to fail.
 The misbehaving client did not correctly handle expired sessions, creating a 
 new connection each time. The huge number of connections exacerbated the 
 issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-803) Improve defenses against misbehaving clients

2010-07-02 Thread Travis Crawford (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884829#action_12884829
 ] 

Travis Crawford commented on ZOOKEEPER-803:
---

Maybe 8-10 clients were running the buggy code. Not too many.

 Improve defenses against misbehaving clients
 

 Key: ZOOKEEPER-803
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-803
 Project: Zookeeper
  Issue Type: Bug
Affects Versions: 3.3.0
Reporter: Travis Crawford
 Attachments: connection-bugfix-diff.png


 This issue is in response to ZOOKEEPER-801. Short version is a small number 
 of buggy clients opened thousands of connections and caused Zookeeper to fail.
 The misbehaving client did not correctly handle expired sessions, creating a 
 new connection each time. The huge number of connections exacerbated the 
 issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-790) Last processed zxid set prematurely while establishing leadership

2010-07-06 Thread Travis Crawford (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12885738#action_12885738
 ] 

Travis Crawford commented on ZOOKEEPER-790:
---

@vishal - How did your test go? I came across this issue while investigating an 
issue and believe this is the root cause.

If your test went well I can also help test.

 Last processed zxid set prematurely while establishing leadership
 -

 Key: ZOOKEEPER-790
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-790
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum
Affects Versions: 3.3.1
Reporter: Flavio Paiva Junqueira
Assignee: Flavio Paiva Junqueira
Priority: Blocker
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-790.patch


 The leader code is setting the last processed zxid to the first of the new 
 epoch even before connecting to a quorum of followers. Because the leader 
 code sets this value before connecting to a quorum of followers 
 (Leader.java:281) and the follower code throws an IOException 
 (Follower.java:73) if the leader epoch is smaller, we have that when the 
 false leader drops leadership and becomes a follower, it finds a smaller 
 epoch and kills itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-790) Last processed zxid set prematurely while establishing leadership

2010-07-14 Thread Travis Crawford (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888576#action_12888576
 ] 

Travis Crawford commented on ZOOKEEPER-790:
---

Sure, I'll help test this too.

 Last processed zxid set prematurely while establishing leadership
 -

 Key: ZOOKEEPER-790
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-790
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum
Affects Versions: 3.3.1
Reporter: Flavio Paiva Junqueira
Assignee: Flavio Paiva Junqueira
Priority: Blocker
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-790.patch


 The leader code is setting the last processed zxid to the first of the new 
 epoch even before connecting to a quorum of followers. Because the leader 
 code sets this value before connecting to a quorum of followers 
 (Leader.java:281) and the follower code throws an IOException 
 (Follower.java:73) if the leader epoch is smaller, we have that when the 
 false leader drops leadership and becomes a follower, it finds a smaller 
 epoch and kills itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.

2010-07-14 Thread Travis Crawford (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888684#action_12888684
 ] 

Travis Crawford commented on ZOOKEEPER-335:
---

Unfortunately I still observed the Leader epoch issue and needed to manually 
force a leader election for the cluster to recover. This test was performed 
with the following base+patches, applied in the order listed.

Zookeeper 3.3.1
ZOOKEEPER-744
ZOOKEEPER-790


{code}
2010-07-15 02:43:57,181 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:files...@82] 
- Reading snapshot /data/zookeeper/version-2/snapshot.231ac2
2010-07-15 02:43:57,384 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@649] - New election. My id 
=  1, Proposed zxid = 154618826848
2010-07-15 02:43:57,385 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: 1, 
154618826848, 4, 1, LOOKING, LOOKING, 1
2010-07-15 02:43:57,385 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 2, 
146030952153, 3, 1, LOOKING, LEADING, 2
2010-07-15 02:43:57,385 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 2, 
146030952153, 3, 1, LOOKING, FOLLOWING, 3
2010-07-15 02:43:57,385 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@642] - FOLLOWING
2010-07-15 02:43:57,385 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:zookeeperser...@151] - Created server with 
tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 4 datadir 
/data/zookeeper/txlog/version-2 snapdir /data/zookeeper/version-2
2010-07-15 02:43:57,387 - FATAL [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] 
- Leader epoch 23 is less than our epoch 24
2010-07-15 02:43:57,387 - WARN  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@82] 
- Exception when following the leader 
java.io.IOException: Error: Epoch of leader is lower
at 
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:73)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:644)
2010-07-15 02:43:57,387 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@166] 
- shutdown called 
java.lang.Exception: shutdown Follower
at 
org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:648)
{code}


I followed the recipe @vishal provided for recreating.

(a) Stop one follower in a three node cluster
(b) Get some tea while it falls behind
(c) Start the node stopped in (a).


These timestamps show where the follower was stopped. It also shows when it was 
turned back on.

{code}
2010-07-15 02:35:36,398 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@1661] - Established session 
0x229aa13cfc6276b with negotiated timeout 1 for client /10.209.45.114:34562
2010-07-15 02:39:18,907 - INFO  [main:quorumpeercon...@90] - Reading 
configuration from: /etc/zookeeper/conf/zoo.cfg
{code}


This timestamp is the first ``Leader epoch`` line. Everything between these two 
points will be the interesting bits.

{code}
2010-07-15 02:39:43,339 - FATAL [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] 
- Leader epoch 23 is less than our epoch 24
{code}

 zookeeper servers should commit the new leader txn to their logs.
 -

 Key: ZOOKEEPER-335
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.1.0
Reporter: Mahadev konar
Assignee: Mahadev konar
Priority: Blocker
 Fix For: 3.4.0

 Attachments: faultynode-vishal.txt, zk.log.gz, zklogs.tar.gz


 currently the zookeeper followers do not commit the new leader election. This 
 will cause problems in a failure scenarios with a follower acking to the same 
 leader txn id twice, which might be two different intermittent leaders and 
 allowing them to propose two different txn's of the same zxid.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-335) zookeeper servers should commit the new leader txn to their logs.

2010-07-14 Thread Travis Crawford (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Crawford updated ZOOKEEPER-335:
--

Attachment: ZOOKEEPER-790.travis.log.bz2

Please see my most recent comment for a summary of where the interesting lines 
are in this log file.

 zookeeper servers should commit the new leader txn to their logs.
 -

 Key: ZOOKEEPER-335
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-335
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.1.0
Reporter: Mahadev konar
Assignee: Mahadev konar
Priority: Blocker
 Fix For: 3.4.0

 Attachments: faultynode-vishal.txt, zk.log.gz, zklogs.tar.gz, 
 ZOOKEEPER-790.travis.log.bz2


 currently the zookeeper followers do not commit the new leader election. This 
 will cause problems in a failure scenarios with a follower acking to the same 
 leader txn id twice, which might be two different intermittent leaders and 
 allowing them to propose two different txn's of the same zxid.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-790) Last processed zxid set prematurely while establishing leadership

2010-07-14 Thread Travis Crawford (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Crawford updated ZOOKEEPER-790:
--

Attachment: ZOOKEEPER-790.travis.log.bz2

 Last processed zxid set prematurely while establishing leadership
 -

 Key: ZOOKEEPER-790
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-790
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum
Affects Versions: 3.3.1
Reporter: Flavio Paiva Junqueira
Assignee: Flavio Paiva Junqueira
Priority: Blocker
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-790.patch, ZOOKEEPER-790.travis.log.bz2


 The leader code is setting the last processed zxid to the first of the new 
 epoch even before connecting to a quorum of followers. Because the leader 
 code sets this value before connecting to a quorum of followers 
 (Leader.java:281) and the follower code throws an IOException 
 (Follower.java:73) if the leader epoch is smaller, we have that when the 
 false leader drops leadership and becomes a follower, it finds a smaller 
 epoch and kills itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-790) Last processed zxid set prematurely while establishing leadership

2010-07-14 Thread Travis Crawford (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888688#action_12888688
 ] 

Travis Crawford commented on ZOOKEEPER-790:
---

I accidentally posted this in ZOOKEEPER-335 -- reposting here. Sorry for the 
posting mixup -- the content is correct though :)


Unfortunately I still observed the Leader epoch issue and needed to manually 
force a leader election for the cluster to recover. This test was performed 
with the following base+patches, applied in the order listed.

Zookeeper 3.3.1
ZOOKEEPER-744
ZOOKEEPER-790

{code}
2010-07-15 02:43:57,181 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:files...@82] 
- Reading snapshot /data/zookeeper/version-2/snapshot.231ac2
2010-07-15 02:43:57,384 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@649] - New election. My id 
=  1, Proposed zxid = 154618826848
2010-07-15 02:43:57,385 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: 1, 
154618826848, 4, 1, LOOKING, LOOKING, 1
2010-07-15 02:43:57,385 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 2, 
146030952153, 3, 1, LOOKING, LEADING, 2
2010-07-15 02:43:57,385 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 2, 
146030952153, 3, 1, LOOKING, FOLLOWING, 3
2010-07-15 02:43:57,385 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@642] - FOLLOWING
2010-07-15 02:43:57,385 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:zookeeperser...@151] - Created server with 
tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 4 datadir 
/data/zookeeper/txlog/version-2 snapdir /data/zookeeper/version-2
2010-07-15 02:43:57,387 - FATAL [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] 
- Leader epoch 23 is less than our epoch 24
2010-07-15 02:43:57,387 - WARN  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@82] 
- Exception when following the leader 
java.io.IOException: Error: Epoch of leader is lower
at 
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:73)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:644)
2010-07-15 02:43:57,387 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@166] 
- shutdown called 
java.lang.Exception: shutdown Follower
at 
org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:648)
{code}


I followed the recipe @vishal provided for recreating.

(a) Stop one follower in a three node cluster
(b) Get some tea while it falls behind
(c) Start the node stopped in (a).

These timestamps show where the follower was stopped. It also shows when it was 
turned back on.

{code}
2010-07-15 02:35:36,398 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@1661] - Established session 
0x229aa13cfc6276b with negotiated timeout 1 for client /10.209.45.114:34562
2010-07-15 02:39:18,907 - INFO  [main:quorumpeercon...@90] - Reading 
configuration from: /etc/zookeeper/conf/zoo.cfg
{code}


This timestamp is the first ``Leader epoch`` line. Everything between these two 
points will be the interesting bits.


{code}
2010-07-15 02:39:43,339 - FATAL [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] 
- Leader epoch 23 is less than our epoch 24
{code}

 Last processed zxid set prematurely while establishing leadership
 -

 Key: ZOOKEEPER-790
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-790
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum
Affects Versions: 3.3.1
Reporter: Flavio Paiva Junqueira
Assignee: Flavio Paiva Junqueira
Priority: Blocker
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-790.patch, ZOOKEEPER-790.travis.log.bz2


 The leader code is setting the last processed zxid to the first of the new 
 epoch even before connecting to a quorum of followers. Because the leader 
 code sets this value before connecting to a quorum of followers 
 (Leader.java:281) and the follower code throws an IOException 
 (Follower.java:73) if the leader epoch is smaller, we have that when the 
 false leader drops leadership and becomes a follower, it finds a smaller 
 epoch and kills itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-790) Last processed zxid set prematurely while establishing leadership

2010-07-15 Thread Travis Crawford (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1242#action_1242
 ] 

Travis Crawford commented on ZOOKEEPER-790:
---

Good find! I can certainly help test a patch, or collect more info if needed.

 Last processed zxid set prematurely while establishing leadership
 -

 Key: ZOOKEEPER-790
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-790
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum
Affects Versions: 3.3.1
Reporter: Flavio Paiva Junqueira
Assignee: Flavio Paiva Junqueira
Priority: Blocker
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-790.patch, ZOOKEEPER-790.travis.log.bz2


 The leader code is setting the last processed zxid to the first of the new 
 epoch even before connecting to a quorum of followers. Because the leader 
 code sets this value before connecting to a quorum of followers 
 (Leader.java:281) and the follower code throws an IOException 
 (Follower.java:73) if the leader epoch is smaller, we have that when the 
 false leader drops leadership and becomes a follower, it finds a smaller 
 epoch and kills itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-790) Last processed zxid set prematurely while establishing leadership

2010-07-20 Thread Travis Crawford (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890512#action_12890512
 ] 

Travis Crawford commented on ZOOKEEPER-790:
---

I tested this patch on a build with the following, applied in the listed order: 
3.3.1 release + ZOOKEEPER-744.patch + ZOOKEEPER-790-3.3.patch

Looks good!

{code}
2010-07-20 23:43:34,229 - INFO  [Thread-2545:nioserverc...@1516] - Closed 
socket connection for client /10.209.21.181:53743 (no session established for 
client)
2010-07-20 23:43:34,659 - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:nioserverc...@639] - 
EndOfStreamException: Unable to read additional data from client sessionid 
0x129d3fcb5a6f60d, likely client has closed socket
2010-07-20 23:43:34,660 - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:nioserverc...@1516] - Closed socket 
connection for client /10.209.21.204:59727 which had sessionid 0x129d3fcb5a6f60d
2010-07-20 23:43:34,684 - INFO  [ProcessThread:-1:preprequestproces...@385] - 
Processed session termination for sessionid: 0x329d3fcb6594e53
2010-07-20 23:52:14,522 - INFO  [main:quorumpeercon...@90] - Reading 
configuration from: /etc/zookeeper/conf/zoo.cfg
2010-07-20 23:52:14,529 - INFO  [main:quorumpeercon...@287] - Defaulting to 
majority quorums
2010-07-20 23:52:14,540 - INFO  [main:quorumpeerm...@119] - Starting quorum peer
2010-07-20 23:52:14,562 - INFO  [main:nioservercnxn$fact...@149] - binding to 
port 0.0.0.0/0.0.0.0:2181
2010-07-20 23:52:14,578 - INFO  [main:quorump...@818] - tickTime set to 2000
2010-07-20 23:52:14,579 - INFO  [main:quorump...@829] - minSessionTimeout set 
to -1
2010-07-20 23:52:14,579 - INFO  [main:quorump...@840] - maxSessionTimeout set 
to -1
2010-07-20 23:52:14,579 - INFO  [main:quorump...@855] - initLimit set to 10
2010-07-20 23:52:14,798 - INFO  [main:files...@82] - Reading snapshot 
/data/zookeeper/version-2/snapshot.2500197ee5
2010-07-20 23:52:15,660 - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:nioservercnxn$fact...@256] - 
Accepted socket connection from /10.209.45.76:57030
2010-07-20 23:52:15,661 - INFO  [Thread-1:quorumcnxmanager$liste...@436] - My 
election bind port: 3888
2010-07-20 23:52:15,663 - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:nioserverc...@644] - Exception 
causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
running
2010-07-20 23:52:15,664 - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:nioserverc...@1516] - Closed socket 
connection for client /10.209.45.76:57030 (no session established for client)
2010-07-20 23:52:15,670 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@620] - LOOKING
2010-07-20 23:52:15,672 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@649] - New election. My id 
=  1, Proposed zxid = 158915472832
2010-07-20 23:52:15,674 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: 1, 
158915472832, 1, 1, LOOKING, LOOKING, 1
2010-07-20 23:52:15,674 - INFO  [WorkerSender Thread:quorumcnxmana...@162] - 
Have smaller server identifier, so dropping the connection: (2, 1)
2010-07-20 23:52:15,675 - INFO  [WorkerSender Thread:quorumcnxmana...@162] - 
Have smaller server identifier, so dropping the connection: (3, 1)
2010-07-20 23:52:15,676 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: 2, 
158915472832, 5, 1, LOOKING, LOOKING, 2
2010-07-20 23:52:15,676 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: 3, 
158915472832, 5, 1, LOOKING, LOOKING, 2
2010-07-20 23:52:15,676 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@711] - Updating proposal
2010-07-20 23:52:15,677 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@799] - Notification: 3, 
158915472832, 5, 1, LOOKING, FOLLOWING, 2
2010-07-20 23:52:15,677 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@689] - Notification: 3, 
158915472832, 5, 1, LOOKING, LOOKING, 3
2010-07-20 23:52:15,879 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@642] - FOLLOWING
2010-07-20 23:52:15,885 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:lear...@72] - 
TCP NoDelay set to: true
2010-07-20 23:52:15,893 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:environm...@97] - Server 
environment:zookeeper.version=3.3.1-942149, built on 05/07/2010 17:14 GMT
2010-07-20 23:52:15,893 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:environm...@97] - Server 
environment:host.name=sjc1k029.twitter.com
2010-07-20 23:52:15,894 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:environm...@97] - Server 
environment:java.version=1.6.0_16
2010-07-20 23:52:15,894 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:environm...@97] - Server 
environment:java.vendor=Sun Microsystems Inc.
2010-07-20 23:52:15,894 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:environm...@97] - Server 
environment:java.home=/usr/java/jdk1.6.0_16/jre
2010-07-20 23:52:15,894 - INFO  

[jira] Commented: (ZOOKEEPER-790) Last processed zxid set prematurely while establishing leadership

2010-07-20 Thread Travis Crawford (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890513#action_12890513
 ] 

Travis Crawford commented on ZOOKEEPER-790:
---

Just to double-check, I'm really really sure the running jar was freshly built. 
For example, lsof says:

java4473 root  memREG  104,1  1012397  10092999 
/usr/local/zookeeper/zookeeper-3.3.2-dev.jar

Yes, the version number is 3.3.2, but this was built from the 3.3.1 release.

Looking in the log we see:

2010-07-20 23:52:15,893 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:environm...@97] - Server 
environment:zookeeper.version=3.3.1-942149, built on 05/07/2010 17:14 GMT

Should that say built this afternoon? I've double  tripled checked and believe 
this is a newly built jar. Looking in the tarball we don't see 
zookeeper-3.3.2-dev.jar until after building.

 Last processed zxid set prematurely while establishing leadership
 -

 Key: ZOOKEEPER-790
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-790
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum
Affects Versions: 3.3.1
Reporter: Flavio Paiva Junqueira
Assignee: Flavio Paiva Junqueira
Priority: Blocker
 Fix For: 3.3.2, 3.4.0

 Attachments: ZOOKEEPER-790-3.3.patch, ZOOKEEPER-790.patch, 
 ZOOKEEPER-790.patch, ZOOKEEPER-790.patch, ZOOKEEPER-790.travis.log.bz2


 The leader code is setting the last processed zxid to the first of the new 
 epoch even before connecting to a quorum of followers. Because the leader 
 code sets this value before connecting to a quorum of followers 
 (Leader.java:281) and the follower code throws an IOException 
 (Follower.java:73) if the leader epoch is smaller, we have that when the 
 false leader drops leadership and becomes a follower, it finds a smaller 
 epoch and kills itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (ZOOKEEPER-856) Connection imbalance leads to overloaded ZK instances

2010-08-26 Thread Travis Crawford (JIRA)
Connection imbalance leads to overloaded ZK instances
-

 Key: ZOOKEEPER-856
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-856
 Project: Zookeeper
  Issue Type: Bug
Reporter: Travis Crawford


We've experienced a number of issues lately where ruok requests would take 
upwards of 10 seconds to return, and ZooKeeper instances were extremely 
sluggish. The sluggish instance requires a restart to make it responsive again.

I believe the issue is connections are very imbalanced, leading to certain 
instances having many thousands of connections, while other instances are 
largely idle.

A potential solution is periodically disconnecting/reconnecting to balance 
connections over time; this seems fine because sessions should not be affected, 
and therefore ephemaral nodes and watches should not be affected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-856) Connection imbalance leads to overloaded ZK instances

2010-08-26 Thread Travis Crawford (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Crawford updated ZOOKEEPER-856:
--

Attachment: zk_open_file_descriptor_count_total.gif
zk_open_file_descriptor_count_members.gif

Attached are two graphs showing:

- Total ZooKeeper connections to a 3 node cluster
- Connections per member in the cluster

In the totals graph, notice how its largely unchanged over time. This period 
represents a steady-state period of usage.

In the members graph, notice how the number of connections is significantly 
different between machines. This cluster allows the leader to service reads, so 
that's not something to factor in when interpreting number of  connections.

These graphs look very similar to an issue I had with another service (scribe) 
and we solved the issue by disconnecting every N+-K messages. We tried getting 
fancy by publishing load metrics and using a smart selection algorithm. Turns 
out in practice though the periodic disconnect/reconnect was easier to 
implement and worked better, so I'm tossing that idea out as a potential 
solution here.

 Connection imbalance leads to overloaded ZK instances
 -

 Key: ZOOKEEPER-856
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-856
 Project: Zookeeper
  Issue Type: Bug
Reporter: Travis Crawford
 Attachments: zk_open_file_descriptor_count_members.gif, 
 zk_open_file_descriptor_count_total.gif


 We've experienced a number of issues lately where ruok requests would take 
 upwards of 10 seconds to return, and ZooKeeper instances were extremely 
 sluggish. The sluggish instance requires a restart to make it responsive 
 again.
 I believe the issue is connections are very imbalanced, leading to certain 
 instances having many thousands of connections, while other instances are 
 largely idle.
 A potential solution is periodically disconnecting/reconnecting to balance 
 connections over time; this seems fine because sessions should not be 
 affected, and therefore ephemaral nodes and watches should not be affected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-856) Connection imbalance leads to overloaded ZK instances

2010-08-26 Thread Travis Crawford (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903065#action_12903065
 ] 

Travis Crawford commented on ZOOKEEPER-856:
---

@mahadev - I would love to help test a patch :) I'm currently using 3.3.1 + 
ZOOKEEPER-744 + ZOOKEEPER-790, applied in that order.

If there's a knob for how frequently to disconnect/reconnect I can try out 
different settings to see what a sensible default would be.

Do you think this should be a client or server setting? I'm thinking a server 
setting because otherwise its not possible to enforce the policy.

 Connection imbalance leads to overloaded ZK instances
 -

 Key: ZOOKEEPER-856
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-856
 Project: Zookeeper
  Issue Type: Bug
Reporter: Travis Crawford
 Fix For: 3.4.0

 Attachments: zk_open_file_descriptor_count_members.gif, 
 zk_open_file_descriptor_count_total.gif


 We've experienced a number of issues lately where ruok requests would take 
 upwards of 10 seconds to return, and ZooKeeper instances were extremely 
 sluggish. The sluggish instance requires a restart to make it responsive 
 again.
 I believe the issue is connections are very imbalanced, leading to certain 
 instances having many thousands of connections, while other instances are 
 largely idle.
 A potential solution is periodically disconnecting/reconnecting to balance 
 connections over time; this seems fine because sessions should not be 
 affected, and therefore ephemaral nodes and watches should not be affected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-856) Connection imbalance leads to overloaded ZK instances

2010-08-26 Thread Travis Crawford (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903170#action_12903170
 ] 

Travis Crawford commented on ZOOKEEPER-856:
---

@patrick - We're using these settings, which I believe are based on what's 
recommended in the troubleshooting guide.

-XX:+PrintGC
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCApplicationConcurrentTime
-XX:+UseConcMarkSweepGC

Looking at the logs I do see lots of GC activity. For example:

Total time for which application threads were stopped: 0.5599050 seconds
Application time: 0.0056590 seconds

I only see this on the hosts that became unresponsive after acquiring lots of 
connections.

Any suggestions for the GC flags? If there's something better I can experiment, 
and update the wiki if we discover something interesting.

http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting

 Connection imbalance leads to overloaded ZK instances
 -

 Key: ZOOKEEPER-856
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-856
 Project: Zookeeper
  Issue Type: Bug
Reporter: Travis Crawford
 Fix For: 3.4.0

 Attachments: zk_open_file_descriptor_count_members.gif, 
 zk_open_file_descriptor_count_total.gif


 We've experienced a number of issues lately where ruok requests would take 
 upwards of 10 seconds to return, and ZooKeeper instances were extremely 
 sluggish. The sluggish instance requires a restart to make it responsive 
 again.
 I believe the issue is connections are very imbalanced, leading to certain 
 instances having many thousands of connections, while other instances are 
 largely idle.
 A potential solution is periodically disconnecting/reconnecting to balance 
 connections over time; this seems fine because sessions should not be 
 affected, and therefore ephemaral nodes and watches should not be affected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-856) Connection imbalance leads to overloaded ZK instances

2010-08-27 Thread Travis Crawford (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903592#action_12903592
 ] 

Travis Crawford commented on ZOOKEEPER-856:
---

Thanks for the suggestions! I'm trying out the following options:

-Xmx4000M
-Xms4000M
-Xloggc:/var/log/zookeeper/gc_20100827_201441.log
-XX:+PrintGC
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCApplicationConcurrentTime
-XX:+UseConcMarkSweepGC
-XX:+CMSIncrementalMode
-XX:+CMSIncrementalPacing

I still think the connection imbalance issue is worth addressing though. Even 
with properly tuned ZK servers they can become overloaded as older processes 
acquire connections to the point of becoming overloaded.

 Connection imbalance leads to overloaded ZK instances
 -

 Key: ZOOKEEPER-856
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-856
 Project: Zookeeper
  Issue Type: Bug
Reporter: Travis Crawford
 Fix For: 3.4.0

 Attachments: zk_open_file_descriptor_count_members.gif, 
 zk_open_file_descriptor_count_total.gif


 We've experienced a number of issues lately where ruok requests would take 
 upwards of 10 seconds to return, and ZooKeeper instances were extremely 
 sluggish. The sluggish instance requires a restart to make it responsive 
 again.
 I believe the issue is connections are very imbalanced, leading to certain 
 instances having many thousands of connections, while other instances are 
 largely idle.
 A potential solution is periodically disconnecting/reconnecting to balance 
 connections over time; this seems fine because sessions should not be 
 affected, and therefore ephemaral nodes and watches should not be affected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.