[jira] Commented: (ZOOKEEPER-719) Add throttling to BookKeeper client

2010-07-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884182#action_12884182
 ] 

Hadoop QA commented on ZOOKEEPER-719:
-

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12448478/ZOOKEEPER-719.patch
  against trunk revision 958096.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h7.grid.sp2.yahoo.net/92/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h7.grid.sp2.yahoo.net/92/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h7.grid.sp2.yahoo.net/92/console

This message is automatically generated.

 Add throttling to BookKeeper client
 ---

 Key: ZOOKEEPER-719
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-719
 Project: Zookeeper
  Issue Type: Bug
  Components: contrib-bookkeeper
Affects Versions: 3.3.0
Reporter: Flavio Paiva Junqueira
Assignee: Flavio Paiva Junqueira
 Fix For: 3.4.0

 Attachments: ZOOKEEPER-719.patch, ZOOKEEPER-719.patch, 
 ZOOKEEPER-719.patch, ZOOKEEPER-719.patch


 Add throttling to client to control the rate of operations to bookies. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-801) Zookeeper outage post-mortem

2010-07-01 Thread Travis Crawford (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Crawford updated ZOOKEEPER-801:
--

Attachment: client-errors.txt

client-errors.txt contains the zookeeper error messages during an outage. The 
interesting part is how frequently they occur. This suggests having clients 
backoff during issues would be an improvement, as this many connection so 
quickly makes things worse.

Of course, there's a tradeoff between quickly connecting and backing off. For 
this reason, perhaps exponential backoff should be the default behavior, but 
clients that that require fast reconnections could optionally use the current 
mode.

 Zookeeper outage post-mortem
 

 Key: ZOOKEEPER-801
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-801
 Project: Zookeeper
  Issue Type: Improvement
Affects Versions: 3.3.0
 Environment: - RHEL5 2.6.18 kernel
 - Zookeeper 3.3.0
 - ulimit raised to 65k files
 - 3 cluster members
 - 4-5k connections in steady-state
 - Primarily C and python clients, plus some java
Reporter: Travis Crawford
 Attachments: client-errors.txt


 [Moving a thread from the zookeeper-user]
 RECOVERY
 We eventually recovered from this situation by shutting down clients. 
 Initially I tried restarting the Zookeepers, however, they were getting 
 hammered and I believe sessions timing out. I shut down ~2k clients 
 (lightweight python app; simply sets one data watch and takes an action when 
 it changes) at which point zktop could make a connection and a leader 
 election was verified. After resetting latency stats the numbers were very 
 good. I do not believe it would have ever recovered without shedding load.
 QUORUM/ELECTIONS DURING EVENT
 Unfortunately I do not have logs from the event :( We had debug logging on, 
 and logrotate configured to keep 10 100MB files, and the interesting parts 
 rotated away. I have already switched to info logging so we don't lose this 
 data again.
 During the incident I was not able to view cluster status with zktop, and 
 never observed a successful operation beyond connections, which quickly timed 
 out.
 GC PAUSE/LOGGING
 This is a very good question. No, Zookeeper GC is not tuned and uses whatever 
 the defaults are in the start scripts. GC logging is not enabled either. I 
 filed an internal bug against myself to enable logging, and tune GC.
 CLIENT SESSION TIMEOUTS
 Clients are not explicitly setting timeouts, and I believe sessionTimeout is 
 10 seconds based on this log entry when initially connecting.
 2010-07-01 05:14:00,260:44267(0x2af330240110):zoo_i...@zookeeper_init@727: 
 Initiating client connection, 
 host=10.209.21.133:2181,10.209.21.175:2181,10.209.21.181:2181 
 sessionTimeout=1 watcher=(nil) sessionId=0 sessionPasswd=null 
 context=(nil) flags=0
 CLIENT BACKOFFS
 Looking in application logs, we see lots of the following:
 2010-07-01 
 05:13:14,674:41491(0x41ebf940):zoo_er...@handle_socket_error_msg@1528: Socket 
 [10.209.21.181:2181] zk retcode=-7, errno=110(Connection timed out): 
 connection timed out (exceeded timeout by 0ms)
 Doing some simple aggregations we see 130 errors in a ~13 minute sample 
 period. This behavior on thousands of clients appears to have been a DDoS 
 attack against Zookeeper. Using exponential behavior as the default behavior 
 seems appropriate looking at this data.
 Fulltext of the client errors is attached. I grepped errno from a Python 
 client log; I believe it uses the same underlying C library, so I did not 
 include example output from a C program (though I can if needed). It looks 
 basically the same.
 GOING FORWARD
 The long-GC pause causing clients to dogpile sounds like the most plausible 
 explanation at this time. GC logging/tuning is clearly where I dropped the 
 ball, just using the defaults; I don't think any changes should be made 
 related to lack of tuning.
 Exponential backoffs does seem like a good idea, and generally useful for 
 most people. There will always be service interruptions and backoffs would be 
 a great preventive measure to get out of a dogpile situation.
 Patrick's message:
 
 Hi Travis, as Flavio suggested would be great to get the logs. A few 
 questions:
 1) how did you eventually recover, restart the zk servers?
 2) was the cluster losing quorum during this time? leader re-election?
 3) Any chance this could have been initially triggered by a long GC pause on 
 one of the servers? (is gc logging turned on, any sort of heap monitoring?) 
 Has the GC been tuned on the servers, for example CMS and incremental?
 4) what are the clients using for timeout on the sessions?
 3.4 probably not for a few months yet, but we are planning for a 3.3.2 in a 
 few weeks to fix a couple critical issues (which don't seem related to what 
 you saw). If we can identify the 

[jira] Created: (ZOOKEEPER-801) Zookeeper outage post-mortem

2010-07-01 Thread Travis Crawford (JIRA)
Zookeeper outage post-mortem


 Key: ZOOKEEPER-801
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-801
 Project: Zookeeper
  Issue Type: Improvement
Affects Versions: 3.3.0
 Environment: - RHEL5 2.6.18 kernel
- Zookeeper 3.3.0
- ulimit raised to 65k files
- 3 cluster members
- 4-5k connections in steady-state
- Primarily C and python clients, plus some java
Reporter: Travis Crawford


[Moving a thread from the zookeeper-user]

RECOVERY
We eventually recovered from this situation by shutting down clients. Initially 
I tried restarting the Zookeepers, however, they were getting hammered and I 
believe sessions timing out. I shut down ~2k clients (lightweight python app; 
simply sets one data watch and takes an action when it changes) at which point 
zktop could make a connection and a leader election was verified. After 
resetting latency stats the numbers were very good. I do not believe it would 
have ever recovered without shedding load.


QUORUM/ELECTIONS DURING EVENT
Unfortunately I do not have logs from the event :( We had debug logging on, and 
logrotate configured to keep 10 100MB files, and the interesting parts rotated 
away. I have already switched to info logging so we don't lose this data again.

During the incident I was not able to view cluster status with zktop, and never 
observed a successful operation beyond connections, which quickly timed out.


GC PAUSE/LOGGING
This is a very good question. No, Zookeeper GC is not tuned and uses whatever 
the defaults are in the start scripts. GC logging is not enabled either. I 
filed an internal bug against myself to enable logging, and tune GC.


CLIENT SESSION TIMEOUTS
Clients are not explicitly setting timeouts, and I believe sessionTimeout is 10 
seconds based on this log entry when initially connecting.

2010-07-01 05:14:00,260:44267(0x2af330240110):zoo_i...@zookeeper_init@727: 
Initiating client connection, 
host=10.209.21.133:2181,10.209.21.175:2181,10.209.21.181:2181 
sessionTimeout=1 watcher=(nil) sessionId=0 sessionPasswd=null 
context=(nil) flags=0


CLIENT BACKOFFS
Looking in application logs, we see lots of the following:

2010-07-01 
05:13:14,674:41491(0x41ebf940):zoo_er...@handle_socket_error_msg@1528: Socket 
[10.209.21.181:2181] zk retcode=-7, errno=110(Connection timed out): connection 
timed out (exceeded timeout by 0ms)

Doing some simple aggregations we see 130 errors in a ~13 minute sample period. 
This behavior on thousands of clients appears to have been a DDoS attack 
against Zookeeper. Using exponential behavior as the default behavior seems 
appropriate looking at this data.

Fulltext of the client errors is attached. I grepped errno from a Python 
client log; I believe it uses the same underlying C library, so I did not 
include example output from a C program (though I can if needed). It looks 
basically the same.


GOING FORWARD
The long-GC pause causing clients to dogpile sounds like the most plausible 
explanation at this time. GC logging/tuning is clearly where I dropped the 
ball, just using the defaults; I don't think any changes should be made related 
to lack of tuning.

Exponential backoffs does seem like a good idea, and generally useful for most 
people. There will always be service interruptions and backoffs would be a 
great preventive measure to get out of a dogpile situation.



Patrick's message:

Hi Travis, as Flavio suggested would be great to get the logs. A few questions:

1) how did you eventually recover, restart the zk servers?

2) was the cluster losing quorum during this time? leader re-election?

3) Any chance this could have been initially triggered by a long GC pause on 
one of the servers? (is gc logging turned on, any sort of heap monitoring?) Has 
the GC been tuned on the servers, for example CMS and incremental?

4) what are the clients using for timeout on the sessions?

3.4 probably not for a few months yet, but we are planning for a 3.3.2 in a few 
weeks to fix a couple critical issues (which don't seem related to what you 
saw). If we can identify the problem here we should be able to include it in 
any fix release we do.

fixing something like 517 might help, but it's not clear how we got to this 
state in the first place. fixing 517 might not have any effect if the root 
cause is not addressed. 662 has only ever been reported once afaik, and we 
weren't able to identify the root cause for that one.

One thing we might also consider is modifying the zk client lib to backoff 
connection attempts if they keep failing (timing out say). Today the clients 
are pretty aggressive on reconnection attempts. Having some sort of backoff 
(exponential?) would provide more breathing room to the server in this 
situation.

Patrick


Flavio's message:

Hi Travis, Do you think it would be possible for you to open a jira and upload 
your logs?

Thanks,
-Flavio


My initial message:

Hey 

[jira] Updated: (ZOOKEEPER-801) Zookeeper outage post-mortem

2010-07-01 Thread Travis Crawford (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Travis Crawford updated ZOOKEEPER-801:
--

Attachment: zookeeper-out-of-files.netstat

This attached netstat output shows a large number of: SYN_RECV CLOSE_WAIT. The 
fast connection churn appears to have consumed all available file handles.

 Zookeeper outage post-mortem
 

 Key: ZOOKEEPER-801
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-801
 Project: Zookeeper
  Issue Type: Improvement
Affects Versions: 3.3.0
 Environment: - RHEL5 2.6.18 kernel
 - Zookeeper 3.3.0
 - ulimit raised to 65k files
 - 3 cluster members
 - 4-5k connections in steady-state
 - Primarily C and python clients, plus some java
Reporter: Travis Crawford
 Attachments: client-errors.txt, zookeeper-out-of-files.netstat


 [Moving a thread from the zookeeper-user]
 RECOVERY
 We eventually recovered from this situation by shutting down clients. 
 Initially I tried restarting the Zookeepers, however, they were getting 
 hammered and I believe sessions timing out. I shut down ~2k clients 
 (lightweight python app; simply sets one data watch and takes an action when 
 it changes) at which point zktop could make a connection and a leader 
 election was verified. After resetting latency stats the numbers were very 
 good. I do not believe it would have ever recovered without shedding load.
 QUORUM/ELECTIONS DURING EVENT
 Unfortunately I do not have logs from the event :( We had debug logging on, 
 and logrotate configured to keep 10 100MB files, and the interesting parts 
 rotated away. I have already switched to info logging so we don't lose this 
 data again.
 During the incident I was not able to view cluster status with zktop, and 
 never observed a successful operation beyond connections, which quickly timed 
 out.
 GC PAUSE/LOGGING
 This is a very good question. No, Zookeeper GC is not tuned and uses whatever 
 the defaults are in the start scripts. GC logging is not enabled either. I 
 filed an internal bug against myself to enable logging, and tune GC.
 CLIENT SESSION TIMEOUTS
 Clients are not explicitly setting timeouts, and I believe sessionTimeout is 
 10 seconds based on this log entry when initially connecting.
 2010-07-01 05:14:00,260:44267(0x2af330240110):zoo_i...@zookeeper_init@727: 
 Initiating client connection, 
 host=10.209.21.133:2181,10.209.21.175:2181,10.209.21.181:2181 
 sessionTimeout=1 watcher=(nil) sessionId=0 sessionPasswd=null 
 context=(nil) flags=0
 CLIENT BACKOFFS
 Looking in application logs, we see lots of the following:
 2010-07-01 
 05:13:14,674:41491(0x41ebf940):zoo_er...@handle_socket_error_msg@1528: Socket 
 [10.209.21.181:2181] zk retcode=-7, errno=110(Connection timed out): 
 connection timed out (exceeded timeout by 0ms)
 Doing some simple aggregations we see 130 errors in a ~13 minute sample 
 period. This behavior on thousands of clients appears to have been a DDoS 
 attack against Zookeeper. Using exponential behavior as the default behavior 
 seems appropriate looking at this data.
 Fulltext of the client errors is attached. I grepped errno from a Python 
 client log; I believe it uses the same underlying C library, so I did not 
 include example output from a C program (though I can if needed). It looks 
 basically the same.
 GOING FORWARD
 The long-GC pause causing clients to dogpile sounds like the most plausible 
 explanation at this time. GC logging/tuning is clearly where I dropped the 
 ball, just using the defaults; I don't think any changes should be made 
 related to lack of tuning.
 Exponential backoffs does seem like a good idea, and generally useful for 
 most people. There will always be service interruptions and backoffs would be 
 a great preventive measure to get out of a dogpile situation.
 Patrick's message:
 
 Hi Travis, as Flavio suggested would be great to get the logs. A few 
 questions:
 1) how did you eventually recover, restart the zk servers?
 2) was the cluster losing quorum during this time? leader re-election?
 3) Any chance this could have been initially triggered by a long GC pause on 
 one of the servers? (is gc logging turned on, any sort of heap monitoring?) 
 Has the GC been tuned on the servers, for example CMS and incremental?
 4) what are the clients using for timeout on the sessions?
 3.4 probably not for a few months yet, but we are planning for a 3.3.2 in a 
 few weeks to fix a couple critical issues (which don't seem related to what 
 you saw). If we can identify the problem here we should be able to include it 
 in any fix release we do.
 fixing something like 517 might help, but it's not clear how we got to this 
 state in the first place. fixing 517 might not have any effect if the root 
 cause is not addressed. 662 has only ever been reported once afaik, and we 

[jira] Commented: (ZOOKEEPER-801) Zookeeper outage post-mortem

2010-07-01 Thread Mahadev konar (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884363#action_12884363
 ] 

Mahadev konar commented on ZOOKEEPER-801:
-

the CLOSE_WAIT would probably be casued if the clients connect to servers and 
then disconnect on there own. This way the server will take some time closing 
the TCP connections. Any reason that the clients would shuttle arnd the servers 
so frequently? 

 Zookeeper outage post-mortem
 

 Key: ZOOKEEPER-801
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-801
 Project: Zookeeper
  Issue Type: Improvement
Affects Versions: 3.3.0
 Environment: - RHEL5 2.6.18 kernel
 - Zookeeper 3.3.0
 - ulimit raised to 65k files
 - 3 cluster members
 - 4-5k connections in steady-state
 - Primarily C and python clients, plus some java
Reporter: Travis Crawford
 Attachments: client-errors.txt, zookeeper-out-of-files.netstat


 [Moving a thread from the zookeeper-user]
 RECOVERY
 We eventually recovered from this situation by shutting down clients. 
 Initially I tried restarting the Zookeepers, however, they were getting 
 hammered and I believe sessions timing out. I shut down ~2k clients 
 (lightweight python app; simply sets one data watch and takes an action when 
 it changes) at which point zktop could make a connection and a leader 
 election was verified. After resetting latency stats the numbers were very 
 good. I do not believe it would have ever recovered without shedding load.
 QUORUM/ELECTIONS DURING EVENT
 Unfortunately I do not have logs from the event :( We had debug logging on, 
 and logrotate configured to keep 10 100MB files, and the interesting parts 
 rotated away. I have already switched to info logging so we don't lose this 
 data again.
 During the incident I was not able to view cluster status with zktop, and 
 never observed a successful operation beyond connections, which quickly timed 
 out.
 GC PAUSE/LOGGING
 This is a very good question. No, Zookeeper GC is not tuned and uses whatever 
 the defaults are in the start scripts. GC logging is not enabled either. I 
 filed an internal bug against myself to enable logging, and tune GC.
 CLIENT SESSION TIMEOUTS
 Clients are not explicitly setting timeouts, and I believe sessionTimeout is 
 10 seconds based on this log entry when initially connecting.
 2010-07-01 05:14:00,260:44267(0x2af330240110):zoo_i...@zookeeper_init@727: 
 Initiating client connection, 
 host=10.209.21.133:2181,10.209.21.175:2181,10.209.21.181:2181 
 sessionTimeout=1 watcher=(nil) sessionId=0 sessionPasswd=null 
 context=(nil) flags=0
 CLIENT BACKOFFS
 Looking in application logs, we see lots of the following:
 2010-07-01 
 05:13:14,674:41491(0x41ebf940):zoo_er...@handle_socket_error_msg@1528: Socket 
 [10.209.21.181:2181] zk retcode=-7, errno=110(Connection timed out): 
 connection timed out (exceeded timeout by 0ms)
 Doing some simple aggregations we see 130 errors in a ~13 minute sample 
 period. This behavior on thousands of clients appears to have been a DDoS 
 attack against Zookeeper. Using exponential behavior as the default behavior 
 seems appropriate looking at this data.
 Fulltext of the client errors is attached. I grepped errno from a Python 
 client log; I believe it uses the same underlying C library, so I did not 
 include example output from a C program (though I can if needed). It looks 
 basically the same.
 GOING FORWARD
 The long-GC pause causing clients to dogpile sounds like the most plausible 
 explanation at this time. GC logging/tuning is clearly where I dropped the 
 ball, just using the defaults; I don't think any changes should be made 
 related to lack of tuning.
 Exponential backoffs does seem like a good idea, and generally useful for 
 most people. There will always be service interruptions and backoffs would be 
 a great preventive measure to get out of a dogpile situation.
 Patrick's message:
 
 Hi Travis, as Flavio suggested would be great to get the logs. A few 
 questions:
 1) how did you eventually recover, restart the zk servers?
 2) was the cluster losing quorum during this time? leader re-election?
 3) Any chance this could have been initially triggered by a long GC pause on 
 one of the servers? (is gc logging turned on, any sort of heap monitoring?) 
 Has the GC been tuned on the servers, for example CMS and incremental?
 4) what are the clients using for timeout on the sessions?
 3.4 probably not for a few months yet, but we are planning for a 3.3.2 in a 
 few weeks to fix a couple critical issues (which don't seem related to what 
 you saw). If we can identify the problem here we should be able to include it 
 in any fix release we do.
 fixing something like 517 might help, but it's not clear how we got to this 
 state in the first place. fixing 517 might not 

[jira] Commented: (ZOOKEEPER-801) Zookeeper outage post-mortem

2010-07-01 Thread Travis Crawford (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884368#action_12884368
 ] 

Travis Crawford commented on ZOOKEEPER-801:
---

I believe clients were reconnecting so frequently because their
connections timed out.


 Zookeeper outage post-mortem
 

 Key: ZOOKEEPER-801
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-801
 Project: Zookeeper
  Issue Type: Improvement
Affects Versions: 3.3.0
 Environment: - RHEL5 2.6.18 kernel
 - Zookeeper 3.3.0
 - ulimit raised to 65k files
 - 3 cluster members
 - 4-5k connections in steady-state
 - Primarily C and python clients, plus some java
Reporter: Travis Crawford
 Attachments: client-errors.txt, zookeeper-out-of-files.netstat


 [Moving a thread from the zookeeper-user]
 RECOVERY
 We eventually recovered from this situation by shutting down clients. 
 Initially I tried restarting the Zookeepers, however, they were getting 
 hammered and I believe sessions timing out. I shut down ~2k clients 
 (lightweight python app; simply sets one data watch and takes an action when 
 it changes) at which point zktop could make a connection and a leader 
 election was verified. After resetting latency stats the numbers were very 
 good. I do not believe it would have ever recovered without shedding load.
 QUORUM/ELECTIONS DURING EVENT
 Unfortunately I do not have logs from the event :( We had debug logging on, 
 and logrotate configured to keep 10 100MB files, and the interesting parts 
 rotated away. I have already switched to info logging so we don't lose this 
 data again.
 During the incident I was not able to view cluster status with zktop, and 
 never observed a successful operation beyond connections, which quickly timed 
 out.
 GC PAUSE/LOGGING
 This is a very good question. No, Zookeeper GC is not tuned and uses whatever 
 the defaults are in the start scripts. GC logging is not enabled either. I 
 filed an internal bug against myself to enable logging, and tune GC.
 CLIENT SESSION TIMEOUTS
 Clients are not explicitly setting timeouts, and I believe sessionTimeout is 
 10 seconds based on this log entry when initially connecting.
 2010-07-01 05:14:00,260:44267(0x2af330240110):zoo_i...@zookeeper_init@727: 
 Initiating client connection, 
 host=10.209.21.133:2181,10.209.21.175:2181,10.209.21.181:2181 
 sessionTimeout=1 watcher=(nil) sessionId=0 sessionPasswd=null 
 context=(nil) flags=0
 CLIENT BACKOFFS
 Looking in application logs, we see lots of the following:
 2010-07-01 
 05:13:14,674:41491(0x41ebf940):zoo_er...@handle_socket_error_msg@1528: Socket 
 [10.209.21.181:2181] zk retcode=-7, errno=110(Connection timed out): 
 connection timed out (exceeded timeout by 0ms)
 Doing some simple aggregations we see 130 errors in a ~13 minute sample 
 period. This behavior on thousands of clients appears to have been a DDoS 
 attack against Zookeeper. Using exponential behavior as the default behavior 
 seems appropriate looking at this data.
 Fulltext of the client errors is attached. I grepped errno from a Python 
 client log; I believe it uses the same underlying C library, so I did not 
 include example output from a C program (though I can if needed). It looks 
 basically the same.
 GOING FORWARD
 The long-GC pause causing clients to dogpile sounds like the most plausible 
 explanation at this time. GC logging/tuning is clearly where I dropped the 
 ball, just using the defaults; I don't think any changes should be made 
 related to lack of tuning.
 Exponential backoffs does seem like a good idea, and generally useful for 
 most people. There will always be service interruptions and backoffs would be 
 a great preventive measure to get out of a dogpile situation.
 Patrick's message:
 
 Hi Travis, as Flavio suggested would be great to get the logs. A few 
 questions:
 1) how did you eventually recover, restart the zk servers?
 2) was the cluster losing quorum during this time? leader re-election?
 3) Any chance this could have been initially triggered by a long GC pause on 
 one of the servers? (is gc logging turned on, any sort of heap monitoring?) 
 Has the GC been tuned on the servers, for example CMS and incremental?
 4) what are the clients using for timeout on the sessions?
 3.4 probably not for a few months yet, but we are planning for a 3.3.2 in a 
 few weeks to fix a couple critical issues (which don't seem related to what 
 you saw). If we can identify the problem here we should be able to include it 
 in any fix release we do.
 fixing something like 517 might help, but it's not clear how we got to this 
 state in the first place. fixing 517 might not have any effect if the root 
 cause is not addressed. 662 has only ever been reported once afaik, and we 
 weren't able to identify the root cause for that one.
 One 

[jira] Updated: (ZOOKEEPER-517) NIO factory fails to close connections when the number of file handles run out.

2010-07-01 Thread Patrick Hunt (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt updated ZOOKEEPER-517:
---

Fix Version/s: 3.3.2
  Component/s: server

would be great to get this into 3.3.2

 NIO factory fails to close connections when the number of file handles run 
 out.
 ---

 Key: ZOOKEEPER-517
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-517
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Reporter: Mahadev konar
Priority: Critical
 Fix For: 3.3.2, 3.4.0


 The code in NIO factory is such that if we fail to accept a connection due to 
 some reasons (too many file handles maybe one of them) we do not close the 
 connections that are in CLOSE_WAIT. We need to call an explicit close on 
 these sockets and then close them. One of the solutions might be to move doIO 
 before accpet so that we can still close connection even if we cannot accept 
 connections.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.