Re: Misbehaving zk servers

2010-04-29 Thread Patrick Hunt
Is there any good (simple/fast/bulletproof) way to monitor the FD use 
inside the jvm? If so we could stop accepting new client connections 
once we get close to the os imposed limit... The test would have to be a 
bulletproof one though - we wouldn't want to end up in some worse 
situation (where we refuse connection because we mistakenly believe that 
the limit has been reached).


Might be good to open a JIRA for this and add some tests. In particular 
we should verify the server handles this as gracefully as it can when 
the limit has been reached.


Patrick

On 04/29/2010 09:34 AM, Mahadev Konar wrote:

Hi Travis,

  How many clients did you have connected to this server? Usually the default
is 8K file descriptors. Did you have clients more than that?

Also, if clients fail to attach to a server, they will run off to another
server. We do not do any blacklisting because we expect the server to heal
and if it does not, it mostly shuts itself down in most of the cases.

Thanks
mahadev


On 4/29/10 12:08 AM, Travis Crawfordtraviscrawf...@gmail.com  wrote:


Hey zookeeper gurus -

We recently had a zookeeper outage when one ZK server was started with
a low limit after upgrading to 3.3.0. Several days later the outage
occurred when that node reached its file descriptor limit and clients
started having major issues.

Are there any circumstances when a ZK server will get blacklisted from
the ensemble? Something similar to how tasktrackers are blacklisted
when too many tasks fail.

Thanks!
Travis




Re: Misbehaving zk servers

2010-04-29 Thread Travis Crawford
On Thu, Apr 29, 2010 at 9:49 AM, Patrick Hunt ph...@apache.org wrote:
 Is there any good (simple/fast/bulletproof) way to monitor the FD use inside
 the jvm? If so we could stop accepting new client connections once we get
 close to the os imposed limit... The test would have to be a bulletproof one
 though - we wouldn't want to end up in some worse situation (where we refuse
 connection because we mistakenly believe that the limit has been reached).

 Might be good to open a JIRA for this and add some tests. In particular we
 should verify the server handles this as gracefully as it can when the limit
 has been reached.

Poking around with jconsole I found two stats that already measure FDs:

- java.lang.OperatingSystem.MaxFileDescriptorCount
- java.lang.OperatingSystem.OpenFileDescriptorCount

They're described (rather tersely) at:

http://java.sun.com/javase/6/docs/jre/api/management/extension/com/sun/management/UnixOperatingSystemMXBean.html

So it sounds like the feature request would be stop accepting new
client connections if OpenFileDescriptorCount  95% of
MaxFileDescriptorCount? Only start accepting new requests when
OpenFileDescriptorCount  90% of MaxFileDescriptorCount. Basically the
high/low watermark thing.

Thoughts?

--travis





 Patrick

 On 04/29/2010 09:34 AM, Mahadev Konar wrote:

 Hi Travis,

  How many clients did you have connected to this server? Usually the
 default
 is 8K file descriptors. Did you have clients more than that?

 Also, if clients fail to attach to a server, they will run off to another
 server. We do not do any blacklisting because we expect the server to heal
 and if it does not, it mostly shuts itself down in most of the cases.

 Thanks
 mahadev


 On 4/29/10 12:08 AM, Travis Crawfordtraviscrawf...@gmail.com  wrote:

 Hey zookeeper gurus -

 We recently had a zookeeper outage when one ZK server was started with
 a low limit after upgrading to 3.3.0. Several days later the outage
 occurred when that node reached its file descriptor limit and clients
 started having major issues.

 Are there any circumstances when a ZK server will get blacklisted from
 the ensemble? Something similar to how tasktrackers are blacklisted
 when too many tasks fail.

 Thanks!
 Travis