Hi all,

We had an event in our prod cluster where an OOM caused a leader node to 
effectively become corrupted while the rest of the ensemble thought it was 
healthy, thus permanently degrading the ensemble to provide read only service 
on existing sessions until a human intervented.

Exceptions in Critical Threads
============

As a tactical step, we've added an OOMHandler to bounce the node.  However, 
we're cognizant of the fact that other exceptions in this space can cause this 
issue again.  There is also an interesting interaction with J8 which I will get 
to shortly.

In this link: 
http://arstechnica.com/information-technology/2015/05/the-discovery-of-apache-zookeepers-poison-packet/
  (specifically bug #1) seems to apply to this issue.  I haven't extensively 
gone through the server code in some time, but will again shortly.  I'm 
wondering if this is seen as an issue by the zookeeper dev community and if 
there are plans to respond.

OS: linux 64 bit
Zk: 3.4.6
jre: 1.8.31

2015-05-10 19:11:49,882 - ERROR 
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2281:NIOServerCnxnFactory$1@44] - Thread 
Thread[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2281,5,main] died

java.lang.OutOfMemoryError: Compressed class space
        at java.lang.ClassLoader.defineClass1(Native Method)
        at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
        at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
        at java.net.URLClassLoader.defineClass(URLClassLoader.java:455)
        at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:367)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.makeLeader(QuorumPeer.java:605)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:798)

Zookeeper and J8
So while this all was occurring, the CCS space in J8 filled up.  This space is, 
by default, 1G.  For it to fill up feels surprising.  Maybe it was somehow due 
to lots of connections occurring.  This caused the OOM which caused the error 
in the leader thread.  I can't imagine what ZK server is doing to legitimately 
fill this space without instrumentation being involved somehow.  Or maybe J8 
has a bug.  Any ideas on this would be appreciated.
Austin


________________________________

NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions or 
views contained herein are not intended to be, and do not constitute, advice 
within the meaning of Section 975 of the Dodd-Frank Wall Street Reform and 
Consumer Protection Act. If you have received this communication in error, 
please destroy all electronic and paper copies; do not disclose, use or act 
upon the information; and notify the sender immediately. Mistransmission is not 
intended to waive confidentiality or privilege. Morgan Stanley reserves the 
right, to the extent permitted under applicable law, to monitor electronic 
communications. This message is subject to terms available at the following 
link: http://www.morganstanley.com/disclaimers If you cannot access these 
links, please notify us by reply message and we will send the contents to you. 
By messaging with Morgan Stanley you consent to the foregoing.

Reply via email to