Do you have GC logging turned on? With a 60GB heap I could pretty easily see stop-the-world GCs taking longer than the session timeout.
Michael Rose (@Xorlev <https://twitter.com/xorlev>) Senior Platform Engineer, FullContact <http://www.fullcontact.com/> mich...@fullcontact.com On Thu, May 29, 2014 at 10:45 AM, Michael Dev <michael_...@outlook.com> wrote: > Derek, > > We are currently running with -Xmx60G and only about 20-30G of that has > been observed to be used. I'm still observing workers restarted every 2 > minutes. > > What timeout is relevant to increase for the heartbeats in question? Is it > be a config on the Zookeeper side we can increase to make our topology more > resilient to these restarts? > > Michael > > > Date: Fri, 23 May 2014 15:50:50 -0500 > > From: der...@yahoo-inc.com > > To: user@storm.incubator.apache.org > > Subject: Re: Workers constantly restarted due to session timeout > > > > > > 2) Is this expected behavior for Storm to be unable to keep up with > heartbeat threads under high CPU or is our theory incorrect? > > > > Check your JVM max heap size (-Xmx). If you use too much, the JVM will > garbage-collect, and that will stop everything--including the thread whose > job it is to do the heartbeating. > > > > > > > > -- > > Derek > > > > On 5/23/14, 15:38, Michael Dev wrote: > > > Hi all, > > > > > > We are seeing our workers constantly being killed by Storm with to the > following logs: > > > worker: 2014-05-23 20:15:08 INFO ClientCxn:1157 - Client session timed > out, have not heard from the server in 28105ms for sessionid > 0x14619bf2f4e0109, closing socket and attempting reconnect > > > supervisor: 2014-05-23 20:17:30 INFO supervisor:0 - Shutting down and > clearing state for id 94349373-74ec-484b-a9f8-a5076e17d474. Current > supervisor time: 1400876250. State: :disallowed, Heartbeat: > #backtype.storm.daemon.common.WorkerHeartbeat{{:time-secs 1400876249, > :storm-id "test-46-1400863199", :executors #{[-1 -1]}, :port 6700} > > > > > > Eventually Storm decides to just kill the worker and restart it as you > see in the supervisor log. We theorize this is the Zookeeper heartbeat > thread and it is being choked out due to very high CPU load on the machine > (near 100%). > > > > > > I have increased the connection timeouts in the storm.yaml config file > yet Storm seems to continue to use some unknown value for the above client > session timeout messages: > > > storm.zookeeper.connection.timeout: 300000 > > > storm.zookeeper.session.timeout: 300000 > > > > > > 1) What timeout config is appropriate for the above timeout message? > > > 2) Is this expected behavior for Storm to be unable to keep up with > heartbeat threads under high CPU or is our theory incorrect? > > > > > > Thanks, > > > Michael > > > > > > >