Do you have GC logging turned on? With a 60GB heap I could pretty easily
see stop-the-world GCs taking longer than the session timeout.

Michael Rose (@Xorlev <https://twitter.com/xorlev>)
Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
mich...@fullcontact.com


On Thu, May 29, 2014 at 10:45 AM, Michael Dev <michael_...@outlook.com>
wrote:

>  Derek,
>
> We are currently running with -Xmx60G and only about 20-30G of that has
> been observed to be used. I'm still observing workers restarted every 2
> minutes.
>
> What timeout is relevant to increase for the heartbeats in question? Is it
> be a config on the Zookeeper side we can increase to make our topology more
> resilient to these restarts?
>
> Michael
>
> > Date: Fri, 23 May 2014 15:50:50 -0500
> > From: der...@yahoo-inc.com
> > To: user@storm.incubator.apache.org
> > Subject: Re: Workers constantly restarted due to session timeout
>
> >
> > > 2) Is this expected behavior for Storm to be unable to keep up with
> heartbeat threads under high CPU or is our theory incorrect?
> >
> > Check your JVM max heap size (-Xmx). If you use too much, the JVM will
> garbage-collect, and that will stop everything--including the thread whose
> job it is to do the heartbeating.
> >
> >
> >
> > --
> > Derek
> >
> > On 5/23/14, 15:38, Michael Dev wrote:
> > > Hi all,
> > >
> > > We are seeing our workers constantly being killed by Storm with to the
> following logs:
> > > worker: 2014-05-23 20:15:08 INFO ClientCxn:1157 - Client session timed
> out, have not heard from the server in 28105ms for sessionid
> 0x14619bf2f4e0109, closing socket and attempting reconnect
> > > supervisor: 2014-05-23 20:17:30 INFO supervisor:0 - Shutting down and
> clearing state for id 94349373-74ec-484b-a9f8-a5076e17d474. Current
> supervisor time: 1400876250. State: :disallowed, Heartbeat:
> #backtype.storm.daemon.common.WorkerHeartbeat{{:time-secs 1400876249,
> :storm-id "test-46-1400863199", :executors #{[-1 -1]}, :port 6700}
> > >
> > > Eventually Storm decides to just kill the worker and restart it as you
> see in the supervisor log. We theorize this is the Zookeeper heartbeat
> thread and it is being choked out due to very high CPU load on the machine
> (near 100%).
> > >
> > > I have increased the connection timeouts in the storm.yaml config file
> yet Storm seems to continue to use some unknown value for the above client
> session timeout messages:
> > > storm.zookeeper.connection.timeout: 300000
> > > storm.zookeeper.session.timeout: 300000
> > >
> > > 1) What timeout config is appropriate for the above timeout message?
> > > 2) Is this expected behavior for Storm to be unable to keep up with
> heartbeat threads under high CPU or is our theory incorrect?
> > >
> > > Thanks,
> > > Michael
> > >
> > >
>

Reply via email to