OK, so GC is probably not the issue.

Specifically, this is a connection timeout to ZK from the worker, and it is 
resulting in nimbus removing it from the assignments for that node.  In turn, 
the supervisor reads the schedule and shoots the worker because it is no longer 
scheduled to be running.


The relevant config is nimbus.task.timeout.secs, and I think the default is 
30s.  What you could try is to make nimbus timeout longer than 
storm.zookeeper.session.timeout.  This would allow the ZK connections to 
timeout and get a heartbeat in before nimbus decides they have timed out.


But the real question is why are the ZK sessions timing out at all?

Do you see this on several workers on that node?  What about the supervisor?  
What about other nodes?  What do the ZK logs say?


--
Derek

On 5/29/14, 11:45, Michael Dev wrote:



Derek,

We are currently running with -Xmx60G and only about 20-30G of that has been 
observed to be used. I'm still observing workers restarted every 2 minutes.

What timeout is relevant to increase for the heartbeats in question? Is it be a 
config on the Zookeeper side we can increase to make our topology more 
resilient to these restarts?

Michael

Date: Fri, 23 May 2014 15:50:50 -0500
From: der...@yahoo-inc.com
To: user@storm.incubator.apache.org
Subject: Re: Workers constantly restarted due to session timeout

2) Is this expected behavior for Storm to be unable to keep up with heartbeat 
threads under high CPU or is our theory incorrect?

Check your JVM max heap size (-Xmx).  If you use too much, the JVM will 
garbage-collect, and that will stop everything--including the thread whose job 
it is to do the heartbeating.



--
Derek

On 5/23/14, 15:38, Michael Dev wrote:
Hi all,

We are seeing our workers constantly being killed by Storm with to the 
following logs:
worker: 2014-05-23 20:15:08 INFO ClientCxn:1157 - Client session timed out, 
have not heard from the server in 28105ms for sessionid 0x14619bf2f4e0109, 
closing socket and attempting reconnect
supervisor: 2014-05-23 20:17:30 INFO supervisor:0 - Shutting down and clearing state for 
id 94349373-74ec-484b-a9f8-a5076e17d474. Current supervisor time: 1400876250. State: 
:disallowed, Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat{{:time-secs 
1400876249, :storm-id "test-46-1400863199", :executors #{[-1 -1]}, :port 6700}

Eventually Storm decides to just kill the worker and restart it as you see in 
the supervisor log. We theorize this is the Zookeeper heartbeat thread and it 
is being choked out due to very high CPU load on the machine (near 100%).

I have increased the connection timeouts in the storm.yaml config file yet 
Storm seems to continue to use some unknown value for the above client session 
timeout messages:
storm.zookeeper.connection.timeout: 300000
storm.zookeeper.session.timeout: 300000

1) What timeout config is appropriate for the above timeout  message?
2) Is this expected behavior for Storm to be unable to keep up with heartbeat 
threads under high CPU or is our theory incorrect?

Thanks,
Michael
                                        


                                        

Reply via email to