RE: Workers constantly restarted due to session timeout

Michael Dev Fri, 30 May 2014 15:52:30 -0700

Michael R,

We don't have GC logging enabled yet. I lean towards agreeing with Derek that I 
don't think it's the issue but I will take a look at logging on Monday just to 
verify.

Derek D,

Are you certain that nimbus.task.timeout.secs is the correct config? Tracing 
through the github code it would seem to me that this is used as the timeout 
value when making a Thrift connection to the Nimbus node. I thought the logs 
indicated the timeout was occurring in the session connection to zookeeper as 
evidenced by ClientCxn being a Zookeeper class.

I discovered that we were running with the default maxSessionTimeout zookeeper 
config of 40 seconds. This would explain why our storm config of 5 minutes was 
not being picked up (but obviously not the root problem nor why timeout 
messages report 14-20 second timeout values). Typically we saw losses in 
connection occur when our cluster becomes super busy with a burst of data 
pushing workers to near 100% CPU. I'm testing the following configs over the 
weekend to see if they at least allow us to prevent chronic worker restarting 
during the brief high CPU periods.

Our current setup is as follows:
Storm 0.9.0.1
3 Storm node cluster
1 Supervisor running per Storm node
1-3 topologies deployed on the Storm cluster (depends on dev/prod/etc systems)
3 Workers per topology
Variable number of executors per component depending on how slow that component 
is. Example file i/o has many executors (say 12) while in memory validation has 
only 3 executors. Always maintaining a multiple of the number of workers for 
even distribution.
Kyro serialization with Java Serialization failover disabled to ensure we're 
using 100% kryo between bolts.

zoo.cfg
tickTime=2000
dataDir=/srv/zookeeper/data
clientPort=2182
initLimit=5
syncLimit=2
skipACL=true
maxClientCnxns=1000
maxSessionTimeout=300000
server.1=node1
server.2=node2
server.3=node3

storm yaml
storm.zookeeper.port: 2182
storm.local.dir=/srv/storm/data
nimbus.host: "node1"
storm.zookeeper.servers:
 - "node1"
 - "node2"
 - "node3"
supervisor.slot.ports:
 - 6700
 - 6701
 - 6702
 - 6703
 - 6704
java.library.path: "/usr/lib:/srv/storm/lib"
#Storm 0.9 netty support
storm.messaging.transport: "backtype.storm.messaging.netty.Context" 
storm.messaging.netty.server_worker_threads: 1
storm.messaging.netty.client_worker_threads: 1 
storm.messaging.netty.buffer_size: 5242880 
storm.messaging.netty.max_retries: 100 
storm.messaging.netty.max_wait_ms: 1000 
storm.messaging.netty.min_wait_ms: 100 
# Timeout band-aids in testing
topology.receiver.buffer.size: 2
storm.zookeeper.connection.timeout: 300000
storm.zookeeper.session.timeout: 300000


> Date: Thu, 29 May 2014 12:56:19 -0500
> From: [email protected]
> To: [email protected]
> Subject: Re: Workers constantly restarted due to session timeout
> 
> OK, so GC is probably not the issue.
> 
> 
> Specifically, this is a connection timeout to ZK from the worker, and it is 
> resulting in nimbus removing it from the assignments for that node.  In turn, 
> the supervisor reads the schedule and shoots the worker because it is no 
> longer scheduled to be running.
> 
> 
> The relevant config is nimbus.task.timeout.secs, and I think the default is 
> 30s.  What you could try is to make nimbus timeout longer than 
> storm.zookeeper.session.timeout.  This would allow the ZK connections to 
> timeout and get a heartbeat in before nimbus decides they have timed out.
> 
> 
> But the real question is why are the ZK sessions timing out at all?
> 
> Do you see this on several workers on that node?  What about the supervisor?  
> What about other nodes?  What do the ZK logs say?
> 
> 
> -- 
> Derek
> 
> On 5/29/14, 11:45, Michael Dev wrote:
> >
> >
> >
> > Derek,
> >
> > We are currently running with -Xmx60G and only about 20-30G of that has 
> > been observed to be used. I'm still observing workers restarted every 2 
> > minutes.
> >
> > What timeout is relevant to increase for the heartbeats in question? Is it 
> > be a config on the Zookeeper side we can increase to make our topology more 
> > resilient to these restarts?
> >
> > Michael
> >
> >> Date: Fri, 23 May 2014 15:50:50 -0500
> >> From: [email protected]
> >> To: [email protected]
> >> Subject: Re: Workers constantly restarted due to session timeout
> >>
> >>> 2) Is this expected behavior for Storm to be unable to keep up with 
> >>> heartbeat threads under high CPU or is our theory incorrect?
> >>
> >> Check your JVM max heap size (-Xmx).  If you use too much, the JVM will 
> >> garbage-collect, and that will stop everything--including the thread whose 
> >> job it is to do the heartbeating.
> >>
> >>
> >>
> >> --
> >> Derek
> >>
> >> On 5/23/14, 15:38, Michael Dev wrote:
> >>> Hi all,
> >>>
> >>> We are seeing our workers constantly being killed by Storm with to the 
> >>> following logs:
> >>> worker: 2014-05-23 20:15:08 INFO ClientCxn:1157 - Client session timed 
> >>> out, have not heard from the server in 28105ms for sessionid 
> >>> 0x14619bf2f4e0109, closing socket and attempting reconnect
> >>> supervisor: 2014-05-23 20:17:30 INFO supervisor:0 - Shutting down and 
> >>> clearing state for id 94349373-74ec-484b-a9f8-a5076e17d474. Current 
> >>> supervisor time: 1400876250. State: :disallowed, Heartbeat: 
> >>> #backtype.storm.daemon.common.WorkerHeartbeat{{:time-secs 1400876249, 
> >>> :storm-id "test-46-1400863199", :executors #{[-1 -1]}, :port 6700}
> >>>
> >>> Eventually Storm decides to just kill the worker and restart it as you 
> >>> see in the supervisor log. We theorize this is the Zookeeper heartbeat 
> >>> thread and it is being choked out due to very high CPU load on the 
> >>> machine (near 100%).
> >>>
> >>> I have increased the connection timeouts in the storm.yaml config file 
> >>> yet Storm seems to continue to use some unknown value for the above 
> >>> client session timeout messages:
> >>> storm.zookeeper.connection.timeout: 300000
> >>> storm.zookeeper.session.timeout: 300000
> >>>
> >>> 1) What timeout config is appropriate for the above timeout  message?
> >>> 2) Is this expected behavior for Storm to be unable to keep up with 
> >>> heartbeat threads under high CPU or is our theory incorrect?
> >>>
> >>> Thanks,
> >>> Michael
> >>>                                           
> >>>
> >
> >                                     
> >
RE: Workers constantly restarted due to session timeout

Reply via email to