Hi all,
We are seeing our workers constantly being killed by Storm with to the
following logs:
worker: 2014-05-23 20:15:08 INFO ClientCxn:1157 - Client session timed out,
have not heard from the server in 28105ms for sessionid 0x14619bf2f4e0109,
closing socket and attempting reconnect
supervisor: 2014-05-23 20:17:30 INFO supervisor:0 - Shutting down and clearing
state for id 94349373-74ec-484b-a9f8-a5076e17d474. Current supervisor time:
1400876250. State: :disallowed, Heartbeat:
#backtype.storm.daemon.common.WorkerHeartbeat{{:time-secs 1400876249, :storm-id
"test-46-1400863199", :executors #{[-1 -1]}, :port 6700}
Eventually Storm decides to just kill the worker and restart it as you see in
the supervisor log. We theorize this is the Zookeeper heartbeat thread and it
is being choked out due to very high CPU load on the machine (near 100%).
I have increased the connection timeouts in the storm.yaml config file yet
Storm seems to continue to use some unknown value for the above client session
timeout messages:
storm.zookeeper.connection.timeout: 300000
storm.zookeeper.session.timeout: 300000
1) What timeout config is appropriate for the above timeout message?
2) Is this expected behavior for Storm to be unable to keep up with heartbeat
threads under high CPU or is our theory incorrect?
Thanks,
Michael