Re: Storm worker died - Client is being closed, and does not take requests any more

임정택 Mon, 11 May 2015 01:05:13 -0700

AFAIK Nimbus doesn't reassign task to other worker with normal circumstance.
Maybe worker didn't send heartbeat well, or Nimbus detected worker's dead.
Actually two things are same since Nimbus detects worker's dead by checking
heartbeat.


Hope this helps.

Thanks.
Jungtaek Lim (HeartSaVioR)

2015-05-11 13:08 GMT+09:00 SuXingyu <[email protected]>:

> It seems that this issue matches our situation best. I am wondering in
> what condition will the worker be moved to other slots except rebalance? Is
> moving worker a normal operation of nimbus?
>
> Anyway, sincerely thank you for your help!
>
> ------------------------------
> Date: Fri, 8 May 2015 20:35:50 +0900
> Subject: Re: Storm worker died - Client is being closed, and does not take
> requests any more
> From: [email protected]
> To: [email protected]
>
>
> If one or more workers were moved to other slots, you may hit this issue.
>
> https://issues.apache.org/jira/browse/STORM-737
>
> Hope this helps.
>
>
> 2015-05-08 18:53 GMT+09:00 SuXingyu <[email protected]>:
>
> I am running my cluster using Storm 0.9.3.  This exception was caused by
> sending msg while Netty client is closing. But still don't know why Netty
> client closed.
>
> ------------------------------
>  From: [email protected]
> Date: Thu, 7 May 2015 23:34:16 -0700
>
> Subject: Re: Storm worker died - Client is being closed, and does not take
> requests any more
> To: [email protected]
>
> What version of Storm are you running?   I had significant issues with
> Storm v0.9.3 having what I'll term "congestive collapse" related to the
> Netty implementation.  Workers would continually die.  The problem was
> resolved by switching the transport layer from Netty back to the original
> ZeroMQ implementation, but that was not very trivial.
>
> - Erik
>
> On Thu, May 7, 2015 at 11:28 PM, SuXingyu <[email protected]> wrote:
>
> Yes, I've found the logs you metioned in Nimbus log. But still can't find
> any indication in worker logs. The first log genarated by storm after work
> start is "Async loop died".
>
> ------------------------------
> From: [email protected]
> Date: Thu, 7 May 2015 22:56:41 -0700
> Subject: Re: Storm worker died - Client is being closed, and does not take
> requests any more
> To: [email protected]
>
>
> The "State" being ":disallowed" usually indicates that the worker's
> executors' slot assignment has changed.  If you look at the Nimbus logs, I
> would expect to see logs stating that the affected executors were noticed
> to be "not alive" by the Nimbus, and then an immediate "Setting new
> assignment for topology id" log line as the Nimbus creates a new assignment
> for this topology.  As for why the Worker itself died -- is there no
> indication in the worker logs?
>
> - Erik
>
> On Thu, May 7, 2015 at 8:36 PM, SuXingyu <[email protected]> wrote:
>
> I checked my supervisor log, and found logs like  "Shutting down and
> clearing state for id {worker.id}". But still can't find the cause of
> worker shutting down.
>
> Here are the logs in supervisor. It seems that worker's jvm exit before
> supervisor kill it.
>
> 2015-05-06 17:21:47:847 b.s.d.supervisor [INFO] Launching worker with
> assignment #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id
> "message_storm-101-17-1430903865", :executors ...} for this supervisor
> 95d77940-08b2-4f7a-9660-03d8f61a2529 on port 6701 with id
> 95722030-7785-40d6-a47a-cc072c4c9d83
> 2015-05-06 17:21:47:849 b.s.d.supervisor [INFO] Launching worker with
> command: '/usr/java/bin/java' '-server' '-Xmx8g' '-Xms8g' '-Xmn2g'
> '-DServer=mblog' '-DNODE_POOL_NAME=openapi_storm-yf' '-XX:PermSize=128m'
> '-XX:MaxPermSize=128m' '-XX:MaxTenuringThreshold=4'
> '-XX:+UseConcMarkSweepGC' '-XX:SurvivorRatio=8'
> '-XX:CMSInitiatingOccupancyFraction=70' '-XX:+ExplicitGCInvokesConcurrent'
> '-XX:+PrintCommandLineFlags' '-XX:+PrintGCDateStamps'
> '-XX:+PrintTenuringDistribution' '-XX:+PrintGCDetails'
> '-XX:+PrintGCTimeStamps' '-XX:+PrintGCApplicationStoppedTime'
> '-XX:+PrintGCApplicationConcurrentTime' '-Xloggc:/data0/logs/storm/gc.log'
> '-Djava.library.path=storm-local/supervisor/stormdist/message_storm-101-17-1430903865/resources/Linux-amd64:storm-local/supervisor/stormdist/message_storm-101-17-1430903865/resources:/usr/local/lib:/opt/local/lib:/usr/lib'
> '-Dlogfile.name=worker-6701.log' '-Dstorm.home=/data0/storm'
> '-Dstorm.conf.file=' '-Dstorm.options=' '-Dstorm.log.dir=/data0/storm/logs'
> '-Dlogback.configurationFile=/data0/storm/logback/cluster.xml'
> '-Dstorm.id=message_storm-101-17-1430903865'
> '-Dworker.id=95722030-7785-40d6-a47a-cc072c4c9d83' '-Dworker.port=6701'
> '-cp'
> '/data0/storm/lib/logback-core-1.0.13.jar:/data0/storm/lib/tools.logging-0.2.3.jar:/data0/storm/lib/commons-codec-1.6.jar:/data0/storm/lib/json-simple-1.1.jar:/data0/storm/lib/commons-logging-1.1.3.jar:/data0/storm/lib/tools.cli-0.2.4.jar:/data0/storm/lib/jetty-util-6.1.26.jar:/data0/storm/lib/kryo-2.21.jar:/data0/storm/lib/ring-devel-0.3.11.jar:/data0/storm/lib/commons-lang-2.5.jar:/data0/storm/lib/commons-exec-1.1.jar:/data0/storm/lib/clout-1.0.1.jar:/data0/storm/lib/servlet-api-2.5.jar:/data0/storm/lib/ring-jetty-adapter-0.3.11.jar:/data0/storm/lib/clj-time-0.4.1.jar:/data0/storm/lib/jgrapht-core-0.9.0.jar:/data0/storm/lib/chill-java-0.3.5.jar:/data0/storm/lib/jline-2.11.jar:/data0/storm/lib/tools.macro-0.1.0.jar:/data0/storm/lib/log4j-over-slf4j-1.6.6.jar:/data0/storm/lib/minlog-1.2.jar:/data0/storm/lib/core.incubator-0.1.0.jar:/data0/storm/lib/jetty-6.1.26.jar:/data0/storm/lib/compojure-1.1.3.jar:/data0/storm/lib/math.numeric-tower-0.0.1.jar:/data0/storm/lib/slf4j-api-1.7.5.jar:/data0/storm/lib/hiccup-0.3.6.jar:/data0/storm/lib/objenesis-1.2.jar:/data0/storm/lib/storm-core-0.9.3.jar:/data0/storm/lib/snakeyaml-1.11.jar:/data0/storm/lib/asm-4.0.jar:/data0/storm/lib/carbonite-1.4.0.jar:/data0/storm/lib/commons-fileupload-1.2.1.jar:/data0/storm/lib/ring-core-1.1.5.jar:/data0/storm/lib/logback-classic-1.0.13.jar:/data0/storm/lib/clojure-1.5.1.jar:/data0/storm/lib/reflectasm-1.07-shaded.jar:/data0/storm/lib/ring-servlet-0.3.11.jar:/data0/storm/lib/disruptor-2.10.1.jar:/data0/storm/lib/joda-time-2.0.jar:/data0/storm/lib/commons-io-2.4.jar:/data0/storm/lib/clj-stacktrace-0.2.2.jar:/data0/storm/conf:storm-local/supervisor/stormdist/message_storm-101-17-1430903865/stormjar.jar'
> 'backtype.storm.daemon.worker' 'message_storm-101-17-1430903865'
> '95d77940-08b2-4f7a-9660-03d8f61a2529' '6701'
> '95722030-7785-40d6-a47a-cc072c4c9d83'
> 2015-05-06 17:21:47:850 b.s.d.supervisor [INFO]
> 95722030-7785-40d6-a47a-cc072c4c9d83 still hasn't started
> 2015-05-06 17:21:48:350 b.s.d.supervisor [INFO]
> 95722030-7785-40d6-a47a-cc072c4c9d83 still hasn't started
> 2015-05-06 17:21:48:850 b.s.d.supervisor [INFO]
> 95722030-7785-40d6-a47a-cc072c4c9d83 still hasn't started
> 2015-05-06 17:21:49:351 b.s.d.supervisor [INFO]
> 95722030-7785-40d6-a47a-cc072c4c9d83 still hasn't started
> 2015-05-06 17:21:49:851 b.s.d.supervisor [INFO]
> 95722030-7785-40d6-a47a-cc072c4c9d83 still hasn't started
> 2015-05-06 17:21:50:352 b.s.d.supervisor [INFO]
> 95722030-7785-40d6-a47a-cc072c4c9d83 still hasn't started
> 2015-05-06 17:21:50:852 b.s.d.supervisor [INFO]
> 95722030-7785-40d6-a47a-cc072c4c9d83 still hasn't started
> 2015-05-06 17:21:51:352 b.s.d.supervisor [INFO]
> 95722030-7785-40d6-a47a-cc072c4c9d83 still hasn't started
> 2015-05-06 17:21:51:853 b.s.d.supervisor [INFO]
> 95722030-7785-40d6-a47a-cc072c4c9d83 still hasn't started
> 2015-05-06 17:21:52:353 b.s.d.supervisor [INFO]
> 95722030-7785-40d6-a47a-cc072c4c9d83 still hasn't started
> 2015-05-06 17:25:37:845 b.s.d.supervisor [INFO] Shutting down and clearing
> state for id 95722030-7785-40d6-a47a-cc072c4c9d83. Current supervisor time:
> 1430904337. State: :disallowed, Heartbeat:
> #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1430904305,
> :storm-id "message_storm-101-17-1430903865", :executors #{...}, :port 6701}
> 2015-05-06 17:25:37:845 b.s.d.supervisor [INFO] Shutting down
> 95d77940-08b2-4f7a-9660-03d8f61a2529:95722030-7785-40d6-a47a-cc072c4c9d83
> 2015-05-06 17:25:37:847 b.s.util [INFO] Error when trying to kill 27187.
> Process is probably already dead.
> 2015-05-06 17:25:38:849 b.s.util [INFO] Error when trying to kill 27187.
> Process is probably already dead.
> 2015-05-06 17:25:38:853 b.s.d.supervisor [INFO] Shut down
> 95d77940-08b2-4f7a-9660-03d8f61a2529:95722030-7785-40d6-a47a-cc072c4c9d83
>
>
>
>
> ------------------------------
> From: [email protected]
> To: [email protected]
> Subject: Re: Storm worker died - Client is being closed, and does not take
> requests any more
> Date: Thu, 7 May 2015 15:15:50 +0000
>
>
> Look into your supervisor logs, with them set to INFO.
>
>  Look for something like:
> “Shutting down and clearing state for id {worker.id}”
>
>  That id will match your worker’s logs.
>
>  The reason for the shutdown will be in the supervisor’s logs.
>
>
>
>
>   From: SuXingyu <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: 2015,Thursday, May 7 at 02:44
> To: "[email protected]" <[email protected]>
> Subject: Storm worker died - Client is being closed, and does not take
> requests any more
>
>   Hi all,
>
> I am runing a Kafka-Storm cluster in distributed mode. There are four
> supervisors in my storm cluster. For most of the time, it works fun. But
> after I redeployed topology yesterday, I found all the four workers in
> storm cluster restart after running several minutes.
>
> I check the log in storm and find this：
>
> 2015-05-06 17:23:08:934 b.s.util [ERROR] Async loop died!
> java.lang.RuntimeException: java.lang.RuntimeException: Client is being
> closed, and does not take requests any more
>         at
> backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:128)
> ~[storm-core-0.9.3.jar:0.9.3]
>         at
> backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99)
> ~[storm-core-0.9.3.jar:0.9.3]
>         at
> backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80)
> ~[storm-core-0.9.3.jar:0.9.3]
>         at
> backtype.storm.disruptor$consume_loop_STAR_$fn__1460.invoke(disruptor.clj:94)
> ~[storm-core-0.9.3.jar:0.9.3]
>         at backtype.storm.util$async_loop$fn__464.invoke(util.clj:463)
> ~[storm-core-0.9.3.jar:0.9.3]
>         at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
>         at java.lang.Thread.run(Thread.java:745) [na:1.8.0_40]
> Caused by: java.lang.RuntimeException: Client is being closed, and does
> not take requests any more
>         at backtype.storm.messaging.netty.Client.send(Client.java:185)
> ~[storm-core-0.9.3.jar:0.9.3]
>         at
> backtype.storm.utils.TransferDrainer.send(TransferDrainer.java:54)
> ~[storm-core-0.9.3.jar:0.9.3]
>         at
> backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__3730$fn__3731.invoke(worker.clj:330)
> ~[storm-core-0.9.3.jar:0.9.3]
>         at
> backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__3730.invoke(worker.clj:328)
> ~[storm-core-0.9.3.jar:0.9.3]
>         at
> backtype.storm.disruptor$clojure_handler$reify__1447.onEvent(disruptor.clj:58)
> ~[storm-core-0.9.3.jar:0.9.3]
>         at
> backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125)
> ~[storm-core-0.9.3.jar:0.9.3]
>         ... 6 common frames omitted
> 2015-05-06 17:23:08:938 b.s.util [ERROR] Halting process: ("Async loop
> died!")
> java.lang.RuntimeException: ("Async loop died!")
>         at backtype.storm.util$exit_process_BANG_.doInvoke(util.clj:325)
> [storm-core-0.9.3.jar:0.9.3]
>         at clojure.lang.RestFn.invoke(RestFn.java:423)
> [clojure-1.5.1.jar:na]
>         at
> backtype.storm.disruptor$consume_loop_STAR_$fn__1458.invoke(disruptor.clj:92)
> [storm-core-0.9.3.jar:0.9.3]
>         at backtype.storm.util$async_loop$fn__464.invoke(util.clj:473)
> [storm-core-0.9.3.jar:0.9.3]
>         at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
>         at java.lang.Thread.run(Thread.java:745) [na:1.8.0_40]
> 2015-05-06 17:23:08:957 b.s.d.worker [INFO] Shutting down worker
> message_storm-101-17-1430903865 95d77940-08b2-4f7a-9660-03d8f61a2529 6700
> 2015-05-06 17:23:45:487 o.a.s.z.ZooKeeper [INFO] Client
> environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT
>
> These logs appear several times in about twenty minutes. And the cluter
> restored after a reblance operation in Storm ui.
>
> Any idea why this may be happening?
>
> Best regard!
>
> Xingyu
>
>
>
>
>
>
>
>
>
> --
> Name : 임 정택
> Blog : http://www.heartsavior.net / http://dev.heartsavior.net
> Twitter : http://twitter.com/heartsavior
> LinkedIn : http://www.linkedin.com/in/heartsavior
>



-- 
Name : 임 정택
Blog : http://www.heartsavior.net / http://dev.heartsavior.net
Twitter : http://twitter.com/heartsavior
LinkedIn : http://www.linkedin.com/in/heartsavior

Re: Storm worker died - Client is being closed, and does not take requests any more

Reply via email to