Really, there's timeout ! Last heartbeat doesn't update properly. I'll try to adjust worker log format to determin which thread. Thanks.
2016-10-24 17:14 GMT+08:00 Jungtaek Lim <[email protected]>: > Sorry but you seem to run multiple ShellBolts in one worker, so I can't > see which threads print out which log. > Could you add %t to worker log format? You can just copy the content from > below link to your worker.xml. > https://github.com/apache/storm/blob/1.x-branch/log4j2/worker.xml > > And there's actual heartbeat timeout occurred: > 2016-10-24 16:18:03.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : > 1477297083673, last heartbeat : 1477296962746, worker timeout (ms) : 120000 > > 1477297083673 - 1477296962746 = 120927 > 120000 > > Thanks, > Jungtaek Lim (HeartSaVioR) > > > 2016년 10월 24일 (월) 오후 6:01, Zhechao Ma <[email protected]>님이 작성: > >> I find something strange. I look at a worker.log, there three subprocess >> heartbeat, 673ms, 678ms, and 850ms. Everything is OK until 16:18:03. >> >> >> ... >> ... >> 2016-10-24 16:17:59.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat >> request to subprocess >> 2016-10-24 16:17:59.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : >> 1477297079673, last heartbeat : 1477296962746, worker timeout (ms) : 120000 >> 2016-10-24 16:17:59.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : >> 1477297079678, last heartbeat : 1477297079165, worker timeout (ms) : 120000 >> 2016-10-24 16:17:59.704 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat >> request to subprocess >> 2016-10-24 16:17:59.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : >> 1477297079850, last heartbeat : 1477297079644, worker timeout (ms) : 120000 >> 2016-10-24 16:17:59.901 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat >> request to subprocess >> 2016-10-24 16:18:00.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat >> request to subprocess >> 2016-10-24 16:18:00.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : >> 1477297080673, last heartbeat : 1477296962746, worker timeout (ms) : 120000 >> 2016-10-24 16:18:00.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : >> 1477297080678, last heartbeat : 1477297080165, worker timeout (ms) : 120000 >> 2016-10-24 16:18:00.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : >> 1477297080850, last heartbeat : 1477297080814, worker timeout (ms) : 120000 >> 2016-10-24 16:18:00.893 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat >> request to subprocess >> 2016-10-24 16:18:00.913 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat >> request to subprocess >> 2016-10-24 16:18:01.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat >> request to subprocess >> 2016-10-24 16:18:01.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : >> 1477297081673, last heartbeat : 1477296962746, worker timeout (ms) : 120000 >> 2016-10-24 16:18:01.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : >> 1477297081678, last heartbeat : 1477297081165, worker timeout (ms) : 120000 >> 2016-10-24 16:18:01.706 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat >> request to subprocess >> 2016-10-24 16:18:01.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : >> 1477297081850, last heartbeat : 1477297081816, worker timeout (ms) : 120000 >> 2016-10-24 16:18:01.885 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat >> request to subprocess >> 2016-10-24 16:18:02.165 <02-165> o.a.s.t.ShellBolt [DEBUG] BOLT - sending >> heartbeat request to subprocess >> 2016-10-24 16:18:02.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : >> 1477297082673, last heartbeat : 1477296962746, worker timeout (ms) : 120000 >> 2016-10-24 16:18:02.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : >> 1477297082678, last heartbeat : 1477297082166, worker timeout (ms) : 120000 >> 2016-10-24 16:18:02.704 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat >> request to subprocess >> 2016-10-24 16:18:02.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : >> 1477297082850, last heartbeat : 1477297082843, worker timeout (ms) : 120000 >> 2016-10-24 16:18:02.861 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat >> request to subprocess >> 2016-10-24 16:18:03.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat >> request to subprocess >> 2016-10-24 16:18:03.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : >> 1477297083673, last heartbeat : 1477296962746, worker timeout (ms) : 120000 >> 2016-10-24 16:18:03.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : >> 1477297083678, last heartbeat : 1477297083166, worker timeout (ms) : 120000 >> >> >> And then, heartbaet timeout just at 16:18:03.675, before the third >> heartbeat (850ms) sends out. However, timeout is set to 120000 ms, or 120s. >> How could that happen? >> >> ... >> ... >> 2016-10-24 16:18:03.675 o.a.s.util [ERROR] Async loop died! >> java.lang.RuntimeException: java.lang.RuntimeException: >> java.lang.RuntimeException: pid:7243, name:normalizeLog exitCode:-1, >> errorString: >> at >> org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:452) >> ~[storm-core-1.0.1.jar:1.0.1] >> at >> org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:418) >> ~[storm-core-1.0.1.jar:1.0.1] >> at >> org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:73) >> ~[storm-core-1.0.1.jar:1.0.1] >> at >> org.apache.storm.daemon.executor$fn__7953$fn__7966$fn__8019.invoke(executor.clj:847) >> ~[storm-core-1.0.1.jar:1.0.1] >> at org.apache.storm.util$async_loop$fn__625.invoke(util.clj:484) >> [storm-core-1.0.1.jar:1.0.1] >> at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?] >> at java.lang.Thread.run(Thread.java:745) [?:1.7.0_79] >> Caused by: java.lang.RuntimeException: java.lang.RuntimeException: pid:7243, >> name:normalizeLog exitCode:-1, errorString: >> at org.apache.storm.task.ShellBolt.execute(ShellBolt.java:150) >> ~[storm-core-1.0.1.jar:1.0.1] >> at >> org.apache.storm.daemon.executor$fn__7953$tuple_action_fn__7955.invoke(executor.clj:728) >> ~[storm-core-1.0.1.jar:1.0.1] >> at >> org.apache.storm.daemon.executor$mk_task_receiver$fn__7874.invoke(executor.clj:461) >> ~[storm-core-1.0.1.jar:1.0.1] >> at >> org.apache.storm.disruptor$clojure_handler$reify__7390.onEvent(disruptor.clj:40) >> ~[storm-core-1.0.1.jar:1.0.1] >> at >> org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:439) >> ~[storm-core-1.0.1.jar:1.0.1] >> ... 6 more >> Caused by: java.lang.RuntimeException: pid:7243, name:normalizeLog >> exitCode:-1, errorString: >> at org.apache.storm.task.ShellBolt.die(ShellBolt.java:295) >> ~[storm-core-1.0.1.jar:1.0.1] >> at org.apache.storm.task.ShellBolt.access$400(ShellBolt.java:70) >> ~[storm-core-1.0.1.jar:1.0.1] >> at >> org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322) >> ~[storm-core-1.0.1.jar:1.0.1] >> at >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) >> ~[?:1.7.0_79] >> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) >> ~[?:1.7.0_79] >> at >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) >> ~[?:1.7.0_79] >> at >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) >> ~[?:1.7.0_79] >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >> ~[?:1.7.0_79] >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >> ~[?:1.7.0_79] >> ... 1 more >> Caused by: java.lang.RuntimeException: subprocess heartbeat timeout >> at >> org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322) >> ~[storm-core-1.0.1.jar:1.0.1] >> at >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) >> ~[?:1.7.0_79] >> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) >> ~[?:1.7.0_79] >> at >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) >> ~[?:1.7.0_79] >> at >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) >> ~[?:1.7.0_79] >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >> ~[?:1.7.0_79] >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >> ~[?:1.7.0_79] >> ... 1 more >> 2016-10-24 16:18:03.681 o.a.s.d.executor [ERROR] >> java.lang.RuntimeException: java.lang.RuntimeException: >> java.lang.RuntimeException: pid:7243, name:normalizeLog exitCode:-1, >> errorString: >> at >> org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:452) >> ~[storm-core-1.0.1.jar:1.0.1] >> at >> org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:418) >> ~[storm-core-1.0.1.jar:1.0.1] >> at >> org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:73) >> ~[storm-core-1.0.1.jar:1.0.1] >> at >> org.apache.storm.daemon.executor$fn__7953$fn__7966$fn__8019.invoke(executor.clj:847) >> ~[storm-core-1.0.1.jar:1.0.1] >> >> >> >> 2016-10-24 15:23 GMT+08:00 Zhechao Ma <[email protected]>: >> >> I open DEBUG-level log, and see the BOLT heartbeat information, timeout >> is 30000ms, everything looks OK. >> >> 2016-10-21 13:14 GMT+08:00 Zhechao Ma <[email protected]>: >> >> I will try to do this, and reply latter. Thanks. >> >> 2016-10-21 13:09 GMT+08:00 Jungtaek Lim <[email protected]>: >> >> Could you modify your log level to DEBUG and see worker's log? If you use >> Storm 1.x you can modify log level from UI on the fly. >> ShellBolt writes log regarding subprocess heartbeat but its level is >> DEBUG since it could produce lots of logs. >> >> Two lines: >> - BOLT - current time : {}, last heartbeat : {}, worker timeout (ms) : {} >> - BOLT - sending heartbeat request to subprocess >> >> Two lines will be logged to each 1 second. Please check logs are >> existing, and 'last heartbeat' is updated properly, and also worker timeout >> is set properly. >> >> 2016년 10월 21일 (금) 오후 1:59, Zhechao Ma <[email protected]>님이 작성: >> >> I do not set "topology.subprocess.timeout.secs", so " >> supervisor.worker.timeout.secs" will be used according to STORM-1314, >> which is set 30 for my cluster. >> 30 seconds is a very very very big value, it will never take more than 30 >> seconds processing my tuple. >> I think there must be problem somewhere else. >> >> 2016-10-21 11:11 GMT+08:00 Jungtaek Lim <[email protected]>: >> >> There're many situations for ShellBolt to trigger heartbeat issue, and at >> least STORM-1946 is not the case. >> >> How long does your tuple take to be processed? You need to set subprocess >> timeout seconds ("topology.subprocess.timeout.secs") to higher than max >> time to process. You can even set it fairly big value so that subprocess >> heartbeat issue will not happen. >> >> >> ShellBolt requires that each tuple is handled and acked within heartbeat >> timeout. I struggled to change this behavior for subprocess to periodically >> sends heartbeat, but no luck because of GIL - global interpreter lock (same >> for Ruby). We need to choose one: stick this restriction, or disable >> subprocess heartbeat. >> >> I hope that we can resolve this issue clearly, but I guess multi-thread >> approach doesn't work on Python, Ruby, and any language which uses GIL, and >> I have no idea on alternatives >> >> - Jungtaek Lim (HeartSaVioR). >> >> 2016년 10월 21일 (금) 오전 11:44, Zhechao Ma <[email protected]>님이 >> 작성: >> >> I made an issue (STORM-2150 >> <https://issues.apache.org/jira/browse/STORM-2150>) 3 days ago, anyone >> can >> help? >> >> I've got a simple topology running with Storm 1.0.1. The topology consists >> of a KafkaSpout and several python multilang ShellBolt. I frequently got >> the following exceptions. >> >> java.lang.RuntimeException: subprocess heartbeat timeout at >> org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run( >> ShellBolt.java:322) >> at java.util.concurrent.Executors$RunnableAdapter. >> call(Executors.java:471) >> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at >> java.util.concurrent.ScheduledThreadPoolExecutor$ >> ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) >> at java.util.concurrent.ScheduledThreadPoolExecutor$ >> ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) >> at java.util.concurrent.ThreadPoolExecutor.runWorker( >> ThreadPoolExecutor.java:1145) >> at java.util.concurrent.ThreadPoolExecutor$Worker.run( >> ThreadPoolExecutor.java:615) >> at java.lang.Thread.run(Thread.java:745) >> >> More information here: >> 1. Topology run with ACK mode. >> 2. Topology had 40 workers. >> 3. Topology emitted about 10 milliom tuples every 10 minutes. >> >> Every time subprocess heartbeat timeout, workers would restart and python >> processes exited with exitCode:-1, which affected processing capacity and >> stability of the topology. >> >> I've checked some related issues from Storm Jira. I first found STORM-1946 >> <https://issues.apache.org/jira/browse/STORM-1946> reported a bug related >> to this problem and said bug had been fixed in Storm 1.0.2. However I got >> the same exception even after I upgraded Storm to 1.0.2. >> >> I checked other related issues. Let's look at history of this problem. >> DashengJu first reported this problem with Non-ACK mode in STORM-738 >> <https://issues.apache.org/jira/browse/STORM-738>. STORM-742 >> <https://issues.apache.org/jira/browse/STORM-742> discussed the approach >> of >> this problem with ACK mode, and it seemed that bug had been fixed in >> 0.10.0. I don't know whether this patch is included in storm-1.x branch. >> In >> a word, this problem still exists in the latest stable version. >> >> >> >> >> >> >> -- >> Thanks >> Zhechao Ma >> >> >> >> >> -- >> Thanks >> Zhechao Ma >> > -- Thanks Zhechao Ma
