Sorry but you seem to run multiple ShellBolts in one worker, so I can't see which threads print out which log. Could you add %t to worker log format? You can just copy the content from below link to your worker.xml. https://github.com/apache/storm/blob/1.x-branch/log4j2/worker.xml
And there's actual heartbeat timeout occurred: 2016-10-24 16:18:03.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 1477297083673, last heartbeat : 1477296962746, worker timeout (ms) : 120000 1477297083673 - 1477296962746 = 120927 > 120000 Thanks, Jungtaek Lim (HeartSaVioR) 2016년 10월 24일 (월) 오후 6:01, Zhechao Ma <[email protected]>님이 작성: > I find something strange. I look at a worker.log, there three subprocess > heartbeat, 673ms, 678ms, and 850ms. Everything is OK until 16:18:03. > > > ... > ... > 2016-10-24 16:17:59.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat > request to subprocess > 2016-10-24 16:17:59.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : > 1477297079673, last heartbeat : 1477296962746, worker timeout (ms) : 120000 > 2016-10-24 16:17:59.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : > 1477297079678, last heartbeat : 1477297079165, worker timeout (ms) : 120000 > 2016-10-24 16:17:59.704 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat > request to subprocess > 2016-10-24 16:17:59.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : > 1477297079850, last heartbeat : 1477297079644, worker timeout (ms) : 120000 > 2016-10-24 16:17:59.901 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat > request to subprocess > 2016-10-24 16:18:00.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat > request to subprocess > 2016-10-24 16:18:00.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : > 1477297080673, last heartbeat : 1477296962746, worker timeout (ms) : 120000 > 2016-10-24 16:18:00.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : > 1477297080678, last heartbeat : 1477297080165, worker timeout (ms) : 120000 > 2016-10-24 16:18:00.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : > 1477297080850, last heartbeat : 1477297080814, worker timeout (ms) : 120000 > 2016-10-24 16:18:00.893 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat > request to subprocess > 2016-10-24 16:18:00.913 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat > request to subprocess > 2016-10-24 16:18:01.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat > request to subprocess > 2016-10-24 16:18:01.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : > 1477297081673, last heartbeat : 1477296962746, worker timeout (ms) : 120000 > 2016-10-24 16:18:01.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : > 1477297081678, last heartbeat : 1477297081165, worker timeout (ms) : 120000 > 2016-10-24 16:18:01.706 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat > request to subprocess > 2016-10-24 16:18:01.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : > 1477297081850, last heartbeat : 1477297081816, worker timeout (ms) : 120000 > 2016-10-24 16:18:01.885 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat > request to subprocess > 2016-10-24 16:18:02.165 <02-165> o.a.s.t.ShellBolt [DEBUG] BOLT - sending > heartbeat request to subprocess > 2016-10-24 16:18:02.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : > 1477297082673, last heartbeat : 1477296962746, worker timeout (ms) : 120000 > 2016-10-24 16:18:02.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : > 1477297082678, last heartbeat : 1477297082166, worker timeout (ms) : 120000 > 2016-10-24 16:18:02.704 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat > request to subprocess > 2016-10-24 16:18:02.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : > 1477297082850, last heartbeat : 1477297082843, worker timeout (ms) : 120000 > 2016-10-24 16:18:02.861 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat > request to subprocess > 2016-10-24 16:18:03.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat > request to subprocess > 2016-10-24 16:18:03.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : > 1477297083673, last heartbeat : 1477296962746, worker timeout (ms) : 120000 > 2016-10-24 16:18:03.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : > 1477297083678, last heartbeat : 1477297083166, worker timeout (ms) : 120000 > > > And then, heartbaet timeout just at 16:18:03.675, before the third > heartbeat (850ms) sends out. However, timeout is set to 120000 ms, or 120s. > How could that happen? > > ... > ... > 2016-10-24 16:18:03.675 o.a.s.util [ERROR] Async loop died! > java.lang.RuntimeException: java.lang.RuntimeException: > java.lang.RuntimeException: pid:7243, name:normalizeLog exitCode:-1, > errorString: > at > org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:452) > ~[storm-core-1.0.1.jar:1.0.1] > at > org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:418) > ~[storm-core-1.0.1.jar:1.0.1] > at > org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:73) > ~[storm-core-1.0.1.jar:1.0.1] > at > org.apache.storm.daemon.executor$fn__7953$fn__7966$fn__8019.invoke(executor.clj:847) > ~[storm-core-1.0.1.jar:1.0.1] > at org.apache.storm.util$async_loop$fn__625.invoke(util.clj:484) > [storm-core-1.0.1.jar:1.0.1] > at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?] > at java.lang.Thread.run(Thread.java:745) [?:1.7.0_79] > Caused by: java.lang.RuntimeException: java.lang.RuntimeException: pid:7243, > name:normalizeLog exitCode:-1, errorString: > at org.apache.storm.task.ShellBolt.execute(ShellBolt.java:150) > ~[storm-core-1.0.1.jar:1.0.1] > at > org.apache.storm.daemon.executor$fn__7953$tuple_action_fn__7955.invoke(executor.clj:728) > ~[storm-core-1.0.1.jar:1.0.1] > at > org.apache.storm.daemon.executor$mk_task_receiver$fn__7874.invoke(executor.clj:461) > ~[storm-core-1.0.1.jar:1.0.1] > at > org.apache.storm.disruptor$clojure_handler$reify__7390.onEvent(disruptor.clj:40) > ~[storm-core-1.0.1.jar:1.0.1] > at > org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:439) > ~[storm-core-1.0.1.jar:1.0.1] > ... 6 more > Caused by: java.lang.RuntimeException: pid:7243, name:normalizeLog > exitCode:-1, errorString: > at org.apache.storm.task.ShellBolt.die(ShellBolt.java:295) > ~[storm-core-1.0.1.jar:1.0.1] > at org.apache.storm.task.ShellBolt.access$400(ShellBolt.java:70) > ~[storm-core-1.0.1.jar:1.0.1] > at > org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322) > ~[storm-core-1.0.1.jar:1.0.1] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > ~[?:1.7.0_79] > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) > ~[?:1.7.0_79] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) > ~[?:1.7.0_79] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > ~[?:1.7.0_79] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > ~[?:1.7.0_79] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > ~[?:1.7.0_79] > ... 1 more > Caused by: java.lang.RuntimeException: subprocess heartbeat timeout > at > org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322) > ~[storm-core-1.0.1.jar:1.0.1] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > ~[?:1.7.0_79] > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) > ~[?:1.7.0_79] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) > ~[?:1.7.0_79] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > ~[?:1.7.0_79] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > ~[?:1.7.0_79] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > ~[?:1.7.0_79] > ... 1 more > 2016-10-24 16:18:03.681 o.a.s.d.executor [ERROR] > java.lang.RuntimeException: java.lang.RuntimeException: > java.lang.RuntimeException: pid:7243, name:normalizeLog exitCode:-1, > errorString: > at > org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:452) > ~[storm-core-1.0.1.jar:1.0.1] > at > org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:418) > ~[storm-core-1.0.1.jar:1.0.1] > at > org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:73) > ~[storm-core-1.0.1.jar:1.0.1] > at > org.apache.storm.daemon.executor$fn__7953$fn__7966$fn__8019.invoke(executor.clj:847) > ~[storm-core-1.0.1.jar:1.0.1] > > > > 2016-10-24 15:23 GMT+08:00 Zhechao Ma <[email protected]>: > > I open DEBUG-level log, and see the BOLT heartbeat information, timeout is > 30000ms, everything looks OK. > > 2016-10-21 13:14 GMT+08:00 Zhechao Ma <[email protected]>: > > I will try to do this, and reply latter. Thanks. > > 2016-10-21 13:09 GMT+08:00 Jungtaek Lim <[email protected]>: > > Could you modify your log level to DEBUG and see worker's log? If you use > Storm 1.x you can modify log level from UI on the fly. > ShellBolt writes log regarding subprocess heartbeat but its level is DEBUG > since it could produce lots of logs. > > Two lines: > - BOLT - current time : {}, last heartbeat : {}, worker timeout (ms) : {} > - BOLT - sending heartbeat request to subprocess > > Two lines will be logged to each 1 second. Please check logs are existing, > and 'last heartbeat' is updated properly, and also worker timeout is set > properly. > > 2016년 10월 21일 (금) 오후 1:59, Zhechao Ma <[email protected]>님이 작성: > > I do not set "topology.subprocess.timeout.secs", so " > supervisor.worker.timeout.secs" will be used according to STORM-1314, > which is set 30 for my cluster. > 30 seconds is a very very very big value, it will never take more than 30 > seconds processing my tuple. > I think there must be problem somewhere else. > > 2016-10-21 11:11 GMT+08:00 Jungtaek Lim <[email protected]>: > > There're many situations for ShellBolt to trigger heartbeat issue, and at > least STORM-1946 is not the case. > > How long does your tuple take to be processed? You need to set subprocess > timeout seconds ("topology.subprocess.timeout.secs") to higher than max > time to process. You can even set it fairly big value so that subprocess > heartbeat issue will not happen. > > > ShellBolt requires that each tuple is handled and acked within heartbeat > timeout. I struggled to change this behavior for subprocess to periodically > sends heartbeat, but no luck because of GIL - global interpreter lock (same > for Ruby). We need to choose one: stick this restriction, or disable > subprocess heartbeat. > > I hope that we can resolve this issue clearly, but I guess multi-thread > approach doesn't work on Python, Ruby, and any language which uses GIL, and > I have no idea on alternatives > > - Jungtaek Lim (HeartSaVioR). > > 2016년 10월 21일 (금) 오전 11:44, Zhechao Ma <[email protected]>님이 작성: > > I made an issue (STORM-2150 > <https://issues.apache.org/jira/browse/STORM-2150>) 3 days ago, anyone can > help? > > I've got a simple topology running with Storm 1.0.1. The topology consists > of a KafkaSpout and several python multilang ShellBolt. I frequently got > the following exceptions. > > java.lang.RuntimeException: subprocess heartbeat timeout at > > org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > > More information here: > 1. Topology run with ACK mode. > 2. Topology had 40 workers. > 3. Topology emitted about 10 milliom tuples every 10 minutes. > > Every time subprocess heartbeat timeout, workers would restart and python > processes exited with exitCode:-1, which affected processing capacity and > stability of the topology. > > I've checked some related issues from Storm Jira. I first found STORM-1946 > <https://issues.apache.org/jira/browse/STORM-1946> reported a bug related > to this problem and said bug had been fixed in Storm 1.0.2. However I got > the same exception even after I upgraded Storm to 1.0.2. > > I checked other related issues. Let's look at history of this problem. > DashengJu first reported this problem with Non-ACK mode in STORM-738 > <https://issues.apache.org/jira/browse/STORM-738>. STORM-742 > <https://issues.apache.org/jira/browse/STORM-742> discussed the approach > of > this problem with ACK mode, and it seemed that bug had been fixed in > 0.10.0. I don't know whether this patch is included in storm-1.x branch. In > a word, this problem still exists in the latest stable version. > > > > > > > -- > Thanks > Zhechao Ma > > > > > -- > Thanks > Zhechao Ma >
