Re: ShellBolt raise subprocess heartbeat timeout Exception

Zhechao Ma Mon, 24 Oct 2016 03:32:42 -0700

Really, there's timeout ! Last heartbeat doesn't update properly. I'll try
to adjust worker log format to determin which thread.
Thanks.


2016-10-24 17:14 GMT+08:00 Jungtaek Lim <[email protected]>:

> Sorry but you seem to run multiple ShellBolts in one worker, so I can't
> see which threads print out which log.
> Could you add %t to worker log format? You can just copy the content from
> below link to your worker.xml.
> https://github.com/apache/storm/blob/1.x-branch/log4j2/worker.xml
>
> And there's actual heartbeat timeout occurred:
> 2016-10-24 16:18:03.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time :
> 1477297083673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
>
> 1477297083673 - 1477296962746 = 120927 > 120000
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
>
> 2016년 10월 24일 (월) 오후 6:01, Zhechao Ma <[email protected]>님이 작성:
>
>> I find something strange. I look at a worker.log, there three subprocess
>> heartbeat, 673ms, 678ms, and 850ms. Everything is OK until 16:18:03.
>>
>>
>> ...
>> ...
>> 2016-10-24 16:17:59.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat 
>> request to subprocess
>> 2016-10-24 16:17:59.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 
>> 1477297079673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
>> 2016-10-24 16:17:59.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 
>> 1477297079678, last heartbeat : 1477297079165, worker timeout (ms) : 120000
>> 2016-10-24 16:17:59.704 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat 
>> request to subprocess
>> 2016-10-24 16:17:59.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 
>> 1477297079850, last heartbeat : 1477297079644, worker timeout (ms) : 120000
>> 2016-10-24 16:17:59.901 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat 
>> request to subprocess
>> 2016-10-24 16:18:00.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat 
>> request to subprocess
>> 2016-10-24 16:18:00.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 
>> 1477297080673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
>> 2016-10-24 16:18:00.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 
>> 1477297080678, last heartbeat : 1477297080165, worker timeout (ms) : 120000
>> 2016-10-24 16:18:00.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 
>> 1477297080850, last heartbeat : 1477297080814, worker timeout (ms) : 120000
>> 2016-10-24 16:18:00.893 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat 
>> request to subprocess
>> 2016-10-24 16:18:00.913 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat 
>> request to subprocess
>> 2016-10-24 16:18:01.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat 
>> request to subprocess
>> 2016-10-24 16:18:01.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 
>> 1477297081673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
>> 2016-10-24 16:18:01.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 
>> 1477297081678, last heartbeat : 1477297081165, worker timeout (ms) : 120000
>> 2016-10-24 16:18:01.706 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat 
>> request to subprocess
>> 2016-10-24 16:18:01.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 
>> 1477297081850, last heartbeat : 1477297081816, worker timeout (ms) : 120000
>> 2016-10-24 16:18:01.885 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat 
>> request to subprocess
>> 2016-10-24 16:18:02.165 <02-165> o.a.s.t.ShellBolt [DEBUG] BOLT - sending 
>> heartbeat request to subprocess
>> 2016-10-24 16:18:02.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 
>> 1477297082673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
>> 2016-10-24 16:18:02.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 
>> 1477297082678, last heartbeat : 1477297082166, worker timeout (ms) : 120000
>> 2016-10-24 16:18:02.704 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat 
>> request to subprocess
>> 2016-10-24 16:18:02.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 
>> 1477297082850, last heartbeat : 1477297082843, worker timeout (ms) : 120000
>> 2016-10-24 16:18:02.861 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat 
>> request to subprocess
>> 2016-10-24 16:18:03.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending heartbeat 
>> request to subprocess
>> 2016-10-24 16:18:03.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 
>> 1477297083673, last heartbeat : 1477296962746, worker timeout (ms) : 120000
>> 2016-10-24 16:18:03.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time : 
>> 1477297083678, last heartbeat : 1477297083166, worker timeout (ms) : 120000
>>
>>
>> And then, heartbaet timeout just at 16:18:03.675, before the third
>> heartbeat (850ms) sends out. However, timeout is set to 120000 ms, or 120s.
>> How could that happen?
>>
>> ...
>> ...
>> 2016-10-24 16:18:03.675 o.a.s.util [ERROR] Async loop died!
>> java.lang.RuntimeException: java.lang.RuntimeException: 
>> java.lang.RuntimeException: pid:7243, name:normalizeLog exitCode:-1, 
>> errorString:
>>         at 
>> org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:452)
>>  ~[storm-core-1.0.1.jar:1.0.1]
>>         at 
>> org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:418)
>>  ~[storm-core-1.0.1.jar:1.0.1]
>>         at 
>> org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:73)
>>  ~[storm-core-1.0.1.jar:1.0.1]
>>         at 
>> org.apache.storm.daemon.executor$fn__7953$fn__7966$fn__8019.invoke(executor.clj:847)
>>  ~[storm-core-1.0.1.jar:1.0.1]
>>         at org.apache.storm.util$async_loop$fn__625.invoke(util.clj:484) 
>> [storm-core-1.0.1.jar:1.0.1]
>>         at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
>>         at java.lang.Thread.run(Thread.java:745) [?:1.7.0_79]
>> Caused by: java.lang.RuntimeException: java.lang.RuntimeException: pid:7243, 
>> name:normalizeLog exitCode:-1, errorString:
>>         at org.apache.storm.task.ShellBolt.execute(ShellBolt.java:150) 
>> ~[storm-core-1.0.1.jar:1.0.1]
>>         at 
>> org.apache.storm.daemon.executor$fn__7953$tuple_action_fn__7955.invoke(executor.clj:728)
>>  ~[storm-core-1.0.1.jar:1.0.1]
>>         at 
>> org.apache.storm.daemon.executor$mk_task_receiver$fn__7874.invoke(executor.clj:461)
>>  ~[storm-core-1.0.1.jar:1.0.1]
>>         at 
>> org.apache.storm.disruptor$clojure_handler$reify__7390.onEvent(disruptor.clj:40)
>>  ~[storm-core-1.0.1.jar:1.0.1]
>>         at 
>> org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:439)
>>  ~[storm-core-1.0.1.jar:1.0.1]
>>         ... 6 more
>> Caused by: java.lang.RuntimeException: pid:7243, name:normalizeLog 
>> exitCode:-1, errorString:
>>         at org.apache.storm.task.ShellBolt.die(ShellBolt.java:295) 
>> ~[storm-core-1.0.1.jar:1.0.1]
>>         at org.apache.storm.task.ShellBolt.access$400(ShellBolt.java:70) 
>> ~[storm-core-1.0.1.jar:1.0.1]
>>         at 
>> org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322)
>>  ~[storm-core-1.0.1.jar:1.0.1]
>>         at 
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) 
>> ~[?:1.7.0_79]
>>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) 
>> ~[?:1.7.0_79]
>>         at 
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>>  ~[?:1.7.0_79]
>>         at 
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>>  ~[?:1.7.0_79]
>>         at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>  ~[?:1.7.0_79]
>>         at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>  ~[?:1.7.0_79]
>>         ... 1 more
>> Caused by: java.lang.RuntimeException: subprocess heartbeat timeout
>>         at 
>> org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322)
>>  ~[storm-core-1.0.1.jar:1.0.1]
>>         at 
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) 
>> ~[?:1.7.0_79]
>>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) 
>> ~[?:1.7.0_79]
>>         at 
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>>  ~[?:1.7.0_79]
>>         at 
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>>  ~[?:1.7.0_79]
>>         at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>  ~[?:1.7.0_79]
>>         at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>  ~[?:1.7.0_79]
>>         ... 1 more
>> 2016-10-24 16:18:03.681 o.a.s.d.executor [ERROR]
>> java.lang.RuntimeException: java.lang.RuntimeException: 
>> java.lang.RuntimeException: pid:7243, name:normalizeLog exitCode:-1, 
>> errorString:
>>         at 
>> org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:452)
>>  ~[storm-core-1.0.1.jar:1.0.1]
>>         at 
>> org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:418)
>>  ~[storm-core-1.0.1.jar:1.0.1]
>>         at 
>> org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:73)
>>  ~[storm-core-1.0.1.jar:1.0.1]
>>         at 
>> org.apache.storm.daemon.executor$fn__7953$fn__7966$fn__8019.invoke(executor.clj:847)
>>  ~[storm-core-1.0.1.jar:1.0.1]
>>
>>
>>
>> 2016-10-24 15:23 GMT+08:00 Zhechao Ma <[email protected]>:
>>
>> I open DEBUG-level log, and see the BOLT heartbeat information, timeout
>> is 30000ms, everything looks OK.
>>
>> 2016-10-21 13:14 GMT+08:00 Zhechao Ma <[email protected]>:
>>
>> I will try to do this, and reply latter. Thanks.
>>
>> 2016-10-21 13:09 GMT+08:00 Jungtaek Lim <[email protected]>:
>>
>> Could you modify your log level to DEBUG and see worker's log? If you use
>> Storm 1.x you can modify log level from UI on the fly.
>> ShellBolt writes log regarding subprocess heartbeat but its level is
>> DEBUG since it could produce lots of logs.
>>
>> Two lines:
>> - BOLT - current time : {}, last heartbeat : {}, worker timeout (ms) : {}
>> - BOLT - sending heartbeat request to subprocess
>>
>> Two lines will be logged to each 1 second. Please check logs are
>> existing, and 'last heartbeat' is updated properly, and also worker timeout
>> is set properly.
>>
>> 2016년 10월 21일 (금) 오후 1:59, Zhechao Ma <[email protected]>님이 작성:
>>
>> I do not set "topology.subprocess.timeout.secs", so "
>> supervisor.worker.timeout.secs" will be used according to STORM-1314,
>> which is set 30 for my cluster.
>> 30 seconds is a very very very big value, it will never take more than 30
>> seconds processing my tuple.
>> I think there must be problem somewhere else.
>>
>> 2016-10-21 11:11 GMT+08:00 Jungtaek Lim <[email protected]>:
>>
>> There're many situations for ShellBolt to trigger heartbeat issue, and at
>> least STORM-1946 is not the case.
>>
>> How long does your tuple take to be processed? You need to set subprocess
>> timeout seconds ("topology.subprocess.timeout.secs") to higher than max
>> time to process. You can even set it fairly big value so that subprocess
>> heartbeat issue will not happen.
>>
>>
>> ShellBolt requires that each tuple is handled and acked within heartbeat
>> timeout. I struggled to change this behavior for subprocess to periodically
>> sends heartbeat, but no luck because of GIL - global interpreter lock (same
>> for Ruby). We need to choose one: stick this restriction, or disable
>> subprocess heartbeat.
>>
>> I hope that we can resolve this issue clearly, but I guess multi-thread
>> approach doesn't work on Python, Ruby, and any language which uses GIL, and
>> I have no idea on alternatives
>>
>> - Jungtaek Lim (HeartSaVioR).
>>
>> 2016년 10월 21일 (금) 오전 11:44, Zhechao Ma <[email protected]>님이
>> 작성:
>>
>> I made an issue (STORM-2150
>> <https://issues.apache.org/jira/browse/STORM-2150>) 3 days ago, anyone
>> can
>> help?
>>
>> I've got a simple topology running with Storm 1.0.1. The topology consists
>> of a KafkaSpout and several python multilang ShellBolt. I frequently got
>> the following exceptions.
>>
>> java.lang.RuntimeException: subprocess heartbeat timeout at
>> org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(
>> ShellBolt.java:322)
>> at java.util.concurrent.Executors$RunnableAdapter.
>> call(Executors.java:471)
>> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at
>> java.util.concurrent.ScheduledThreadPoolExecutor$
>> ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>> at java.util.concurrent.ScheduledThreadPoolExecutor$
>> ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(
>> ThreadPoolExecutor.java:1145)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>> ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> More information here:
>> 1. Topology run with ACK mode.
>> 2. Topology had 40 workers.
>> 3. Topology emitted about 10 milliom tuples every 10 minutes.
>>
>> Every time subprocess heartbeat timeout, workers would restart and python
>> processes exited with exitCode:-1, which affected processing capacity and
>> stability of the topology.
>>
>> I've checked some related issues from Storm Jira. I first found STORM-1946
>> <https://issues.apache.org/jira/browse/STORM-1946> reported a bug related
>> to this problem and said bug had been fixed in Storm 1.0.2. However I got
>> the same exception even after I upgraded Storm to 1.0.2.
>>
>> I checked other related issues. Let's look at history of this problem.
>> DashengJu first reported this problem with Non-ACK mode in STORM-738
>> <https://issues.apache.org/jira/browse/STORM-738>. STORM-742
>> <https://issues.apache.org/jira/browse/STORM-742> discussed the approach
>> of
>> this problem with ACK mode, and it seemed that bug had been fixed in
>> 0.10.0. I don't know whether this patch is included in storm-1.x branch.
>> In
>> a word, this problem still exists in the latest stable version.
>>
>>
>>
>>
>>
>>
>> --
>> Thanks
>> Zhechao Ma
>>
>>
>>
>>
>> --
>> Thanks
>> Zhechao Ma
>>
>


-- 
Thanks
Zhechao Ma

Re: ShellBolt raise subprocess heartbeat timeout Exception

Reply via email to