Re: ShellBolt raise subprocess heartbeat timeout Exception

Zhechao Ma Mon, 24 Oct 2016 02:01:50 -0700

I find something strange. I look at a worker.log, there three subprocess
heartbeat, 673ms, 678ms, and 850ms. Everything is OK until 16:18:03.



...
...
2016-10-24 16:17:59.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending
heartbeat request to subprocess
2016-10-24 16:17:59.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time
: 1477297079673, last heartbeat : 1477296962746, worker timeout (ms) :
120000
2016-10-24 16:17:59.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time
: 1477297079678, last heartbeat : 1477297079165, worker timeout (ms) :
120000
2016-10-24 16:17:59.704 o.a.s.t.ShellBolt [DEBUG] BOLT - sending
heartbeat request to subprocess
2016-10-24 16:17:59.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time
: 1477297079850, last heartbeat : 1477297079644, worker timeout (ms) :
120000
2016-10-24 16:17:59.901 o.a.s.t.ShellBolt [DEBUG] BOLT - sending
heartbeat request to subprocess
2016-10-24 16:18:00.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending
heartbeat request to subprocess
2016-10-24 16:18:00.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time
: 1477297080673, last heartbeat : 1477296962746, worker timeout (ms) :
120000
2016-10-24 16:18:00.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time
: 1477297080678, last heartbeat : 1477297080165, worker timeout (ms) :
120000
2016-10-24 16:18:00.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time
: 1477297080850, last heartbeat : 1477297080814, worker timeout (ms) :
120000
2016-10-24 16:18:00.893 o.a.s.t.ShellBolt [DEBUG] BOLT - sending
heartbeat request to subprocess
2016-10-24 16:18:00.913 o.a.s.t.ShellBolt [DEBUG] BOLT - sending
heartbeat request to subprocess
2016-10-24 16:18:01.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending
heartbeat request to subprocess
2016-10-24 16:18:01.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time
: 1477297081673, last heartbeat : 1477296962746, worker timeout (ms) :
120000
2016-10-24 16:18:01.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time
: 1477297081678, last heartbeat : 1477297081165, worker timeout (ms) :
120000
2016-10-24 16:18:01.706 o.a.s.t.ShellBolt [DEBUG] BOLT - sending
heartbeat request to subprocess
2016-10-24 16:18:01.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time
: 1477297081850, last heartbeat : 1477297081816, worker timeout (ms) :
120000
2016-10-24 16:18:01.885 o.a.s.t.ShellBolt [DEBUG] BOLT - sending
heartbeat request to subprocess
2016-10-24 16:18:02.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending
heartbeat request to subprocess
2016-10-24 16:18:02.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time
: 1477297082673, last heartbeat : 1477296962746, worker timeout (ms) :
120000
2016-10-24 16:18:02.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time
: 1477297082678, last heartbeat : 1477297082166, worker timeout (ms) :
120000
2016-10-24 16:18:02.704 o.a.s.t.ShellBolt [DEBUG] BOLT - sending
heartbeat request to subprocess
2016-10-24 16:18:02.850 o.a.s.t.ShellBolt [DEBUG] BOLT - current time
: 1477297082850, last heartbeat : 1477297082843, worker timeout (ms) :
120000
2016-10-24 16:18:02.861 o.a.s.t.ShellBolt [DEBUG] BOLT - sending
heartbeat request to subprocess
2016-10-24 16:18:03.165 o.a.s.t.ShellBolt [DEBUG] BOLT - sending
heartbeat request to subprocess
2016-10-24 16:18:03.673 o.a.s.t.ShellBolt [DEBUG] BOLT - current time
: 1477297083673, last heartbeat : 1477296962746, worker timeout (ms) :
120000
2016-10-24 16:18:03.678 o.a.s.t.ShellBolt [DEBUG] BOLT - current time
: 1477297083678, last heartbeat : 1477297083166, worker timeout (ms) :
120000


And then, heartbaet timeout just at 16:18:03.675, before the third
heartbeat (850ms) sends out. However, timeout is set to 120000 ms, or 120s.
How could that happen?

...
...
2016-10-24 16:18:03.675 o.a.s.util [ERROR] Async loop died!
java.lang.RuntimeException: java.lang.RuntimeException:
java.lang.RuntimeException: pid:7243, name:normalizeLog exitCode:-1,
errorString:
        at 
org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:452)
~[storm-core-1.0.1.jar:1.0.1]
        at 
org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:418)
~[storm-core-1.0.1.jar:1.0.1]
        at 
org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:73)
~[storm-core-1.0.1.jar:1.0.1]
        at 
org.apache.storm.daemon.executor$fn__7953$fn__7966$fn__8019.invoke(executor.clj:847)
~[storm-core-1.0.1.jar:1.0.1]
        at org.apache.storm.util$async_loop$fn__625.invoke(util.clj:484)
[storm-core-1.0.1.jar:1.0.1]
        at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
        at java.lang.Thread.run(Thread.java:745) [?:1.7.0_79]
Caused by: java.lang.RuntimeException: java.lang.RuntimeException:
pid:7243, name:normalizeLog exitCode:-1, errorString:
        at org.apache.storm.task.ShellBolt.execute(ShellBolt.java:150)
~[storm-core-1.0.1.jar:1.0.1]
        at 
org.apache.storm.daemon.executor$fn__7953$tuple_action_fn__7955.invoke(executor.clj:728)
~[storm-core-1.0.1.jar:1.0.1]
        at 
org.apache.storm.daemon.executor$mk_task_receiver$fn__7874.invoke(executor.clj:461)
~[storm-core-1.0.1.jar:1.0.1]
        at 
org.apache.storm.disruptor$clojure_handler$reify__7390.onEvent(disruptor.clj:40)
~[storm-core-1.0.1.jar:1.0.1]
        at 
org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:439)
~[storm-core-1.0.1.jar:1.0.1]
        ... 6 more
Caused by: java.lang.RuntimeException: pid:7243, name:normalizeLog
exitCode:-1, errorString:
        at org.apache.storm.task.ShellBolt.die(ShellBolt.java:295)
~[storm-core-1.0.1.jar:1.0.1]
        at org.apache.storm.task.ShellBolt.access$400(ShellBolt.java:70)
~[storm-core-1.0.1.jar:1.0.1]
        at 
org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322)
~[storm-core-1.0.1.jar:1.0.1]
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
~[?:1.7.0_79]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
~[?:1.7.0_79]
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
~[?:1.7.0_79]
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
~[?:1.7.0_79]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
~[?:1.7.0_79]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
~[?:1.7.0_79]
        ... 1 more
Caused by: java.lang.RuntimeException: subprocess heartbeat timeout
        at 
org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322)
~[storm-core-1.0.1.jar:1.0.1]
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
~[?:1.7.0_79]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
~[?:1.7.0_79]
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
~[?:1.7.0_79]
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
~[?:1.7.0_79]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
~[?:1.7.0_79]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
~[?:1.7.0_79]
        ... 1 more
2016-10-24 16:18:03.681 o.a.s.d.executor [ERROR]
java.lang.RuntimeException: java.lang.RuntimeException:
java.lang.RuntimeException: pid:7243, name:normalizeLog exitCode:-1,
errorString:
        at 
org.apache.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:452)
~[storm-core-1.0.1.jar:1.0.1]
        at 
org.apache.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:418)
~[storm-core-1.0.1.jar:1.0.1]
        at 
org.apache.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:73)
~[storm-core-1.0.1.jar:1.0.1]
        at 
org.apache.storm.daemon.executor$fn__7953$fn__7966$fn__8019.invoke(executor.clj:847)
~[storm-core-1.0.1.jar:1.0.1]



2016-10-24 15:23 GMT+08:00 Zhechao Ma <[email protected]>:

> I open DEBUG-level log, and see the BOLT heartbeat information, timeout is
> 30000ms, everything looks OK.
>
> 2016-10-21 13:14 GMT+08:00 Zhechao Ma <[email protected]>:
>
>> I will try to do this, and reply latter. Thanks.
>>
>> 2016-10-21 13:09 GMT+08:00 Jungtaek Lim <[email protected]>:
>>
>>> Could you modify your log level to DEBUG and see worker's log? If you
>>> use Storm 1.x you can modify log level from UI on the fly.
>>> ShellBolt writes log regarding subprocess heartbeat but its level is
>>> DEBUG since it could produce lots of logs.
>>>
>>> Two lines:
>>> - BOLT - current time : {}, last heartbeat : {}, worker timeout (ms) : {}
>>> - BOLT - sending heartbeat request to subprocess
>>>
>>> Two lines will be logged to each 1 second. Please check logs are
>>> existing, and 'last heartbeat' is updated properly, and also worker timeout
>>> is set properly.
>>>
>>> 2016년 10월 21일 (금) 오후 1:59, Zhechao Ma <[email protected]>님이
>>> 작성:
>>>
>>>> I do not set "topology.subprocess.timeout.secs", so "
>>>> supervisor.worker.timeout.secs" will be used according to STORM-1314,
>>>> which is set 30 for my cluster.
>>>> 30 seconds is a very very very big value, it will never take more than
>>>> 30 seconds processing my tuple.
>>>> I think there must be problem somewhere else.
>>>>
>>>> 2016-10-21 11:11 GMT+08:00 Jungtaek Lim <[email protected]>:
>>>>
>>>> There're many situations for ShellBolt to trigger heartbeat issue, and
>>>> at least STORM-1946 is not the case.
>>>>
>>>> How long does your tuple take to be processed? You need to set
>>>> subprocess timeout seconds ("topology.subprocess.timeout.secs") to
>>>> higher than max time to process. You can even set it fairly big value so
>>>> that subprocess heartbeat issue will not happen.
>>>>
>>>>
>>>> ShellBolt requires that each tuple is handled and acked within
>>>> heartbeat timeout. I struggled to change this behavior for subprocess to
>>>> periodically sends heartbeat, but no luck because of GIL - global
>>>> interpreter lock (same for Ruby). We need to choose one: stick this
>>>> restriction, or disable subprocess heartbeat.
>>>>
>>>> I hope that we can resolve this issue clearly, but I guess multi-thread
>>>> approach doesn't work on Python, Ruby, and any language which uses GIL, and
>>>> I have no idea on alternatives
>>>>
>>>> - Jungtaek Lim (HeartSaVioR).
>>>>
>>>> 2016년 10월 21일 (금) 오전 11:44, Zhechao Ma <[email protected]>님이
>>>> 작성:
>>>>
>>>> I made an issue (STORM-2150
>>>> <https://issues.apache.org/jira/browse/STORM-2150>) 3 days ago, anyone
>>>> can
>>>> help?
>>>>
>>>> I've got a simple topology running with Storm 1.0.1. The topology
>>>> consists
>>>> of a KafkaSpout and several python multilang ShellBolt. I frequently got
>>>> the following exceptions.
>>>>
>>>> java.lang.RuntimeException: subprocess heartbeat timeout at
>>>> org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(S
>>>> hellBolt.java:322)
>>>> at java.util.concurrent.Executors$RunnableAdapter.call(Executor
>>>> s.java:471)
>>>> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at
>>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu
>>>> tureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>>>> at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu
>>>> tureTask.run(ScheduledThreadPoolExecutor.java:293)
>>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>>> Executor.java:1145)
>>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>>>> lExecutor.java:615)
>>>> at java.lang.Thread.run(Thread.java:745)
>>>>
>>>> More information here:
>>>> 1. Topology run with ACK mode.
>>>> 2. Topology had 40 workers.
>>>> 3. Topology emitted about 10 milliom tuples every 10 minutes.
>>>>
>>>> Every time subprocess heartbeat timeout, workers would restart and
>>>> python
>>>> processes exited with exitCode:-1, which affected processing capacity
>>>> and
>>>> stability of the topology.
>>>>
>>>> I've checked some related issues from Storm Jira. I first found
>>>> STORM-1946
>>>> <https://issues.apache.org/jira/browse/STORM-1946> reported a bug
>>>> related
>>>> to this problem and said bug had been fixed in Storm 1.0.2. However I
>>>> got
>>>> the same exception even after I upgraded Storm to 1.0.2.
>>>>
>>>> I checked other related issues. Let's look at history of this problem.
>>>> DashengJu first reported this problem with Non-ACK mode in STORM-738
>>>> <https://issues.apache.org/jira/browse/STORM-738>. STORM-742
>>>> <https://issues.apache.org/jira/browse/STORM-742> discussed the
>>>> approach of
>>>> this problem with ACK mode, and it seemed that bug had been fixed in
>>>> 0.10.0. I don't know whether this patch is included in storm-1.x
>>>> branch. In
>>>> a word, this problem still exists in the latest stable version.
>>>>
>>>>
>>>>
>>
>
>
> --
> Thanks
> Zhechao Ma
>



-- 
Thanks
Zhechao Ma

Re: ShellBolt raise subprocess heartbeat timeout Exception

Reply via email to