Re: ShellBolt raise subprocess heartbeat timeout Exception

Zhechao Ma Mon, 24 Oct 2016 00:23:43 -0700

I open DEBUG-level log, and see the BOLT heartbeat information, timeout is
30000ms, everything looks OK.


2016-10-21 13:14 GMT+08:00 Zhechao Ma <[email protected]>:

> I will try to do this, and reply latter. Thanks.
>
> 2016-10-21 13:09 GMT+08:00 Jungtaek Lim <[email protected]>:
>
>> Could you modify your log level to DEBUG and see worker's log? If you use
>> Storm 1.x you can modify log level from UI on the fly.
>> ShellBolt writes log regarding subprocess heartbeat but its level is
>> DEBUG since it could produce lots of logs.
>>
>> Two lines:
>> - BOLT - current time : {}, last heartbeat : {}, worker timeout (ms) : {}
>> - BOLT - sending heartbeat request to subprocess
>>
>> Two lines will be logged to each 1 second. Please check logs are
>> existing, and 'last heartbeat' is updated properly, and also worker timeout
>> is set properly.
>>
>> 2016년 10월 21일 (금) 오후 1:59, Zhechao Ma <[email protected]>님이 작성:
>>
>>> I do not set "topology.subprocess.timeout.secs", so "
>>> supervisor.worker.timeout.secs" will be used according to STORM-1314,
>>> which is set 30 for my cluster.
>>> 30 seconds is a very very very big value, it will never take more than
>>> 30 seconds processing my tuple.
>>> I think there must be problem somewhere else.
>>>
>>> 2016-10-21 11:11 GMT+08:00 Jungtaek Lim <[email protected]>:
>>>
>>> There're many situations for ShellBolt to trigger heartbeat issue, and
>>> at least STORM-1946 is not the case.
>>>
>>> How long does your tuple take to be processed? You need to set
>>> subprocess timeout seconds ("topology.subprocess.timeout.secs") to
>>> higher than max time to process. You can even set it fairly big value so
>>> that subprocess heartbeat issue will not happen.
>>>
>>>
>>> ShellBolt requires that each tuple is handled and acked within heartbeat
>>> timeout. I struggled to change this behavior for subprocess to periodically
>>> sends heartbeat, but no luck because of GIL - global interpreter lock (same
>>> for Ruby). We need to choose one: stick this restriction, or disable
>>> subprocess heartbeat.
>>>
>>> I hope that we can resolve this issue clearly, but I guess multi-thread
>>> approach doesn't work on Python, Ruby, and any language which uses GIL, and
>>> I have no idea on alternatives
>>>
>>> - Jungtaek Lim (HeartSaVioR).
>>>
>>> 2016년 10월 21일 (금) 오전 11:44, Zhechao Ma <[email protected]>님이
>>> 작성:
>>>
>>> I made an issue (STORM-2150
>>> <https://issues.apache.org/jira/browse/STORM-2150>) 3 days ago, anyone
>>> can
>>> help?
>>>
>>> I've got a simple topology running with Storm 1.0.1. The topology
>>> consists
>>> of a KafkaSpout and several python multilang ShellBolt. I frequently got
>>> the following exceptions.
>>>
>>> java.lang.RuntimeException: subprocess heartbeat timeout at
>>> org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(S
>>> hellBolt.java:322)
>>> at java.util.concurrent.Executors$RunnableAdapter.call(
>>> Executors.java:471)
>>> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at
>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu
>>> tureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>>> at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu
>>> tureTask.run(ScheduledThreadPoolExecutor.java:293)
>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>> Executor.java:1145)
>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>>> lExecutor.java:615)
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>> More information here:
>>> 1. Topology run with ACK mode.
>>> 2. Topology had 40 workers.
>>> 3. Topology emitted about 10 milliom tuples every 10 minutes.
>>>
>>> Every time subprocess heartbeat timeout, workers would restart and python
>>> processes exited with exitCode:-1, which affected processing capacity and
>>> stability of the topology.
>>>
>>> I've checked some related issues from Storm Jira. I first found
>>> STORM-1946
>>> <https://issues.apache.org/jira/browse/STORM-1946> reported a bug
>>> related
>>> to this problem and said bug had been fixed in Storm 1.0.2. However I got
>>> the same exception even after I upgraded Storm to 1.0.2.
>>>
>>> I checked other related issues. Let's look at history of this problem.
>>> DashengJu first reported this problem with Non-ACK mode in STORM-738
>>> <https://issues.apache.org/jira/browse/STORM-738>. STORM-742
>>> <https://issues.apache.org/jira/browse/STORM-742> discussed the
>>> approach of
>>> this problem with ACK mode, and it seemed that bug had been fixed in
>>> 0.10.0. I don't know whether this patch is included in storm-1.x branch.
>>> In
>>> a word, this problem still exists in the latest stable version.
>>>
>>>
>>>
>


-- 
Thanks
Zhechao Ma

Re: ShellBolt raise subprocess heartbeat timeout Exception

Reply via email to