Re: ShellBolt raise subprocess heartbeat timeout Exception

Zhechao Ma Thu, 20 Oct 2016 21:59:23 -0700

I do not set "topology.subprocess.timeout.secs", so "
supervisor.worker.timeout.secs" will be used according to STORM-1314, which
is set 30 for my cluster.
30 seconds is a very very very big value, it will never take more than 30
seconds processing my tuple.
I think there must be problem somewhere else.


2016-10-21 11:11 GMT+08:00 Jungtaek Lim <[email protected]>:

> There're many situations for ShellBolt to trigger heartbeat issue, and at
> least STORM-1946 is not the case.
>
> How long does your tuple take to be processed? You need to set subprocess
> timeout seconds ("topology.subprocess.timeout.secs") to higher than max
> time to process. You can even set it fairly big value so that subprocess
> heartbeat issue will not happen.
>
>
> ShellBolt requires that each tuple is handled and acked within heartbeat
> timeout. I struggled to change this behavior for subprocess to periodically
> sends heartbeat, but no luck because of GIL - global interpreter lock (same
> for Ruby). We need to choose one: stick this restriction, or disable
> subprocess heartbeat.
>
> I hope that we can resolve this issue clearly, but I guess multi-thread
> approach doesn't work on Python, Ruby, and any language which uses GIL, and
> I have no idea on alternatives
>
> - Jungtaek Lim (HeartSaVioR).
>
> 2016년 10월 21일 (금) 오전 11:44, Zhechao Ma <[email protected]>님이 작성:
>
>> I made an issue (STORM-2150
>> <https://issues.apache.org/jira/browse/STORM-2150>) 3 days ago, anyone
>> can
>> help?
>>
>> I've got a simple topology running with Storm 1.0.1. The topology consists
>> of a KafkaSpout and several python multilang ShellBolt. I frequently got
>> the following exceptions.
>>
>> java.lang.RuntimeException: subprocess heartbeat timeout at
>> org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(
>> ShellBolt.java:322)
>> at java.util.concurrent.Executors$RunnableAdapter.
>> call(Executors.java:471)
>> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at
>> java.util.concurrent.ScheduledThreadPoolExecutor$
>> ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>> at java.util.concurrent.ScheduledThreadPoolExecutor$
>> ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(
>> ThreadPoolExecutor.java:1145)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>> ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> More information here:
>> 1. Topology run with ACK mode.
>> 2. Topology had 40 workers.
>> 3. Topology emitted about 10 milliom tuples every 10 minutes.
>>
>> Every time subprocess heartbeat timeout, workers would restart and python
>> processes exited with exitCode:-1, which affected processing capacity and
>> stability of the topology.
>>
>> I've checked some related issues from Storm Jira. I first found STORM-1946
>> <https://issues.apache.org/jira/browse/STORM-1946> reported a bug related
>> to this problem and said bug had been fixed in Storm 1.0.2. However I got
>> the same exception even after I upgraded Storm to 1.0.2.
>>
>> I checked other related issues. Let's look at history of this problem.
>> DashengJu first reported this problem with Non-ACK mode in STORM-738
>> <https://issues.apache.org/jira/browse/STORM-738>. STORM-742
>> <https://issues.apache.org/jira/browse/STORM-742> discussed the approach
>> of
>> this problem with ACK mode, and it seemed that bug had been fixed in
>> 0.10.0. I don't know whether this patch is included in storm-1.x branch.
>> In
>> a word, this problem still exists in the latest stable version.
>>
>

Re: ShellBolt raise subprocess heartbeat timeout Exception

Reply via email to