I made an issue (STORM-2150
<https://issues.apache.org/jira/browse/STORM-2150>) 3 days ago, anyone can
help?

I've got a simple topology running with Storm 1.0.1. The topology consists
of a KafkaSpout and several python multilang ShellBolt. I frequently got
the following exceptions.

java.lang.RuntimeException: subprocess heartbeat timeout at
org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(ShellBolt.java:322)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

More information here:
1. Topology run with ACK mode.
2. Topology had 40 workers.
3. Topology emitted about 10 milliom tuples every 10 minutes.

Every time subprocess heartbeat timeout, workers would restart and python
processes exited with exitCode:-1, which affected processing capacity and
stability of the topology.

I've checked some related issues from Storm Jira. I first found STORM-1946
<https://issues.apache.org/jira/browse/STORM-1946> reported a bug related
to this problem and said bug had been fixed in Storm 1.0.2. However I got
the same exception even after I upgraded Storm to 1.0.2.

I checked other related issues. Let's look at history of this problem.
DashengJu first reported this problem with Non-ACK mode in STORM-738
<https://issues.apache.org/jira/browse/STORM-738>. STORM-742
<https://issues.apache.org/jira/browse/STORM-742> discussed the approach of
this problem with ACK mode, and it seemed that bug had been fixed in
0.10.0. I don't know whether this patch is included in storm-1.x branch. In
a word, this problem still exists in the latest stable version.

Reply via email to