I do not set "topology.subprocess.timeout.secs", so " supervisor.worker.timeout.secs" will be used according to STORM-1314, which is set 30 for my cluster. 30 seconds is a very very very big value, it will never take more than 30 seconds processing my tuple. I think there must be problem somewhere else.
2016-10-21 11:11 GMT+08:00 Jungtaek Lim <[email protected]>: > There're many situations for ShellBolt to trigger heartbeat issue, and at > least STORM-1946 is not the case. > > How long does your tuple take to be processed? You need to set subprocess > timeout seconds ("topology.subprocess.timeout.secs") to higher than max > time to process. You can even set it fairly big value so that subprocess > heartbeat issue will not happen. > > > ShellBolt requires that each tuple is handled and acked within heartbeat > timeout. I struggled to change this behavior for subprocess to periodically > sends heartbeat, but no luck because of GIL - global interpreter lock (same > for Ruby). We need to choose one: stick this restriction, or disable > subprocess heartbeat. > > I hope that we can resolve this issue clearly, but I guess multi-thread > approach doesn't work on Python, Ruby, and any language which uses GIL, and > I have no idea on alternatives > > - Jungtaek Lim (HeartSaVioR). > > 2016년 10월 21일 (금) 오전 11:44, Zhechao Ma <[email protected]>님이 작성: > >> I made an issue (STORM-2150 >> <https://issues.apache.org/jira/browse/STORM-2150>) 3 days ago, anyone >> can >> help? >> >> I've got a simple topology running with Storm 1.0.1. The topology consists >> of a KafkaSpout and several python multilang ShellBolt. I frequently got >> the following exceptions. >> >> java.lang.RuntimeException: subprocess heartbeat timeout at >> org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run( >> ShellBolt.java:322) >> at java.util.concurrent.Executors$RunnableAdapter. >> call(Executors.java:471) >> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at >> java.util.concurrent.ScheduledThreadPoolExecutor$ >> ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) >> at java.util.concurrent.ScheduledThreadPoolExecutor$ >> ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) >> at java.util.concurrent.ThreadPoolExecutor.runWorker( >> ThreadPoolExecutor.java:1145) >> at java.util.concurrent.ThreadPoolExecutor$Worker.run( >> ThreadPoolExecutor.java:615) >> at java.lang.Thread.run(Thread.java:745) >> >> More information here: >> 1. Topology run with ACK mode. >> 2. Topology had 40 workers. >> 3. Topology emitted about 10 milliom tuples every 10 minutes. >> >> Every time subprocess heartbeat timeout, workers would restart and python >> processes exited with exitCode:-1, which affected processing capacity and >> stability of the topology. >> >> I've checked some related issues from Storm Jira. I first found STORM-1946 >> <https://issues.apache.org/jira/browse/STORM-1946> reported a bug related >> to this problem and said bug had been fixed in Storm 1.0.2. However I got >> the same exception even after I upgraded Storm to 1.0.2. >> >> I checked other related issues. Let's look at history of this problem. >> DashengJu first reported this problem with Non-ACK mode in STORM-738 >> <https://issues.apache.org/jira/browse/STORM-738>. STORM-742 >> <https://issues.apache.org/jira/browse/STORM-742> discussed the approach >> of >> this problem with ACK mode, and it seemed that bug had been fixed in >> 0.10.0. I don't know whether this patch is included in storm-1.x branch. >> In >> a word, this problem still exists in the latest stable version. >> >
