I open DEBUG-level log, and see the BOLT heartbeat information, timeout is 30000ms, everything looks OK.
2016-10-21 13:14 GMT+08:00 Zhechao Ma <[email protected]>: > I will try to do this, and reply latter. Thanks. > > 2016-10-21 13:09 GMT+08:00 Jungtaek Lim <[email protected]>: > >> Could you modify your log level to DEBUG and see worker's log? If you use >> Storm 1.x you can modify log level from UI on the fly. >> ShellBolt writes log regarding subprocess heartbeat but its level is >> DEBUG since it could produce lots of logs. >> >> Two lines: >> - BOLT - current time : {}, last heartbeat : {}, worker timeout (ms) : {} >> - BOLT - sending heartbeat request to subprocess >> >> Two lines will be logged to each 1 second. Please check logs are >> existing, and 'last heartbeat' is updated properly, and also worker timeout >> is set properly. >> >> 2016년 10월 21일 (금) 오후 1:59, Zhechao Ma <[email protected]>님이 작성: >> >>> I do not set "topology.subprocess.timeout.secs", so " >>> supervisor.worker.timeout.secs" will be used according to STORM-1314, >>> which is set 30 for my cluster. >>> 30 seconds is a very very very big value, it will never take more than >>> 30 seconds processing my tuple. >>> I think there must be problem somewhere else. >>> >>> 2016-10-21 11:11 GMT+08:00 Jungtaek Lim <[email protected]>: >>> >>> There're many situations for ShellBolt to trigger heartbeat issue, and >>> at least STORM-1946 is not the case. >>> >>> How long does your tuple take to be processed? You need to set >>> subprocess timeout seconds ("topology.subprocess.timeout.secs") to >>> higher than max time to process. You can even set it fairly big value so >>> that subprocess heartbeat issue will not happen. >>> >>> >>> ShellBolt requires that each tuple is handled and acked within heartbeat >>> timeout. I struggled to change this behavior for subprocess to periodically >>> sends heartbeat, but no luck because of GIL - global interpreter lock (same >>> for Ruby). We need to choose one: stick this restriction, or disable >>> subprocess heartbeat. >>> >>> I hope that we can resolve this issue clearly, but I guess multi-thread >>> approach doesn't work on Python, Ruby, and any language which uses GIL, and >>> I have no idea on alternatives >>> >>> - Jungtaek Lim (HeartSaVioR). >>> >>> 2016년 10월 21일 (금) 오전 11:44, Zhechao Ma <[email protected]>님이 >>> 작성: >>> >>> I made an issue (STORM-2150 >>> <https://issues.apache.org/jira/browse/STORM-2150>) 3 days ago, anyone >>> can >>> help? >>> >>> I've got a simple topology running with Storm 1.0.1. The topology >>> consists >>> of a KafkaSpout and several python multilang ShellBolt. I frequently got >>> the following exceptions. >>> >>> java.lang.RuntimeException: subprocess heartbeat timeout at >>> org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(S >>> hellBolt.java:322) >>> at java.util.concurrent.Executors$RunnableAdapter.call( >>> Executors.java:471) >>> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at >>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu >>> tureTask.access$301(ScheduledThreadPoolExecutor.java:178) >>> at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu >>> tureTask.run(ScheduledThreadPoolExecutor.java:293) >>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool >>> Executor.java:1145) >>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo >>> lExecutor.java:615) >>> at java.lang.Thread.run(Thread.java:745) >>> >>> More information here: >>> 1. Topology run with ACK mode. >>> 2. Topology had 40 workers. >>> 3. Topology emitted about 10 milliom tuples every 10 minutes. >>> >>> Every time subprocess heartbeat timeout, workers would restart and python >>> processes exited with exitCode:-1, which affected processing capacity and >>> stability of the topology. >>> >>> I've checked some related issues from Storm Jira. I first found >>> STORM-1946 >>> <https://issues.apache.org/jira/browse/STORM-1946> reported a bug >>> related >>> to this problem and said bug had been fixed in Storm 1.0.2. However I got >>> the same exception even after I upgraded Storm to 1.0.2. >>> >>> I checked other related issues. Let's look at history of this problem. >>> DashengJu first reported this problem with Non-ACK mode in STORM-738 >>> <https://issues.apache.org/jira/browse/STORM-738>. STORM-742 >>> <https://issues.apache.org/jira/browse/STORM-742> discussed the >>> approach of >>> this problem with ACK mode, and it seemed that bug had been fixed in >>> 0.10.0. I don't know whether this patch is included in storm-1.x branch. >>> In >>> a word, this problem still exists in the latest stable version. >>> >>> >>> > -- Thanks Zhechao Ma
