I will try to do this, and reply latter. Thanks.

2016-10-21 13:09 GMT+08:00 Jungtaek Lim <[email protected]>:

> Could you modify your log level to DEBUG and see worker's log? If you use
> Storm 1.x you can modify log level from UI on the fly.
> ShellBolt writes log regarding subprocess heartbeat but its level is DEBUG
> since it could produce lots of logs.
>
> Two lines:
> - BOLT - current time : {}, last heartbeat : {}, worker timeout (ms) : {}
> - BOLT - sending heartbeat request to subprocess
>
> Two lines will be logged to each 1 second. Please check logs are existing,
> and 'last heartbeat' is updated properly, and also worker timeout is set
> properly.
>
> 2016년 10월 21일 (금) 오후 1:59, Zhechao Ma <[email protected]>님이 작성:
>
>> I do not set "topology.subprocess.timeout.secs", so "
>> supervisor.worker.timeout.secs" will be used according to STORM-1314,
>> which is set 30 for my cluster.
>> 30 seconds is a very very very big value, it will never take more than 30
>> seconds processing my tuple.
>> I think there must be problem somewhere else.
>>
>> 2016-10-21 11:11 GMT+08:00 Jungtaek Lim <[email protected]>:
>>
>> There're many situations for ShellBolt to trigger heartbeat issue, and at
>> least STORM-1946 is not the case.
>>
>> How long does your tuple take to be processed? You need to set subprocess
>> timeout seconds ("topology.subprocess.timeout.secs") to higher than max
>> time to process. You can even set it fairly big value so that subprocess
>> heartbeat issue will not happen.
>>
>>
>> ShellBolt requires that each tuple is handled and acked within heartbeat
>> timeout. I struggled to change this behavior for subprocess to periodically
>> sends heartbeat, but no luck because of GIL - global interpreter lock (same
>> for Ruby). We need to choose one: stick this restriction, or disable
>> subprocess heartbeat.
>>
>> I hope that we can resolve this issue clearly, but I guess multi-thread
>> approach doesn't work on Python, Ruby, and any language which uses GIL, and
>> I have no idea on alternatives
>>
>> - Jungtaek Lim (HeartSaVioR).
>>
>> 2016년 10월 21일 (금) 오전 11:44, Zhechao Ma <[email protected]>님이
>> 작성:
>>
>> I made an issue (STORM-2150
>> <https://issues.apache.org/jira/browse/STORM-2150>) 3 days ago, anyone
>> can
>> help?
>>
>> I've got a simple topology running with Storm 1.0.1. The topology consists
>> of a KafkaSpout and several python multilang ShellBolt. I frequently got
>> the following exceptions.
>>
>> java.lang.RuntimeException: subprocess heartbeat timeout at
>> org.apache.storm.task.ShellBolt$BoltHeartbeatTimerTask.run(
>> ShellBolt.java:322)
>> at java.util.concurrent.Executors$RunnableAdapter.
>> call(Executors.java:471)
>> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at
>> java.util.concurrent.ScheduledThreadPoolExecutor$
>> ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>> at java.util.concurrent.ScheduledThreadPoolExecutor$
>> ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(
>> ThreadPoolExecutor.java:1145)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>> ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> More information here:
>> 1. Topology run with ACK mode.
>> 2. Topology had 40 workers.
>> 3. Topology emitted about 10 milliom tuples every 10 minutes.
>>
>> Every time subprocess heartbeat timeout, workers would restart and python
>> processes exited with exitCode:-1, which affected processing capacity and
>> stability of the topology.
>>
>> I've checked some related issues from Storm Jira. I first found STORM-1946
>> <https://issues.apache.org/jira/browse/STORM-1946> reported a bug related
>> to this problem and said bug had been fixed in Storm 1.0.2. However I got
>> the same exception even after I upgraded Storm to 1.0.2.
>>
>> I checked other related issues. Let's look at history of this problem.
>> DashengJu first reported this problem with Non-ACK mode in STORM-738
>> <https://issues.apache.org/jira/browse/STORM-738>. STORM-742
>> <https://issues.apache.org/jira/browse/STORM-742> discussed the approach
>> of
>> this problem with ACK mode, and it seemed that bug had been fixed in
>> 0.10.0. I don't know whether this patch is included in storm-1.x branch.
>> In
>> a word, this problem still exists in the latest stable version.
>>
>>
>>

Reply via email to