"graceful" multilang failures

임정택 Thu, 13 Nov 2014 14:16:27 -0800

Spout doesn't need additional heartbeat tuple, since sync message from
subprocess (PHP) side should be sended continuously if it's working
correctly.
ShellSpout uses any messages from subprocess to check heartbeat, so I
recommend you to change log level to DEBUG and check log message related to
multilang heartbeat from ShellSpout.


Please let me know if it doesn't help.
Sharing PHP code would be helpful, too.

Jungtaek Lim (HeartSaVioR)


2014년 11월 14일 금요일, William Oberman<[email protected]
<javascript:_e(%7B%7D,'cvml','[email protected]');>>님이 작성한 메시지:

> I tried applying https://github.com/apache/storm/pull/286 for STORM-513,
> but my ShellSpout claimed to have a heartbeat timeout and killed itself
> (even though the PHP spout was emitting messages).  I looked at the
> py/js/rb examples in that patch, and the heartbeat only seems to apply to
> the Bolts.  So, no idea...
>
> I then applied the patch for STORM-329:
> https://github.com/apache/storm/pull/268.patch
>
> And killed a random PHP Bolt and the 30*3min retry loop went entirely
> away!!!  So, I *do* have a netty issue solved by that patch.  But, then
> that subgraph just sat there for 16 minutes and 30 seconds doing nothing
> before starting again.  If 6 minutes 30 seconds makes more sense, I do have
> a large message timeout on this topology (10 minutes).
>
> Other than setting debug on the topology (which I have on), is there any
> other kind of debugging I can enable?  During that period of 990 seconds
> the Storm-UI doesn't show anything wrong, and there is no activity in the
> storm logs (other than __metrics).
>
> Thanks again for all of the help!
>
> will
>
>
> On Thu, Nov 13, 2014 at 1:59 PM, Itai Frenkel <[email protected]> wrote:
>
>>  If the problems are related to netty, try and use semi-distributed
>> topology. For example, instead of deploying 1 topology on 3 machines, use 3
>> topologies each one is on its own machine (isolation scheduler). Assuming 3
>> spouts could live side by side (1 on each machine) you do not have a netty
>> issues since each machine is on its own.
>>
>>
>>  ------------------------------
>> *From:* William Oberman <[email protected]>
>> *Sent:* Thursday, November 13, 2014 8:39 PM
>> *To:* [email protected]
>> *Subject:* Re: "graceful" multilang failures
>>
>>  STORM-513 is really interesting.  Unless I'm totally misunderstanding,
>> heartbeat was added to the protocol.  My only concern there is PHP is
>> single threaded, so I wonder how I'll handle cases where the tuples take a
>> long time to process (I have 5 different types of background processing,
>> some that are high volume low latency, but others that are low volume high
>> latency).  I guess I'll just have to make sure heartbeat timeout >= maximum
>> expected tuple processing time...
>>
>>  Option 1 is kind of happening now (the OOM breaks STDIN).   I've just
>> been unlucky (?) that I've hit bugs that either grind to a halt forever
>> (the 0.92 issue that I still don't have a reference for), or grind for a
>> very long time (the 0.93 issue involves Netty retries, which take ~90
>> minutes before everything starts moving at full speed... as far as I can
>> tell based on logging 90 minutes  = 30 retries of something that takes 3
>> minutes to timeout).
>>
>>  will
>>
>>
>> On Thu, Nov 13, 2014 at 12:47 PM, Itai Frenkel <[email protected]> wrote:
>>
>>>  Hi,
>>>
>>>
>>>  This is what I would do if I were you:
>>>
>>>
>>>  Option 1:
>>>
>>> Self destruct the PHP process if the php detects it is not healthy.
>>>
>>> This would cause the worker to crash (since it cannot write to
>>> stdout) which would be restarted by the supervisor (eventually depending on
>>> the settings it could take a minute event)... and the new worker would
>>> also start a new php process.
>>>
>>>
>>>  Option 2:
>>>
>>> Compile 0.9.2 with this patch https://github.com/apache/storm/pull/286
>>> and https://issues.apache.org/jira/browse/STORM-513
>>>
>>> If it works as advertised, when the php stops responding to heartbeats
>>> in a timely manner the worker would kill the php process and then self
>>> destruct (a new worker would be restarted by the supervisor... etc..). You
>>> would also need to update the PHP implementation to support heartbeats like
>>> the nodejs/ruby/python were updated.
>>>
>>>
>>>
>>>  In any case I would advise that the PHP process, would listen for
>>> SIGTERM that means parent process died - like so
>>> http://stackoverflow.com/questions/24930670/execute-function-in-php-before-sigterm
>>> . And in the handler self destruct the php process. This ensures that there
>>> are no zombie PHP processes running wild.
>>>
>>>
>>>  Regards,
>>>
>>> Itai
>>>
>>>
>>>
>>>  ------------------------------
>>> *From:* William Oberman <[email protected]>
>>> *Sent:* Thursday, November 13, 2014 7:05 PM
>>> *To:* [email protected]
>>> *Subject:* "graceful" multilang failures
>>>
>>>   I was wondering if there is a way to force a graceful failure for a
>>> multilang Bolt.
>>>
>>>  Basically, I have a web app written in PHP (that's never going to
>>> change unfortunately).  It has highly parallelizable backend processing
>>> (also written in PHP of course to reuse the biz logic) I used to self
>>> manage (e.g. running N PHP processes across M machines).  Wrapping this
>>> system in Storm simplifies my IT world.
>>>
>>>  So far so good.
>>>
>>>  But the PHP code has memory leaks.  Previously I didn't care, I just
>>> had the php restart itself on OOM failure.  But, in Storm things keep
>>> grinding to a halt.  I started using Storm in 0.92 and hit one bug (I
>>> forgot case #, but basically the ShellBolt wouldn't die even though the
>>> managed PHP process did, eventually starving the cluster).  I'm trying
>>> 0.93-rc1 and I hit another bug (STORM-404 that solves STORM-329... I
>>> think...).
>>>
>>>  I'm wondering if there is a way to have PHP terminate itself in a
>>> controlled way that allows Storm to quickly heal itself (since I can have
>>> PHP watch itself in terms of memory usage).  Looking at the multilang
>>> protocol, I don't see a "stop" concept.  And, since ShellBolt "pushes" to
>>> PHP, I *think* calling exit() will be no different than OOM'ing (as both
>>> just break the STDIN/STDOUT pipes), other than the status code returned.
>>>
>>>  Yes, in parallel I'm trying to solve all memory leaks, but that's
>>> looking to be a big chore.
>>>
>>>  And, maybe I'm just getting unlucky on hitting edge cases on Storm
>>> grinding to a halt?
>>>
>>>  will
>>>
>>
>>
>>
>

-- 
Name : 임 정택
Blog : http://www.heartsavior.net / http://dev.heartsavior.net
Twitter : http://twitter.com/heartsavior
LinkedIn : http://www.linkedin.com/in/heartsavior

"graceful" multilang failures

Reply via email to