Re: "graceful" multilang failures

William Oberman Thu, 13 Nov 2014 10:41:18 -0800

STORM-513 is really interesting.  Unless I'm totally misunderstanding,
heartbeat was added to the protocol.  My only concern there is PHP is
single threaded, so I wonder how I'll handle cases where the tuples take a
long time to process (I have 5 different types of background processing,
some that are high volume low latency, but others that are low volume high
latency).  I guess I'll just have to make sure heartbeat timeout >= maximum
expected tuple processing time...


Option 1 is kind of happening now (the OOM breaks STDIN).   I've just been
unlucky (?) that I've hit bugs that either grind to a halt forever (the
0.92 issue that I still don't have a reference for), or grind for a very
long time (the 0.93 issue involves Netty retries, which take ~90 minutes
before everything starts moving at full speed... as far as I can tell based
on logging 90 minutes  = 30 retries of something that takes 3 minutes to
timeout).

will


On Thu, Nov 13, 2014 at 12:47 PM, Itai Frenkel <[email protected]> wrote:

>  Hi,
>
>
>  This is what I would do if I were you:
>
>
>  Option 1:
>
> Self destruct the PHP process if the php detects it is not healthy.
>
> This would cause the worker to crash (since it cannot write to
> stdout) which would be restarted by the supervisor (eventually depending on
> the settings it could take a minute event)... and the new worker would
> also start a new php process.
>
>
>  Option 2:
>
> Compile 0.9.2 with this patch https://github.com/apache/storm/pull/286
> and https://issues.apache.org/jira/browse/STORM-513
>
> If it works as advertised, when the php stops responding to heartbeats in
> a timely manner the worker would kill the php process and then self
> destruct (a new worker would be restarted by the supervisor... etc..). You
> would also need to update the PHP implementation to support heartbeats like
> the nodejs/ruby/python were updated.
>
>
>
>  In any case I would advise that the PHP process, would listen for
> SIGTERM that means parent process died - like so
> http://stackoverflow.com/questions/24930670/execute-function-in-php-before-sigterm
> . And in the handler self destruct the php process. This ensures that there
> are no zombie PHP processes running wild.
>
>
>  Regards,
>
> Itai
>
>
>
>  ------------------------------
> *From:* William Oberman <[email protected]>
> *Sent:* Thursday, November 13, 2014 7:05 PM
> *To:* [email protected]
> *Subject:* "graceful" multilang failures
>
>  I was wondering if there is a way to force a graceful failure for a
> multilang Bolt.
>
>  Basically, I have a web app written in PHP (that's never going to change
> unfortunately).  It has highly parallelizable backend processing (also
> written in PHP of course to reuse the biz logic) I used to self manage
> (e.g. running N PHP processes across M machines).  Wrapping this system in
> Storm simplifies my IT world.
>
>  So far so good.
>
>  But the PHP code has memory leaks.  Previously I didn't care, I just had
> the php restart itself on OOM failure.  But, in Storm things keep grinding
> to a halt.  I started using Storm in 0.92 and hit one bug (I forgot case #,
> but basically the ShellBolt wouldn't die even though the managed PHP
> process did, eventually starving the cluster).  I'm trying 0.93-rc1 and I
> hit another bug (STORM-404 that solves STORM-329... I think...).
>
>  I'm wondering if there is a way to have PHP terminate itself in a
> controlled way that allows Storm to quickly heal itself (since I can have
> PHP watch itself in terms of memory usage).  Looking at the multilang
> protocol, I don't see a "stop" concept.  And, since ShellBolt "pushes" to
> PHP, I *think* calling exit() will be no different than OOM'ing (as both
> just break the STDIN/STDOUT pipes), other than the status code returned.
>
>  Yes, in parallel I'm trying to solve all memory leaks, but that's
> looking to be a big chore.
>
>  And, maybe I'm just getting unlucky on hitting edge cases on Storm
> grinding to a halt?
>
>  will
>

Re: "graceful" multilang failures

Reply via email to