Re: "graceful" multilang failures

William Oberman Thu, 13 Nov 2014 13:29:22 -0800

I tried applying https://github.com/apache/storm/pull/286 for STORM-513,
but my ShellSpout claimed to have a heartbeat timeout and killed itself
(even though the PHP spout was emitting messages).  I looked at the
py/js/rb examples in that patch, and the heartbeat only seems to apply to
the Bolts.  So, no idea...


I then applied the patch for STORM-329:
https://github.com/apache/storm/pull/268.patch

And killed a random PHP Bolt and the 30*3min retry loop went entirely
away!!!  So, I *do* have a netty issue solved by that patch.  But, then
that subgraph just sat there for 16 minutes and 30 seconds doing nothing
before starting again.  If 6 minutes 30 seconds makes more sense, I do have
a large message timeout on this topology (10 minutes).

Other than setting debug on the topology (which I have on), is there any
other kind of debugging I can enable?  During that period of 990 seconds
the Storm-UI doesn't show anything wrong, and there is no activity in the
storm logs (other than __metrics).

Thanks again for all of the help!

will


On Thu, Nov 13, 2014 at 1:59 PM, Itai Frenkel <[email protected]> wrote:

>  If the problems are related to netty, try and use semi-distributed
> topology. For example, instead of deploying 1 topology on 3 machines, use 3
> topologies each one is on its own machine (isolation scheduler). Assuming 3
> spouts could live side by side (1 on each machine) you do not have a netty
> issues since each machine is on its own.
>
>
>  ------------------------------
> *From:* William Oberman <[email protected]>
> *Sent:* Thursday, November 13, 2014 8:39 PM
> *To:* [email protected]
> *Subject:* Re: "graceful" multilang failures
>
>  STORM-513 is really interesting.  Unless I'm totally misunderstanding,
> heartbeat was added to the protocol.  My only concern there is PHP is
> single threaded, so I wonder how I'll handle cases where the tuples take a
> long time to process (I have 5 different types of background processing,
> some that are high volume low latency, but others that are low volume high
> latency).  I guess I'll just have to make sure heartbeat timeout >= maximum
> expected tuple processing time...
>
>  Option 1 is kind of happening now (the OOM breaks STDIN).   I've just
> been unlucky (?) that I've hit bugs that either grind to a halt forever
> (the 0.92 issue that I still don't have a reference for), or grind for a
> very long time (the 0.93 issue involves Netty retries, which take ~90
> minutes before everything starts moving at full speed... as far as I can
> tell based on logging 90 minutes  = 30 retries of something that takes 3
> minutes to timeout).
>
>  will
>
>
> On Thu, Nov 13, 2014 at 12:47 PM, Itai Frenkel <[email protected]> wrote:
>
>>  Hi,
>>
>>
>>  This is what I would do if I were you:
>>
>>
>>  Option 1:
>>
>> Self destruct the PHP process if the php detects it is not healthy.
>>
>> This would cause the worker to crash (since it cannot write to
>> stdout) which would be restarted by the supervisor (eventually depending on
>> the settings it could take a minute event)... and the new worker would
>> also start a new php process.
>>
>>
>>  Option 2:
>>
>> Compile 0.9.2 with this patch https://github.com/apache/storm/pull/286
>> and https://issues.apache.org/jira/browse/STORM-513
>>
>> If it works as advertised, when the php stops responding to heartbeats in
>> a timely manner the worker would kill the php process and then self
>> destruct (a new worker would be restarted by the supervisor... etc..). You
>> would also need to update the PHP implementation to support heartbeats like
>> the nodejs/ruby/python were updated.
>>
>>
>>
>>  In any case I would advise that the PHP process, would listen for
>> SIGTERM that means parent process died - like so
>> http://stackoverflow.com/questions/24930670/execute-function-in-php-before-sigterm
>> . And in the handler self destruct the php process. This ensures that there
>> are no zombie PHP processes running wild.
>>
>>
>>  Regards,
>>
>> Itai
>>
>>
>>
>>  ------------------------------
>> *From:* William Oberman <[email protected]>
>> *Sent:* Thursday, November 13, 2014 7:05 PM
>> *To:* [email protected]
>> *Subject:* "graceful" multilang failures
>>
>>   I was wondering if there is a way to force a graceful failure for a
>> multilang Bolt.
>>
>>  Basically, I have a web app written in PHP (that's never going to
>> change unfortunately).  It has highly parallelizable backend processing
>> (also written in PHP of course to reuse the biz logic) I used to self
>> manage (e.g. running N PHP processes across M machines).  Wrapping this
>> system in Storm simplifies my IT world.
>>
>>  So far so good.
>>
>>  But the PHP code has memory leaks.  Previously I didn't care, I just
>> had the php restart itself on OOM failure.  But, in Storm things keep
>> grinding to a halt.  I started using Storm in 0.92 and hit one bug (I
>> forgot case #, but basically the ShellBolt wouldn't die even though the
>> managed PHP process did, eventually starving the cluster).  I'm trying
>> 0.93-rc1 and I hit another bug (STORM-404 that solves STORM-329... I
>> think...).
>>
>>  I'm wondering if there is a way to have PHP terminate itself in a
>> controlled way that allows Storm to quickly heal itself (since I can have
>> PHP watch itself in terms of memory usage).  Looking at the multilang
>> protocol, I don't see a "stop" concept.  And, since ShellBolt "pushes" to
>> PHP, I *think* calling exit() will be no different than OOM'ing (as both
>> just break the STDIN/STDOUT pipes), other than the status code returned.
>>
>>  Yes, in parallel I'm trying to solve all memory leaks, but that's
>> looking to be a big chore.
>>
>>  And, maybe I'm just getting unlucky on hitting edge cases on Storm
>> grinding to a halt?
>>
>>  will
>>
>
>
>

Re: "graceful" multilang failures

Reply via email to