If the problems are related to netty, try and use semi-distributed topology. 
For example, instead of deploying 1 topology on 3 machines, use 3 topologies 
each one is on its own machine (isolation scheduler). Assuming 3 spouts could 
live side by side (1 on each machine) you do not have a netty issues since each 
machine is on its own.


________________________________
From: William Oberman <[email protected]>
Sent: Thursday, November 13, 2014 8:39 PM
To: [email protected]
Subject: Re: "graceful" multilang failures

STORM-513 is really interesting.  Unless I'm totally misunderstanding, 
heartbeat was added to the protocol.  My only concern there is PHP is single 
threaded, so I wonder how I'll handle cases where the tuples take a long time 
to process (I have 5 different types of background processing, some that are 
high volume low latency, but others that are low volume high latency).  I guess 
I'll just have to make sure heartbeat timeout >= maximum expected tuple 
processing time...

Option 1 is kind of happening now (the OOM breaks STDIN).   I've just been 
unlucky (?) that I've hit bugs that either grind to a halt forever (the 0.92 
issue that I still don't have a reference for), or grind for a very long time 
(the 0.93 issue involves Netty retries, which take ~90 minutes before 
everything starts moving at full speed... as far as I can tell based on logging 
90 minutes  = 30 retries of something that takes 3 minutes to timeout).

will


On Thu, Nov 13, 2014 at 12:47 PM, Itai Frenkel 
<[email protected]<mailto:[email protected]>> wrote:

Hi,


This is what I would do if I were you:


Option 1:

Self destruct the PHP process if the php detects it is not healthy.

This would cause the worker to crash (since it cannot write to stdout) which 
would be restarted by the supervisor (eventually depending on the settings it 
could take a minute event)... and the new worker would also start a new php 
process.


Option 2:

Compile 0.9.2 with this patch https://github.com/apache/storm/pull/286 and 
https://issues.apache.org/jira/browse/STORM-513

If it works as advertised, when the php stops responding to heartbeats in a 
timely manner the worker would kill the php process and then self destruct (a 
new worker would be restarted by the supervisor... etc..). You would also need 
to update the PHP implementation to support heartbeats like the 
nodejs/ruby/python were updated.



In any case I would advise that the PHP process, would listen for SIGTERM that 
means parent process died - like so 
http://stackoverflow.com/questions/24930670/execute-function-in-php-before-sigterm
 . And in the handler self destruct the php process. This ensures that there 
are no zombie PHP processes running wild.


Regards,

Itai



________________________________
From: William Oberman 
<[email protected]<mailto:[email protected]>>
Sent: Thursday, November 13, 2014 7:05 PM
To: [email protected]<mailto:[email protected]>
Subject: "graceful" multilang failures

I was wondering if there is a way to force a graceful failure for a multilang 
Bolt.

Basically, I have a web app written in PHP (that's never going to change 
unfortunately).  It has highly parallelizable backend processing (also written 
in PHP of course to reuse the biz logic) I used to self manage (e.g. running N 
PHP processes across M machines).  Wrapping this system in Storm simplifies my 
IT world.

So far so good.

But the PHP code has memory leaks.  Previously I didn't care, I just had the 
php restart itself on OOM failure.  But, in Storm things keep grinding to a 
halt.  I started using Storm in 0.92 and hit one bug (I forgot case #, but 
basically the ShellBolt wouldn't die even though the managed PHP process did, 
eventually starving the cluster).  I'm trying 0.93-rc1 and I hit another bug 
(STORM-404 that solves STORM-329... I think...).

I'm wondering if there is a way to have PHP terminate itself in a controlled 
way that allows Storm to quickly heal itself (since I can have PHP watch itself 
in terms of memory usage).  Looking at the multilang protocol, I don't see a 
"stop" concept.  And, since ShellBolt "pushes" to PHP, I *think* calling exit() 
will be no different than OOM'ing (as both just break the STDIN/STDOUT pipes), 
other than the status code returned.

Yes, in parallel I'm trying to solve all memory leaks, but that's looking to be 
a big chore.

And, maybe I'm just getting unlucky on hitting edge cases on Storm grinding to 
a halt?

will


Reply via email to