STORM-513 is really interesting. Unless I'm totally misunderstanding, heartbeat was added to the protocol. My only concern there is PHP is single threaded, so I wonder how I'll handle cases where the tuples take a long time to process (I have 5 different types of background processing, some that are high volume low latency, but others that are low volume high latency). I guess I'll just have to make sure heartbeat timeout >= maximum expected tuple processing time...
Option 1 is kind of happening now (the OOM breaks STDIN). I've just been unlucky (?) that I've hit bugs that either grind to a halt forever (the 0.92 issue that I still don't have a reference for), or grind for a very long time (the 0.93 issue involves Netty retries, which take ~90 minutes before everything starts moving at full speed... as far as I can tell based on logging 90 minutes = 30 retries of something that takes 3 minutes to timeout). will On Thu, Nov 13, 2014 at 12:47 PM, Itai Frenkel <[email protected]> wrote: > Hi, > > > This is what I would do if I were you: > > > Option 1: > > Self destruct the PHP process if the php detects it is not healthy. > > This would cause the worker to crash (since it cannot write to > stdout) which would be restarted by the supervisor (eventually depending on > the settings it could take a minute event)... and the new worker would > also start a new php process. > > > Option 2: > > Compile 0.9.2 with this patch https://github.com/apache/storm/pull/286 > and https://issues.apache.org/jira/browse/STORM-513 > > If it works as advertised, when the php stops responding to heartbeats in > a timely manner the worker would kill the php process and then self > destruct (a new worker would be restarted by the supervisor... etc..). You > would also need to update the PHP implementation to support heartbeats like > the nodejs/ruby/python were updated. > > > > In any case I would advise that the PHP process, would listen for > SIGTERM that means parent process died - like so > http://stackoverflow.com/questions/24930670/execute-function-in-php-before-sigterm > . And in the handler self destruct the php process. This ensures that there > are no zombie PHP processes running wild. > > > Regards, > > Itai > > > > ------------------------------ > *From:* William Oberman <[email protected]> > *Sent:* Thursday, November 13, 2014 7:05 PM > *To:* [email protected] > *Subject:* "graceful" multilang failures > > I was wondering if there is a way to force a graceful failure for a > multilang Bolt. > > Basically, I have a web app written in PHP (that's never going to change > unfortunately). It has highly parallelizable backend processing (also > written in PHP of course to reuse the biz logic) I used to self manage > (e.g. running N PHP processes across M machines). Wrapping this system in > Storm simplifies my IT world. > > So far so good. > > But the PHP code has memory leaks. Previously I didn't care, I just had > the php restart itself on OOM failure. But, in Storm things keep grinding > to a halt. I started using Storm in 0.92 and hit one bug (I forgot case #, > but basically the ShellBolt wouldn't die even though the managed PHP > process did, eventually starving the cluster). I'm trying 0.93-rc1 and I > hit another bug (STORM-404 that solves STORM-329... I think...). > > I'm wondering if there is a way to have PHP terminate itself in a > controlled way that allows Storm to quickly heal itself (since I can have > PHP watch itself in terms of memory usage). Looking at the multilang > protocol, I don't see a "stop" concept. And, since ShellBolt "pushes" to > PHP, I *think* calling exit() will be no different than OOM'ing (as both > just break the STDIN/STDOUT pipes), other than the status code returned. > > Yes, in parallel I'm trying to solve all memory leaks, but that's > looking to be a big chore. > > And, maybe I'm just getting unlucky on hitting edge cases on Storm > grinding to a halt? > > will >
