If the problems are related to netty, try and use semi-distributed topology. For example, instead of deploying 1 topology on 3 machines, use 3 topologies each one is on its own machine (isolation scheduler). Assuming 3 spouts could live side by side (1 on each machine) you do not have a netty issues since each machine is on its own.
________________________________ From: William Oberman <[email protected]> Sent: Thursday, November 13, 2014 8:39 PM To: [email protected] Subject: Re: "graceful" multilang failures STORM-513 is really interesting. Unless I'm totally misunderstanding, heartbeat was added to the protocol. My only concern there is PHP is single threaded, so I wonder how I'll handle cases where the tuples take a long time to process (I have 5 different types of background processing, some that are high volume low latency, but others that are low volume high latency). I guess I'll just have to make sure heartbeat timeout >= maximum expected tuple processing time... Option 1 is kind of happening now (the OOM breaks STDIN). I've just been unlucky (?) that I've hit bugs that either grind to a halt forever (the 0.92 issue that I still don't have a reference for), or grind for a very long time (the 0.93 issue involves Netty retries, which take ~90 minutes before everything starts moving at full speed... as far as I can tell based on logging 90 minutes = 30 retries of something that takes 3 minutes to timeout). will On Thu, Nov 13, 2014 at 12:47 PM, Itai Frenkel <[email protected]<mailto:[email protected]>> wrote: Hi, This is what I would do if I were you: Option 1: Self destruct the PHP process if the php detects it is not healthy. This would cause the worker to crash (since it cannot write to stdout) which would be restarted by the supervisor (eventually depending on the settings it could take a minute event)... and the new worker would also start a new php process. Option 2: Compile 0.9.2 with this patch https://github.com/apache/storm/pull/286 and https://issues.apache.org/jira/browse/STORM-513 If it works as advertised, when the php stops responding to heartbeats in a timely manner the worker would kill the php process and then self destruct (a new worker would be restarted by the supervisor... etc..). You would also need to update the PHP implementation to support heartbeats like the nodejs/ruby/python were updated. In any case I would advise that the PHP process, would listen for SIGTERM that means parent process died - like so http://stackoverflow.com/questions/24930670/execute-function-in-php-before-sigterm . And in the handler self destruct the php process. This ensures that there are no zombie PHP processes running wild. Regards, Itai ________________________________ From: William Oberman <[email protected]<mailto:[email protected]>> Sent: Thursday, November 13, 2014 7:05 PM To: [email protected]<mailto:[email protected]> Subject: "graceful" multilang failures I was wondering if there is a way to force a graceful failure for a multilang Bolt. Basically, I have a web app written in PHP (that's never going to change unfortunately). It has highly parallelizable backend processing (also written in PHP of course to reuse the biz logic) I used to self manage (e.g. running N PHP processes across M machines). Wrapping this system in Storm simplifies my IT world. So far so good. But the PHP code has memory leaks. Previously I didn't care, I just had the php restart itself on OOM failure. But, in Storm things keep grinding to a halt. I started using Storm in 0.92 and hit one bug (I forgot case #, but basically the ShellBolt wouldn't die even though the managed PHP process did, eventually starving the cluster). I'm trying 0.93-rc1 and I hit another bug (STORM-404 that solves STORM-329... I think...). I'm wondering if there is a way to have PHP terminate itself in a controlled way that allows Storm to quickly heal itself (since I can have PHP watch itself in terms of memory usage). Looking at the multilang protocol, I don't see a "stop" concept. And, since ShellBolt "pushes" to PHP, I *think* calling exit() will be no different than OOM'ing (as both just break the STDIN/STDOUT pipes), other than the status code returned. Yes, in parallel I'm trying to solve all memory leaks, but that's looking to be a big chore. And, maybe I'm just getting unlucky on hitting edge cases on Storm grinding to a halt? will
