I tried applying https://github.com/apache/storm/pull/286 for STORM-513, but my ShellSpout claimed to have a heartbeat timeout and killed itself (even though the PHP spout was emitting messages). I looked at the py/js/rb examples in that patch, and the heartbeat only seems to apply to the Bolts. So, no idea...
I then applied the patch for STORM-329: https://github.com/apache/storm/pull/268.patch And killed a random PHP Bolt and the 30*3min retry loop went entirely away!!! So, I *do* have a netty issue solved by that patch. But, then that subgraph just sat there for 16 minutes and 30 seconds doing nothing before starting again. If 6 minutes 30 seconds makes more sense, I do have a large message timeout on this topology (10 minutes). Other than setting debug on the topology (which I have on), is there any other kind of debugging I can enable? During that period of 990 seconds the Storm-UI doesn't show anything wrong, and there is no activity in the storm logs (other than __metrics). Thanks again for all of the help! will On Thu, Nov 13, 2014 at 1:59 PM, Itai Frenkel <[email protected]> wrote: > If the problems are related to netty, try and use semi-distributed > topology. For example, instead of deploying 1 topology on 3 machines, use 3 > topologies each one is on its own machine (isolation scheduler). Assuming 3 > spouts could live side by side (1 on each machine) you do not have a netty > issues since each machine is on its own. > > > ------------------------------ > *From:* William Oberman <[email protected]> > *Sent:* Thursday, November 13, 2014 8:39 PM > *To:* [email protected] > *Subject:* Re: "graceful" multilang failures > > STORM-513 is really interesting. Unless I'm totally misunderstanding, > heartbeat was added to the protocol. My only concern there is PHP is > single threaded, so I wonder how I'll handle cases where the tuples take a > long time to process (I have 5 different types of background processing, > some that are high volume low latency, but others that are low volume high > latency). I guess I'll just have to make sure heartbeat timeout >= maximum > expected tuple processing time... > > Option 1 is kind of happening now (the OOM breaks STDIN). I've just > been unlucky (?) that I've hit bugs that either grind to a halt forever > (the 0.92 issue that I still don't have a reference for), or grind for a > very long time (the 0.93 issue involves Netty retries, which take ~90 > minutes before everything starts moving at full speed... as far as I can > tell based on logging 90 minutes = 30 retries of something that takes 3 > minutes to timeout). > > will > > > On Thu, Nov 13, 2014 at 12:47 PM, Itai Frenkel <[email protected]> wrote: > >> Hi, >> >> >> This is what I would do if I were you: >> >> >> Option 1: >> >> Self destruct the PHP process if the php detects it is not healthy. >> >> This would cause the worker to crash (since it cannot write to >> stdout) which would be restarted by the supervisor (eventually depending on >> the settings it could take a minute event)... and the new worker would >> also start a new php process. >> >> >> Option 2: >> >> Compile 0.9.2 with this patch https://github.com/apache/storm/pull/286 >> and https://issues.apache.org/jira/browse/STORM-513 >> >> If it works as advertised, when the php stops responding to heartbeats in >> a timely manner the worker would kill the php process and then self >> destruct (a new worker would be restarted by the supervisor... etc..). You >> would also need to update the PHP implementation to support heartbeats like >> the nodejs/ruby/python were updated. >> >> >> >> In any case I would advise that the PHP process, would listen for >> SIGTERM that means parent process died - like so >> http://stackoverflow.com/questions/24930670/execute-function-in-php-before-sigterm >> . And in the handler self destruct the php process. This ensures that there >> are no zombie PHP processes running wild. >> >> >> Regards, >> >> Itai >> >> >> >> ------------------------------ >> *From:* William Oberman <[email protected]> >> *Sent:* Thursday, November 13, 2014 7:05 PM >> *To:* [email protected] >> *Subject:* "graceful" multilang failures >> >> I was wondering if there is a way to force a graceful failure for a >> multilang Bolt. >> >> Basically, I have a web app written in PHP (that's never going to >> change unfortunately). It has highly parallelizable backend processing >> (also written in PHP of course to reuse the biz logic) I used to self >> manage (e.g. running N PHP processes across M machines). Wrapping this >> system in Storm simplifies my IT world. >> >> So far so good. >> >> But the PHP code has memory leaks. Previously I didn't care, I just >> had the php restart itself on OOM failure. But, in Storm things keep >> grinding to a halt. I started using Storm in 0.92 and hit one bug (I >> forgot case #, but basically the ShellBolt wouldn't die even though the >> managed PHP process did, eventually starving the cluster). I'm trying >> 0.93-rc1 and I hit another bug (STORM-404 that solves STORM-329... I >> think...). >> >> I'm wondering if there is a way to have PHP terminate itself in a >> controlled way that allows Storm to quickly heal itself (since I can have >> PHP watch itself in terms of memory usage). Looking at the multilang >> protocol, I don't see a "stop" concept. And, since ShellBolt "pushes" to >> PHP, I *think* calling exit() will be no different than OOM'ing (as both >> just break the STDIN/STDOUT pipes), other than the status code returned. >> >> Yes, in parallel I'm trying to solve all memory leaks, but that's >> looking to be a big chore. >> >> And, maybe I'm just getting unlucky on hitting edge cases on Storm >> grinding to a halt? >> >> will >> > > >
