Really good news here. I dug into this issue more -- including doing a detailed analysis with stack traces that I put up on Github:
https://gist.github.com/amontalenti/8ff0c31a7b95a6dea3d2 (now updated with the below e-mail's text since the exploration was obsoleted by the fix) The issue had nothing to do with Storm and everything to do with Ubuntu 14.04 and its interaction with Xen network kernel drivers in EC2. I was staring at the results of this research and thinking, "What could possibly cause the network subsystem of Storm to just hang?" My first impulse: firewalls. Maybe as the network was ramping up, I was hitting up against a firewall rule? I checked our munin monitoring graphs and noticed a bunch of eth0 errors correlated with our topologies running. I checked our production Storm 0.8.2 cluster -- no errors. Ah hah! It must be firewall rules or something! That led me to run dmesg on the supervisor nodes. I found a bunch of entries like this: xen_netfront: xennet: skb rides the rocket: 20 slots xen_netfront: xennet: skb rides the rocket: 19 slots That's odd. I also saw some entries related to ufw (Ubuntu's firewall service). So, I try running `ufw disable`. No change. I then dig in more to these error messages and I come across this open bug on Launchpad: https://bugs.launchpad.net/ubuntu/+source/linux-lts-raring/+bug/1195474 I dig in there and come across the current workaround, running: sudo ethtool -K eth0 sg off On the server. I issue that command, restart my topology, and VOILA, the Storm topology is *now running at full performance*. Back in my earliest days as a professional programmer, I had a friend named Jimmy. I once spent 3 days debugging a JVM garbage collection issue with him. We ran profilers, did detailed code traces, extensive logging, etc. And in the end, the fix to the problem was a single line of code change -- a mistaken allocation of an expensive object that was happening in a tight loop. At that moment, I coined "Jimmy's Law", which is: "The amount of time it takes to discover a bug's fix is inversely proportional to the lines of code changed by the fix, with infinite time converging to one line." After hours of investigating and debugging this issue, that's certainly how I feel. Shame on me for upgrading my Storm cluster and Ubuntu version simultaneously! Now for the really good news: I'm 14 million tuples into my Storm 0.9.2-incubating cluster (running with Netty) and everything is humming along, running fast. 92 tasks, 8 workers, 2 supervisors. My simplest Python bolt has 1.10ms process latencies -- some of the fastest I've seen. Thanks for the help investigating, and here's to an awesome 0.9.2 release! p.s. I'm glad that *something* positive came out of this, at least -- my contribution to sync up the storm-0mq driver for those who prefer it. Glad to continue to help with that, if for no other reason than to have a reliable second transport so Storm community members can debug *actual* Netty issues they may come across. On Thu, Jun 19, 2014 at 8:53 PM, P. Taylor Goetz <[email protected]> wrote: > Okay. Keep me posted. I still plan on looking at and testing your patch to > storm-0mq, but probably won't get to that until early next week. > > -Taylor > > On Jun 19, 2014, at 7:43 PM, Andrew Montalenti <[email protected]> wrote: > > FYI, the issue happened with both zmq and netty transports. We will > investigate more tomorrow. We think the issue only happens with more than > one supervisor and multiple workers. > On Jun 19, 2014 7:32 PM, "P. Taylor Goetz" <[email protected]> wrote: > >> Hi Andrew, >> >> Thanks for pointing this out. I agree with your point about bit rot. >> >> However, we had to remove the the 0mq transport due to license >> incompatibilities with Apache, so any kind of release test suite would have >> to be maintained outside of Apache since it would likely pull in >> LGPL-licensed dependencies. So if something like you're suggesting could be >> accomplished in the storm-0mq project, that would be the best option. >> >> I'm open to pull requests, help, contributions, etc. to storm-0mq. It >> just can't be part of Apache. >> >> I'll test out your changes to storm-0mq to see if I can reproduce the >> issue you're seeing. As Nathan mentioned, any additional information >> (thread dumps, etc.) you could provide would help. >> >> Thanks (and sorry for the inconvenience), >> >> Taylor >> >> >> On Jun 19, 2014, at 6:09 PM, Andrew Montalenti <[email protected]> >> wrote: >> >> Another interesting 0.9.2 issue I came across: the IConnection interface >> has changed, meaning any pluggable transports no longer work without a code >> change. >> >> I implemented changes to storm-0mq to get it to be compatible with this >> interface change in my fork here. >> >> https://github.com/Parsely/storm-0mq/compare/ptgoetz:master...master >> >> I tested that and it nominally works in distributed mode with two >> independent workers in my cluster. Don't know what the performance impact >> is of the interface change. >> >> I get that zmq is no longer part of storm core, but maintaining a stable >> interface for pluggable components like this transport is probably >> something that should be in the release test suite. Otherwise bitrot will >> take its toll. I am glad to volunteer help with this. >> >> My team is now debugging an issue where Storm stops asking our spout for >> next tuples after awhile of running the topology, causing the tool go to >> basically freeze with no errors in the logs. At first blush, seems like a >> regression from 0.9.1. But we'll have more detailed info once we isolate >> some variables soon. >> On Jun 18, 2014 4:32 PM, "Andrew Montalenti" <[email protected]> wrote: >> >>> I built the v0.9.2-incubating rc-3 locally and once verifying that it >>> worked for our topology, pushed it into our cluster. So far, so good. >>> >>> One thing for the community to be aware of. If you try to upgrade an >>> existing v0.9.1-incubating or 0.8 cluster to v0.9.2-incubating, you may hit >>> exceptions upon nimbus/supervisor startup about stormcode.ser/stormconf.ser. >>> >>> The issue is that the new cluster will try to re-submit the topologies >>> that were already running before the upgrade. These will fail because >>> Storm's Clojure version has been upgraded from 1.4 -> 1.5, thus the >>> serialization formats & IDs have changed. This would be true basically if >>> any class serial IDs change that happen to be in these .ser files >>> (stormconf.ser & stormcode.ser, as defined in Storm's internal config >>> <https://github.com/apache/incubator-storm/blob/master/storm-core/src/clj/backtype/storm/config.clj#L143-L153> >>> ). >>> >>> The solution is to clear out the storm data directories on your worker >>> nodes/nimbus nodes and restart the cluster. >>> >>> I have some open source tooling that submits topologies to the nimbus >>> using StormSubmitter. This upgrade also made me realize that due to the use >>> of serialized Java files >>> <https://github.com/apache/incubator-storm/blob/master/storm-core/src/jvm/backtype/storm/utils/Utils.java#L73-L97>, >>> it is very important the StormSubmitter class used for submitting and the >>> running Storm cluster be precisely the same version / classpath. I describe >>> this more in the GH issue here: >>> >>> https://github.com/Parsely/streamparse/issues/27 >>> >>> I wonder if maybe it's worth it to consider using a less finicky >>> serialization format within Storm itself. Would that change be welcome as a >>> pull request? >>> >>> It would make it easier to script Storm clusters without consideration >>> for client/server Storm version mismatches, which I presume was the >>> original reasoning behind putting Storm functionality behind a Thrift API >>> anyway. And it would prevent crashed topologies during minor Storm version >>> upgrades. >>> >> >>
