Hey All, dug into the netty Client.jar and the various versions/tags of the file. So it looks like there were two commits which attempted to fix this issue:
https://github.com/nathanmarz/storm/commit/213102b36f890 and then https://github.com/nathanmarz/storm/commit/c638db0e88e3c56f808c8a76a88f94d7bf1988c4 It looks like the affected method is getSleepTimeMs() In the 0.9.0 tag ( In the 0.9.0.1 tag ( https://github.com/nathanmarz/storm/blob/0.9.0.1/storm-netty/src/jvm/backtype/storm/messaging/netty/Client.java) the method is: private int getSleepTimeMs() { int backoff = 1 << retries.get(); int sleepMs = base_sleep_ms * Math.max(1, random.nextInt(backoff)); if ( sleepMs > max_sleep_ms ) sleepMs = max_sleep_ms; return sleepMs; } I put together a simple test which demonstrates the method is still broken, it is still possible to overflow sleepMs and end up with a large negative timeout: private static int getSleepTimeMs(int retries, int base_sleep_ms, int max_sleep_ms, Random random) { int backoff = 1 << retries; int sleepMs = base_sleep_ms * Math.max(1, random.nextInt(backoff)); if ( sleepMs > max_sleep_ms ) sleepMs = max_sleep_ms; return sleepMs; } public static void main(String[] args) throws Exception{ Random random = new Random(); int base_sleep_ms = 100; int max_sleep_ms = 1000; for(int i = 0; i < 30; i++){ System.out.println(getSleepTimeMs(i, base_sleep_ms, max_sleep_ms, random));= } } To fix the issue a few of the integers should be converted to longs. I'll send a pull request in a few. On Mon, Mar 3, 2014 at 11:17 AM, Drew Goya <[email protected]> wrote: > Thanks for sharing your experiences guys, we will be heading back to 0mq > as well. It's a shame as we really got some nice throughput improvements > with Netty. > > > On Sun, Mar 2, 2014 at 5:18 PM, Michael Rose <[email protected]>wrote: > >> Right now we're having slow, off-heap memory leaks, unknown if these are >> linked to Netty (yet). When the workers inevitably get OOMed, the topology >> will rarely recover gracefully with similar Netty timeouts. Sounds like >> we'll be heading back to 0mq. >> >> Michael Rose (@Xorlev <https://twitter.com/xorlev>) >> Senior Platform Engineer, FullContact <http://www.fullcontact.com/> >> [email protected] >> >> >> On Sun, Mar 2, 2014 at 5:44 PM, Sean Allen >> <[email protected]>wrote: >> >>> We have the same issue and after attempting a few fixes, we switched >>> back to using 0mq for now. >>> >>> >>> On Sun, Mar 2, 2014 at 2:46 PM, Drew Goya <[email protected]> wrote: >>> >>>> Hey All, I'm running a 0.9.0.1 storm topology in AWS EC2 and I >>>> occasionally run into a strange and pretty catastrophic error. One of my >>>> workers is either overloaded or stuck and gets killed and restarted. This >>>> usually works fine but once in a while the whole topology breaks down, all >>>> the workers are killed and restarted continually. Looking through the logs >>>> it looks like some netty errors on initialization kill the Async Loop. The >>>> topology is never able to recover, I have to kill it manually and relaunch >>>> it. >>>> >>>> Is this something anyone else has come across? Any tips? Config >>>> settings I could change? >>>> >>>> This is a pastebin of the errors: http://pastebin.com/XXZBsEj1 >>>> >>> >>> >>> >>> -- >>> >>> Ce n'est pas une signature >>> >> >> >
