Pull request sent: https://github.com/apache/incubator-storm/pull/41
On Mon, Mar 3, 2014 at 12:03 PM, Drew Goya <[email protected]> wrote: > Hey All, dug into the netty Client.jar and the various versions/tags of > the file. So it looks like there were two commits which attempted to fix > this issue: > > https://github.com/nathanmarz/storm/commit/213102b36f890 > > and then > > > https://github.com/nathanmarz/storm/commit/c638db0e88e3c56f808c8a76a88f94d7bf1988c4 > > It looks like the affected method is getSleepTimeMs() > > In the 0.9.0 tag ( > > In the 0.9.0.1 tag ( > https://github.com/nathanmarz/storm/blob/0.9.0.1/storm-netty/src/jvm/backtype/storm/messaging/netty/Client.java) > the method is: > > private int getSleepTimeMs() > { > int backoff = 1 << retries.get(); > int sleepMs = base_sleep_ms * Math.max(1, random.nextInt(backoff)); > if ( sleepMs > max_sleep_ms ) > sleepMs = max_sleep_ms; > return sleepMs; > } > > I put together a simple test which demonstrates the method is still > broken, it is still possible to overflow sleepMs and end up with a large > negative timeout: > > private static int getSleepTimeMs(int retries, int base_sleep_ms, int > max_sleep_ms, Random random) > { > int backoff = 1 << retries; > int sleepMs = base_sleep_ms * Math.max(1, random.nextInt(backoff)); > if ( sleepMs > max_sleep_ms ) > sleepMs = max_sleep_ms; > return sleepMs; > } > > public static void main(String[] args) throws Exception{ > Random random = new Random(); > int base_sleep_ms = 100; > int max_sleep_ms = 1000; > for(int i = 0; i < 30; i++){ > System.out.println(getSleepTimeMs(i, base_sleep_ms, max_sleep_ms, > random));= > } > } > > To fix the issue a few of the integers should be converted to longs. I'll > send a pull request in a few. > > On Mon, Mar 3, 2014 at 11:17 AM, Drew Goya <[email protected]> wrote: > >> Thanks for sharing your experiences guys, we will be heading back to 0mq >> as well. It's a shame as we really got some nice throughput improvements >> with Netty. >> >> >> On Sun, Mar 2, 2014 at 5:18 PM, Michael Rose <[email protected]>wrote: >> >>> Right now we're having slow, off-heap memory leaks, unknown if these are >>> linked to Netty (yet). When the workers inevitably get OOMed, the topology >>> will rarely recover gracefully with similar Netty timeouts. Sounds like >>> we'll be heading back to 0mq. >>> >>> Michael Rose (@Xorlev <https://twitter.com/xorlev>) >>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/> >>> [email protected] >>> >>> >>> On Sun, Mar 2, 2014 at 5:44 PM, Sean Allen >>> <[email protected]>wrote: >>> >>>> We have the same issue and after attempting a few fixes, we switched >>>> back to using 0mq for now. >>>> >>>> >>>> On Sun, Mar 2, 2014 at 2:46 PM, Drew Goya <[email protected]> wrote: >>>> >>>>> Hey All, I'm running a 0.9.0.1 storm topology in AWS EC2 and I >>>>> occasionally run into a strange and pretty catastrophic error. One of my >>>>> workers is either overloaded or stuck and gets killed and restarted. This >>>>> usually works fine but once in a while the whole topology breaks down, all >>>>> the workers are killed and restarted continually. Looking through the >>>>> logs >>>>> it looks like some netty errors on initialization kill the Async Loop. >>>>> The >>>>> topology is never able to recover, I have to kill it manually and relaunch >>>>> it. >>>>> >>>>> Is this something anyone else has come across? Any tips? Config >>>>> settings I could change? >>>>> >>>>> This is a pastebin of the errors: http://pastebin.com/XXZBsEj1 >>>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Ce n'est pas une signature >>>> >>> >>> >> >
