Re: Netty Errors, chain reaction, topology breaks down

Drew Goya Mon, 03 Mar 2014 12:04:40 -0800

Hey All, dug into the netty Client.jar and the various versions/tags of the
file.  So it looks like there were two commits which attempted to fix this
issue:


https://github.com/nathanmarz/storm/commit/213102b36f890

and then

https://github.com/nathanmarz/storm/commit/c638db0e88e3c56f808c8a76a88f94d7bf1988c4

It looks like the affected method is getSleepTimeMs()

In the 0.9.0 tag (

In the 0.9.0.1 tag (
https://github.com/nathanmarz/storm/blob/0.9.0.1/storm-netty/src/jvm/backtype/storm/messaging/netty/Client.java)
the method is:

private int getSleepTimeMs()
{
  int backoff = 1 << retries.get();
  int sleepMs = base_sleep_ms * Math.max(1, random.nextInt(backoff));
  if ( sleepMs > max_sleep_ms )
    sleepMs = max_sleep_ms;
  return sleepMs;
}

I put together a simple test which demonstrates the method is still broken,
it is still possible to overflow sleepMs and end up with a large negative
timeout:

private static int getSleepTimeMs(int retries, int base_sleep_ms, int
max_sleep_ms, Random random)
{
  int backoff = 1 << retries;
  int sleepMs = base_sleep_ms * Math.max(1, random.nextInt(backoff));
  if ( sleepMs > max_sleep_ms )
    sleepMs = max_sleep_ms;
  return sleepMs;
}

public static void main(String[] args) throws Exception{
  Random random = new Random();
  int base_sleep_ms = 100;
  int max_sleep_ms = 1000;
  for(int i = 0; i < 30; i++){
    System.out.println(getSleepTimeMs(i, base_sleep_ms, max_sleep_ms,
random));=
  }
}

To fix the issue a few of the integers should be converted to longs.  I'll
send a pull request in a few.

On Mon, Mar 3, 2014 at 11:17 AM, Drew Goya <[email protected]> wrote:

> Thanks for sharing your experiences guys, we will be heading back to 0mq
> as well.  It's a shame as we really got some nice throughput improvements
> with Netty.
>
>
> On Sun, Mar 2, 2014 at 5:18 PM, Michael Rose <[email protected]>wrote:
>
>> Right now we're having slow, off-heap memory leaks, unknown if these are
>> linked to Netty (yet). When the workers inevitably get OOMed, the topology
>> will rarely recover gracefully with similar Netty timeouts. Sounds like
>> we'll be heading back to 0mq.
>>
>> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>> [email protected]
>>
>>
>> On Sun, Mar 2, 2014 at 5:44 PM, Sean Allen 
>> <[email protected]>wrote:
>>
>>> We have the same issue and after attempting a few fixes, we switched
>>> back to using 0mq for now.
>>>
>>>
>>> On Sun, Mar 2, 2014 at 2:46 PM, Drew Goya <[email protected]> wrote:
>>>
>>>> Hey All, I'm running a 0.9.0.1 storm topology in AWS EC2 and I
>>>> occasionally run into a strange and pretty catastrophic error.  One of my
>>>> workers is either overloaded or stuck and gets killed and restarted.  This
>>>> usually works fine but once in a while the whole topology breaks down, all
>>>> the workers are killed and restarted continually.  Looking through the logs
>>>> it looks like some netty errors on initialization kill the Async Loop.  The
>>>> topology is never able to recover, I have to kill it manually and relaunch
>>>> it.
>>>>
>>>> Is this something anyone else has come across?  Any tips? Config
>>>> settings I could change?
>>>>
>>>> This is a pastebin of the errors:  http://pastebin.com/XXZBsEj1
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Ce n'est pas une signature
>>>
>>
>>
>

Re: Netty Errors, chain reaction, topology breaks down

Reply via email to