Re: Netty Errors, chain reaction, topology breaks down

Drew Goya Mon, 03 Mar 2014 12:10:40 -0800

Pull request sent:

https://github.com/apache/incubator-storm/pull/41



On Mon, Mar 3, 2014 at 12:03 PM, Drew Goya <[email protected]> wrote:

> Hey All, dug into the netty Client.jar and the various versions/tags of
> the file.  So it looks like there were two commits which attempted to fix
> this issue:
>
> https://github.com/nathanmarz/storm/commit/213102b36f890
>
> and then
>
>
> https://github.com/nathanmarz/storm/commit/c638db0e88e3c56f808c8a76a88f94d7bf1988c4
>
> It looks like the affected method is getSleepTimeMs()
>
> In the 0.9.0 tag (
>
> In the 0.9.0.1 tag (
> https://github.com/nathanmarz/storm/blob/0.9.0.1/storm-netty/src/jvm/backtype/storm/messaging/netty/Client.java)
> the method is:
>
> private int getSleepTimeMs()
> {
>   int backoff = 1 << retries.get();
>   int sleepMs = base_sleep_ms * Math.max(1, random.nextInt(backoff));
>   if ( sleepMs > max_sleep_ms )
>     sleepMs = max_sleep_ms;
>   return sleepMs;
> }
>
> I put together a simple test which demonstrates the method is still
> broken, it is still possible to overflow sleepMs and end up with a large
> negative timeout:
>
> private static int getSleepTimeMs(int retries, int base_sleep_ms, int
> max_sleep_ms, Random random)
> {
>   int backoff = 1 << retries;
>   int sleepMs = base_sleep_ms * Math.max(1, random.nextInt(backoff));
>   if ( sleepMs > max_sleep_ms )
>     sleepMs = max_sleep_ms;
>   return sleepMs;
> }
>
> public static void main(String[] args) throws Exception{
>   Random random = new Random();
>   int base_sleep_ms = 100;
>   int max_sleep_ms = 1000;
>   for(int i = 0; i < 30; i++){
>     System.out.println(getSleepTimeMs(i, base_sleep_ms, max_sleep_ms,
> random));=
>   }
> }
>
> To fix the issue a few of the integers should be converted to longs.  I'll
> send a pull request in a few.
>
> On Mon, Mar 3, 2014 at 11:17 AM, Drew Goya <[email protected]> wrote:
>
>> Thanks for sharing your experiences guys, we will be heading back to 0mq
>> as well.  It's a shame as we really got some nice throughput improvements
>> with Netty.
>>
>>
>> On Sun, Mar 2, 2014 at 5:18 PM, Michael Rose <[email protected]>wrote:
>>
>>> Right now we're having slow, off-heap memory leaks, unknown if these are
>>> linked to Netty (yet). When the workers inevitably get OOMed, the topology
>>> will rarely recover gracefully with similar Netty timeouts. Sounds like
>>> we'll be heading back to 0mq.
>>>
>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>> [email protected]
>>>
>>>
>>> On Sun, Mar 2, 2014 at 5:44 PM, Sean Allen 
>>> <[email protected]>wrote:
>>>
>>>> We have the same issue and after attempting a few fixes, we switched
>>>> back to using 0mq for now.
>>>>
>>>>
>>>> On Sun, Mar 2, 2014 at 2:46 PM, Drew Goya <[email protected]> wrote:
>>>>
>>>>> Hey All, I'm running a 0.9.0.1 storm topology in AWS EC2 and I
>>>>> occasionally run into a strange and pretty catastrophic error.  One of my
>>>>> workers is either overloaded or stuck and gets killed and restarted.  This
>>>>> usually works fine but once in a while the whole topology breaks down, all
>>>>> the workers are killed and restarted continually.  Looking through the 
>>>>> logs
>>>>> it looks like some netty errors on initialization kill the Async Loop.  
>>>>> The
>>>>> topology is never able to recover, I have to kill it manually and relaunch
>>>>> it.
>>>>>
>>>>> Is this something anyone else has come across?  Any tips? Config
>>>>> settings I could change?
>>>>>
>>>>> This is a pastebin of the errors:  http://pastebin.com/XXZBsEj1
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Ce n'est pas une signature
>>>>
>>>
>>>
>>
>

Re: Netty Errors, chain reaction, topology breaks down

Reply via email to