Hey All, I've been running into network issues on the Amazon network over the last couple of days. This has exposed an unexpected behavior in my storm topology. It looks like Netty exceptions can kill a workers async loop and bring it down. Is this a bug? Is there a configuration setting I can use to change this behavior? I'm seeing exceptions like this:
2014-01-31 16:56:35 STDIO [ERROR] Jan 31, 2014 4:56:35 PM org.jboss.netty.channel.DefaultChannelPipeline WARNING: An exception was thrown by a user handler while handling an exception event ([id: 0x2a528847] EXCEPTION: java.net.ConnectException: Connection refused) java.lang.IllegalArgumentException: timeout value is negative at java.lang.Thread.sleep(Native Method) at backtype.storm.messaging.netty.Client.reconnect(Client.java:78) at backtype.storm.messaging.netty.StormClientHandler.exceptionCaught(StormClientHandler.java:108) at org.jboss.netty.handler.codec.frame.FrameDecoder.exceptionCaught(FrameDecoder.java:377) at org.jboss.netty.channel.Channels.fireExceptionCaught(Channels.java:525) at org.jboss.netty.channel.socket.nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:109) at org.jboss.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:78) at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312) at org.jboss.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:41) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) 2014-01-31 16:56:42 b.s.m.n.Client [WARN] Remote address is not reachable. We will close this client. 2014-01-31 16:56:43 c.g.s.c.s.LwesSpout [WARN] Event queue depth: count: 10000, total: 4419, [min, average, max]=[0, 0.44, 41] 2014-01-31 16:56:45 STDIO [ERROR] Jan 31, 2014 4:56:45 PM org.jboss.netty.channel.DefaultChannelPipeline WARNING: An exception was thrown by a user handler while handling an exception event ([id: 0x2b41f372] EXCEPTION: java.net.ConnectException: Connection refused) java.lang.IllegalArgumentException: timeout value is negative at java.lang.Thread.sleep(Native Method) at backtype.storm.messaging.netty.Client.reconnect(Client.java:78) at backtype.storm.messaging.netty.StormClientHandler.exceptionCaught(StormClientHandler.java:108) at org.jboss.netty.handler.codec.frame.FrameDecoder.exceptionCaught(FrameDecoder.java:377) at org.jboss.netty.channel.Channels.fireExceptionCaught(Channels.java:525) at org.jboss.netty.channel.socket.nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:109) at org.jboss.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:78) at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312) at org.jboss.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:41) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) 2014-01-31 16:56:45 b.s.util [ERROR] Async loop died! java.lang.RuntimeException: java.lang.RuntimeException: Client is being closed, and does not take requests any more at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:90) ~[storm-core-0.9.0.1.jar:na] at backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:61) ~[storm-core-0.9.0.1.jar:na] at backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:62) ~[storm-core-0.9.0.1.jar:na] at backtype.storm.disruptor$consume_loop_STAR_$fn__2975.invoke(disruptor.clj:74) ~[storm-core-0.9.0.1.jar:na] at backtype.storm.util$async_loop$fn__444.invoke(util.clj:403) ~[storm-core-0.9.0.1.jar:na] at clojure.lang.AFn.run(AFn.java:24) [clojure-1.4.0.jar:na] at java.lang.Thread.run(Thread.java:722) [na:1.7.0_07] Caused by: java.lang.RuntimeException: Client is being closed, and does not take requests any more at backtype.storm.messaging.netty.Client.send(Client.java:109) ~[storm-netty-0.9.0.1.jar:na] at backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5867$fn__5868.invoke(worker.clj:304) ~[storm-core-0.9.0.1.jar:na] at backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5867.invoke(worker.clj:293) ~[storm-core-0.9.0.1.jar:na] at backtype.storm.disruptor$clojure_handler$reify__2962.onEvent(disruptor.clj:43) ~[storm-core-0.9.0.1.jar:na] at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:87) ~[storm-core-0.9.0.1.jar:na]
