Go to the bottom of the problem and thought to share and maybe gain more insight on why... One of the bolts that used to run in around 35ms was taking 450+ms to process a message. After capturing some timing inside the bolt the culprit was found to be collector.reportError() it was taking up 400+ms to report an error, and since batch of messages was all going that route average time for this bolt was 450+ms. Switching back to logging error to file instead of to UI brought back the process time to 35ms range. It may be that running with 1 netty thread is causing this, or reporting an error in general is a costly operation?
Thanks, Tomas On Wed, Jul 23, 2014 at 2:35 PM, Tomas Mazukna <[email protected]> wrote: > No, its on RHE on real hardware... > Although we went through a network upgrade couple day ago and all sort of > things started failing.... > > > On Wed, Jul 23, 2014 at 1:43 PM, Andrew Montalenti <[email protected]> > wrote: > >> Tomas, >> >> You don't happen to be running Ubuntu 14.04 on Xen kernel, do you? Eg on >> Amazon EC2. >> >> I discovered an issue where running Storm across many workers on that OS >> led to me hitting an annoying network driver bug that would cause timeouts >> and topology freezes like you are seeing. Check dmesg for odd messages from >> your network stack. Just a guess. >> >> (copied from my reply to another similar thread) >> >> -AM >> On Jul 23, 2014 10:07 AM, "Tomas Mazukna" <[email protected]> >> wrote: >> >>> I am really puzzled why processing stopped in the topology. >>> Looks like the acking threads all stopped communicating. Only hint I saw >>> was this netty exception: >>> Any hints how to prevent this from happening again? >>> >>> 2014-07-23 08:56:03 b.s.m.n.Client [INFO] Closing Netty Client >>> Netty-Client-ndhhadappp3.tsh.mis.mckesson.com/10.48.132.224:9703 >>> >>> 2014-07-23 08:56:03 b.s.m.n.Client [INFO] Waiting for pending batchs to >>> be sent with >>> Netty-Client-ndhhadappp3.tsh.mis.mckesson.com/10.48.132.224:9703..., >>> timeout: 600000ms, pendings: 0 >>> >>> 2014-07-23 08:56:03 b.s.m.n.Client [INFO] Closing Netty Client >>> Netty-Client-ndhhadappp3.tsh.mis.mckesson.com/10.48.132.224:9700 >>> >>> 2014-07-23 08:56:03 b.s.m.n.Client [INFO] Waiting for pending batchs to >>> be sent with >>> Netty-Client-ndhhadappp3.tsh.mis.mckesson.com/10.48.132.224:9700..., >>> timeout: 600000ms, pendings: 0 >>> >>> 2014-07-23 08:56:03 b.s.util [ERROR] Async loop died! >>> >>> java.lang.RuntimeException: java.lang.RuntimeException: Client is being >>> closed, and does not take requests any more >>> >>> at >>> backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:128) >>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] >>> >>> at >>> backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99) >>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] >>> >>> at >>> backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80) >>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] >>> >>> at >>> backtype.storm.disruptor$consume_loop_STAR_$fn__758.invoke(disruptor.clj:94) >>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] >>> >>> at backtype.storm.util$async_loop$fn__457.invoke(util.clj:431) >>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] >>> >>> at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na] >>> >>> at java.lang.Thread.run(Thread.java:745) [na:1.7.0_60] >>> >>> Caused by: java.lang.RuntimeException: Client is being closed, and does >>> not take requests any more >>> >>> at backtype.storm.messaging.netty.Client.send(Client.java:194) >>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] >>> >>> at >>> backtype.storm.utils.TransferDrainer.send(TransferDrainer.java:54) >>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] >>> >>> at >>> backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5927$fn__5928.invoke(worker.clj:322) >>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] >>> >>> at >>> backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5927.invoke(worker.clj:320) >>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] >>> >>> at >>> backtype.storm.disruptor$clojure_handler$reify__745.onEvent(disruptor.clj:58) >>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] >>> >>> at >>> backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125) >>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating] >>> >>> ... 6 common frames omitted >>> >>> 2014-07-23 08:56:03 b.s.util [INFO] Halting process: ("Async loop >>> died!") >>> >>> >>> Configuration: >>> >>> worker.childopts: "-Xmx2048m -Xss256k -XX:MaxPermSize=256m >>> -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+UseConcMarkSweepGC >>> -XX:NewSize=128m -XX:CMSInitiatingOccupancyFraction=70 >>> -XX:-CMSConcurrentMTEnabled -Djava.net.preferIPv4Stack=true" >>> >>> supervisor.childopts: "-Xmx256m -Djava.net.preferIPv4Stack=true" >>> >>> nimbus.childopts: "-Xmx1024m -Djava.net.preferIPv4Stack=true" >>> >>> ui.childopts: "-Xmx768m -Djava.net.preferIPv4Stack=true" >>> >>> nimbus.thrift.threads: 256 >>> >>> >>> storm.messaging.transport: "backtype.storm.messaging.netty.Context" >>> >>> storm.messaging.netty.server_worker_threads: 1 >>> >>> storm.messaging.netty.client_worker_threads: 1 >>> >>> storm.messaging.netty.buffer_size: 5242880 >>> >>> storm.messaging.netty.max_retries: 100 >>> >>> storm.messaging.netty.max_wait_ms: 1000 >>> >>> storm.messaging.netty.min_wait_ms: 100 >>> >>> >>> Thanks, >>> -- >>> Tomas Mazukna >>> 678-557-3834 >>> >> > > > -- > Tomas Mazukna > 678-557-3834 > -- Tomas Mazukna 678-557-3834
