Re: Netty transport errors.

Tomas Mazukna Thu, 24 Jul 2014 13:59:30 -0700

Go to the bottom of the problem and thought to share and maybe gain more
insight on why...
One of the bolts that used to run in around 35ms was taking 450+ms to
process a message.
After capturing some timing inside the bolt the culprit was found to be
collector.reportError()
it was taking up 400+ms to report an error, and since batch of messages was
all going that route average time for this bolt was 450+ms.
Switching back to logging error to file instead of to UI brought back the
process time to 35ms range.
It may be that running with 1 netty thread is causing this, or reporting an
error in general is a costly operation?


Thanks,
Tomas


On Wed, Jul 23, 2014 at 2:35 PM, Tomas Mazukna <[email protected]>
wrote:

> No, its on RHE on real hardware...
> Although we went through a network upgrade couple day ago and all sort of
> things started failing....
>
>
> On Wed, Jul 23, 2014 at 1:43 PM, Andrew Montalenti <[email protected]>
> wrote:
>
>> Tomas,
>>
>> You don't happen to be running Ubuntu 14.04 on Xen kernel, do you? Eg on
>> Amazon EC2.
>>
>> I discovered an issue where running Storm across many workers on that OS
>> led to me hitting an annoying network driver bug that would cause timeouts
>> and topology freezes like you are seeing. Check dmesg for odd messages from
>> your network stack. Just a guess.
>>
>> (copied from my reply to another similar thread)
>>
>> -AM
>> On Jul 23, 2014 10:07 AM, "Tomas Mazukna" <[email protected]>
>> wrote:
>>
>>> I am really puzzled why processing stopped in the topology.
>>> Looks like the acking threads all stopped communicating. Only hint I saw
>>> was this netty exception:
>>> Any hints how to prevent this from happening again?
>>>
>>> 2014-07-23 08:56:03 b.s.m.n.Client [INFO] Closing Netty Client
>>> Netty-Client-ndhhadappp3.tsh.mis.mckesson.com/10.48.132.224:9703
>>>
>>> 2014-07-23 08:56:03 b.s.m.n.Client [INFO] Waiting for pending batchs to
>>> be sent with
>>> Netty-Client-ndhhadappp3.tsh.mis.mckesson.com/10.48.132.224:9703...,
>>> timeout: 600000ms, pendings: 0
>>>
>>> 2014-07-23 08:56:03 b.s.m.n.Client [INFO] Closing Netty Client
>>> Netty-Client-ndhhadappp3.tsh.mis.mckesson.com/10.48.132.224:9700
>>>
>>> 2014-07-23 08:56:03 b.s.m.n.Client [INFO] Waiting for pending batchs to
>>> be sent with
>>> Netty-Client-ndhhadappp3.tsh.mis.mckesson.com/10.48.132.224:9700...,
>>> timeout: 600000ms, pendings: 0
>>>
>>> 2014-07-23 08:56:03 b.s.util [ERROR] Async loop died!
>>>
>>> java.lang.RuntimeException: java.lang.RuntimeException: Client is being
>>> closed, and does not take requests any more
>>>
>>>         at
>>> backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:128)
>>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>>>
>>>         at
>>> backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99)
>>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>>>
>>>         at
>>> backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80)
>>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>>>
>>>         at
>>> backtype.storm.disruptor$consume_loop_STAR_$fn__758.invoke(disruptor.clj:94)
>>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>>>
>>>         at backtype.storm.util$async_loop$fn__457.invoke(util.clj:431)
>>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>>>
>>>         at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
>>>
>>>         at java.lang.Thread.run(Thread.java:745) [na:1.7.0_60]
>>>
>>> Caused by: java.lang.RuntimeException: Client is being closed, and does
>>> not take requests any more
>>>
>>>         at backtype.storm.messaging.netty.Client.send(Client.java:194)
>>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>>>
>>>         at
>>> backtype.storm.utils.TransferDrainer.send(TransferDrainer.java:54)
>>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>>>
>>>         at
>>> backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5927$fn__5928.invoke(worker.clj:322)
>>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>>>
>>>         at
>>> backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5927.invoke(worker.clj:320)
>>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>>>
>>>         at
>>> backtype.storm.disruptor$clojure_handler$reify__745.onEvent(disruptor.clj:58)
>>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>>>
>>>         at
>>> backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125)
>>> ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
>>>
>>>         ... 6 common frames omitted
>>>
>>> 2014-07-23 08:56:03 b.s.util [INFO] Halting process: ("Async loop
>>> died!")
>>>
>>>
>>> Configuration:
>>>
>>> worker.childopts: "-Xmx2048m -Xss256k -XX:MaxPermSize=256m
>>> -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
>>> -XX:NewSize=128m -XX:CMSInitiatingOccupancyFraction=70
>>> -XX:-CMSConcurrentMTEnabled -Djava.net.preferIPv4Stack=true"
>>>
>>> supervisor.childopts: "-Xmx256m -Djava.net.preferIPv4Stack=true"
>>>
>>> nimbus.childopts: "-Xmx1024m -Djava.net.preferIPv4Stack=true"
>>>
>>> ui.childopts: "-Xmx768m -Djava.net.preferIPv4Stack=true"
>>>
>>> nimbus.thrift.threads: 256
>>>
>>>
>>> storm.messaging.transport: "backtype.storm.messaging.netty.Context"
>>>
>>> storm.messaging.netty.server_worker_threads: 1
>>>
>>> storm.messaging.netty.client_worker_threads: 1
>>>
>>> storm.messaging.netty.buffer_size: 5242880
>>>
>>> storm.messaging.netty.max_retries: 100
>>>
>>> storm.messaging.netty.max_wait_ms: 1000
>>>
>>> storm.messaging.netty.min_wait_ms: 100
>>>
>>>
>>> Thanks,
>>> --
>>> Tomas Mazukna
>>> 678-557-3834
>>>
>>
>
>
> --
> Tomas Mazukna
> 678-557-3834
>



-- 
Tomas Mazukna
678-557-3834

Re: Netty transport errors.

Reply via email to