Hi Justin,

Can you share your storm.yaml config file?

Do you have any firewall software running on any of the machines in your 
cluster?

- Taylor

On May 7, 2014, at 11:11 AM, Justin Workman <[email protected]> wrote:

> We have spent the better part of 2 weeks now trying to get a pretty basic 
> topology running across multiple nodes. I am sure I am missing something 
> simple but for the life of me I cannot figure it out.
> 
> Here is the situation, I have 1 nimbus server and 5 supervisor servers, with 
> Zookeeper running on the nimbus server and 2 supervisor nodes. These hosts 
> are all virtual machines 4 CPU's 8GB RAM, running in a OpenStack deployment. 
> If all of the guests are running on the same physical hyperisor then the 
> topology starts up just fine and runs without any issues. However, if we take 
> the guests and spread them out over multiple hypervisors ( in the same 
> OpenStack cluster ), the topology never really completely starts up. Things 
> start to run, some messages are pulled off the spout, but nothing ever makes 
> it all the way through the topology and nothing is ever ack'd. 
> 
> In the worker logs we get messages about reconnecting and eventually a Remote 
> host unreachable error, and Async Loop Died. This used to result in a 
> NumberFormat exception, reducing the netty retries from 30 to 10 resloved the 
> NumberFormat error, and not we get the following
> 
> 2014-05-07 09:00:51 b.s.m.n.Client [INFO] Reconnect ... [9]
> 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10]
> 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [9]
> 2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable. We 
> will close this client.
> 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [9]
> 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10]
> 2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable. We 
> will close this client.
> 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10]
> 2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable. We 
> will close this client.
> 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10]
> 2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable. We 
> will close this client.
> 2014-05-07 09:00:53 b.s.util [ERROR] Async loop died!
> java.lang.RuntimeException: java.lang.RuntimeException: Client is being 
> closed, and does not take requests any more
>         at 
> backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:107)
>  ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
>         at 
> backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:78)
>  ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
>         at 
> backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:77)
>  ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
>         at 
> backtype.storm.disruptor$consume_loop_STAR_$fn__1577.invoke(disruptor.clj:89) 
> ~[na:na]
>         at backtype.storm.util$async_loop$fn__384.invoke(util.clj:433) 
> ~[na:na]
>         at clojure.lang.AFn.run(AFn.java:24) [clojure-1.4.0.jar:na]
>         at java.lang.Thread.run(Thread.java:662) [na:1.6.0_26]
> Caused by: java.lang.RuntimeException: Client is being closed, and does not 
> take requests any more
>         at backtype.storm.messaging.netty.Client.send(Client.java:125) 
> ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
>         at 
> backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__4398$fn__4399.invoke(worker.clj:319)
>  ~[na:na]
>         at 
> backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__4398.invoke(worker.clj:308)
>  ~[na:na]
>         at 
> backtype.storm.disruptor$clojure_handler$reify__1560.onEvent(disruptor.clj:58)
>  ~[na:na]
> 
> And in the supervisor logs we see errors about the workers timing out and not 
> starting up all the way, we also see executor timeouts in the nimbus logs. 
> But we do not see any errors in the Zookeeper logs and the Zookeeper stats 
> look fine. 
> 
> There do not appear to be any real network issues, I can run a continuous 
> flood ping, between the hosts, with varying packet sizes, with minimal 
> latency, and no dropped packets. I have also attempted to add all hosts to 
> the local hosts files on each machine without any difference. 
> 
> I have also played with adjusting the different heartbeat timeouts and 
> intervals with out any luck, and I have also deployed this same setup to a 5 
> node cluster on physical hardware ( 24 cores 64GB ram and a lot of local 
> disks ), and we had the same issue. Topology would start, but data ever made 
> it through the topology.
> 
> The only way I have ever been able to get the topology to work is under 
> OpenStack when all guests are on the same physical hypervisor. I think I am 
> just missing something very obvious, but I am going in circles at this point 
> and could use some additional suggestions.
> 
> Thanks
> Justin

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to