Weirdness running topology on multiple nodes

Justin Workman Thu, 15 May 2014 20:15:18 -0700

We have spent the better part of 2 weeks now trying to get a pretty basic
topology running across multiple nodes. I am sure I am missing something
simple but for the life of me I cannot figure it out.


Here is the situation, I have 1 nimbus server and 5 supervisor servers,
with Zookeeper running on the nimbus server and 2 supervisor nodes. These
hosts are all virtual machines 4 CPU's 8GB RAM, running in a OpenStack
deployment. If all of the guests are running on the same physical hyperisor
then the topology starts up just fine and runs without any issues. However,
if we take the guests and spread them out over multiple hypervisors ( in
the same OpenStack cluster ), the topology never really completely starts
up. Things start to run, some messages are pulled off the spout, but
nothing ever makes it all the way through the topology and nothing is ever
ack'd.

In the worker logs we get messages about reconnecting and eventually a
Remote host unreachable error, and Async Loop Died. This used to result in
a NumberFormat exception, reducing the netty retries from 30 to 10 resloved
the NumberFormat error, and not we get the following

2014-05-07 09:00:51 b.s.m.n.Client [INFO] Reconnect ... [9]
2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10]
2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [9]
2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable.
We will close this client.
2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [9]
2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10]
2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable.
We will close this client.
2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10]
2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable.
We will close this client.
2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10]
2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable.
We will close this client.
2014-05-07 09:00:53 b.s.util [ERROR] Async loop died!
java.lang.RuntimeException: java.lang.RuntimeException: Client is being
closed, and does not take requests any more
        at
backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:107)
~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
        at
backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:78)
~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
        at
backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:77)
~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
        at
backtype.storm.disruptor$consume_loop_STAR_$fn__1577.invoke(disruptor.clj:89)
~[na:na]
        at backtype.storm.util$async_loop$fn__384.invoke(util.clj:433)
~[na:na]
        at clojure.lang.AFn.run(AFn.java:24) [clojure-1.4.0.jar:na]
        at java.lang.Thread.run(Thread.java:662) [na:1.6.0_26]
Caused by: java.lang.RuntimeException: Client is being closed, and does not
take requests any more
        at backtype.storm.messaging.netty.Client.send(Client.java:125)
~[storm-core-0.9.1-incubating.jar:0.9.1-incubating]
        at
backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__4398$fn__4399.invoke(worker.clj:319)
~[na:na]
        at
backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__4398.invoke(worker.clj:308)
~[na:na]
        at
backtype.storm.disruptor$clojure_handler$reify__1560.onEvent(disruptor.clj:58)
~[na:na]

And in the supervisor logs we see errors about the workers timing out and
not starting up all the way, we also see executor timeouts in the nimbus
logs. But we do not see any errors in the Zookeeper logs and the Zookeeper
stats look fine.

There do not appear to be any real network issues, I can run a continuous
flood ping, between the hosts, with varying packet sizes, with minimal
latency, and no dropped packets. I have also attempted to add all hosts to
the local hosts files on each machine without any difference.

I have also played with adjusting the different heartbeat timeouts and
intervals with out any luck, and I have also deployed this same setup to a
5 node cluster on physical hardware ( 24 cores 64GB ram and a lot of local
disks ), and we had the same issue. Topology would start, but data ever
made it through the topology.

The only way I have ever been able to get the topology to work is under
OpenStack when all guests are on the same physical hypervisor. I think I am
just missing something very obvious, but I am going in circles at this
point and could use some additional suggestions.

Thanks
Justin

Weirdness running topology on multiple nodes

Reply via email to