Hi Justin, Can you share your storm.yaml config file?
Do you have any firewall software running on any of the machines in your cluster? - Taylor On May 7, 2014, at 11:11 AM, Justin Workman <[email protected]> wrote: > We have spent the better part of 2 weeks now trying to get a pretty basic > topology running across multiple nodes. I am sure I am missing something > simple but for the life of me I cannot figure it out. > > Here is the situation, I have 1 nimbus server and 5 supervisor servers, with > Zookeeper running on the nimbus server and 2 supervisor nodes. These hosts > are all virtual machines 4 CPU's 8GB RAM, running in a OpenStack deployment. > If all of the guests are running on the same physical hyperisor then the > topology starts up just fine and runs without any issues. However, if we take > the guests and spread them out over multiple hypervisors ( in the same > OpenStack cluster ), the topology never really completely starts up. Things > start to run, some messages are pulled off the spout, but nothing ever makes > it all the way through the topology and nothing is ever ack'd. > > In the worker logs we get messages about reconnecting and eventually a Remote > host unreachable error, and Async Loop Died. This used to result in a > NumberFormat exception, reducing the netty retries from 30 to 10 resloved the > NumberFormat error, and not we get the following > > 2014-05-07 09:00:51 b.s.m.n.Client [INFO] Reconnect ... [9] > 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10] > 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [9] > 2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable. We > will close this client. > 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [9] > 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10] > 2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable. We > will close this client. > 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10] > 2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable. We > will close this client. > 2014-05-07 09:00:52 b.s.m.n.Client [INFO] Reconnect ... [10] > 2014-05-07 09:00:52 b.s.m.n.Client [WARN] Remote address is not reachable. We > will close this client. > 2014-05-07 09:00:53 b.s.util [ERROR] Async loop died! > java.lang.RuntimeException: java.lang.RuntimeException: Client is being > closed, and does not take requests any more > at > backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:107) > ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating] > at > backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:78) > ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating] > at > backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:77) > ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating] > at > backtype.storm.disruptor$consume_loop_STAR_$fn__1577.invoke(disruptor.clj:89) > ~[na:na] > at backtype.storm.util$async_loop$fn__384.invoke(util.clj:433) > ~[na:na] > at clojure.lang.AFn.run(AFn.java:24) [clojure-1.4.0.jar:na] > at java.lang.Thread.run(Thread.java:662) [na:1.6.0_26] > Caused by: java.lang.RuntimeException: Client is being closed, and does not > take requests any more > at backtype.storm.messaging.netty.Client.send(Client.java:125) > ~[storm-core-0.9.1-incubating.jar:0.9.1-incubating] > at > backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__4398$fn__4399.invoke(worker.clj:319) > ~[na:na] > at > backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__4398.invoke(worker.clj:308) > ~[na:na] > at > backtype.storm.disruptor$clojure_handler$reify__1560.onEvent(disruptor.clj:58) > ~[na:na] > > And in the supervisor logs we see errors about the workers timing out and not > starting up all the way, we also see executor timeouts in the nimbus logs. > But we do not see any errors in the Zookeeper logs and the Zookeeper stats > look fine. > > There do not appear to be any real network issues, I can run a continuous > flood ping, between the hosts, with varying packet sizes, with minimal > latency, and no dropped packets. I have also attempted to add all hosts to > the local hosts files on each machine without any difference. > > I have also played with adjusting the different heartbeat timeouts and > intervals with out any luck, and I have also deployed this same setup to a 5 > node cluster on physical hardware ( 24 cores 64GB ram and a lot of local > disks ), and we had the same issue. Topology would start, but data ever made > it through the topology. > > The only way I have ever been able to get the topology to work is under > OpenStack when all guests are on the same physical hypervisor. I think I am > just missing something very obvious, but I am going in circles at this point > and could use some additional suggestions. > > Thanks > Justin
signature.asc
Description: Message signed with OpenPGP using GPGMail
