State of Inter-worker communication during the course of a running topology after Connection exceptions

Yashwant Ganti Mon, 08 Jun 2015 12:00:21 -0700

Hello All,

I had a question regarding the state of the inter-worker communication
while a Topology running, after a worker restart. We have observed in our
Topology that a worker gets restarted due to a zookeeper heartbeat timing
out. The supervisor daemon on that node promptly restarts the worker and
presumably connections have to be re-established between this worker and
the others in the Topology to exchange tuples. In my logs, I can see the
following messages after a worker restart -


*connection established from /10.0.21.168:34807
<http://10.0.21.168:34807> to ip-10-0-21-168/10.0.21.168:6705
<http://10.0.21.168:6705>* - This is the log file for the worker that
was restarted.


The other log message (on a different node) is :

ip-10-0-21-168/10.0.21.183:168 is not reachable. We will close this client


So my understanding is that some workers are able to establish
connections to the newly brought up worker while some are not.
Unfortunately for us, depending on when this happens, there are less
or more number of connection failures and Topology throughput either
stays steady or drops drastically. When the Topology throughput drops,
we have to do multiple re-deploys of the Topology to get things
working.


The question I have is: In case of a failure to establish connection
at worker startup, is there no further communication between those two
JVMs? We have 80 workers spread out across 10 nodes and at least two
bolts with Shuffle grouping, which probably requires a decent number
of inter-jvm connections to be established for the Topology to achieve
a significant throughput.


Thanks,

Yash

State of Inter-worker communication during the course of a running topology after Connection exceptions

Reply via email to