Hello All, I had a question regarding the state of the inter-worker communication while a Topology running, after a worker restart. We have observed in our Topology that a worker gets restarted due to a zookeeper heartbeat timing out. The supervisor daemon on that node promptly restarts the worker and presumably connections have to be re-established between this worker and the others in the Topology to exchange tuples. In my logs, I can see the following messages after a worker restart -
*connection established from /10.0.21.168:34807 <http://10.0.21.168:34807> to ip-10-0-21-168/10.0.21.168:6705 <http://10.0.21.168:6705>* - This is the log file for the worker that was restarted. The other log message (on a different node) is : ip-10-0-21-168/10.0.21.183:168 is not reachable. We will close this client So my understanding is that some workers are able to establish connections to the newly brought up worker while some are not. Unfortunately for us, depending on when this happens, there are less or more number of connection failures and Topology throughput either stays steady or drops drastically. When the Topology throughput drops, we have to do multiple re-deploys of the Topology to get things working. The question I have is: In case of a failure to establish connection at worker startup, is there no further communication between those two JVMs? We have 80 workers spread out across 10 nodes and at least two bolts with Shuffle grouping, which probably requires a decent number of inter-jvm connections to be established for the Topology to achieve a significant throughput. Thanks, Yash
