Hi, This is Magesh working as a Engineer at Visa INc. I'm relatively new to the Kafka ecosystem. We are using Kafka 0.9 and during our testing in our test environments, we have noticed that producer does retries with NETWORK_EXCEPTION.
To debug the issue, i enabled TRACE logging and noticed that the nodes were added to the Disconnected list and hence they were being retried. >From the producer code, I noticed that the following would be the only scenario where a node is marked disconnected /* cancel any defunct sockets */ if (!key.isValid()) { close(channel); this.disconnected.add(channel.id()); } } catch (Exception e) { String desc = channel.socketDescription(); if (e instanceof IOException) log.debug("Connection with {} disconnected", desc, e); else log.warn("Unexpected error from {}; closing connection", desc, e); close(channel); this.disconnected.add(channel.id()); } Upon careful analysis, I didn't find any logs related to the exception block. So, the only possibility is that the sockets were becoming DeFunct. With netsat, I found that the sockets were getting dropped periodically. I wasn't sure if it was the Producer, Broker or the network layer thats causing this. Just wanted to check if there is any recommendation for this. We are using SASL. Thanks Magesh