Maybe add this to the description of https://issues.apache.org/jira/browse/KAFKA-1843 ? I can't find it now, but I think there was another bug where I described a similar problem -- in some cases it makes sense to fall back to the list of bootstrap nodes because you've gotten into a bad state and can't make any progress without a metadata update but can't connect to any nodes. The leastLoadedNode method only considers nodes in the current metadata, so in your example K1 is not considered an option after seeing the producer metadata update that only includes K2. In KAFKA-1501 I also found another obscure edge case where you can run into this problem if your broker hostnames/ports aren't consistent across restarts. Yours is obviously much more likely to occur, and may not even be that uncommon for producers that are only sending data to one topi.
If you have logs at debug level, are you seeing this message in between the connection attempts: Give up sending metadata request since no node is available Also, if you let it continue running, does it recover after the metadata.max.age.ms timeout? If so, I think that would definitely confirm the issue and might suggest a fix -- preserve the bootstrap metadata and fall back to choosing a node from it when leastLoadedNode would otherwise return null. -Ewen On Mon, Apr 27, 2015 at 5:40 AM, Manikumar Reddy <manikumar.re...@gmail.com> wrote: > Any comments on this issue? > On Apr 24, 2015 8:05 PM, "Manikumar Reddy" <ku...@nmsworks.co.in> wrote: > > > We are testing new producer on a 2 node cluster. > > Under some node failure scenarios, producer is not able > > to update metadata. > > > > Steps to reproduce > > 1. form a 2 node cluster (K1, K2) > > 2. create a topic with single partition, replication factor = 2 > > 3. start producing data (producer metadata : K1,K2) > > 2. Kill leader node (say K1) > > 3. K2 becomes the leader (producer metadata : K2) > > 4. Bring back K1 and Kill K2 before metadata.max.age.ms > > 5. K1 becomes the Leader (producer metadata still contains : K2) > > > > After this point, producer is not able to update the metadata. > > producer continuously trying to connect with dead node (K2). > > > > This looks like a bug to me. Am I missing anything? > > > -- Thanks, Ewen