Maybe add this to the description of
https://issues.apache.org/jira/browse/KAFKA-1843 ? I can't find it now, but
I think there was another bug where I described a similar problem -- in
some cases it makes sense to fall back to the list of bootstrap nodes
because you've gotten into a bad state and can't make any progress without
a metadata update but can't connect to any nodes. The leastLoadedNode
method only considers nodes in the current metadata, so in your example K1
is not considered an option after seeing the producer metadata update that
only includes K2. In KAFKA-1501 I also found another obscure edge case
where you can run into this problem if your broker hostnames/ports aren't
consistent across restarts. Yours is obviously much more likely to occur,
and may not even be that uncommon for producers that are only sending data
to one topi.

If you have logs at debug level, are you seeing this message in between the
connection attempts:

Give up sending metadata request since no node is available

Also, if you let it continue running, does it recover after the
metadata.max.age.ms timeout? If so, I think that would definitely confirm
the issue and might suggest a fix -- preserve the bootstrap metadata and
fall back to choosing a node from it when leastLoadedNode would otherwise
return null.

-Ewen

On Mon, Apr 27, 2015 at 5:40 AM, Manikumar Reddy <manikumar.re...@gmail.com>
wrote:

> Any comments on this issue?
> On Apr 24, 2015 8:05 PM, "Manikumar Reddy" <ku...@nmsworks.co.in> wrote:
>
> > We are testing new producer on a 2 node cluster.
> > Under some node failure scenarios, producer is not able
> > to update metadata.
> >
> > Steps to reproduce
> > 1. form a 2 node cluster (K1, K2)
> > 2. create a topic with single partition, replication factor = 2
> > 3. start producing data (producer metadata : K1,K2)
> > 2. Kill leader node (say K1)
> > 3. K2 becomes the leader (producer metadata : K2)
> > 4. Bring back K1 and Kill K2 before metadata.max.age.ms
> > 5. K1 becomes the Leader (producer metadata still contains : K2)
> >
> > After this point, producer is not able to update the metadata.
> > producer continuously trying to connect with dead node (K2).
> >
> > This looks like a bug to me. Am I missing anything?
> >
>



-- 
Thanks,
Ewen

Reply via email to