Are you seeing this in practice or is this just a concern about the way the
code currently works? If the broker is actually down and the host is
rejecting connections, the situation you describe shouldn't be a problem.
It's true that the NetworkClient chooses a fixed nodeIndexOffset, but the
expectation is that if we run one iteration of leastLoadedNode and select a
node, we'll try to connect and any failure will be handled by putting that
node into a blackout period during which subsequent calls to
leastLoadedNode will give priority to other options. If your server is
*not* explicitly rejecting connections, I think it could be possible that
we end up hanging for a long while just waiting for that connection. If
this is the case (e.g., if you are running on EC2 and it has this behavior
-- I believe default firewall rules will not kill the connection), this
would be useful to know.

A couple of bugs you might want to be aware of:

https://issues.apache.org/jira/browse/KAFKA-1843 is meant to generally
address the fact that there are a lot of states that we could be in, and
the way we handle them (especially with leastLoadedNode), may not work well
in all cases. It's very difficult to be comprehensive here, so if there is
a scenario that is not failing for you, the more information you can give
about the state of the system and the producer, the better.

https://issues.apache.org/jira/browse/KAFKA-1842 might also be relevant --
right now we rely on the underlying TCP connection timeouts, but this is
definitely not ideal. They can be quite long by default, and we might want
to consider connections failed much sooner.

I also could have sworn there was a JIRA filed about the fact that the
bootstrap servers are never reused, but I can't find it at the moment -- in
some cases, if you have no better option then it would be best to revert
back to the original set of bootstrap servers for loading metadata. This
can especially become a problem in some cases where your only producing to
one or a small number of topics and therefore only have metadata for a
couple of servers. If anything happens to those servers too quickly (within
the metadata refresh period) you might potentially get stuck with only
references to dead nodes.

-Ewen

On Fri, Aug 21, 2015 at 6:56 PM, Kishore Senji <kse...@gmail.com> wrote:

> If one of the broker we specify in the bootstrap servers list is down,
> there is a chance that the Producer (a brand new instance with no prior
> metadata) will never be able to publish anything to Kafka until that broker
> is up. Because the logic for getting the initial metadata is based on some
> random index to the set of bootstrap nodes and if it happens to be the down
> node, Kafka producer keeps on trying to get the metadata on that node only.
> It is never switched to another node. Without metadata, the Producer can
> never send anything.
>
> The nodeIndexOffset is chosen at the creation of the NetworkClient (and
> this offset is not changed when we fail to get a new connection) and so for
> getting the metadata for the first time, there is a possibility that we
> keep on trying on the broker that is down.
>
> This can be a problem if a broker goes down and also a Producer is
> restarted or a new instance is brought up. Is this a known issue?
>



-- 
Thanks,
Ewen

Reply via email to