Aggie, I'm not able to re-produce your behavior in 0.10.0.1.
> I did more testing and find the rule (Topic is created with "--replication-factor 2 --partitions 1" in following case): > node 1 node 2 > down(lead) down (replica) > down(replica) up (lead) producer send fail !!! When node 2 is up, after the metadata update producer able to connect and send messages to it. Logs: [2016-09-27T15:18:17,907] NetworkClient: handleDisconnections(): Node 1 disconnected. [2016-09-27T15:18:18,007] NetworkClient: initiateConnect(): Initiating connection to node 1 at localhost:9093. [2016-09-27T15:18:18,008] Selector: pollSelectionKeys(): Connection with localhost/127.0.0.1 disconnected java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:1.8.0_45] at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:1.8.0_45] at org.apache.kafka.common.network.PlaintextTransportLayer.finishConnect(PlaintextTransportLayer.java:51) ~[kafka-clients-0.10.0.1.jar:?] at org.apache.kafka.common.network.KafkaChannel.finishConnect(KafkaChannel.java:73) ~[kafka-clients-0.10.0.1.jar:?] at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:309) [kafka-clients-0.10.0.1.jar:?] at org.apache.kafka.common.network.Selector.poll(Selector.java:283) [kafka-clients-0.10.0.1.jar:?] at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:260) [kafka-clients-0.10.0.1.jar:?] at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:229) [kafka-clients-0.10.0.1.jar:?] at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:134) [kafka-clients-0.10.0.1.jar:?] at java.lang.Thread.run(Thread.java:745) [?:1.8.0_45] [2016-09-27T15:18:18,008] NetworkClient: handleDisconnections(): Node 1 disconnected. [2016-09-27T15:18:18,043] NetworkClient: maybeUpdate(): Sending metadata request {topics=[hello]} to node 0 [2016-09-27T15:18:18,052] Metadata: update(): Updated cluster metadata version 4 to Cluster(nodes = [tcltest1.nmsworks.co.in:9092 (id: 0 rack: null)], partitions = [Partition(topic = hello, partition = 0, leader = none, replicas = [0,1,], isr = []]) [2016-09-27T15:18:19,053] NetworkClient: maybeUpdate(): Sending metadata request {topics=[hello]} to node 0 [2016-09-27T15:18:19,056] Metadata: update(): Updated cluster metadata version 5 to Cluster(nodes = [tcltest1.nmsworks.co.in:9092 (id: 0 rack: null)], partitions = [Partition(topic = hello, partition = 0, leader = 0, replicas = [0,1,], isr = [0,]]) [2016-09-27T15:18:19,081] KafkaProducer: main(): Batch : 4 sent [2016-09-27T15:18:19,182] KafkaProducer: main(): Batch : 5, Sending the record with key : 0 - Kamal On Mon, Sep 26, 2016 at 8:53 AM, FEI Aggie <aggie....@alcatel-lucent.com> wrote: > Kamal, > Thanks for your response. I tried testing with metadata.max.age.ms > reduced to 10s, but the behavior not changed, and producer still can't find > the live broker. > > I did more testing and find the rule (Topic is created with > "--replication-factor 2 --partitions 1" in following case): > node 1 node 2 > down(lead) down (replica) > down(replica) up (lead) producer send fail !!! > > > down(lead) down (replica) > up (lead) down (replica) producer send ok !!! > > If the only node with original lead partition up, everything is fine. > If the only node with original replica partition up, producer can't > connect to broker alive (always try to connect to the original lead broker, > node 1 in my case). > > Kafka can't recover for this situation? Anyone has clue for this? > > Thanks! > Aggie > -----Original Message----- > From: Kamal C [mailto:kamaltar...@gmail.com] > Sent: Saturday, September 24, 2016 1:37 PM > To: users@kafka.apache.org > Subject: Re: producer can't push msg sometimes with 1 broker recoved > > Reduce the metadata refresh interval 'metadata.max.age.ms' from 5 min to > your desired time interval. > This may reduce the time window of non-availability broker. > > -- Kamal >