Hi all! We are running Kafka in a 3 node setup with Kafka and Zookeeper on each node. The topics have 1 partition and 2 replicas, like:
Topic:someTopic PartitionCount:1 ReplicationFactor:2 Configs:retention.ms=600000 Topic: someTopic Partition: 0 Leader: 2 Replicas: 2,0 Isr: 2,0 We uses the following settings Consumer settings: fetch.min.bytes=1 enable.auto.commit=true max.partition.fetch.bytes=1073741824 Producer settings: metadata.fetch.timeout.ms=1000 If we stop Kafka and Zookeeper on one node with 'kill -9', Kafka detects that the leader is missing within seconds and switches leader to the other replica and consumers will continue to receive messages. If we on the other hand bring down the network for the same node with 'ifdown eth0' (which will break the connection to both Kafka and Zookeeper on that node) it seems like Kafka have problems detecting that the broker is missing and it takes up to 2 minutes until any more messages can be consumed on affected topics. The following log can be seen on the consumer : [2017-05-04 15:44:26,916] WARN Auto offset commit failed for group console-consumer-75510: Commit offsets failed with retriable exception. You should retry committing offsets. (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator) and on the producer: May 04 15:44:18: 15:44:18.420 [kafka-producer-network-thread | producer-2] ERROR - app Publishing to topic 'someTopic' failed May 04 15:44:18: org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received. May 04 15:44:18: 15:44:18.435 [kafka-producer-network-thread | producer-2] ERROR - app Publishing to topic 'someTopic' failed May 04 15:44:18: org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received. May 04 15:44:18: 15:44:18.440 [kafka-producer-network-thread | producer-2] ERROR - app Publishing to topic 'someTopic' failed May 04 15:44:18: org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received. May 04 15:44:18: 15:44:18.442 [kafka-producer-network-thread | producer-2] ERROR - app Publishing to topic 'someTopic' failed May 04 15:44:18: org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received. May 04 15:44:18: 15:44:18.444 [kafka-producer-network-thread | producer-2] ERROR - app Publishing to topic 'someTopic' failed May 04 15:44:18: org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received. May 04 15:44:18: org.apache.kafka.common.errors.TimeoutException: Batch containing 31 record(s) expired due to timeout while requesting metadata from brokers for someTopic-0 May 04 15:44:18: 15:44:18.446 [kafka-producer-network-thread | producer-2] ERROR - app Publishing to topic 'Heartbeat.Heartbeat' failed May 04 15:44:18: org.apache.kafka.common.errors.TimeoutException: Batch containing 31 record(s) expired due to timeout while requesting metadata from brokers for someTopic-0 May 04 15:44:18: 15:44:18.448 [kafka-producer-network-thread | producer-2] ERROR - app Publishing to topic 'Heartbeat.Heartbeat' failed May 04 15:44:18: org.apache.kafka.common.errors.TimeoutException: Batch containing 31 record(s) expired due to timeout while requesting metadata from brokers for someTopic-0 May 04 15:44:18: 15:44:18.449 [kafka-producer-network-thread | producer-2] ERROR - app Publishing to topic 'Heartbeat.Heartbeat' failed ... will continue print those for a while ________________________ This email was scanned by Bitdefender