Hi all, I'm testing a setup where I have 3 zookeeper hosts and 3 kafka brokers (version 1.0.0), using the kafka-producer-perf-test.sh script.
It seems that in certain circumstances, sending records is not retried after a timeout. I'm not sure what is wrong... >From the documentation of the request.timeout.ms producer config, I understand that the request should be retried after a timeout: "If the response is not received before the timeout elapses the client will resend the request if necessary or fail the request if retries are exhausted". However it seems that it is not (always) the case, when overloading the brokers (or the network) using kafka-producer-perf-test.sh, I see that kind of output: org.apache.kafka.common.errors.TimeoutException: Expiring 184 record(s) for test-123: 40011 ms has passed since batch creation plus linger time I don't understand the above log, the duration printed correspond to request.timeout.ms, why isn't it retried (retries is set to MAX_INT)? I also see a couple of those, but it's not for the same partitions as the messages above: WARN [Producer clientId=producer-1] Got error produce response with correlation id 3000 on topic-partition test-92, retrying (2147483646 attempts left). Error: REQUEST_TIMED_OUT (org.apache.kafka.clients.producer.internals.Sender) For reference, here's the command line used: ./kafka-producer-perf-test.sh --topic test --producer.config producer.properties --payload-file ../LICENSE --num-records 1000000 --throughput 10000 --print-metrics and the content of producer.properties (the purpose is maximum reliability, inspired by https://www.confluent.io/kafka-summit-2016-ops-when-it-absolutely-positively-has-to-be-there/): bootstrap.servers=10.0.17.157:9092,10.0.17.205:9092,10.0.30.24:9092 compression.type = none enable.idempotence = true max.in.flight.requests.per.connection = 1 request.timeout.ms = 40000 acks = all retries = 2147483647 retry.backoff.ms = 1000 a few of the corresponding metrics: producer-metrics:record-error-total:{client-id=producer-1} : 2958.000 producer-metrics:record-retry-total:{client-id=producer-1} : 68.000 producer-metrics:record-send-total:{client-id=producer-1} : 997110.000 producer-node-metrics:request-total:{client-id=producer-1, node-id=node--1} : 89.000 producer-node-metrics:request-total:{client-id=producer-1, node-id=node--2} : 24.000 producer-node-metrics:request-total:{client-id=producer-1, node-id=node--3} : 24.000 producer-node-metrics:request-total:{client-id=producer-1, node-id=node-1} : 42859475.000 producer-node-metrics:request-total:{client-id=producer-1, node-id=node-2} : 50855563.000 producer-node-metrics:request-total:{client-id=producer-1, node-id=node-3} : 41940906.000 producer-topic-metrics:record-error-total:{client-id=producer-1, topic=test} : 2958.000 producer-topic-metrics:record-retry-total:{client-id=producer-1, topic=test} : 68.000 producer-topic-metrics:record-send-total:{client-id=producer-1, topic=test} : 997110.000 In the metrics above, I get the node-[1-3], but what are the node--[1-3]? Thanks