Hi all,

I'm testing a setup where I have 3 zookeeper hosts and 3 kafka brokers
(version 1.0.0), using the kafka-producer-perf-test.sh script.

It seems that in certain circumstances, sending records is not retried
after a timeout. I'm not sure what is wrong...

>From the documentation of the request.timeout.ms producer config, I
understand that the request should be retried after a timeout: "If the
response is not received before the timeout elapses the client will
resend the request if necessary or fail the request if retries are
exhausted".

However it seems that it is not (always) the case, when overloading
the brokers (or the network) using kafka-producer-perf-test.sh, I see
that kind of output:

org.apache.kafka.common.errors.TimeoutException: Expiring 184
record(s) for test-123: 40011 ms has passed since batch creation plus
linger time

I don't understand the above log, the duration printed correspond to
request.timeout.ms, why isn't it retried (retries is set to MAX_INT)?

I also see a couple of those, but it's not for the same partitions as
the messages above:

WARN [Producer clientId=producer-1] Got error produce response with
correlation id 3000 on topic-partition test-92, retrying (2147483646
attempts left). Error: REQUEST_TIMED_OUT
(org.apache.kafka.clients.producer.internals.Sender)

For reference, here's the command line used:

./kafka-producer-perf-test.sh --topic test --producer.config
producer.properties --payload-file ../LICENSE --num-records 1000000
--throughput 10000 --print-metrics


and the content of producer.properties (the purpose is maximum
reliability, inspired by
https://www.confluent.io/kafka-summit-2016-ops-when-it-absolutely-positively-has-to-be-there/):

bootstrap.servers=10.0.17.157:9092,10.0.17.205:9092,10.0.30.24:9092
compression.type = none
enable.idempotence = true
max.in.flight.requests.per.connection = 1
request.timeout.ms = 40000
acks = all
retries = 2147483647
retry.backoff.ms = 1000

a few of the corresponding metrics:
producer-metrics:record-error-total:{client-id=producer-1}
           : 2958.000
producer-metrics:record-retry-total:{client-id=producer-1}
           : 68.000
producer-metrics:record-send-total:{client-id=producer-1}
           : 997110.000
producer-node-metrics:request-total:{client-id=producer-1,
node-id=node--1}       : 89.000
producer-node-metrics:request-total:{client-id=producer-1,
node-id=node--2}       : 24.000
producer-node-metrics:request-total:{client-id=producer-1,
node-id=node--3}       : 24.000
producer-node-metrics:request-total:{client-id=producer-1,
node-id=node-1}        : 42859475.000
producer-node-metrics:request-total:{client-id=producer-1,
node-id=node-2}        : 50855563.000
producer-node-metrics:request-total:{client-id=producer-1,
node-id=node-3}        : 41940906.000
producer-topic-metrics:record-error-total:{client-id=producer-1,
topic=test}      : 2958.000
producer-topic-metrics:record-retry-total:{client-id=producer-1,
topic=test}      : 68.000
producer-topic-metrics:record-send-total:{client-id=producer-1,
topic=test}       : 997110.000

In the metrics above, I get the node-[1-3], but what are the node--[1-3]?

Thanks

Reply via email to