Problem Description
When using a consumer created with librdkafka to receive messages from Kafka, intermittent message latency issues are observed. The time difference between message receipt and the timestamp in the message body exceeds 1 second, although most messages are received within about 10ms. Environment Information Software Versions librdkafka version: 2.11.0 Operating System: CentOS 7.6 Kafka version: 3.6.2 (zookeeper mode deployment) Kafka Cluster Number of nodes: 3 nodes Server configuration: 64 vCPU, 128GB RAM Network: Gigabit network, connected to the same switch, low network latency disk: HDD RAID1 Topic Configuration Test Topic (test): Partitions: 1 Replicas: 2 message.timestamp.type=LogAppendTime min.insync.replicas=1 Load Topics (testA, testB, testC, testD): Each topic: 128 partitions, 2 replicas Total message rate: 80,000 messages/second (20,000 messages/second per topic) Message size: 500 bytes per message Consumer Configuration (librdkafka) fetch.wait.max.ms: 10 (500 still have this issue,so i change to 10) All other configurations are librdkafka defaults Reproduction Steps Create four load topics (testA, testB, testC, testD), each with 128 partitions and 2 replicas Deploy test programs to send 20,000 messages per second (500 bytes each) to each of the four load topics Create test topic test with the configuration mentioned above Use a test program to send 1 message per second (100 bytes) to the test topic Create a consumer that subscribes to the test topic The consumer prints the received message time and the timestamp in the message Observe that most messages are received within about 10ms, but occasionally messages are delayed by more than 1 second Key Observations The test consumer program runs on the partition leader node of the test topic (eliminating node clock differences) Intermittent latency occurs under high load (80k msg/s) Low-throughput topic (1 msg/s) experiences delays in a high-throughput background Latency is intermittent, not continuous Using librdkafka version 2.11.0, CentOS 7.6 operating system I used tcpdump to capture network packets and observed that the consumer frequently initiates fetch request requests, and Kafka's fetch responses are also very fast. However, it requires multiple requests and responses before the message can be received, which is the main source of the delay. Perhaps this is not an issue with librdkafka. I set log.cleaner.threads=4 and num.replica.fetchers=4 on Kafka, but the problem still persists. After upgrading Kafka to version 4.0 with Kraft deployment and following the same test steps, the delay issue still exists, but the frequency is much lower, with delays around 500ms. Does anyone have new directions to suggest for further troubleshooting this issue? | | 杜杰 |
