I’ve been investigating some possible network performance issues we’re having with our Kafka brokers, and noticed that traffic sent between brokers tends to show frequent bursts of very small packets:
16:09:52.299863 IP stream02.chartbeat.net.9092 > stream03.chartbeat.net.39399: Flags [P.], seq 127908:127925, ack 4143, win 32488, length 17 16:09:52.299870 IP stream02.chartbeat.net.9092 > stream03.chartbeat.net.39399: Flags [P.], seq 127925:127943, ack 4143, win 32488, length 18 16:09:52.299876 IP stream02.chartbeat.net.9092 > stream03.chartbeat.net.39399: Flags [P.], seq 127943:127967, ack 4143, win 32488, length 24 16:09:52.299889 IP stream02.chartbeat.net.9092 > stream03.chartbeat.net.39399: Flags [P.], seq 127967:127985, ack 4143, win 32488, length 18 16:09:52.299892 IP stream02.chartbeat.net.9092 > stream03.chartbeat.net.39399: Flags [P.], seq 127985:127999, ack 4143, win 32488, length 14 16:09:52.299895 IP stream02.chartbeat.net.9092 > stream03.chartbeat.net.39399: Flags [P.], seq 127999:128017, ack 4143, win 32488, length 18 16:09:52.299897 IP stream02.chartbeat.net.9092 > stream03.chartbeat.net.39399: Flags [P.], seq 128017:128031, ack 4143, win 32488, length 14 16:09:52.299900 IP stream02.chartbeat.net.9092 > stream03.chartbeat.net.39399: Flags [P.], seq 128031:128049, ack 4143, win 32488, length 18 16:09:52.300612 IP stream02.chartbeat.net.9092 > stream03.chartbeat.net.39400: Flags [P.], seq 279162:279178, ack 6700, win 32488, length 16 16:09:52.300645 IP stream02.chartbeat.net.9092 > stream03.chartbeat.net.39400: Flags [P.], seq 279178:279189, ack 6700, win 32488, length 11 16:09:52.300655 IP stream02.chartbeat.net.9092 > stream03.chartbeat.net.39400: Flags [P.], seq 279189:279207, ack 6700, win 32488, length 18 I don’t know if this in itself is really an issue, but I thought I’d check with the group to see. The MTU on the interfaces is set to 9001, and regular consumers don’t get the same bursts of small push packets. Our replica config is: replica.lag.time.max.ms=10000 replica.lag.max.messages=4000 replica.socket.timeout.ms=301000 replica.socket.receive.buffer.bytes=641024 replica.fetch.max.bytes=10241024 replica.fetch.wait.max.ms=500 replica.fetch.min.bytes=1 num.replica.fetchers=16 Any thoughts on whether or not this is an issue, and if so how we should correct it? I’m wondering about the replica.fetch.*.bytes settings — it’s unclear to me from the docs what those do exactly. Thanks, Wes
