[ https://issues.apache.org/jira/browse/KAFKA-9648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17379515#comment-17379515 ]
Haruki Okada edited comment on KAFKA-9648 at 7/13/21, 2:05 AM: --------------------------------------------------------------- Hi. We operate a Kafka cluster in our company which has 130+ brokers, 1.3M+ total socket server connections and 25K+ partitions. We faced very similar issue to https://issues.apache.org/jira/browse/KAFKA-9211 (i.e. producer slowing down mysteriously with small TCP packets) recently, and we figured out the cause. We hope our knowledge could help. h2. Environment: * Kafka broker version: 2.4.1 (but we suppose the version doesn't matter) * Kafka broker OS: CentOS7 (kernel 3.10.X) h2. Phenomenon: * Restart a broker process, and execute preferred leader election after the broker became in-sync * Some producers's node-request-latency to the broker got insanely higher than usual ** However no such high produce response latency is observed on broker-side metrics * As the result, producer batches couldn't be sent out in sufficient pace, then caused batch expiration h2. Analysis: * We observed TCP SYN cookies metric was increased at incidental time, with following dmesg message: ** {code:java} TCP: request_sock_TCP: Possible SYN flooding on port 22991. Sending cookies. Check SNMP counters.{code} * So we also suspected the phenomenon is due to `wscale` drop (as like described in this issue), but we doubt it because: ** Even with TCP SYN cookies, `wscale` should be available as long as TCP timestamp is enabled. (refs: [https://blog.cloudflare.com/syn-packet-handling-in-the-wild/]) ** In our environment, TCP timestamp is enabled. ** Also, generally, `wscale` (window scaling factor) is used for extending window beyond 65535 (max window size in TCP spec) on large-network round trip environment such as internet *** Our typical produce request size is smaller than that *** So it's hard to imagine that `wscale` drop causes such significant request-latency degradation * After several attempts to reproduce, we found out that receiver (i.e. broker in this context)'s `wscale` is inconsistent between producer and broker at incidental time. ** receiver's `wscale` advertised from broker -> producer along with SYN+ACK packet: 7 *** {code:java} 17:32:05.161213 IP BROKER.HOST.XXXX > CLIENT.YYYY: Flags [S.], seq 29292755, ack 17469019, win 28960, options [mss 1460,sackOK,TS val 25589601 ecr 9055245,nop,wscale 1], length 0{code} * ** *** (seq numbers are substituted with random value) ** actual receiver's `wscale` after established: 1 *** {code:java} [user@BROKER ~]$ /sbin/ss -e -i -t | less ... ESTAB 0 0 BROKER.XXXX CLIENT.YYYY timer:(keepalive,21min,0) uid:503 ino:15143860 sk:ffff9ba25dc4f440 <-> ts sack cubic wscale:7,7 rto:201 rtt:0.179/0.006 ato:140 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:97054 bytes_received:18950537 segs_out:15094 segs_in:13941 send 647.2Mbps lastsnd:217 lastrcv:17 lastack:217 pacing_rate 1288.0Mbps rcv_rtt:1.875 rcv_space:29200{code} * ** *** `wscale:7,7` means that broker's receiver window scaling factor is 7 * Okay, then this inconsistency could explain the phenomenon as below: ** Premise: When `wscale` is enabled, TCP window size is calculated by `window_size * 2^wscale` ** When broker calculates advertised window size, it's bit-shifted to the right by `wscale` (== 7) *** [https://github.com/torvalds/linux/blob/v3.10/net/ipv4/tcp_output.c#L290] ** On the other hand, producer multiplies advertised window size by 1, which is conveyed through SYN+ACK ** As the result, window size became 64 times smaller than expected ** Then producer splits TCP packets to much smaller size than usual (possibly under MSS) *** TCP acks are delayed due to the conditions are not met ([https://github.com/torvalds/linux/blob/v3.10/net/ipv4/tcp_input.c#L4760]) * Last remaining question is "why such wscale inconsistency happened?" ** Read through the kernel source code, then we found that there's an issue in the logic to calculate wscale if TCP connection is established through syncookies *** It's fixed in this commit: [https://github.com/torvalds/linux/commit/909172a149749242990a6e64cb55d55460d4e417] h2. Solution: * 1. Upgrade linux kernel to at least v5.10 (which the patch is committed) * 2. Disable SYN cookies ** i.e. setting `net.ipv4.tcp_syncookies` kernel parameter to 0 ** With SYN cookies disabled, some SYN's are dropped on same situations but typically it's not be a serious problem thanks to SYN retries. Clients should reconnect soon. * 3. Adjust backlog size (as this ticket suggests) ** Even disabling SYN cookies work thanks to SYN retries, it will cause certain delay in TCP establishment if reconnect happened. ** So generally it's preferred to adjust backlog size as well, to avoid SYN drops in the first place. I think we should add new config to adjust backlog size, since preferred backlog size depends on environment so it's hard to decide globally-appropriate fixed value. If you don't mind, I'm ready to submit a patch to add a new config to adjust backlog size. What do you think? Thanks, was (Author: ocadaruma): Hi. We operate a Kafka cluster in our company which has 130+ brokers, 1.3M+ total socket server connections and 25K+ partitions. We faced very similar issue to https://issues.apache.org/jira/browse/KAFKA-9211 (i.e. producer slowing down mysteriously with small TCP packets) recently, and we figured out the cause. We hope our knowledge could help. h2. Environment: * Kafka broker version: 2.4.1 (but we suppose the version doesn't matter) * Kafka broker OS: CentOS7 (kernel 3.10.X) h2. Phenomenon: * Restart a broker process, and execute preferred leader election after the broker became in-sync * Some producers's node-request-latency to the broker got insanely higher than usual ** However no such high produce response latency is observed on broker-side metrics * As the result, producer batches couldn't be sent out in sufficient pace, then caused batch expiration h2. Analysis: * We observed TCP SYN cookies metric was increased at incidental time, with following dmesg message: ** {code:java} TCP: request_sock_TCP: Possible SYN flooding on port 22991. Sending cookies. Check SNMP counters.{code} * So we also suspected the phenomenon is due to `wscale` drop (as like described in this issue), but we doubt it because: ** Even with TCP SYN cookies, `wscale` should be available as long as TCP timestamp is enabled. (refs: [https://blog.cloudflare.com/syn-packet-handling-in-the-wild/]) ** In our environment, TCP timestamp is enabled. ** Also, generally, `wscale` (window scaling factor) is used for extending window beyond 65535 (max window size in TCP spec) on large-network round trip environment such as internet *** Our typical produce request size is smaller than that *** So it's hard to imagine that `wscale` drop causes such significant request-latency degradation * After several attempts to reproduce, we found out that receiver (i.e. broker in this context)'s `wscale` is inconsistent between producer and broker at incidental time. ** receiver's `wscale` advertised from broker -> producer along with SYN+ACK packet: 7 *** {code:java} 17:32:05.161213 IP BROKER.HOST.XXXX > CLIENT.YYYY: Flags [S.], seq 29292755, ack 17469019, win 28960, options [mss 1460,sackOK,TS val 25589601 ecr 9055245,nop,wscale 1], length 0{code} *** (seq numbers are substituted with random value) ** actual receiver's `wscale` after established: 1 *** {code:java} [user@BROKER ~]$ /sbin/ss -e -i -t | less ... ESTAB 0 0 BROKER.XXXX CLIENT.YYYY timer:(keepalive,21min,0) uid:503 ino:15143860 sk:ffff9ba25dc4f440 <-> ts sack cubic wscale:7,7 rto:201 rtt:0.179/0.006 ato:140 mss:1448 rcvmss:1448 advmss:1448 cwnd:10 bytes_acked:97054 bytes_received:18950537 segs_out:15094 segs_in:13941 send 647.2Mbps lastsnd:217 lastrcv:17 lastack:217 pacing_rate 1288.0Mbps rcv_rtt:1.875 rcv_space:29200{code} *** `wscale:7,7` means that broker's receiver window scaling factor is 7 * Okay, then this inconsistency could explain the phenomenon as below: ** Premise: When `wscale` is enabled, TCP window size is calculated by `window_size * 2^wscale` ** When broker calculates advertised window size, it's bit-shifted to the right by `wscale` (== 7) *** [https://github.com/torvalds/linux/blob/v3.10/net/ipv4/tcp_output.c#L290] ** On the other hand, producer multiplies advertised window size by 1, which is conveyed through SYN+ACK ** As the result, window size became 64 times smaller than expected ** Then producer splits TCP packets to much smaller size than usual (possibly under MSS) *** TCP acks are delayed due to the conditions are not met ([https://github.com/torvalds/linux/blob/v3.10/net/ipv4/tcp_input.c#L4760]) * Last remaining question is "why such wscale inconsistency happened?" ** Read through the kernel source code, then we found that there's an issue in the logic to calculate wscale if TCP connection is established through syncookies *** It's fixed in this commit: [https://github.com/torvalds/linux/commit/909172a149749242990a6e64cb55d55460d4e417] h2. Solution: * 1. Upgrade linux kernel to at least v5.10 (which the patch is committed) * 2. Disable SYN cookies ** i.e. setting `net.ipv4.tcp_syncookies` kernel parameter to 0 ** With SYN cookies disabled, some SYN's are dropped on same situations but typically it's not be a serious problem thanks to SYN retries. Clients should reconnect soon. * 3. Adjust backlog size (as this ticket suggests) ** Even disabling SYN cookies work thanks to SYN retries, it will cause certain delay in TCP establishment if reconnect happened. ** So generally it's preferable to adjust backlog size as well, to avoid SYN drops in the first place. I think we should add new config to adjust backlog size, since preferred backlog size depends on environment so it's hard to decide globally-appropriate fixed value. If you don't mind, I'm ready to submit a patch to add a new config to adjust backlog size. What do you think? Thanks, > kafka server should resize backlog when create serversocket > ----------------------------------------------------------- > > Key: KAFKA-9648 > URL: https://issues.apache.org/jira/browse/KAFKA-9648 > Project: Kafka > Issue Type: Improvement > Components: core > Affects Versions: 0.10.0.1 > Reporter: li xiangyuan > Priority: Minor > > I have describe a mystery problem > (https://issues.apache.org/jira/browse/KAFKA-9211). This issue I found kafka > server will trigger tcp Congestion Control in some condition. finally we > found the root cause. > when kafka server restart for any reason and then execute preferred replica > leader, lots of replica leader will give back to it & trigger cluster > metadata update. then all clients will establish connection to this server. > at the monment many tcp estable request are waiting in the tcp sync queue , > and then to accept queue. > kafka create serversocket in SocketServer.scala > > {code:java} > serverChannel.socket.bind(socketAddress);{code} > this method has second parameter "backlog", min(backlog,tcp_max_syn_backlog) > will decide the queue length.beacues kafka haven't set ,it is default value > 50. > if this queue is full, and tcp_syncookies = 0, then new connection request > will be rejected. If tcp_syncookies=1, it will trigger the tcp synccookie > mechanism. this mechanism could allow linux handle more tcp sync request, but > it would lose many tcp external parameter, include "wscale", the one that > allow tcp connection to send much more bytes per tcp package. because > syncookie triggerd, wscale has lost, and this tcp connection will handle > network very slow, forever,until this connection is closed and establish > another tcp connection. > so after a preferred repilca executed, lots of new tcp connection will > establish without set wscale,and many network traffic to this server will > have a very slow speed. > i'm not sure whether new linux version have resolved this problem, but kafka > also should set backlog a larger value. we now have modify this to 512, seems > everything is ok. > -- This message was sent by Atlassian Jira (v8.3.4#803005)