[jira] [Comment Edited] (KAFKA-9648) kafka server should resize backlog when create serversocket

Haruki Okada (Jira) Mon, 12 Jul 2021 19:06:05 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-9648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17379515#comment-17379515
 ]


Haruki Okada edited comment on KAFKA-9648 at 7/13/21, 2:05 AM:
---------------------------------------------------------------

Hi.

We operate a Kafka cluster in our company which has 130+ brokers, 1.3M+ total 
socket server connections and 25K+ partitions.
 We faced very similar issue to 
https://issues.apache.org/jira/browse/KAFKA-9211 (i.e. producer slowing down 
mysteriously with small TCP packets) recently, and we figured out the cause.
 We hope our knowledge could help.
h2. Environment:
 * Kafka broker version: 2.4.1 (but we suppose the version doesn't matter)
 * Kafka broker OS: CentOS7 (kernel 3.10.X)

h2. Phenomenon:
 * Restart a broker process, and execute preferred leader election after the 
broker became in-sync
 * Some producers's node-request-latency to the broker got insanely higher than 
usual
 ** However no such high produce response latency is observed on broker-side 
metrics
 * As the result, producer batches couldn't be sent out in sufficient pace, 
then caused batch expiration

h2. Analysis:
 * We observed TCP SYN cookies metric was increased at incidental time, with 
following dmesg message:
 ** 
{code:java}
TCP: request_sock_TCP: Possible SYN flooding on port 22991. Sending cookies.  
Check SNMP counters.{code}

 * So we also suspected the phenomenon is due to `wscale` drop (as like 
described in this issue), but we doubt it because:
 ** Even with TCP SYN cookies, `wscale` should be available as long as TCP 
timestamp is enabled. (refs: 
[https://blog.cloudflare.com/syn-packet-handling-in-the-wild/])
 ** In our environment, TCP timestamp is enabled.
 ** Also, generally, `wscale` (window scaling factor) is used for extending 
window beyond 65535 (max window size in TCP spec) on large-network round trip 
environment such as internet
 *** Our typical produce request size is smaller than that
 *** So it's hard to imagine that `wscale` drop causes such significant 
request-latency degradation
 * After several attempts to reproduce, we found out that receiver (i.e. broker 
in this context)'s `wscale` is inconsistent between producer and broker at 
incidental time.
 ** receiver's `wscale` advertised from broker -> producer along with SYN+ACK 
packet: 7
 *** 
{code:java}
17:32:05.161213 IP BROKER.HOST.XXXX > CLIENT.YYYY: Flags [S.], seq 29292755, 
ack 17469019, win 28960, options [mss 1460,sackOK,TS val 25589601 ecr 
9055245,nop,wscale 1], length 0{code}

 * 
 ** 
 *** (seq numbers are substituted with random value)
 ** actual receiver's `wscale` after established: 1
 *** 
{code:java}
[user@BROKER ~]$ /sbin/ss -e -i -t | less
...
ESTAB      0      0      BROKER.XXXX                CLIENT.YYYY                 
timer:(keepalive,21min,0) uid:503 ino:15143860 sk:ffff9ba25dc4f440 <->         
ts sack cubic wscale:7,7 rto:201 rtt:0.179/0.006 ato:140 mss:1448 rcvmss:1448 
advmss:1448 cwnd:10 bytes_acked:97054 bytes_received:18950537 segs_out:15094 
segs_in:13941 send 647.2Mbps lastsnd:217 lastrcv:17 lastack:217 pacing_rate 
1288.0Mbps rcv_rtt:1.875 rcv_space:29200{code}

 * 
 ** 
 *** `wscale:7,7` means that broker's receiver window scaling factor is 7
 * Okay, then this inconsistency could explain the phenomenon as below:
 ** Premise: When `wscale` is enabled, TCP window size is calculated by 
`window_size * 2^wscale`
 ** When broker calculates advertised window size, it's bit-shifted to the 
right by `wscale` (== 7)
 *** [https://github.com/torvalds/linux/blob/v3.10/net/ipv4/tcp_output.c#L290]
 ** On the other hand, producer multiplies advertised window size by 1, which 
is conveyed through SYN+ACK
 ** As the result, window size became 64 times smaller than expected
 ** Then producer splits TCP packets to much smaller size than usual (possibly 
under MSS)
 *** TCP acks are delayed due to the conditions are not met 
([https://github.com/torvalds/linux/blob/v3.10/net/ipv4/tcp_input.c#L4760])
 * Last remaining question is "why such wscale inconsistency happened?"
 ** Read through the kernel source code, then we found that there's an issue in 
the logic to calculate wscale if TCP connection is established through 
syncookies
 *** It's fixed in this commit: 
[https://github.com/torvalds/linux/commit/909172a149749242990a6e64cb55d55460d4e417]

h2. Solution:
 * 1. Upgrade linux kernel to at least v5.10 (which the patch is committed)
 * 2. Disable SYN cookies
 ** i.e. setting `net.ipv4.tcp_syncookies` kernel parameter to 0
 ** With SYN cookies disabled, some SYN's are dropped on same situations but 
typically it's not be a serious problem thanks to SYN retries. Clients should 
reconnect soon.
 * 3. Adjust backlog size (as this ticket suggests)
 ** Even disabling SYN cookies work thanks to SYN retries, it will cause 
certain delay in TCP establishment if reconnect happened.
 ** So generally it's preferred to adjust backlog size as well, to avoid SYN 
drops in the first place.

 

I think we should add new config to adjust backlog size, since preferred 
backlog size depends on environment so it's hard to decide globally-appropriate 
fixed value.

If you don't mind, I'm ready to submit a patch to add a new config to adjust 
backlog size.

What do you think?

 

Thanks,


was (Author: ocadaruma):
Hi.


 We operate a Kafka cluster in our company which has 130+ brokers, 1.3M+ total 
socket server connections and 25K+ partitions.
 We faced very similar issue to 
https://issues.apache.org/jira/browse/KAFKA-9211 (i.e. producer slowing down 
mysteriously with small TCP packets) recently, and we figured out the cause.
 We hope our knowledge could help.
h2. Environment:
 * Kafka broker version: 2.4.1 (but we suppose the version doesn't matter)
 * Kafka broker OS: CentOS7 (kernel 3.10.X)

h2. Phenomenon:
 * Restart a broker process, and execute preferred leader election after the 
broker became in-sync
 * Some producers's node-request-latency to the broker got insanely higher than 
usual
 ** However no such high produce response latency is observed on broker-side 
metrics
 * As the result, producer batches couldn't be sent out in sufficient pace, 
then caused batch expiration

h2. Analysis:
 * We observed TCP SYN cookies metric was increased at incidental time, with 
following dmesg message:
 ** 
{code:java}
TCP: request_sock_TCP: Possible SYN flooding on port 22991. Sending cookies.  
Check SNMP counters.{code}

 * So we also suspected the phenomenon is due to `wscale` drop (as like 
described in this issue), but we doubt it because:
 ** Even with TCP SYN cookies, `wscale` should be available as long as TCP 
timestamp is enabled. (refs: 
[https://blog.cloudflare.com/syn-packet-handling-in-the-wild/])
 ** In our environment, TCP timestamp is enabled.
 ** Also, generally, `wscale` (window scaling factor) is used for extending 
window beyond 65535 (max window size in TCP spec) on large-network round trip 
environment such as internet
 *** Our typical produce request size is smaller than that
 *** So it's hard to imagine that `wscale` drop causes such significant 
request-latency degradation
 * After several attempts to reproduce, we found out that receiver (i.e. broker 
in this context)'s `wscale` is inconsistent between producer and broker at 
incidental time.
 ** receiver's `wscale` advertised from broker -> producer along with SYN+ACK 
packet: 7
 *** 
{code:java}
17:32:05.161213 IP BROKER.HOST.XXXX > CLIENT.YYYY: Flags [S.], seq 29292755, 
ack 17469019, win 28960, options [mss 1460,sackOK,TS val 25589601 ecr 
9055245,nop,wscale 1], length 0{code}

 *** (seq numbers are substituted with random value)
 ** actual receiver's `wscale` after established: 1
 *** 
{code:java}
[user@BROKER ~]$ /sbin/ss -e -i -t | less
...
ESTAB      0      0      BROKER.XXXX                CLIENT.YYYY                 
timer:(keepalive,21min,0) uid:503 ino:15143860 sk:ffff9ba25dc4f440 <->         
ts sack cubic wscale:7,7 rto:201 rtt:0.179/0.006 ato:140 mss:1448 rcvmss:1448 
advmss:1448 cwnd:10 bytes_acked:97054 bytes_received:18950537 segs_out:15094 
segs_in:13941 send 647.2Mbps lastsnd:217 lastrcv:17 lastack:217 pacing_rate 
1288.0Mbps rcv_rtt:1.875 rcv_space:29200{code}

 *** `wscale:7,7` means that broker's receiver window scaling factor is 7
 * Okay, then this inconsistency could explain the phenomenon as below:
 ** Premise: When `wscale` is enabled, TCP window size is calculated by 
`window_size * 2^wscale`
 ** When broker calculates advertised window size, it's bit-shifted to the 
right by `wscale` (== 7)
 *** [https://github.com/torvalds/linux/blob/v3.10/net/ipv4/tcp_output.c#L290]
 ** On the other hand, producer multiplies advertised window size by 1, which 
is conveyed through SYN+ACK
 ** As the result, window size became 64 times smaller than expected
 ** Then producer splits TCP packets to much smaller size than usual (possibly 
under MSS)
 *** TCP acks are delayed due to the conditions are not met 
([https://github.com/torvalds/linux/blob/v3.10/net/ipv4/tcp_input.c#L4760])
 * Last remaining question is "why such wscale inconsistency happened?"
 ** Read through the kernel source code, then we found that there's an issue in 
the logic to calculate wscale if TCP connection is established through 
syncookies
 *** It's fixed in this commit: 
[https://github.com/torvalds/linux/commit/909172a149749242990a6e64cb55d55460d4e417]

h2. Solution:
 * 1. Upgrade linux kernel to at least v5.10 (which the patch is committed)
 * 2. Disable SYN cookies
 ** i.e. setting `net.ipv4.tcp_syncookies` kernel parameter to 0
 ** With SYN cookies disabled, some SYN's are dropped on same situations but 
typically it's not be a serious problem thanks to SYN retries. Clients should 
reconnect soon.
 * 3. Adjust backlog size (as this ticket suggests)
 ** Even disabling SYN cookies work thanks to SYN retries, it will cause 
certain delay in TCP establishment if reconnect happened.
 ** So generally it's preferable to adjust backlog size as well, to avoid SYN 
drops in the first place.

 

I think we should add new config to adjust backlog size, since preferred 
backlog size depends on environment so it's hard to decide globally-appropriate 
fixed value.

If you don't mind, I'm ready to submit a patch to add a new config to adjust 
backlog size.

What do you think?

 

Thanks,

> kafka server should resize backlog when create serversocket
> -----------------------------------------------------------
>
>                 Key: KAFKA-9648
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9648
>             Project: Kafka
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 0.10.0.1
>            Reporter: li xiangyuan
>            Priority: Minor
>
> I have describe a mystery problem 
> (https://issues.apache.org/jira/browse/KAFKA-9211). This issue I found kafka 
> server will trigger tcp Congestion Control in some condition. finally we 
> found the root cause.
> when kafka server restart for any reason and then execute preferred replica 
> leader, lots of replica leader will give back to it & trigger cluster 
> metadata update. then all clients will establish connection to this server. 
> at the monment many tcp estable request are waiting in the tcp sync queue , 
> and then to accept queue. 
> kafka create serversocket in SocketServer.scala 
>  
> {code:java}
> serverChannel.socket.bind(socketAddress);{code}
> this method has second parameter "backlog", min(backlog,tcp_max_syn_backlog) 
> will decide the queue length.beacues kafka haven't set ,it is default value 
> 50.
> if this queue is full, and tcp_syncookies = 0, then new connection request 
> will be rejected. If tcp_syncookies=1, it will trigger the tcp synccookie 
> mechanism. this mechanism could allow linux handle more tcp sync request, but 
> it would lose many tcp external parameter, include "wscale", the one that 
> allow tcp connection to send much more bytes per tcp package. because 
> syncookie triggerd, wscale has lost, and this tcp connection will handle 
> network very slow, forever,until this connection is closed and establish 
> another tcp connection.
> so after a preferred repilca executed, lots of new tcp connection will 
> establish without set wscale,and many network traffic to this server will 
> have a very slow speed.
> i'm not sure whether new linux version have resolved this problem, but kafka 
> also should set backlog a larger value. we now have modify this to 512, seems 
> everything is ok.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (KAFKA-9648) kafka server should resize backlog when create serversocket

Reply via email to