Re: SWS for rcvbuf < MTU
From: Alex Sidorenko <[EMAIL PROTECTED]> Date: Mon, 2 Apr 2007 16:01:13 -0400 > our customer has tested your patch thoroughly and everything's working fine. > Please apply the patch if this is not done yet. Thanks for following through on this Alex, I've been waiting patiently for your test results :-)) I'll apply and push the patch now. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SWS for rcvbuf < MTU
On March 14, 2007 12:18:56 pm Alex Sidorenko wrote: > On March 13, 2007 03:01:50 pm John Heffner wrote: > > Sorry for the long delay in response, I've been on vacation. I'm okay > > with your patch, and I can't think of any real problem with it, except > > that the behavior is non-standard. Then again, Linux acking in general > > is non-standard, which has created the bug in the first place. :) The > > only thing I can think where it might still ack too often is if > > free_space frequently drops just below full_space/2 for a bit then rises > > above full_space/2. > > > > I've also attached a corrected version of my earlier patch that I think > > solves the problem you noted. > > Hello John, > > I think your corrected version should avoid the problem I mentioned, but I > will still ask the customer to test the new version. It will be interesting > to compare traffic patterns with those observed with stock kernel and my > earlier patch. > > It might take some time to do proper testing, I'll inform you as soon as > there are any results. Hello John, our customer has tested your patch thoroughly and everything's working fine. Please apply the patch if this is not done yet. Thanks, Alex -- -- Alexandre Sidorenko email: [EMAIL PROTECTED] Global Solutions Engineering: Unix Networking Hewlett-Packard (Canada) -- - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SWS for rcvbuf < MTU
On March 13, 2007 03:01:50 pm John Heffner wrote: > Sorry for the long delay in response, I've been on vacation. I'm okay > with your patch, and I can't think of any real problem with it, except > that the behavior is non-standard. Then again, Linux acking in general > is non-standard, which has created the bug in the first place. :) The > only thing I can think where it might still ack too often is if > free_space frequently drops just below full_space/2 for a bit then rises > above full_space/2. > > I've also attached a corrected version of my earlier patch that I think > solves the problem you noted. Hello John, I think your corrected version should avoid the problem I mentioned, but I will still ask the customer to test the new version. It will be interesting to compare traffic patterns with those observed with stock kernel and my earlier patch. It might take some time to do proper testing, I'll inform you as soon as there are any results. Thanks, Alex -- -- Alexandre Sidorenko email: [EMAIL PROTECTED] Global Solutions Engineering: Unix Networking Hewlett-Packard (Canada) -- - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SWS for rcvbuf < MTU
Alex Sidorenko wrote: Here are the values from live kernel (obtained with 'crash') when the host was in SWS state: full_space=708 full_space/2=354 free_space=393 window=76 In this case the test from my original fix, (window < full_space/2), succeeds. But John's test free_space > window + full_space/2 393 430 does not. So I suspect that the new fix will not always work. From tcpdump traces we can see that both hosts exchange with 76-byte packets for a long time. From customer's application log we see that it continues to read 76-byte chunks per each read() call - even though more than that is available in the receive buffer. Technically it's OK for read() to return even after reading one byte, so if sk->receive_queue contains multiple 76-byte skbuffs we may return after processing just one skbuff (but we we don't understand the details of why this happens on customer's system). Are there any particular reasons why you want to postpone window update until free_space becomes > window + full_space/2 and not as soon as free_space > full_space/2? As the only real-life occurance of SWS shows free_space oscillating slightly above full_space/2, I created the fix specifically to match this phenomena as seen on customer's host. We reach the modified section only when (free_space > full_space/2) so it should be OK to update the window at this point if mss==full_space. So yes, we can test John's fix on customer's host but I doubt it will work for the reasons mentioned above, in brief: 'window = free_space' instead of 'window=full_space/2' is OK, but the test 'free_space > window + full_space/2' is not for the specific pattern customer sees on his hosts. Sorry for the long delay in response, I've been on vacation. I'm okay with your patch, and I can't think of any real problem with it, except that the behavior is non-standard. Then again, Linux acking in general is non-standard, which has created the bug in the first place. :) The only thing I can think where it might still ack too often is if free_space frequently drops just below full_space/2 for a bit then rises above full_space/2. I've also attached a corrected version of my earlier patch that I think solves the problem you noted. Thanks, -John Do full receiver-side SWS avoidance when rcvbuf < mss. Signed-off-by: John Heffner <[EMAIL PROTECTED]> --- commit f4333661026621e15549fb75b37be785e4a1c443 tree 30d46b64ea19634875fdd4656d33f76db526a313 parent 562aa1d4c6a874373f9a48ac184f662fbbb06a04 author John Heffner <[EMAIL PROTECTED]> Tue, 13 Mar 2007 14:17:03 -0400 committer John Heffner <[EMAIL PROTECTED]> Tue, 13 Mar 2007 14:17:03 -0400 net/ipv4/tcp_output.c |9 - 1 files changed, 8 insertions(+), 1 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index dc15113..e621a63 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1605,8 +1605,15 @@ u32 __tcp_select_window(struct sock *sk) * We also don't do any window rounding when the free space * is too small. */ - if (window <= free_space - mss || window > free_space) + if (window <= free_space - mss || window > free_space) { window = (free_space/mss)*mss; + } else if (mss == full_space) { + /* Do full receive-side SWS avoidance +* when rcvbuf <= mss */ + window = tcp_receive_window(tp); + if (free_space > window + full_space/2) + window = free_space; + } } return window;
Re: SWS for rcvbuf < MTU
On March 3, 2007 06:40:12 pm John Heffner wrote: > David Miller wrote: > > From: John Heffner <[EMAIL PROTECTED]> > > Date: Fri, 02 Mar 2007 16:16:39 -0500 > > > >> Please don't apply the patch I sent. I've been thinking about this a > >> bit harder, and it may not fix this particular problem. (Hard to say > >> without knowing exactly what it is.) As the comment above > >> __tcp_select_window() states, we do not do full receive-side SWS > >> avoidance because of header prediction. > >> > >> Alex, you're right I missed that special zero-window case. I'm still > >> not quite sure I'm completely happy with this patch. I'd like to think > >> about this a little bit harder... > > > > Ok > > Alright, I've thought about it a bit more, and I think the patch I sent > should work. Alex, any opinion? Any way you can test this out? Here are the values from live kernel (obtained with 'crash') when the host was in SWS state: full_space=708 full_space/2=354 free_space=393 window=76 In this case the test from my original fix, (window < full_space/2), succeeds. But John's test free_space > window + full_space/2 393 430 does not. So I suspect that the new fix will not always work. From tcpdump traces we can see that both hosts exchange with 76-byte packets for a long time. From customer's application log we see that it continues to read 76-byte chunks per each read() call - even though more than that is available in the receive buffer. Technically it's OK for read() to return even after reading one byte, so if sk->receive_queue contains multiple 76-byte skbuffs we may return after processing just one skbuff (but we we don't understand the details of why this happens on customer's system). Are there any particular reasons why you want to postpone window update until free_space becomes > window + full_space/2 and not as soon as free_space > full_space/2? As the only real-life occurance of SWS shows free_space oscillating slightly above full_space/2, I created the fix specifically to match this phenomena as seen on customer's host. We reach the modified section only when (free_space > full_space/2) so it should be OK to update the window at this point if mss==full_space. So yes, we can test John's fix on customer's host but I doubt it will work for the reasons mentioned above, in brief: 'window = free_space' instead of 'window=full_space/2' is OK, but the test 'free_space > window + full_space/2' is not for the specific pattern customer sees on his hosts. Thanks, Alex -- -- Alexandre Sidorenko email: [EMAIL PROTECTED] Global Solutions Engineering: Unix Networking Hewlett-Packard (Canada) -- - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SWS for rcvbuf < MTU
David Miller wrote: From: John Heffner <[EMAIL PROTECTED]> Date: Fri, 02 Mar 2007 16:16:39 -0500 Please don't apply the patch I sent. I've been thinking about this a bit harder, and it may not fix this particular problem. (Hard to say without knowing exactly what it is.) As the comment above __tcp_select_window() states, we do not do full receive-side SWS avoidance because of header prediction. Alex, you're right I missed that special zero-window case. I'm still not quite sure I'm completely happy with this patch. I'd like to think about this a little bit harder... Ok Alright, I've thought about it a bit more, and I think the patch I sent should work. Alex, any opinion? Any way you can test this out? Thanks, -John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SWS for rcvbuf < MTU
From: John Heffner <[EMAIL PROTECTED]> Date: Fri, 02 Mar 2007 16:16:39 -0500 > Please don't apply the patch I sent. I've been thinking about this a > bit harder, and it may not fix this particular problem. (Hard to say > without knowing exactly what it is.) As the comment above > __tcp_select_window() states, we do not do full receive-side SWS > avoidance because of header prediction. > > Alex, you're right I missed that special zero-window case. I'm still > not quite sure I'm completely happy with this patch. I'd like to think > about this a little bit harder... Ok - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SWS for rcvbuf < MTU
David Miller wrote: From: Alex Sidorenko <[EMAIL PROTECTED]> Date: Fri, 2 Mar 2007 15:21:58 -0500 they told us that they use small rcvbuf to throttle bandwidth for this application. I explained it would be better to use TC for this purpose. They agreed and will probably redesign their application in the future, but they cannot do it right now. For the same reason they have to use the old 2.4.20 for a while - in big companies the important production software cannot be changed quickly. The fix I suggested is trivial and should have no impact the case of rcvfbuf>mtu, so I think it makes sense to include it in upstream kernel. I have no objection to the fix, especially John's version. I was just curious about the app, thanks for the info :) Please don't apply the patch I sent. I've been thinking about this a bit harder, and it may not fix this particular problem. (Hard to say without knowing exactly what it is.) As the comment above __tcp_select_window() states, we do not do full receive-side SWS avoidance because of header prediction. Alex, you're right I missed that special zero-window case. I'm still not quite sure I'm completely happy with this patch. I'd like to think about this a little bit harder... Thanks, -John - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SWS for rcvbuf < MTU
From: Alex Sidorenko <[EMAIL PROTECTED]> Date: Fri, 2 Mar 2007 15:21:58 -0500 > they told us that they use small rcvbuf to throttle bandwidth for this > application. I explained it would be better to use TC for this purpose. They > agreed and will probably redesign their application in the future, but they > cannot do it right now. For the same reason they have to use the old 2.4.20 > for a while - in big companies the important production software cannot be > changed quickly. > > The fix I suggested is trivial and should have no impact the case of > rcvfbuf>mtu, so I think it makes sense to include it in upstream kernel. I have no objection to the fix, especially John's version. I was just curious about the app, thanks for the info :) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SWS for rcvbuf < MTU
On March 2, 2007 01:54:45 pm John Heffner wrote: > Alex Sidorenko wrote: > [snip] > > > --- net/ipv4/tcp_output.c.orig Wed May 3 20:40:43 2006 > > +++ net/ipv4/tcp_output.c Tue Jan 30 14:24:56 2007 > > @@ -641,6 +641,7 @@ > > * Note, we don't "adjust" for TIMESTAMP or SACK option bytes. > > * Regular options like TIMESTAMP are taken into account. > > */ > > +static const char *SWS_id_string="@#SWS-fix-2"; > > u32 __tcp_select_window(struct sock *sk) > > { > > struct tcp_opt *tp = &sk->tp_pinfo.af_tcp; > > @@ -682,6 +683,9 @@ > > window = tp->rcv_wnd; > > if (window <= free_space - mss || window > free_space) > > window = (free_space/mss)*mss; > > +/* A fix for small rcvbuf [EMAIL PROTECTED] */ > > + else if (mss == full_space && window < full_space/2) > > + window = full_space/2; > > > > return window; > > } > > Good analysis of the problem, but the patch does not look quite right. > In particular, you can't ever announce a zero window. :) Hi John, in case when (free_space < full_space/2) we do not reach the modified code and we will return zero: if (free_space < full_space/2) { icsk->icsk_ack.quick = 0; if (tcp_memory_pressure) tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U*tp->advmss); if (free_space < mss) return 0; } Here is how windows look with the fixed kernel (from customer's test): 20:59:45.320758 Node1.logical.40171 > 11.0.0.1.39909: win = 708 20:59:45.322758 Node1.logical.40171 > 11.0.0.1.39909: win = 288 20:59:45.714567 Node1.logical.40171 > 11.0.0.1.39909: win = 354 20:59:45.717110 Node1.logical.40171 > 11.0.0.1.39909: win = 0 20:59:45.719110 Node1.logical.40171 > 11.0.0.1.39909: win = 708 ... Regards, Alex > I think this attached patch does the correct SWS avoidance. > > Thanks, >-John -- -- Alexandre Sidorenko email: [EMAIL PROTECTED] Global Solutions Engineering: Unix Networking Hewlett-Packard (Canada) -- - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SWS for rcvbuf < MTU
On March 2, 2007 02:25:42 pm David Miller wrote: > From: Alex Sidorenko <[EMAIL PROTECTED]> > Date: Fri, 2 Mar 2007 11:28:28 -0500 > > > Customer has confirmed that this resolves the problem and decreases > > CPU usage by his custom application - even when there is no SWS. > > There is rarely ever a reason to set explicit socket receive > buffer sizes, since the kernel dynamically sizes them based > upon how the connection is used. > > Why do they set it so low? > > It is just as easy to fix their performance bug by simply removing > SO_RCVBUF setting in the application. Hi David, they told us that they use small rcvbuf to throttle bandwidth for this application. I explained it would be better to use TC for this purpose. They agreed and will probably redesign their application in the future, but they cannot do it right now. For the same reason they have to use the old 2.4.20 for a while - in big companies the important production software cannot be changed quickly. The fix I suggested is trivial and should have no impact the case of rcvfbuf>mtu, so I think it makes sense to include it in upstream kernel. Regards, Alex -- -- Alexandre Sidorenko email: [EMAIL PROTECTED] Global Solutions Engineering: Unix Networking Hewlett-Packard (Canada) -- - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SWS for rcvbuf < MTU
From: Alex Sidorenko <[EMAIL PROTECTED]> Date: Fri, 2 Mar 2007 11:28:28 -0500 > Customer has confirmed that this resolves the problem and decreases > CPU usage by his custom application - even when there is no SWS. There is rarely ever a reason to set explicit socket receive buffer sizes, since the kernel dynamically sizes them based upon how the connection is used. Why do they set it so low? It is just as easy to fix their performance bug by simply removing SO_RCVBUF setting in the application. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SWS for rcvbuf < MTU
Alex Sidorenko wrote: [snip] --- net/ipv4/tcp_output.c.orig Wed May 3 20:40:43 2006 +++ net/ipv4/tcp_output.c Tue Jan 30 14:24:56 2007 @@ -641,6 +641,7 @@ * Note, we don't "adjust" for TIMESTAMP or SACK option bytes. * Regular options like TIMESTAMP are taken into account. */ +static const char *SWS_id_string="@#SWS-fix-2"; u32 __tcp_select_window(struct sock *sk) { struct tcp_opt *tp = &sk->tp_pinfo.af_tcp; @@ -682,6 +683,9 @@ window = tp->rcv_wnd; if (window <= free_space - mss || window > free_space) window = (free_space/mss)*mss; +/* A fix for small rcvbuf [EMAIL PROTECTED] */ + else if (mss == full_space && window < full_space/2) + window = full_space/2; return window; } Good analysis of the problem, but the patch does not look quite right. In particular, you can't ever announce a zero window. :) I think this attached patch does the correct SWS avoidance. Thanks, -John Do receiver-side SWS avoidance for rcvbuf < MSS. Signed-off-by: John Heffner <[EMAIL PROTECTED]> --- commit 38d33181c93a28cf7fb2f9f3377305a04636c054 tree 503f8a9de6e78694bae9fc2eb1c9dd5d26a0b5ed parent 562aa1d4c6a874373f9a48ac184f662fbbb06a04 author John Heffner <[EMAIL PROTECTED]> Fri, 02 Mar 2007 13:47:44 -0500 committer John Heffner <[EMAIL PROTECTED]> Fri, 02 Mar 2007 13:47:44 -0500 net/ipv4/tcp_output.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index dc15113..688b955 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1607,6 +1607,9 @@ u32 __tcp_select_window(struct sock *sk) */ if (window <= free_space - mss || window > free_space) window = (free_space/mss)*mss; + else if (mss == full_space && +free_space > window + full_space/2) + window = free_space; } return window;
SWS for rcvbuf < MTU
Hello, this is a rare corner case met by one of HP partners on 2.4.20 on IA64. Inspecting the sources of the latest 2.6.20.1 (net/ipv4/tcp_output.c) we can see that the bug is still there. Here is a description of the bug and the suggested fix. The problem occurs when the remote host (not necessarily Linux - in our case it was Solaris) does not implement SWS avoidance on sender side. If Linux connection socket has rcvbuf mtu. But if we use small rcvbuf (set by SO_RCVBUF), we can go into SWS mode. Let us for simplicity look only at the case when we don't have WS enabled. If we have free_space above full_space/2, we reach the following section: /* Don't do rounding if we are using window scaling, since the * scaled window will not line up with the MSS boundary anyway. */ window = tp->rcv_wnd; if (tp->rx_opt.rcv_wscale) { } else { /* Get the largest window that is a nice multiple of mss. * Window clamp already applied above. * If our current window offering is within 1 mss of the * free space we just keep it. This prevents the divide * and multiply from happening most of the time. * We also don't do any window rounding when the free space * is too small. */ (1) if (window <= free_space - mss || window > free_space) window = (free_space/mss)*mss; } return window; What happens if we have a small tp->rcv_wnd and rcvbuf <= mss? In this case condition (1) is almost always false and as a result we'll return unmodified 'window' set to tp->rcv_wnd. If tp->rcv_wnd is small, it can be reused over and over again. For the case rcvbuf <= mss __tcp_select_window() returns: 0 if we have free_space < full_space/2OK mss if rcvbuf is empty OK tp->rcv_wnd in other case Bad If there is no SWS avoidance on sender side, we can see Linux advertising the same small rcv_wnd over and over again. The problem here is that we never advertise one-half the receiver's buffer space as described e.g. in "TCP/IP Illustrated" by Stevens (v.1, Chapter 22.3): "The normal algorithm is for the receiver not to advertise a larger window than it is currently advertising (which can be 0) until the window can be increased by either one full-sized segment (i.e. the MSS being received) or by one-half the receiver's buffer space, whichever is smaller" ^^ The fix. We have not been able to reproduce the problem inside HP as it is unclear what conditions are needed to bring system into SWS mode (this needs very special event timing). HP customer was seeing it every 2-3 days while running a custom application (Solaris<->Linux) that was running with low priority on a busy host running other custom applications with SCHED_RR. After going into SWS mode, his application stayed in it until restarted. We provided to customer a fix for 2.4.20 only (used by customer in production) by adding another test and returning rcvbuf/2 when needed: --- net/ipv4/tcp_output.c.orig Wed May 3 20:40:43 2006 +++ net/ipv4/tcp_output.c Tue Jan 30 14:24:56 2007 @@ -641,6 +641,7 @@ * Note, we don't "adjust" for TIMESTAMP or SACK option bytes. * Regular options like TIMESTAMP are taken into account. */ +static const char *SWS_id_string="@#SWS-fix-2"; u32 __tcp_select_window(struct sock *sk) { struct tcp_opt *tp = &sk->tp_pinfo.af_tcp; @@ -682,6 +683,9 @@ window = tp->rcv_wnd; if (window <= free_space - mss || window > free_space) window = (free_space/mss)*mss; +/* A fix for small rcvbuf [EMAIL PROTECTED] */ + else if (mss == full_space && window < full_space/2) + window = full_space/2; return window; } Customer has confirmed that this resolves the problem and decreases CPU usage by his custom application - even when there is no SWS. This is a rare corner case and most users will never meet it. But as the fix is trivial, I think it makes sense to include it in upstream sources. Regards, Alex -- -- Alexandre Sidorenko email: [EMAIL PROTECTED] Global Solutions Engineering: Unix Networking Hewlett-Packard (Canada) -- - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html