On 11/6/15, 5:39 PM, "Hans Petter Selasky" <h...@selasky.org> wrote:
>On 11/06/15 22:56, Cui, Cheng wrote: >> On Nov 6, 2015, at 4:16 PM, Hans Petter Selasky <h...@selasky.org> wrote: >> >>> On 11/06/15 21:46, Cui, Cheng wrote: >>>> Hello Hans, >>>> >>>> Sorry if my previous email does not reach you because of a bad >>>>subject. >>>> >>>> This is Cheng Cui. I am reading the CURRENT FreeBSD code in >>>>tcp_output.c, and find this question regarding your change in revision >>>>271946. >>>> >>>>https://svnweb.freebsd.org/base/head/sys/netinet/tcp_output.c?r1=271946 >>>>&r2=271945&pathrev=271946 >>>> >>>> trim data "len" under TSO: >>>> >>>> 885 /* >>>> 886 * Prevent the last segment from being >>>> 887 * fractional unless the send sockbuf >>>> can >>>>be >>>> 888 * emptied: >>>> 889 */ >>>> 890 max_len = (tp->t_maxopd - optlen); >>>> 891 if ((off + len) < sbavail(&so->so_snd)) >>>> { >>>> <== >>>> 892 moff = len % max_len; >>>> 893 if (moff != 0) { >>>> 894 len -= moff; >>>> 895 sendalot = 1; >>>> 896 } >>>> 897 } >>>> >>>> Is there a specific reason that it should skip trimming the data >>>>"len" under the condition of "(off + len) == sbavail(&so->so_snd)" in >>>>TSO? >>>> Because I am wondering if we can trim the data "len" directly without >>>>checking the "(off + len)" condition. >>> >>> Hi Cheng, >>> >>> I believe the reason is to avoid looping one more time outputting a >>>single packet containing the remainder of the available data, with >>>regard to max_len. > > How did you envision the removal of this check would influence the >generated packet sequence? >>> >>> --HPS >>> >> Hi Hans, >> >> I may be wrong but my assumption is that the remainder of the available >>data may be larger than one single packet. >> >> Suppose max_len==1500, sb_acc==3001, off==2, and (off+len)==3001. In >>this case, the current code will not trim the "len" >> and let it go directly to the NIC. I think it skips the Nagle's >>algorithm. As len==2999, the last packet is 1499, >> it is supposed to be held until all outstanding data are ACKed, but it >>has been sent out. > >Hi Cheng, > >That is correct. Nagle's algorithm is not active when "(off+len) == >sb_acc". Anyhow, the check for "(off+len) == sb_acc" does not go away. >It has to be put before sendalot = 1 to avoid sending the so-called >"small packet" in the next iteration. Possibly you will need to add a >check for TCP nodelay being active, which disable Nagle's algorithm. >Have you done any tests removing this check? > >--HPS Hi Hans, Sorry for the delay to continue this discussion. I did some tests and collected some trace files by using iperf and tcpdump. Well, I did not find anything wrong with the Nagle's algorithm. But I found the remainder chunk of data could be larger than a single packet, which will push NIC to send extra fractional packet, if the send buf size is under a certain condition. Here is my test. The iperf command I choose pushes 5793 bytes data to the 7240bytes write buffer by setting the "-l" option and the "-w" option. I tested this TCP connection performance on a pair of FreeBSD 10.2 nodes (s1 and r1) with a switch in between. Both nodes have TSO and delayed ACK enabled. root@s1:~ # ping -c 3 r1 PING r1-link1 (10.1.2.3): 56 data bytes 64 bytes from 10.1.2.3: icmp_seq=0 ttl=64 time=0.154 ms 64 bytes from 10.1.2.3: icmp_seq=1 ttl=64 time=0.144 ms 64 bytes from 10.1.2.3: icmp_seq=2 ttl=64 time=0.142 ms --- r1-link1 ping statistics --- 3 packets transmitted, 3 packets received, 0.0% packet loss round-trip min/avg/max/stddev = 0.142/0.147/0.154/0.005 ms root@r1:~ # ping -c 3 s1 PING s1-link1 (10.1.2.2): 56 data bytes 64 bytes from 10.1.2.2: icmp_seq=0 ttl=64 time=0.163 ms 64 bytes from 10.1.2.2: icmp_seq=1 ttl=64 time=0.145 ms 64 bytes from 10.1.2.2: icmp_seq=2 ttl=64 time=0.143 ms --- s1-link1 ping statistics --- 3 packets transmitted, 3 packets received, 0.0% packet loss round-trip min/avg/max/stddev = 0.143/0.150/0.163/0.009 ms iperf -s <== iperf command@receiver iperf -c 10.1.2.3 -l 5793 -w 5793 -n 10M -m -f B <== iperf command@sender ------------------------------------------------------------ Client connecting to 10.1.2.3, TCP port 5001 TCP window size: 7240 Byte (WARNING: requested 5793 Byte) ------------------------------------------------------------ [ 3] local 10.1.2.2 port 16338 connected with 10.1.2.3 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0- 0.5 sec 10491123 Bytes 22615589 Bytes/sec [ 3] MSS size 1448 bytes (MTU 1500 bytes, ethernet) I sent 10MBytes of data, and collected the packet trace from both nodes by tcpdump. I did this test twice to confirm the result can be reproduced. >From the trace files of both nodes before my code change, I see a lot of fractional packets. See the attached trace files in "before_code_change.zip". Then, I did my code change in 10.2 src by commenting out the data trim condition below: 868 /* 869 * Prevent the last segment from being 870 * fractional unless the send sockbuf can be 871 * emptied: 872 */ 873 max_len = (tp->t_maxopd - optlen); 874 // if ((off + len) < so->so_snd.sb_cc) { 875 moff = len % max_len; 876 if (moff != 0) { 877 len -= moff; 878 sendalot = 1; 879 } 880 // } And I did the same iperf test and gathered trace files. I did not find many fractional packets this time. See the attached trace files in "after_code_change.zip". Compared with the receiver traces, I see receiver got the same 7251 packets in the two tests, instead of 9060 packets before the change. That's a save of 20% on the wire. Compared with the sender traces, I see sender's TSO handled 2185 packets and 1839 packets in the two tests, instead of 4498 packets and 4473 packets before the change. That's also a save of roughly more than 40% on the handling of TSO chunks. There may be other conditions I did not cover, but I think the current data trime can be improved in TSO by removing the above condition. Trace files before/after code change are attached. _______________________________________________ svn-src-head@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"