Re: TCP and BBR: reproducibly low cwnd and bandwidth
On Sun, 2018-02-18 at 22:49 +0100, Oleksandr Natalenko wrote: > Hi. > > On neděle 18. února 2018 22:04:27 CET Eric Dumazet wrote: > > I was able to take a look today, and I believe this is the time to > > switch TCP to GSO being always on. > > > > As a bonus, we get speed boost for cubic as well. > > > > Todays high BDP and recent TCP improvements (rtx queue as rb-tree, sack > > coalescing, TCP pacing...) all were developed/tested/maintained with > > GSO/TSO being the norm. > > > > Can you please test the following patch ? > > Yes, results below: > > BBR+fq: > sg on: 6.02 Gbits/sec > sg off: 1.33 Gbits/sec > > BBR+pfifo_fast: > sg on: 4.13 Gbits/sec > sg off: 1.34 Gbits/sec > > BBR+fq_codel: > sg on: 4.16 Gbits/sec > sg off: 1.35 Gbits/sec > > Reno+fq: > sg on: 6.44 Gbits/sec > sg off: 1.39 Gbits/sec > > Reno+pfifo_fast: > sg on: 6.36 Gbits/sec > sg off: 1.39 Gbits/sec > > Reno+fq_codel: > sg on: 6.41 Gbits/sec > sg off: 1.38 Gbits/sec > > While BBR still suffers when fq is not used, disabling sg doesn't bring > drastic throughput drop anymore. So, looks good to me, eh? > Indeed :) Here are my results on 40Gbit link (mlx4) : BBR+fq: sg on: 26 Gbits/sec sg off: 15.7 Gbits/sec (was 2.3 Gbit before patch) BBR+pfifo_fast: sg on: 24.2 Gbits/sec sg off: 14.9 Gbits/sec (was 0.66 Gbit before patch !!! ) BBR+fq_codel: sg on: 24.4 Gbits/sec sg off: 15 Gbits/sec (was 0.66 Gbit before patch !!! ) Reno+fq: sg on: 20 Gbits/sec sg off: 15.7 Gbits/sec (was 6 Gbit) Reno+pfifo_fast: sg on: 25.7 Gbits/sec sg off: 15.5 Gbits/sec (was 7 Gbit) Reno+fq_codel: sg on: 25.8 Gbits/sec sg off: 16 Gbits/sec(was 7 Gbit) Definitely worth it ;) Thanks !
Re: TCP and BBR: reproducibly low cwnd and bandwidth
Hi. On neděle 18. února 2018 22:04:27 CET Eric Dumazet wrote: > I was able to take a look today, and I believe this is the time to > switch TCP to GSO being always on. > > As a bonus, we get speed boost for cubic as well. > > Todays high BDP and recent TCP improvements (rtx queue as rb-tree, sack > coalescing, TCP pacing...) all were developed/tested/maintained with > GSO/TSO being the norm. > > Can you please test the following patch ? Yes, results below: BBR+fq: sg on: 6.02 Gbits/sec sg off: 1.33 Gbits/sec BBR+pfifo_fast: sg on: 4.13 Gbits/sec sg off: 1.34 Gbits/sec BBR+fq_codel: sg on: 4.16 Gbits/sec sg off: 1.35 Gbits/sec Reno+fq: sg on: 6.44 Gbits/sec sg off: 1.39 Gbits/sec Reno+pfifo_fast: sg on: 6.36 Gbits/sec sg off: 1.39 Gbits/sec Reno+fq_codel: sg on: 6.41 Gbits/sec sg off: 1.38 Gbits/sec While BBR still suffers when fq is not used, disabling sg doesn't bring drastic throughput drop anymore. So, looks good to me, eh? > Note that some cleanups can be done later in TCP stack, removing lots > of legacy stuff. > > Also TCP internal-pacing could benefit from something similar to this > fq patch eventually, although there is no hurry. > https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?i > d=fefa569a9d4bc4b7758c0fddd75bb0382c95da77 Feel free to ping me if you have something else to test then ;). > Of course, you have to consider why SG was disabled on your device, > this looks very pessimistic. Dunno why that happens, but I've managed to just enable it automatically on interface up. Thanks. Oleksandr
Re: TCP and BBR: reproducibly low cwnd and bandwidth
On Sun, 2018-02-18 at 13:04 -0800, Eric Dumazet wrote: > > Can you please test the following patch ? > > Note that some cleanups can be done later in TCP stack, removing lots > of legacy stuff. > > Also TCP internal-pacing could benefit from something similar to this > fq patch eventually, although there is no hurry. > https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=fefa569a9d4bc4b7758c0fddd75bb0382c95da77 > > > Of course, you have to consider why SG was disabled on your device, > this looks very pessimistic. > > Thanks ! > > include/net/sock.h |1 + > net/core/sock.c |2 +- > net/ipv4/tcp_ipv4.c |1 + > net/ipv6/tcp_ipv6.c |1 + > 4 files changed, 4 insertions(+), 1 deletion(-) Also note that the patch only deals with active connections. My official patch will also take care of passive ones of course.
Re: TCP and BBR: reproducibly low cwnd and bandwidth
On Sat, 2018-02-17 at 10:52 -0800, Eric Dumazet wrote: > > This must be some race condition in the code I added in TCP for self- > pacing, when a sort timeout is programmed. > > Disabling SG means TCP cooks 1-MSS packets. > > I will take a look, probably after the (long) week-end : Tuesday. I was able to take a look today, and I believe this is the time to switch TCP to GSO being always on. As a bonus, we get speed boost for cubic as well. Todays high BDP and recent TCP improvements (rtx queue as rb-tree, sack coalescing, TCP pacing...) all were developed/tested/maintained with GSO/TSO being the norm. Can you please test the following patch ? Note that some cleanups can be done later in TCP stack, removing lots of legacy stuff. Also TCP internal-pacing could benefit from something similar to this fq patch eventually, although there is no hurry. https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=fefa569a9d4bc4b7758c0fddd75bb0382c95da77 Of course, you have to consider why SG was disabled on your device, this looks very pessimistic. Thanks ! include/net/sock.h |1 + net/core/sock.c |2 +- net/ipv4/tcp_ipv4.c |1 + net/ipv6/tcp_ipv6.c |1 + 4 files changed, 4 insertions(+), 1 deletion(-) diff --git a/include/net/sock.h b/include/net/sock.h index 169c92afcafa3d548f8238e91606b87c187559f4..df4ac691870ff9f779f1782ded58140eb4d3a961 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -417,6 +417,7 @@ struct sock { struct page_fragsk_frag; netdev_features_t sk_route_caps; netdev_features_t sk_route_nocaps; + netdev_features_t sk_route_forced_caps; int sk_gso_type; unsigned intsk_gso_max_size; gfp_t sk_allocation; diff --git a/net/core/sock.c b/net/core/sock.c index c501499a04fe973e80e18655b306d762d348ff44..b084acb3b3b96791663b731788a392041148416c 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1773,7 +1773,7 @@ void sk_setup_caps(struct sock *sk, struct dst_entry *dst) u32 max_segs = 1; sk_dst_set(sk, dst); - sk->sk_route_caps = dst->dev->features; + sk->sk_route_caps = dst->dev->features | sk->sk_route_forced_caps; if (sk->sk_route_caps & NETIF_F_GSO) sk->sk_route_caps |= NETIF_F_GSO_SOFTWARE; sk->sk_route_caps &= ~sk->sk_route_nocaps; diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index f8ad397e285e9b8db0b04f8abc30a42f22294ef9..eaf1e30fc5af879442f5f33ed4bd69f89dff8cfb 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -233,6 +233,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len) } /* OK, now commit destination to socket. */ sk->sk_gso_type = SKB_GSO_TCPV4; + sk->sk_route_forced_caps = NETIF_F_GSO; sk_setup_caps(sk, &rt->dst); rt = NULL; diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 412139f4eccd96923daaea064cd9fb8be13f5916..4a461e8e2d654aa341d525a0df609a294c2040df 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -269,6 +269,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr, inet->inet_rcv_saddr = LOOPBACK4_IPV6; sk->sk_gso_type = SKB_GSO_TCPV6; + sk->sk_route_forced_caps = NETIF_F_GSO; ip6_dst_store(sk, dst, NULL, NULL); icsk->icsk_ext_hdr_len = 0;
Re: TCP and BBR: reproducibly low cwnd and bandwidth
On Sat, 2018-02-17 at 11:01 +0100, Oleksandr Natalenko wrote: > Hi. > > On pátek 16. února 2018 23:59:52 CET Eric Dumazet wrote: > > Well, no effect here on e1000e (1 Gbit) at least > > > > # ethtool -K eth3 sg off > > Actual changes: > > scatter-gather: off > > tx-scatter-gather: off > > tcp-segmentation-offload: off > > tx-tcp-segmentation: off [requested on] > > tx-tcp6-segmentation: off [requested on] > > generic-segmentation-offload: off [requested on] > > > > # tc qd replace dev eth3 root pfifo_fast > > # ./super_netperf 1 -H 7.7.7.84 -- -K cubic > > 941 > > # ./super_netperf 1 -H 7.7.7.84 -- -K bbr > > 941 > > # tc qd replace dev eth3 root fq > > # ./super_netperf 1 -H 7.7.7.84 -- -K cubic > > 941 > > # ./super_netperf 1 -H 7.7.7.84 -- -K bbr > > 941 > > # tc qd replace dev eth3 root fq_codel > > # ./super_netperf 1 -H 7.7.7.84 -- -K cubic > > 941 > > # ./super_netperf 1 -H 7.7.7.84 -- -K bbr > > 941 > > # > > That really looks strange to me. I'm able to reproduce the effect caused by > disabling scatter-gather even on the VM (using iperf3, as usual): This must be some race condition in the code I added in TCP for self- pacing, when a sort timeout is programmed. Disabling SG means TCP cooks 1-MSS packets. I will take a look, probably after the (long) week-end : Tuesday. Thanks !
Re: TCP and BBR: reproducibly low cwnd and bandwidth
Hi. On pátek 16. února 2018 23:59:52 CET Eric Dumazet wrote: > Well, no effect here on e1000e (1 Gbit) at least > > # ethtool -K eth3 sg off > Actual changes: > scatter-gather: off > tx-scatter-gather: off > tcp-segmentation-offload: off > tx-tcp-segmentation: off [requested on] > tx-tcp6-segmentation: off [requested on] > generic-segmentation-offload: off [requested on] > > # tc qd replace dev eth3 root pfifo_fast > # ./super_netperf 1 -H 7.7.7.84 -- -K cubic > 941 > # ./super_netperf 1 -H 7.7.7.84 -- -K bbr > 941 > # tc qd replace dev eth3 root fq > # ./super_netperf 1 -H 7.7.7.84 -- -K cubic > 941 > # ./super_netperf 1 -H 7.7.7.84 -- -K bbr > 941 > # tc qd replace dev eth3 root fq_codel > # ./super_netperf 1 -H 7.7.7.84 -- -K cubic > 941 > # ./super_netperf 1 -H 7.7.7.84 -- -K bbr > 941 > # That really looks strange to me. I'm able to reproduce the effect caused by disabling scatter-gather even on the VM (using iperf3, as usual): BBR+fq_codel: sg on: 4.23 Gbits/sec sg off: 121 Mbits/sec BBR+fq: sg on: 6.38 Gbits/sec sg off: 437 Mbits/sec Reno+fq_codel: sg on: 6.74 Gbits/sec sg off: 1.37 Gbits/sec Reno+fq: sg on: 6.53 Gbits/sec sg off: 1.19 Gbits/sec Regardless of which congestion algorithm and qdisc is in use, the throughput drops, but when BBR is in use, especially with something non-fq, it drops the most. Oleksandr
Re: TCP and BBR: reproducibly low cwnd and bandwidth
On pátek 16. února 2018 23:50:35 CET Eric Dumazet wrote: > /* snip */ > If you use > > tcptrace -R test_s2c.pcap > xplot.org d2c_rtt.xpl > > Then you'll see plenty of suspect 40ms rtt samples. That's odd. Even the way how they look uniformly. > It looks like receiver misses wakeups for some reason, > and only the TCP delayed ACK timer is helping. > > So it does not look like a sender side issue to me. To make things even more complicated, I've disabled sg on the server, leaving it enabled on the client: client to server flow: 935 Mbits/sec server to client flow: 72.5 Mbits/sec So still, to me it looks like a sender issue. No?
Re: TCP and BBR: reproducibly low cwnd and bandwidth
On Fri, Feb 16, 2018 at 2:50 PM, Oleksandr Natalenko wrote: > Hi. > > On pátek 16. února 2018 21:54:05 CET Eric Dumazet wrote: >> /* snip */ >> Something fishy really : >> /* snip */ >> Not only the receiver suddenly adds a 25 ms delay, but also note that >> it acknowledges all prior segments (ack 112949), but with a wrong ecr >> value ( 2327043753 ) >> instead of 2327043759 >> /* snip */ > > Eric has encouraged me to look closer at what's there in the ethtool, and I've > just had a free time to play with it. I've found out that enabling scatter- > gather (ethtool -K enp3s0 sg on, it is disabled by default on both hosts) > brings the throughput back to normal even with BBR+fq_codel. > > Wh? What's the deal BBR has with sg? Well, no effect here on e1000e (1 Gbit) at least # ethtool -K eth3 sg off Actual changes: scatter-gather: off tx-scatter-gather: off tcp-segmentation-offload: off tx-tcp-segmentation: off [requested on] tx-tcp6-segmentation: off [requested on] generic-segmentation-offload: off [requested on] # tc qd replace dev eth3 root pfifo_fast # ./super_netperf 1 -H 7.7.7.84 -- -K cubic 941 # ./super_netperf 1 -H 7.7.7.84 -- -K bbr 941 # tc qd replace dev eth3 root fq # ./super_netperf 1 -H 7.7.7.84 -- -K cubic 941 # ./super_netperf 1 -H 7.7.7.84 -- -K bbr 941 # tc qd replace dev eth3 root fq_codel # ./super_netperf 1 -H 7.7.7.84 -- -K cubic 941 # ./super_netperf 1 -H 7.7.7.84 -- -K bbr 941 #
Re: TCP and BBR: reproducibly low cwnd and bandwidth
Hi. On pátek 16. února 2018 21:54:05 CET Eric Dumazet wrote: > /* snip */ > Something fishy really : > /* snip */ > Not only the receiver suddenly adds a 25 ms delay, but also note that > it acknowledges all prior segments (ack 112949), but with a wrong ecr > value ( 2327043753 ) > instead of 2327043759 > /* snip */ Eric has encouraged me to look closer at what's there in the ethtool, and I've just had a free time to play with it. I've found out that enabling scatter- gather (ethtool -K enp3s0 sg on, it is disabled by default on both hosts) brings the throughput back to normal even with BBR+fq_codel. Wh? What's the deal BBR has with sg? Oleksandr
Re: TCP and BBR: reproducibly low cwnd and bandwidth
On Fri, 2018-02-16 at 12:54 -0800, Eric Dumazet wrote: > On Fri, Feb 16, 2018 at 9:25 AM, Oleksandr Natalenko > wrote: > > Hi. > > > > On pátek 16. února 2018 17:33:48 CET Neal Cardwell wrote: > > > Thanks for the detailed report! Yes, this sounds like an issue in BBR. We > > > have not run into this one in our team, but we will try to work with you > > > to > > > fix this. > > > > > > Would you be able to take a sender-side tcpdump trace of the slow BBR > > > transfer ("v4.13 + BBR + fq_codel == Not OK")? Packet headers only would > > > be > > > fine. Maybe something like: > > > > > > tcpdump -w /tmp/test.pcap -c100 -s 100 -i eth0 port $PORT > > > > So, going on with two real HW hosts. They are both running latest stock Arch > > Linux kernel (4.15.3-1-ARCH, CONFIG_PREEMPT=y, CONFIG_HZ=1000) and are > > interconnected with 1 Gbps link (via switch if that matters). Using iperf3, > > running each test for 20 seconds. > > > > Having BBR+fq_codel (or pfifo_fast, same result) on both hosts: > > > > Client to server: 112 Mbits/sec > > Server to client: 96.1 Mbits/sec > > > > Having BBR+fq on both hosts: > > > > Client to server: 347 Mbits/sec > > Server to client: 397 Mbits/sec > > > > Having YeAH+fq on both hosts: > > [1] https://natalenko.name/myfiles/bbr/ > > > > Something fishy really : > > 09:18:31.449903 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [P.], > seq 76745:79641, ack 38, win 227, options [nop,nop,TS val 2327043753 > ecr 3190508870], length 2896 > 09:18:31.449916 IP 172.29.28.55.14936 > 172.29.28.1.5201: Flags [.], > ack 79641, win 1011, options [nop,nop,TS val 3190508870 ecr > 2327043753], length 0 > 09:18:31.449925 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.], > seq 79641:83985, ack 38, win 227, options [nop,nop,TS val 2327043753 > ecr 3190508870], length 4344 > 09:18:31.449936 IP 172.29.28.55.14936 > 172.29.28.1.5201: Flags [.], > ack 83985, win 987, options [nop,nop,TS val 3190508870 ecr > 2327043753], length 0 > 09:18:31.450112 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.], > seq 83985:86881, ack 38, win 227, options [nop,nop,TS val 2327043753 > ecr 3190508870], length 2896 > 09:18:31.450124 IP 172.29.28.55.14936 > 172.29.28.1.5201: Flags [.], > ack 86881, win 971, options [nop,nop,TS val 3190508871 ecr > 2327043753], length 0 > 09:18:31.450299 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.], > seq 86881:91225, ack 38, win 227, options [nop,nop,TS val 2327043753 > ecr 3190508870], length 4344 > 09:18:31.450313 IP 172.29.28.55.14936 > 172.29.28.1.5201: Flags [.], > ack 91225, win 947, options [nop,nop,TS val 3190508871 ecr > 2327043753], length 0 > 09:18:31.450491 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [P.], > seq 91225:92673, ack 38, win 227, options [nop,nop,TS val 2327043753 > ecr 3190508870], length 1448 > 09:18:31.450505 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.], > seq 92673:94121, ack 38, win 227, options [nop,nop,TS val 2327043753 > ecr 3190508871], length 1448 > 09:18:31.450511 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [P.], > seq 94121:95569, ack 38, win 227, options [nop,nop,TS val 2327043754 > ecr 3190508871], length 1448 > 09:18:31.450720 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.], > seq 95569:101361, ack 38, win 227, options [nop,nop,TS val 2327043754 > ecr 3190508871], length 5792 > 09:18:31.450932 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.], > seq 101361:105705, ack 38, win 227, options [nop,nop,TS val 2327043754 > ecr 3190508871], length 4344 > 09:18:31.451132 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.], > seq 105705:110049, ack 38, win 227, options [nop,nop,TS val 2327043754 > ecr 3190508871], length 4344 > 09:18:31.451342 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.], > seq 110049:111497, ack 38, win 227, options [nop,nop,TS val 2327043754 > ecr 3190508871], length 1448 > 09:18:31.455841 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.], > seq 111497:112945, ack 38, win 227, options [nop,nop,TS val 2327043759 > ecr 3190508871], length 1448 > > Not only the receiver suddenly adds a 25 ms delay, but also note that > it acknowledges all prior segments (ack 112949), but with a wrong ecr > value ( 2327043753 ) > instead of 2327043759 If you use tcptrace -R test_s2c.pcap xplot.org d2c_rtt.xpl Then you'll see plenty of suspect 40ms rtt samples. It looks like receiver misses wakeups for some reason, and only the TCP delayed ACK timer is helping. So it does not look like a sender side issue to me.
Re: TCP and BBR: reproducibly low cwnd and bandwidth
On Fri, Feb 16, 2018 at 9:25 AM, Oleksandr Natalenko wrote: > Hi. > > On pátek 16. února 2018 17:33:48 CET Neal Cardwell wrote: >> Thanks for the detailed report! Yes, this sounds like an issue in BBR. We >> have not run into this one in our team, but we will try to work with you to >> fix this. >> >> Would you be able to take a sender-side tcpdump trace of the slow BBR >> transfer ("v4.13 + BBR + fq_codel == Not OK")? Packet headers only would be >> fine. Maybe something like: >> >> tcpdump -w /tmp/test.pcap -c100 -s 100 -i eth0 port $PORT > > So, going on with two real HW hosts. They are both running latest stock Arch > Linux kernel (4.15.3-1-ARCH, CONFIG_PREEMPT=y, CONFIG_HZ=1000) and are > interconnected with 1 Gbps link (via switch if that matters). Using iperf3, > running each test for 20 seconds. > > Having BBR+fq_codel (or pfifo_fast, same result) on both hosts: > > Client to server: 112 Mbits/sec > Server to client: 96.1 Mbits/sec > > Having BBR+fq on both hosts: > > Client to server: 347 Mbits/sec > Server to client: 397 Mbits/sec > > Having YeAH+fq on both hosts: > [1] https://natalenko.name/myfiles/bbr/ > Something fishy really : 09:18:31.449903 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [P.], seq 76745:79641, ack 38, win 227, options [nop,nop,TS val 2327043753 ecr 3190508870], length 2896 09:18:31.449916 IP 172.29.28.55.14936 > 172.29.28.1.5201: Flags [.], ack 79641, win 1011, options [nop,nop,TS val 3190508870 ecr 2327043753], length 0 09:18:31.449925 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.], seq 79641:83985, ack 38, win 227, options [nop,nop,TS val 2327043753 ecr 3190508870], length 4344 09:18:31.449936 IP 172.29.28.55.14936 > 172.29.28.1.5201: Flags [.], ack 83985, win 987, options [nop,nop,TS val 3190508870 ecr 2327043753], length 0 09:18:31.450112 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.], seq 83985:86881, ack 38, win 227, options [nop,nop,TS val 2327043753 ecr 3190508870], length 2896 09:18:31.450124 IP 172.29.28.55.14936 > 172.29.28.1.5201: Flags [.], ack 86881, win 971, options [nop,nop,TS val 3190508871 ecr 2327043753], length 0 09:18:31.450299 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.], seq 86881:91225, ack 38, win 227, options [nop,nop,TS val 2327043753 ecr 3190508870], length 4344 09:18:31.450313 IP 172.29.28.55.14936 > 172.29.28.1.5201: Flags [.], ack 91225, win 947, options [nop,nop,TS val 3190508871 ecr 2327043753], length 0 09:18:31.450491 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [P.], seq 91225:92673, ack 38, win 227, options [nop,nop,TS val 2327043753 ecr 3190508870], length 1448 09:18:31.450505 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.], seq 92673:94121, ack 38, win 227, options [nop,nop,TS val 2327043753 ecr 3190508871], length 1448 09:18:31.450511 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [P.], seq 94121:95569, ack 38, win 227, options [nop,nop,TS val 2327043754 ecr 3190508871], length 1448 09:18:31.450720 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.], seq 95569:101361, ack 38, win 227, options [nop,nop,TS val 2327043754 ecr 3190508871], length 5792 09:18:31.450932 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.], seq 101361:105705, ack 38, win 227, options [nop,nop,TS val 2327043754 ecr 3190508871], length 4344 09:18:31.451132 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.], seq 105705:110049, ack 38, win 227, options [nop,nop,TS val 2327043754 ecr 3190508871], length 4344 09:18:31.451342 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.], seq 110049:111497, ack 38, win 227, options [nop,nop,TS val 2327043754 ecr 3190508871], length 1448 09:18:31.455841 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.], seq 111497:112945, ack 38, win 227, options [nop,nop,TS val 2327043759 ecr 3190508871], length 1448 Not only the receiver suddenly adds a 25 ms delay, but also note that it acknowledges all prior segments (ack 112949), but with a wrong ecr value ( 2327043753 ) instead of 2327043759 09:18:31.482238 IP 172.29.28.55.14936 > 172.29.28.1.5201: Flags [.], ack 112945, win , options [nop,nop,TS val 3190508903 ecr 2327043753], length 0 09:18:31.482704 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.], seq 112945:114393, ack 38, win 227, options [nop,nop,TS val 2327043786 ecr 3190508903], length 1448 09:18:31.482734 IP 172.29.28.55.14936 > 172.29.28.1.5201: Flags [.], ack 114393, win 1134, options [nop,nop,TS val 3190508903 ecr 2327043786], length 0 09:18:31.482802 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.], seq 114393:117289, ack 38, win 227, options [nop,nop,TS val 2327043786 ecr 3190508903], length 2896 09:18:31.482813 IP 172.29.28.55.14936 > 172.29.28.1.5201: Flags [.], ack 117289, win 1179, options [nop,nop,TS val 3190508903 ecr 2327043786], length 0 09:18:31.483138 IP 172.29.28.1.5201 > 172.29.28.55.14936: Flags [.], seq 117289:120185, ack 38, win 227, options [nop,nop,TS val 2327043786 ecr 3190508903], length 2896 09:18:31.483158 IP 172.29.28.55.14936 > 172.29.28.1.5201: Flags
Re: TCP and BBR: reproducibly low cwnd and bandwidth
Hi. On pátek 16. února 2018 18:56:12 CET Holger Hoffstätte wrote: > There is simply no reason why you shouldn't get approx. line rate > (~920+-ish) Mbit over wired 1GBit Ethernet; even my broken 10-year old > Core2Duo laptop can do that. Can you boot with spectre_v2=off and try "the > simplest case" with the defaults cubic/pfifo_fast? spectre_v2 has terrible > performance impact esp. on small/older processors. Just have tried. No visible difference. > When I last benchmarked full PREEMPT with 4.9.x it was similarly bad and > also had a noticeable network throughput impact even on my i7. > > Also congratulations for being the only other person I know who ever tried > YeAH. :-) Well, according to the git log on tcp_yeah.c and Reported-by tag, I was not the only one there ;). Regards, Oleksandr
Re: TCP and BBR: reproducibly low cwnd and bandwidth
On 02/16/18 18:25, Oleksandr Natalenko wrote: > So, going on with two real HW hosts. They are both running latest stock Arch > Linux kernel (4.15.3-1-ARCH, CONFIG_PREEMPT=y, CONFIG_HZ=1000) and are > interconnected with 1 Gbps link (via switch if that matters). Using iperf3, > running each test for 20 seconds. > > Having BBR+fq_codel (or pfifo_fast, same result) on both hosts: > > Client to server: 112 Mbits/sec > Server to client: 96.1 Mbits/sec > > Having BBR+fq on both hosts: > > Client to server: 347 Mbits/sec > Server to client: 397 Mbits/sec > > Having YeAH+fq on both hosts: > > Client to server: 928 Mbits/sec > Server to client: 711 Mbits/sec > > (when the server generates traffic, the throughput is a little bit lower, as > you can see, but I assume that's because I have there low-power Silvermont > CPU, when the client has Ivy Bridge beast) There is simply no reason why you shouldn't get approx. line rate (~920+-ish) Mbit over wired 1GBit Ethernet; even my broken 10-year old Core2Duo laptop can do that. Can you boot with spectre_v2=off and try "the simplest case" with the defaults cubic/pfifo_fast? spectre_v2 has terrible performance impact esp. on small/older processors. When I last benchmarked full PREEMPT with 4.9.x it was similarly bad and also had a noticeable network throughput impact even on my i7. Also congratulations for being the only other person I know who ever tried YeAH. :-) cheers Holger
Re: TCP and BBR: reproducibly low cwnd and bandwidth
Hi. On pátek 16. února 2018 17:25:58 CET Eric Dumazet wrote: > The way TCP pacing works, it defaults to internal pacing using a hint > stored in the socket. > > If you change the qdisc while flow is alive, result could be unexpected. I don't change a qdisc while flow is alive. Either the VM is completely restarted, or iperf3 is restarted on both sides. > (TCP socket remembers that one FQ was supposed to handle the pacing) > > What results do you have if you use standard pfifo_fast ? Almost the same as with fq_codel (see my previous email with numbers). > I am asking because TCP pacing relies on High resolution timers, and > that might be weak on your VM. Also, I've switched to measuring things on a real HW only (also see previous email with numbers). Thanks. Regards, Oleksandr
Re: TCP and BBR: reproducibly low cwnd and bandwidth
Hi. On pátek 16. února 2018 17:26:11 CET Holger Hoffstätte wrote: > These are very odd configurations. :) > Non-preempt/100 might well be too slow, whereas PREEMPT/1000 might simply > have too much overhead. Since the pacing is based on hrtimers, should HZ matter at all? Even if so, poor 1 Gbps link shouldn't drop to below 100 Mbps, for sure. > BBR in general will run with lower cwnd than e.g. Cubic or others. > That's a feature and necessary for WAN transfers. Okay, got it. > Something seems really wrong with your setup. I get completely > expected throughput on wired 1Gb between two hosts: > /* snip */ Yes, and that's strange :/. And that's why I'm wondering what I am missing since things cannot be *that* bad. > /* snip */ > Please note that BBR was developed to address the case of WAN transfers > (or more precisely high BDP paths) which often suffer from TCP throughput > collapse due to single packet loss events. While it might "work" in other > scenarios as well, strictly speaking delay-based anything is increasingly > less likely to work when there is no meaningful notion of delay - such > as on a LAN. (yes, this is very simplified..) > > The BBR mailing list has several nice reports why the current BBR > implementation (dubbed v1) has a few - sometimes severe - problems. > These are being addressed as we speak. > > (let me know if you want some of those tech reports by email. :) Well, yes, please, why not :). > /* snip */ > I'm not sure testing the old version without builtin pacing is going to help > matters in finding the actual problem. :) > Several people have reported severe performance regressions with 4.15.x, > maybe that's related. Can you test latest 4.14.x? Observed this on v4.14 too but didn't pay much attention until realised that things look definitely wrong. > Out of curiosity, what is the expected use case for BBR here? Nothing special, just assumed it could be set as a default for both WAN and LAN usage. Regards, Oleksandr
Re: TCP and BBR: reproducibly low cwnd and bandwidth
Hi. On pátek 16. února 2018 17:33:48 CET Neal Cardwell wrote: > Thanks for the detailed report! Yes, this sounds like an issue in BBR. We > have not run into this one in our team, but we will try to work with you to > fix this. > > Would you be able to take a sender-side tcpdump trace of the slow BBR > transfer ("v4.13 + BBR + fq_codel == Not OK")? Packet headers only would be > fine. Maybe something like: > > tcpdump -w /tmp/test.pcap -c100 -s 100 -i eth0 port $PORT So, going on with two real HW hosts. They are both running latest stock Arch Linux kernel (4.15.3-1-ARCH, CONFIG_PREEMPT=y, CONFIG_HZ=1000) and are interconnected with 1 Gbps link (via switch if that matters). Using iperf3, running each test for 20 seconds. Having BBR+fq_codel (or pfifo_fast, same result) on both hosts: Client to server: 112 Mbits/sec Server to client: 96.1 Mbits/sec Having BBR+fq on both hosts: Client to server: 347 Mbits/sec Server to client: 397 Mbits/sec Having YeAH+fq on both hosts: Client to server: 928 Mbits/sec Server to client: 711 Mbits/sec (when the server generates traffic, the throughput is a little bit lower, as you can see, but I assume that's because I have there low-power Silvermont CPU, when the client has Ivy Bridge beast) Now, to tcpdump. I've captured it 2 times, for client-to-server flow (c2s) and for server-to-client flow (s2c) while using BBR + pfifo_fast: # tcpdump -w test_XXX.pcap -c100 -s 100 -i enp2s0 port 5201 I've uploaded both files here [1]. Thanks. Oleksandr [1] https://natalenko.name/myfiles/bbr/
Re: TCP and BBR: reproducibly low cwnd and bandwidth
On 02/16/18 17:56, Neal Cardwell wrote: > On Fri, Feb 16, 2018 at 11:26 AM, Holger Hoffstätte > wrote: >> >> BBR in general will run with lower cwnd than e.g. Cubic or others. >> That's a feature and necessary for WAN transfers. > > Please note that there's no general rule about whether BBR will run > with a lower or higher cwnd than CUBIC, Reno, or other loss-based > congestion control algorithms. Whether BBR's cwnd will be lower or > higher depends on the BDP of the path, the amount of buffering in the > bottleneck, and the number of flows. BBR tries to match the amount of > in-flight data to the BDP based on the available bandwidth and the > two-way propagation delay. This will usually produce an amount of data > in flight that is smaller than CUBIC/Reno (yielding lower latency) if > the path has deep buffers (bufferbloat), but can be larger than > CUBIC/Reno (yielding higher throughput) if the buffers are shallow and > the traffic is suffering burst losses. In all my tests I've never seen it larger, but OK. Thanks for the explanation. :) On second reading the "necessary for WAN transfers" was phrased a bit unfortunately, but it likely doesn't matter for Oleksandr's case anyway.. (snip) >> Something seems really wrong with your setup. I get completely >> expected throughput on wired 1Gb between two hosts: >> >> Connecting to host tux, port 5201 >> [ 5] local 192.168.100.223 port 48718 connected to 192.168.100.222 port 5201 >> [ ID] Interval Transfer Bitrate Retr Cwnd >> [ 5] 0.00-1.00 sec 113 MBytes 948 Mbits/sec0204 KBytes >> [ 5] 1.00-2.00 sec 112 MBytes 941 Mbits/sec0204 KBytes >> [ 5] 2.00-3.00 sec 112 MBytes 941 Mbits/sec0204 KBytes >> [...] >> >> Running it locally gives the more or less expected results as well: >> >> Connecting to host ragnarok, port 5201 >> [ 5] local 192.168.100.223 port 54090 connected to 192.168.100.223 port 5201 >> [ ID] Interval Transfer Bitrate Retr Cwnd >> [ 5] 0.00-1.00 sec 8.09 GBytes 69.5 Gbits/sec0512 KBytes >> [ 5] 1.00-2.00 sec 8.14 GBytes 69.9 Gbits/sec0512 KBytes >> [ 5] 2.00-3.00 sec 8.43 GBytes 72.4 Gbits/sec0512 KBytes >> [...] >> >> Both hosts running 4.14.x with bbr and fq_codel (default qdisc everywhere). > > Can you please clarify if this is over bare metal or between VM > guests? It sounds like Oleksandr's initial report was between KVM VMs, > so the virtualization may be an ingredient here. These are real hosts, not VMs, wired by 1Gbit Ethernet (home office). Like Eric said it's probably weird HZ, slow host, iffy high-res timer (bad for both fq and fq_codel), overhead of retpoline in a VM or whatnot. cheers Holger
Re: TCP and BBR: reproducibly low cwnd and bandwidth
Hi! On pátek 16. února 2018 17:45:56 CET Neal Cardwell wrote: > Eric raises a good question: bare metal vs VMs. > > Oleksandr, your first email mentioned KVM VMs and virtio NICs. Your > second e-mail did not seem to mention if those results were for bare > metal or a VM scenario: can you please clarify the details on your > second set of tests? Ugh, so many letters simultaneously… I'll answer them one by one if you don't mind :). Both the first and the second set of tests were performed on 2 KVM VMs, but from now I'll test everything using real HW only to exclude potential influence of virtualisation. Also, as I've already pointed out, on the real HW the difference is even bigger (~10 times). Now, I'm going to answer other emails of yours, including the actual results from the real HW and tcpdump output as requested. Thanks! Regards, Oleksandr
Re: TCP and BBR: reproducibly low cwnd and bandwidth
On Fri, Feb 16, 2018 at 11:26 AM, Holger Hoffstätte wrote: > > BBR in general will run with lower cwnd than e.g. Cubic or others. > That's a feature and necessary for WAN transfers. Please note that there's no general rule about whether BBR will run with a lower or higher cwnd than CUBIC, Reno, or other loss-based congestion control algorithms. Whether BBR's cwnd will be lower or higher depends on the BDP of the path, the amount of buffering in the bottleneck, and the number of flows. BBR tries to match the amount of in-flight data to the BDP based on the available bandwidth and the two-way propagation delay. This will usually produce an amount of data in flight that is smaller than CUBIC/Reno (yielding lower latency) if the path has deep buffers (bufferbloat), but can be larger than CUBIC/Reno (yielding higher throughput) if the buffers are shallow and the traffic is suffering burst losses. > >>> If using real HW (1 Gbps LAN, laptop and server), BBR limits the throughput >>> to ~100 Mbps (verifiable not only by iperf3, but also by scp while >>> transferring some files between hosts). > > Something seems really wrong with your setup. I get completely > expected throughput on wired 1Gb between two hosts: > > Connecting to host tux, port 5201 > [ 5] local 192.168.100.223 port 48718 connected to 192.168.100.222 port 5201 > [ ID] Interval Transfer Bitrate Retr Cwnd > [ 5] 0.00-1.00 sec 113 MBytes 948 Mbits/sec0204 KBytes > [ 5] 1.00-2.00 sec 112 MBytes 941 Mbits/sec0204 KBytes > [ 5] 2.00-3.00 sec 112 MBytes 941 Mbits/sec0204 KBytes > [...] > > Running it locally gives the more or less expected results as well: > > Connecting to host ragnarok, port 5201 > [ 5] local 192.168.100.223 port 54090 connected to 192.168.100.223 port 5201 > [ ID] Interval Transfer Bitrate Retr Cwnd > [ 5] 0.00-1.00 sec 8.09 GBytes 69.5 Gbits/sec0512 KBytes > [ 5] 1.00-2.00 sec 8.14 GBytes 69.9 Gbits/sec0512 KBytes > [ 5] 2.00-3.00 sec 8.43 GBytes 72.4 Gbits/sec0512 KBytes > [...] > > Both hosts running 4.14.x with bbr and fq_codel (default qdisc everywhere). Can you please clarify if this is over bare metal or between VM guests? It sounds like Oleksandr's initial report was between KVM VMs, so the virtualization may be an ingredient here. thanks, neal
Re: TCP and BBR: reproducibly low cwnd and bandwidth
On Fri, Feb 16, 2018 at 11:43 AM, Eric Dumazet wrote: > > On Fri, Feb 16, 2018 at 8:33 AM, Neal Cardwell wrote: > > Oleksandr, > > > > Thanks for the detailed report! Yes, this sounds like an issue in BBR. We > > have not run into this one in our team, but we will try to work with you to > > fix this. > > > > Would you be able to take a sender-side tcpdump trace of the slow BBR > > transfer ("v4.13 + BBR + fq_codel == Not OK")? Packet headers only would be > > fine. Maybe something like: > > > > tcpdump -w /tmp/test.pcap -c100 -s 100 -i eth0 port $PORT > > > > Thanks! > > neal > > On baremetal and using latest net tree, I get pretty normal results at > least, on 40Gbit NIC, Eric raises a good question: bare metal vs VMs. Oleksandr, your first email mentioned KVM VMs and virtio NICs. Your second e-mail did not seem to mention if those results were for bare metal or a VM scenario: can you please clarify the details on your second set of tests? Thanks!
Re: TCP and BBR: reproducibly low cwnd and bandwidth
On Fri, Feb 16, 2018 at 8:33 AM, Neal Cardwell wrote: > Oleksandr, > > Thanks for the detailed report! Yes, this sounds like an issue in BBR. We > have not run into this one in our team, but we will try to work with you to > fix this. > > Would you be able to take a sender-side tcpdump trace of the slow BBR > transfer ("v4.13 + BBR + fq_codel == Not OK")? Packet headers only would be > fine. Maybe something like: > > tcpdump -w /tmp/test.pcap -c100 -s 100 -i eth0 port $PORT > > Thanks! > neal On baremetal and using latest net tree, I get pretty normal results at least, on 40Gbit NIC, with pfifo_fast, fq and fq_codel. # tc qd replace dev eth0 root fq # ./super_netperf 1 -H lpaa24 -- -K cubic 25627 # ./super_netperf 1 -H lpaa24 -- -K bbr 25897 # tc qd replace dev eth0 root fq_codel # ./super_netperf 1 -H lpaa24 -- -K cubic 22246 # ./super_netperf 1 -H lpaa24 -- -K bbr 25228 # tc qd replace dev eth0 root pfifo_fast # ./super_netperf 1 -H lpaa24 -- -K cubic 25454 # ./super_netperf 1 -H lpaa24 -- -K bbr 25508
Re: TCP and BBR: reproducibly low cwnd and bandwidth
On 02/16/18 16:15, Oleksandr Natalenko wrote: > Hi, David, Eric, Neal et al. > > On čtvrtek 15. února 2018 21:42:26 CET Oleksandr Natalenko wrote: >> I've faced an issue with a limited TCP bandwidth between my laptop and a >> server in my 1 Gbps LAN while using BBR as a congestion control mechanism. >> To verify my observations, I've set up 2 KVM VMs with the following >> parameters: >> >> 1) Linux v4.15.3 >> 2) virtio NICs >> 3) 128 MiB of RAM >> 4) 2 vCPUs >> 5) tested on both non-PREEMPT/100 Hz and PREEMPT/1000 Hz These are very odd configurations. :) Non-preempt/100 might well be too slow, whereas PREEMPT/1000 might simply have too much overhead. >> The VMs are interconnected via host bridge (-netdev bridge). I was running >> iperf3 in the default and reverse mode. Here are the results: >> >> 1) BBR on both VMs >> >> upload: 3.42 Gbits/sec, cwnd ~ 320 KBytes >> download: 3.39 Gbits/sec, cwnd ~ 320 KBytes >> >> 2) Reno on both VMs >> >> upload: 5.50 Gbits/sec, cwnd = 976 KBytes (constant) >> download: 5.22 Gbits/sec, cwnd = 1.20 MBytes (constant) >> >> 3) Reno on client, BBR on server >> >> upload: 5.29 Gbits/sec, cwnd = 952 KBytes (constant) >> download: 3.45 Gbits/sec, cwnd ~ 320 KBytes >> >> 4) BBR on client, Reno on server >> >> upload: 3.36 Gbits/sec, cwnd ~ 370 KBytes >> download: 5.21 Gbits/sec, cwnd = 887 KBytes (constant) >> >> So, as you may see, when BBR is in use, upload rate is bad and cwnd is low. BBR in general will run with lower cwnd than e.g. Cubic or others. That's a feature and necessary for WAN transfers. >> If using real HW (1 Gbps LAN, laptop and server), BBR limits the throughput >> to ~100 Mbps (verifiable not only by iperf3, but also by scp while >> transferring some files between hosts). Something seems really wrong with your setup. I get completely expected throughput on wired 1Gb between two hosts: Connecting to host tux, port 5201 [ 5] local 192.168.100.223 port 48718 connected to 192.168.100.222 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 113 MBytes 948 Mbits/sec0204 KBytes [ 5] 1.00-2.00 sec 112 MBytes 941 Mbits/sec0204 KBytes [ 5] 2.00-3.00 sec 112 MBytes 941 Mbits/sec0204 KBytes [...] Running it locally gives the more or less expected results as well: Connecting to host ragnarok, port 5201 [ 5] local 192.168.100.223 port 54090 connected to 192.168.100.223 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 8.09 GBytes 69.5 Gbits/sec0512 KBytes [ 5] 1.00-2.00 sec 8.14 GBytes 69.9 Gbits/sec0512 KBytes [ 5] 2.00-3.00 sec 8.43 GBytes 72.4 Gbits/sec0512 KBytes [...] Both hosts running 4.14.x with bbr and fq_codel (default qdisc everywhere). In the past I only used BBR briefly for testing since at 1Gb speeds on my LAN it was actually slightly slower than Cubic (some of those bugs were recently addressed) and made no difference otherwise, even for uploads - which are capped by my 50/10 DSL anyway. Please note that BBR was developed to address the case of WAN transfers (or more precisely high BDP paths) which often suffer from TCP throughput collapse due to single packet loss events. While it might "work" in other scenarios as well, strictly speaking delay-based anything is increasingly less likely to work when there is no meaningful notion of delay - such as on a LAN. (yes, this is very simplified..) The BBR mailing list has several nice reports why the current BBR implementation (dubbed v1) has a few - sometimes severe - problems. These are being addressed as we speak. (let me know if you want some of those tech reports by email. :) >> Also, I've tried to use YeAH instead of Reno, and it gives me the same >> results as Reno (IOW, YeAH works fine too). >> >> Questions: >> >> 1) is this expected? >> 2) or am I missing some extra BBR tuneable? No, it should work out of the box. >> 3) if it is not a regression (I don't have any previous data to compare >> with), how can I fix this? >> 4) if it is a bug in BBR, what else should I provide or check for a proper >> investigation? > > I've played with BBR a little bit more and managed to narrow the issue down > to > the changes between v4.12 and v4.13. Here are my observations: > > v4.12 + BBR + fq_codel == OK > v4.12 + BBR + fq == OK > v4.13 + BBR + fq_codel == Not OK > v4.13 + BBR + fq == OK > > I think this has something to do with an internal TCP implementation for > pacing, that was introduced in v4.13 (commit 218af599fa63) specifically to > allow using BBR together with non-fq qdiscs. Once BBR relies on fq, the > throughput is high and saturates the link, but if another qdisc is in use, > for > instance, fq_codel, the throughput drops. Just to be sure, I've also tried > pfifo_fast instead of fq_codel with the same outcome resulting in the low > throughput. I'm
Re: TCP and BBR: reproducibly low cwnd and bandwidth
On Fri, Feb 16, 2018 at 7:15 AM, Oleksandr Natalenko wrote: > Hi, David, Eric, Neal et al. > > On čtvrtek 15. února 2018 21:42:26 CET Oleksandr Natalenko wrote: >> I've faced an issue with a limited TCP bandwidth between my laptop and a >> server in my 1 Gbps LAN while using BBR as a congestion control mechanism. >> To verify my observations, I've set up 2 KVM VMs with the following >> parameters: >> >> 1) Linux v4.15.3 >> 2) virtio NICs >> 3) 128 MiB of RAM >> 4) 2 vCPUs >> 5) tested on both non-PREEMPT/100 Hz and PREEMPT/1000 Hz >> >> The VMs are interconnected via host bridge (-netdev bridge). I was running >> iperf3 in the default and reverse mode. Here are the results: >> >> 1) BBR on both VMs >> >> upload: 3.42 Gbits/sec, cwnd ~ 320 KBytes >> download: 3.39 Gbits/sec, cwnd ~ 320 KBytes >> >> 2) Reno on both VMs >> >> upload: 5.50 Gbits/sec, cwnd = 976 KBytes (constant) >> download: 5.22 Gbits/sec, cwnd = 1.20 MBytes (constant) >> >> 3) Reno on client, BBR on server >> >> upload: 5.29 Gbits/sec, cwnd = 952 KBytes (constant) >> download: 3.45 Gbits/sec, cwnd ~ 320 KBytes >> >> 4) BBR on client, Reno on server >> >> upload: 3.36 Gbits/sec, cwnd ~ 370 KBytes >> download: 5.21 Gbits/sec, cwnd = 887 KBytes (constant) >> >> So, as you may see, when BBR is in use, upload rate is bad and cwnd is low. >> If using real HW (1 Gbps LAN, laptop and server), BBR limits the throughput >> to ~100 Mbps (verifiable not only by iperf3, but also by scp while >> transferring some files between hosts). >> >> Also, I've tried to use YeAH instead of Reno, and it gives me the same >> results as Reno (IOW, YeAH works fine too). >> >> Questions: >> >> 1) is this expected? >> 2) or am I missing some extra BBR tuneable? >> 3) if it is not a regression (I don't have any previous data to compare >> with), how can I fix this? >> 4) if it is a bug in BBR, what else should I provide or check for a proper >> investigation? > > I've played with BBR a little bit more and managed to narrow the issue down to > the changes between v4.12 and v4.13. Here are my observations: > > v4.12 + BBR + fq_codel == OK > v4.12 + BBR + fq == OK > v4.13 + BBR + fq_codel == Not OK > v4.13 + BBR + fq == OK > > I think this has something to do with an internal TCP implementation for > pacing, that was introduced in v4.13 (commit 218af599fa63) specifically to > allow using BBR together with non-fq qdiscs. Once BBR relies on fq, the > throughput is high and saturates the link, but if another qdisc is in use, for > instance, fq_codel, the throughput drops. Just to be sure, I've also tried > pfifo_fast instead of fq_codel with the same outcome resulting in the low > throughput. > > Unfortunately, I do not know if this is something expected or should be > considered as a regression. Thus, asking for an advice. > > Ideas? The way TCP pacing works, it defaults to internal pacing using a hint stored in the socket. If you change the qdisc while flow is alive, result could be unexpected. (TCP socket remembers that one FQ was supposed to handle the pacing) What results do you have if you use standard pfifo_fast ? I am asking because TCP pacing relies on High resolution timers, and that might be weak on your VM.
Re: TCP and BBR: reproducibly low cwnd and bandwidth
Lets CC BBR folks at Google, and remove the ones that probably have no idea. On Thu, 2018-02-15 at 21:42 +0100, Oleksandr Natalenko wrote: > Hello. > > I've faced an issue with a limited TCP bandwidth between my laptop and a > server in my 1 Gbps LAN while using BBR as a congestion control mechanism. To > verify my observations, I've set up 2 KVM VMs with the following parameters: > > 1) Linux v4.15.3 > 2) virtio NICs > 3) 128 MiB of RAM > 4) 2 vCPUs > 5) tested on both non-PREEMPT/100 Hz and PREEMPT/1000 Hz > > The VMs are interconnected via host bridge (-netdev bridge). I was running > iperf3 in the default and reverse mode. Here are the results: > > 1) BBR on both VMs > > upload: 3.42 Gbits/sec, cwnd ~ 320 KBytes > download: 3.39 Gbits/sec, cwnd ~ 320 KBytes > > 2) Reno on both VMs > > upload: 5.50 Gbits/sec, cwnd = 976 KBytes (constant) > download: 5.22 Gbits/sec, cwnd = 1.20 MBytes (constant) > > 3) Reno on client, BBR on server > > upload: 5.29 Gbits/sec, cwnd = 952 KBytes (constant) > download: 3.45 Gbits/sec, cwnd ~ 320 KBytes > > 4) BBR on client, Reno on server > > upload: 3.36 Gbits/sec, cwnd ~ 370 KBytes > download: 5.21 Gbits/sec, cwnd = 887 KBytes (constant) > > So, as you may see, when BBR is in use, upload rate is bad and cwnd is low. > If > using real HW (1 Gbps LAN, laptop and server), BBR limits the throughput to > ~100 Mbps (verifiable not only by iperf3, but also by scp while transferring > some files between hosts). > > Also, I've tried to use YeAH instead of Reno, and it gives me the same > results > as Reno (IOW, YeAH works fine too). > > Questions: > > 1) is this expected? > 2) or am I missing some extra BBR tuneable? > 3) if it is not a regression (I don't have any previous data to compare > with), > how can I fix this? > 4) if it is a bug in BBR, what else should I provide or check for a proper > investigation? > > Thanks. > > Regards, > Oleksandr > >
Re: TCP and BBR: reproducibly low cwnd and bandwidth
Hi, David, Eric, Neal et al. On čtvrtek 15. února 2018 21:42:26 CET Oleksandr Natalenko wrote: > I've faced an issue with a limited TCP bandwidth between my laptop and a > server in my 1 Gbps LAN while using BBR as a congestion control mechanism. > To verify my observations, I've set up 2 KVM VMs with the following > parameters: > > 1) Linux v4.15.3 > 2) virtio NICs > 3) 128 MiB of RAM > 4) 2 vCPUs > 5) tested on both non-PREEMPT/100 Hz and PREEMPT/1000 Hz > > The VMs are interconnected via host bridge (-netdev bridge). I was running > iperf3 in the default and reverse mode. Here are the results: > > 1) BBR on both VMs > > upload: 3.42 Gbits/sec, cwnd ~ 320 KBytes > download: 3.39 Gbits/sec, cwnd ~ 320 KBytes > > 2) Reno on both VMs > > upload: 5.50 Gbits/sec, cwnd = 976 KBytes (constant) > download: 5.22 Gbits/sec, cwnd = 1.20 MBytes (constant) > > 3) Reno on client, BBR on server > > upload: 5.29 Gbits/sec, cwnd = 952 KBytes (constant) > download: 3.45 Gbits/sec, cwnd ~ 320 KBytes > > 4) BBR on client, Reno on server > > upload: 3.36 Gbits/sec, cwnd ~ 370 KBytes > download: 5.21 Gbits/sec, cwnd = 887 KBytes (constant) > > So, as you may see, when BBR is in use, upload rate is bad and cwnd is low. > If using real HW (1 Gbps LAN, laptop and server), BBR limits the throughput > to ~100 Mbps (verifiable not only by iperf3, but also by scp while > transferring some files between hosts). > > Also, I've tried to use YeAH instead of Reno, and it gives me the same > results as Reno (IOW, YeAH works fine too). > > Questions: > > 1) is this expected? > 2) or am I missing some extra BBR tuneable? > 3) if it is not a regression (I don't have any previous data to compare > with), how can I fix this? > 4) if it is a bug in BBR, what else should I provide or check for a proper > investigation? I've played with BBR a little bit more and managed to narrow the issue down to the changes between v4.12 and v4.13. Here are my observations: v4.12 + BBR + fq_codel == OK v4.12 + BBR + fq == OK v4.13 + BBR + fq_codel == Not OK v4.13 + BBR + fq == OK I think this has something to do with an internal TCP implementation for pacing, that was introduced in v4.13 (commit 218af599fa63) specifically to allow using BBR together with non-fq qdiscs. Once BBR relies on fq, the throughput is high and saturates the link, but if another qdisc is in use, for instance, fq_codel, the throughput drops. Just to be sure, I've also tried pfifo_fast instead of fq_codel with the same outcome resulting in the low throughput. Unfortunately, I do not know if this is something expected or should be considered as a regression. Thus, asking for an advice. Ideas? Thanks. Regards, Oleksandr
TCP and BBR: reproducibly low cwnd and bandwidth
Hello. I've faced an issue with a limited TCP bandwidth between my laptop and a server in my 1 Gbps LAN while using BBR as a congestion control mechanism. To verify my observations, I've set up 2 KVM VMs with the following parameters: 1) Linux v4.15.3 2) virtio NICs 3) 128 MiB of RAM 4) 2 vCPUs 5) tested on both non-PREEMPT/100 Hz and PREEMPT/1000 Hz The VMs are interconnected via host bridge (-netdev bridge). I was running iperf3 in the default and reverse mode. Here are the results: 1) BBR on both VMs upload: 3.42 Gbits/sec, cwnd ~ 320 KBytes download: 3.39 Gbits/sec, cwnd ~ 320 KBytes 2) Reno on both VMs upload: 5.50 Gbits/sec, cwnd = 976 KBytes (constant) download: 5.22 Gbits/sec, cwnd = 1.20 MBytes (constant) 3) Reno on client, BBR on server upload: 5.29 Gbits/sec, cwnd = 952 KBytes (constant) download: 3.45 Gbits/sec, cwnd ~ 320 KBytes 4) BBR on client, Reno on server upload: 3.36 Gbits/sec, cwnd ~ 370 KBytes download: 5.21 Gbits/sec, cwnd = 887 KBytes (constant) So, as you may see, when BBR is in use, upload rate is bad and cwnd is low. If using real HW (1 Gbps LAN, laptop and server), BBR limits the throughput to ~100 Mbps (verifiable not only by iperf3, but also by scp while transferring some files between hosts). Also, I've tried to use YeAH instead of Reno, and it gives me the same results as Reno (IOW, YeAH works fine too). Questions: 1) is this expected? 2) or am I missing some extra BBR tuneable? 3) if it is not a regression (I don't have any previous data to compare with), how can I fix this? 4) if it is a bug in BBR, what else should I provide or check for a proper investigation? Thanks. Regards, Oleksandr