Re: [PATCH net-next 0/2] tcp: fix high tail latencies in DCTCP
On 6/30/18, 5:26 PM, "netdev-ow...@vger.kernel.org on behalf of Neal Cardwell" wrote: On Fri, Jun 29, 2018 at 9:48 PM Lawrence Brakmo wrote: > > When have observed high tail latencies when using DCTCP for RPCs as > compared to using Cubic. For example, in one setup there are 2 hosts > sending to a 3rd one, with each sender having 3 flows (1 stream, > 1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following > table shows the 99% and 99.9% latencies for both Cubic and dctcp: > >Cubic 99% Cubic 99.9% dctcp 99%dctcp 99.9% > 1MB RPCs2.6ms 5.5ms 43ms 208ms > 10KB RPCs1.1ms 1.3ms 53ms 212ms > > Looking at tcpdump traces showed that there are two causes for the > latency. > > 1) RTOs caused by the receiver sending a dup ACK and not ACKing > the last (and only) packet sent. > 2) Delaying ACKs when the sender has a cwnd of 1, so everything > pauses for the duration of the delayed ACK. > > The first patch fixes the cause of the dup ACKs, not updating DCTCP > state when an ACK that was initially delayed has been sent with a > data packet. > > The second patch insures that an ACK is sent immediately when a > CWR marked packet arrives. > > With the patches the latencies for DCTCP now look like: > >dctcp 99% dctcp 99.9% > 1MB RPCs4.8ms 6.5ms > 10KB RPCs143us 184us > > Note that while the 1MB RPCs tail latencies are higher than Cubic's, > the 10KB latencies are much smaller than Cubic's. These patches fix > issues on the receiver, but tcpdump traces indicate there is an > opportunity to also fix an issue at the sender that adds about 3ms > to the tail latencies. > > The following trace shows the issue that tiggers an RTO (fixed by these patches): > >Host A sends the last packets of the request >Host B receives them, and the last packet is marked with congestion (CE) >Host B sends ACKs for packets not marked with congestion >Host B sends data packet with reply and ACK for packet marked with > congestion (TCP flag ECE) >Host A receives ACKs with no ECE flag >Host A receives data packet with ACK for the last packet of request > and which has TCP ECE bit set >Host A sends 1st data packet of the next request with TCP flag CWR >Host B receives the packet (as seen in tcpdump at B), no CE flag >Host B sends a dup ACK that also has the TCP ECE flag >Host A RTO timer fires! >Host A to send the next packet >Host A receives an ACK for everything it has sent (i.e. Host B > did receive 1st packet of request) >Host A send more packets… > > [PATCH net-next 1/2] tcp: notify when a delayed ack is sent > [PATCH net-next 2/2] tcp: ack immediately when a cwr packet arrives Thanks, Larry! I suspect there is a broader problem with "DCTCP-style precise ECE ACKs" that this patch series does not address. AFAICT the DCTCP "historical" ACKs for ECE precision, generated in dctcp_ce_state_0_to_1() and dctcp_ce_state_1_to_0(), violate the assumptions of the pre-existing delayed ACK state machine. They violate those assumptions by rewinding tp->rcv_nxt backwards and then calling tcp_send_ack(sk). But the existing code path to send an ACK assumes that any ACK transmission always sends an ACK that accounts for all received data, so that after sending an ACK we can cancel the delayed ACK timer. So it seems with DCTCP we can have buggy sequences where the DCTCP historical ACK causes us to forget that we need to schedule a delayed ACK (which will force the sender to RTO, as shown in the trace): tcp_event_data_recv() inet_csk_schedule_ack() // remember that we need an ACK tcp_ecn_check_ce() tcp_ca_event() // with CA_EVENT_ECN_IS_CE or CA_EVENT_ECN_NO_CE dctcp_cwnd_event() dctcp_ce_state_0_to_1() or dctcp_ce_state_1_to_0() if (... && ca->delayed_ack_reserved) tp->rcv_nxt = ca->prior_rcv_nxt; tcp_send_ack(sk); // send an ACK, but for old data! tcp_transmit_skb() tcp_event_ack_sent() inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK); // forget that we need a delayed ACK! AFAICT the first patch in this series, to force an immediate ACK on any packet with a CWR, papers over this issue of the forgotten delayed ACK by forcing an immediate ack in __tcp_ack_snd_check(). This ensures that at least for CWR packets the forgotten delayed ACK does not matter. But for packets without CWR I believe the buggy "forgotten delayed ACK" sequence above is still possib
Re: [PATCH net] tcp: prevent bogus FRTO undos with non-SACK flows
On Fri, Jun 29, 2018 at 3:52 PM Ilpo Järvinen wrote: > > +.005 < . 1:1(0) ack 2001 win 257 > > Why did the receiver send a cumulative ACK only for 2001? Sorry, you are right Ilpo. Upon further reflection, the packetdrill scenario I posted is not a realistic one, and I agree we should not worry about it. As I mentioned, I ran your patch through all our team's TCP packetdrill tests, and it passes all of the tests. One of our tests needed updating, because if there is a non-SACK connection with a spurious RTO due to a delayed flight of ACKs then the FRTO undo now happens one ACK later (when we get an ACK that doesn't cover a retransmit). But that seems fine to me. I also cooked the new packetdrill test below to explicitly cover this case you are addressing (please let me know if you have an alternate suggestion). Tested-by: Neal Cardwell Acked-by: Neal Cardwell Thanks! neal --- 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < S 0:0(0) win 32792 +0 > S. 0:0(0) ack 1 +.02 < . 1:1(0) ack 1 win 257 +0 accept(3, ..., ...) = 4 // Send 3 packets. First is really lost. And the dupacks // for the data packets that arrived at the reciver are slow in arriving. +0 write(4, ..., 3000) = 3000 +0 > P. 1:3001(3000) ack 1 // RTO and retransmit head. This fills a real loss. +.22 > . 1:1001(1000) ack 1 // Dupacks for packets 2 and 3 arrive. +.02 < . 1:1(0) ack 1 win 257 +0 < . 1:1(0) ack 1 win 257 // The cumulative ACK for all the data arrives. We do not undo, because // this is a non-SACK connection, and retransmitted data was ACKed. // It's good that there's no FRTO undo, since a packet was really lost. // Because this is non-SACK, tcp_try_undo_recovery() holds CA_Loss // until something beyond high_seq is ACKed. +.005 < . 1:1(0) ack 3001 win 257 +0 %{ assert tcpi_ca_state == TCP_CA_Loss, tcpi_ca_state }% +0 %{ assert tcpi_snd_cwnd == 4, tcpi_snd_cwnd }% +0 %{ assert tcpi_snd_ssthresh == 7, tcpi_snd_ssthresh }%
Re: [PATCH net-next 2/2] tcp: ack immediately when a cwr packet arrives
On 6/30/18, 11:23 AM, "Neal Cardwell" wrote: On Fri, Jun 29, 2018 at 9:48 PM Lawrence Brakmo wrote: > > We observed high 99 and 99.9% latencies when doing RPCs with DCTCP. The > problem is triggered when the last packet of a request arrives CE > marked. The reply will carry the ECE mark causing TCP to shrink its cwnd > to 1 (because there are no packets in flight). When the 1st packet of > the next request arrives it was sometimes delayed adding up to 40ms to > the latency. > > This patch insures that CWR makred data packets arriving will be acked > immediately. > > Signed-off-by: Lawrence Brakmo > --- Thanks, Larry. Ensuring that CWR-marked data packets arriving will be acked immediately sounds like a good goal to me. I am wondering if we can have a simpler fix. The dctcp_ce_state_0_to_1() code is setting the TCP_ECN_DEMAND_CWR bit in ecn_flags, which disables the code in __tcp_ecn_check_ce() that would have otherwise used the tcp_enter_quickack_mode() mechanism to force an ACK: __tcp_ecn_check_ce(): ... case INET_ECN_CE: if (tcp_ca_needs_ecn(sk)) tcp_ca_event(sk, CA_EVENT_ECN_IS_CE); // -> dctcp_ce_state_0_to_1() // -> tp->ecn_flags |= TCP_ECN_DEMAND_CWR; if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR)) { /* Better not delay acks, sender can have a very low cwnd */ tcp_enter_quickack_mode(sk, 1); tp->ecn_flags |= TCP_ECN_DEMAND_CWR; } tp->ecn_flags |= TCP_ECN_SEEN; break; So it seems like the bug here may be that dctcp_ce_state_0_to_1() is setting the TCP_ECN_DEMAND_CWR bit in ecn_flags, when really it should let its caller, __tcp_ecn_check_ce() set TCP_ECN_DEMAND_CWR, in which case the code would already properly force an immediate ACK. So, what if we use this fix instead (not compiled, not tested): diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c index 5f5e5936760e..4fecd2824edb 100644 --- a/net/ipv4/tcp_dctcp.c +++ b/net/ipv4/tcp_dctcp.c @@ -152,8 +152,6 @@ static void dctcp_ce_state_0_to_1(struct sock *sk) ca->prior_rcv_nxt = tp->rcv_nxt; ca->ce_state = 1; - - tp->ecn_flags |= TCP_ECN_DEMAND_CWR; } static void dctcp_ce_state_1_to_0(struct sock *sk) What do you think? neal I see two issues, one is that entering quickack mode as you mentioned does not insure that it will still be on when the CWR arrives. The second issue is that the problem occurs right after the receiver sends a small reply which results in entering pingpong mode right before the sender starts the new request by sending just one packet (i.e. forces delayed ack). I compiled and tested your patch. Both 99 and 99.9 percentile latencies are around 40ms. Looking at the packet traces shows that some CWR marked packets are not being ack immediately (delayed by 40ms). Larry
Re: pull-request: bpf 2018-07-01
From: Daniel Borkmann Date: Sun, 1 Jul 2018 01:56:37 +0200 > The following pull-request contains BPF updates for your *net* tree. > > The main changes are: ... > Please consider pulling these changes from: > > git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git Pulled, thanks Daniel.
Re: [PATCH net-next 0/2] tcp: fix high tail latencies in DCTCP
On Fri, Jun 29, 2018 at 9:48 PM Lawrence Brakmo wrote: > > When have observed high tail latencies when using DCTCP for RPCs as > compared to using Cubic. For example, in one setup there are 2 hosts > sending to a 3rd one, with each sender having 3 flows (1 stream, > 1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following > table shows the 99% and 99.9% latencies for both Cubic and dctcp: > >Cubic 99% Cubic 99.9% dctcp 99%dctcp 99.9% > 1MB RPCs2.6ms 5.5ms 43ms 208ms > 10KB RPCs1.1ms 1.3ms 53ms 212ms > > Looking at tcpdump traces showed that there are two causes for the > latency. > > 1) RTOs caused by the receiver sending a dup ACK and not ACKing > the last (and only) packet sent. > 2) Delaying ACKs when the sender has a cwnd of 1, so everything > pauses for the duration of the delayed ACK. > > The first patch fixes the cause of the dup ACKs, not updating DCTCP > state when an ACK that was initially delayed has been sent with a > data packet. > > The second patch insures that an ACK is sent immediately when a > CWR marked packet arrives. > > With the patches the latencies for DCTCP now look like: > >dctcp 99% dctcp 99.9% > 1MB RPCs4.8ms 6.5ms > 10KB RPCs143us 184us > > Note that while the 1MB RPCs tail latencies are higher than Cubic's, > the 10KB latencies are much smaller than Cubic's. These patches fix > issues on the receiver, but tcpdump traces indicate there is an > opportunity to also fix an issue at the sender that adds about 3ms > to the tail latencies. > > The following trace shows the issue that tiggers an RTO (fixed by these > patches): > >Host A sends the last packets of the request >Host B receives them, and the last packet is marked with congestion (CE) >Host B sends ACKs for packets not marked with congestion >Host B sends data packet with reply and ACK for packet marked with > congestion (TCP flag ECE) >Host A receives ACKs with no ECE flag >Host A receives data packet with ACK for the last packet of request > and which has TCP ECE bit set >Host A sends 1st data packet of the next request with TCP flag CWR >Host B receives the packet (as seen in tcpdump at B), no CE flag >Host B sends a dup ACK that also has the TCP ECE flag >Host A RTO timer fires! >Host A to send the next packet >Host A receives an ACK for everything it has sent (i.e. Host B > did receive 1st packet of request) >Host A send more packets… > > [PATCH net-next 1/2] tcp: notify when a delayed ack is sent > [PATCH net-next 2/2] tcp: ack immediately when a cwr packet arrives Thanks, Larry! I suspect there is a broader problem with "DCTCP-style precise ECE ACKs" that this patch series does not address. AFAICT the DCTCP "historical" ACKs for ECE precision, generated in dctcp_ce_state_0_to_1() and dctcp_ce_state_1_to_0(), violate the assumptions of the pre-existing delayed ACK state machine. They violate those assumptions by rewinding tp->rcv_nxt backwards and then calling tcp_send_ack(sk). But the existing code path to send an ACK assumes that any ACK transmission always sends an ACK that accounts for all received data, so that after sending an ACK we can cancel the delayed ACK timer. So it seems with DCTCP we can have buggy sequences where the DCTCP historical ACK causes us to forget that we need to schedule a delayed ACK (which will force the sender to RTO, as shown in the trace): tcp_event_data_recv() inet_csk_schedule_ack() // remember that we need an ACK tcp_ecn_check_ce() tcp_ca_event() // with CA_EVENT_ECN_IS_CE or CA_EVENT_ECN_NO_CE dctcp_cwnd_event() dctcp_ce_state_0_to_1() or dctcp_ce_state_1_to_0() if (... && ca->delayed_ack_reserved) tp->rcv_nxt = ca->prior_rcv_nxt; tcp_send_ack(sk); // send an ACK, but for old data! tcp_transmit_skb() tcp_event_ack_sent() inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK); // forget that we need a delayed ACK! AFAICT the first patch in this series, to force an immediate ACK on any packet with a CWR, papers over this issue of the forgotten delayed ACK by forcing an immediate ack in __tcp_ack_snd_check(). This ensures that at least for CWR packets the forgotten delayed ACK does not matter. But for packets without CWR I believe the buggy "forgotten delayed ACK" sequence above is still possible, and could still plague DCTCP with high tail latencies in some cases. I suspect that after your CWR-forces-ACK patch it may not be jumping out in performance test results because (a) if there is no CWR involved then usually the cwnd is high enough that missing delayed ACK does not cause a stall, and (b) if there is a CWR involved then your patch forces an immediate ACK. But AFAICT that forgotten delayed ACK bug is still there. What do folks think? neal
Re: [bpf PATCH v5 0/4] BPF fixes for sockhash
On 06/30/2018 03:17 PM, John Fastabend wrote: > This addresses two syzbot issues that lead to identifying (by Eric and > Wei) a class of bugs where we don't correctly check for IPv4/v6 > sockets and their associated state. The second issue was a locking > omission in sockhash. > > The first patch addresses IPv6 socks and fixing an error where > sockhash would overwrite the prot pointer with IPv4 prot. To fix > this build similar solution to TLS ULP. Although we continue to > allow socks in all states not just ESTABLISH in this patch set > because as Martin points out there should be no issue with this > on the sockmap ULP because we don't use the ctx in this code. Once > multiple ULPs coexist we may need to revisit this. However we > can do this in *next trees. > > The other issue syzbot found that the tcp_close() handler missed > locking the hash bucket lock which could result in corrupting the > sockhash bucket list if delete and close ran at the same time. > And also the smap_list_remove() routine was not working correctly > at all. This was not caught in my testing because in general my > tests (to date at least lets add some more robust selftest in > bpf-next) do things in the "expected" order, create map, add socks, > delete socks, then tear down maps. The tests we have that do the > ops out of this order where only working on single maps not multi- > maps so we never saw the issue. Thanks syzbot. The fix is to > restructure the tcp_close() lock handling. And fix the obvious > bug in smap_list_remove(). > > Finally, during review I noticed the release handler was omitted > from the upstream code (patch 4) due to an incorrect merge conflict > fix when I ported the code to latest bpf-next before submitting. > This would leave references to the map around if the user never > closes the map. > > v3: rework patches, dropping ESTABLISH check and adding rcu > annotation along with the smap_list_remove fix > > v4: missed one more case where maps was being accessed without > the sk_callback_lock, spoted by Martin as well. > > v5: changed to use a specific lock for maps and reduced callback > lock so that it is only used to gaurd sk callbacks. I think > this makes the logic a bit cleaner and avoids confusion > ovoer what each lock is doing. > > Also big thanks to Martin for thorough review he caught at least > one case where I missed a rcu_call(). Applied it to bpf, thanks John!
pull-request: bpf 2018-07-01
Hi David, The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) A bpf_fib_lookup() helper fix to change the API before freeze to return an encoding of the FIB lookup result and return the nexthop device index in the params struct (instead of device index as return code that we had before), from David. 2) Various BPF JIT fixes to address syzkaller fallout, that is, do not reject progs when set_memory_*() fails since it could still be RO. Also arm32 JIT was not using bpf_jit_binary_lock_ro() API which was an issue, and a memory leak in s390 JIT found during review, from Daniel. 3) Multiple fixes for sockmap/hash to address most of the syzkaller triggered bugs. Usage with IPv6 was crashing, a GPF in bpf_tcp_close(), a missing sock_map_release() routine to hook up to callbacks, and a fix for an omitted bucket lock in sock_close(), from John. 4) Two bpftool fixes to remove duplicated error message on program load, and another one to close the libbpf object after program load. One additional fix for nfp driver's BPF offload to avoid stopping offload completely if replace of program failed, from Jakub. 5) Couple of BPF selftest fixes that bail out in some of the test scripts if the user does not have the right privileges, from Jeffrin. 6) Fixes in test_bpf for s390 when CONFIG_BPF_JIT_ALWAYS_ON is set where we need to set the flag that some of the test cases are expected to fail, from Kleber. 7) Fix to detangle BPF_LIRC_MODE2 dependency from CONFIG_CGROUP_BPF since it has no relation to it and lirc2 users often have configs without cgroups enabled and thus would not be able to use it, from Sean. 8) Fix a selftest failure in sockmap by removing a useless setrlimit() call that would set a too low limit where at the same time we are already including bpf_rlimit.h that does the job, from Yonghong. 9) Fix BPF selftest config with missing missing NET_SCHED, from Anders. Please consider pulling these changes from: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git Thanks a lot! The following changes since commit 3739a21e0ef6ac06f46bd38e81daa95e8cb462bc: selftests: net: add tcp_inq to gitignore (2018-06-21 15:02:32 +0900) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git for you to fetch changes up to bf2b866a2fe2d74558fe4b7bdf63a4bc0afbdf70: Merge branch 'bpf-sockmap-fixes' (2018-07-01 01:21:33 +0200) Alexei Starovoitov (1): Merge branch 'bpf-fixes' Anders Roxell (1): selftests: bpf: add missing NET_SCHED to config Daniel Borkmann (5): Merge branch 'bpf-bpftool-fixes' bpf, arm32: fix to use bpf_jit_binary_lock_ro api bpf, s390: fix potential memleak when later bpf_jit_prog fails bpf: undo prog rejection on read-only lock failure Merge branch 'bpf-sockmap-fixes' David Ahern (1): bpf: Change bpf_fib_lookup to return lookup status Jakub Kicinski (3): tools: bpftool: remove duplicated error message on prog load tools: bpftool: remember to close the libbpf object after prog load nfp: bpf: don't stop offload if replace failed Jeffrin Jose T (3): selftests: bpf: notification about privilege required to run test_kmod.sh testing script selftests: bpf: notification about privilege required to run test_lirc_mode2.sh testing script selftests: bpf: notification about privilege required to run test_lwt_seg6local.sh testing script John Fastabend (4): bpf: sockmap, fix crash when ipv6 sock is added bpf: sockmap, fix smap_list_map_remove when psock is in many maps bpf: sockhash fix omitted bucket lock in sock_close bpf: sockhash, add release routine Kleber Sacilotto de Souza (1): test_bpf: flag tests that cannot be jited on s390 Sean Young (1): bpf: fix attach type BPF_LIRC_MODE2 dependency wrt CONFIG_CGROUP_BPF Yonghong Song (1): tools/bpf: fix test_sockmap failure arch/arm/net/bpf_jit_32.c | 2 +- arch/s390/net/bpf_jit_comp.c | 1 + drivers/media/rc/bpf-lirc.c | 14 +- drivers/net/ethernet/netronome/nfp/bpf/main.c | 6 +- include/linux/bpf-cgroup.h| 26 +++ include/linux/bpf.h | 8 + include/linux/bpf_lirc.h | 5 +- include/linux/filter.h| 56 + include/uapi/linux/bpf.h | 28 ++- kernel/bpf/cgroup.c | 54 + kernel/bpf/core.c | 30 +-- kernel/bpf/sockmap.c | 254 -- kernel/bpf/syscall.c | 99 ++--- lib/test_bpf.c
Re: [PATCH bpf] bpf: hash_map: decrement counter on error
On 06/29/2018 02:48 PM, Mauricio Vasquez B wrote: > Decrement the number of elements in the map in case the allocation > of a new node fails. > > Signed-off-by: Mauricio Vasquez B Thanks for the fix, Mauricio! Could you reply with a Fixes: tag in order to track the commit originally introducing this bug? Thanks, Daniel
Re: [PATCH bpf-next 0/8] tools: bpf: updates to bpftool and libbpf
On 06/28/2018 11:41 PM, Jakub Kicinski wrote: > Hi! > > Set of random updates to bpftool and libbpf. I'm preparing for > extending bpftool prog load, but there is a good number of > improvements that can be made before bpf -> bpf-next merge > helping to keep the later patch set to a manageable size as well. > > First patch is a bpftool build speed improvement. Next missing > program types are added to libbpf program type detection by section > name. The ability to load programs from '.text' section is restored > when ELF file doesn't contain any pseudo calls. > > In bpftool I remove my Author comments as unnecessary sign of vanity. > Last but not least missing option is added to bash completions and > processing of options in bash completions is improved. Applied to bpf-next, thanks Jakub!
[PATCH net-next] r8169: remove old PHY reset hack
This hack (affecting the non-PCIe models only) was introduced in 2004 to deal with link negotiation failures in 1GBit mode. Based on a comment in the r8169 vendor driver I assume the issue affects RTL8169sb in combination with particular 1GBit switch models. Resetting the PHY every 10s and hoping that one fine day we will make it to establish the link seems to be very hacky to me. I'd say: If 1GBit doesn't work reliably in a users environment then the user should remove 1GBit from the advertised modes, e.g. by using ethtool -s advertise <10/100 modes> If the issue affects one chip version only and that with most link partners, then we could also think of removing 1GBit from the advertised modes for this chip version in the driver. Signed-off-by: Heiner Kallweit --- drivers/net/ethernet/realtek/r8169.c | 57 +--- 1 file changed, 1 insertion(+), 56 deletions(-) diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index 72a7778b..b0a06902 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -80,7 +80,6 @@ static const int multicast_filter_limit = 32; #define R8169_RX_RING_BYTES(NUM_RX_DESC * sizeof(struct RxDesc)) #define RTL8169_TX_TIMEOUT (6*HZ) -#define RTL8169_PHY_TIMEOUT(10*HZ) /* write/read MMIO register */ #define RTL_W8(tp, reg, val8) writeb((val8), tp->mmio_addr + (reg)) @@ -703,7 +702,6 @@ enum rtl_flag { RTL_FLAG_TASK_ENABLED, RTL_FLAG_TASK_SLOW_PENDING, RTL_FLAG_TASK_RESET_PENDING, - RTL_FLAG_TASK_PHY_PENDING, RTL_FLAG_MAX }; @@ -731,7 +729,6 @@ struct rtl8169_private { dma_addr_t RxPhyAddr; void *Rx_databuff[NUM_RX_DESC]; /* Rx data buffers */ struct ring_info tx_skb[NUM_TX_DESC]; /* Tx data buffers */ - struct timer_list timer; u16 cp_cmd; u16 event_slow; @@ -1788,20 +1785,7 @@ static int rtl8169_set_speed_xmii(struct net_device *dev, static int rtl8169_set_speed(struct net_device *dev, u8 autoneg, u16 speed, u8 duplex, u32 advertising) { - struct rtl8169_private *tp = netdev_priv(dev); - int ret; - - ret = rtl8169_set_speed_xmii(dev, autoneg, speed, duplex, advertising); - if (ret < 0) - goto out; - - if (netif_running(dev) && (autoneg == AUTONEG_ENABLE) && - (advertising & ADVERTISED_1000baseT_Full) && - !pci_is_pcie(tp->pci_dev)) { - mod_timer(&tp->timer, jiffies + RTL8169_PHY_TIMEOUT); - } -out: - return ret; + return rtl8169_set_speed_xmii(dev, autoneg, speed, duplex, advertising); } static netdev_features_t rtl8169_fix_features(struct net_device *dev, @@ -1888,8 +1872,6 @@ static int rtl8169_set_link_ksettings(struct net_device *dev, cmd->link_modes.advertising)) return -EINVAL; - del_timer_sync(&tp->timer); - rtl_lock_work(tp); rc = rtl8169_set_speed(dev, cmd->base.autoneg, cmd->base.speed, cmd->base.duplex, advertising); @@ -4293,44 +4275,12 @@ static void rtl_hw_phy_config(struct net_device *dev) } } -static void rtl_phy_work(struct rtl8169_private *tp) -{ - struct timer_list *timer = &tp->timer; - unsigned long timeout = RTL8169_PHY_TIMEOUT; - - if (rtl8169_xmii_reset_pending(tp)) { - /* -* A busy loop could burn quite a few cycles on nowadays CPU. -* Let's delay the execution of the timer for a few ticks. -*/ - timeout = HZ/10; - goto out_mod_timer; - } - - if (rtl8169_xmii_link_ok(tp)) - return; - - netif_dbg(tp, link, tp->dev, "PHY reset until link up\n"); - - rtl8169_xmii_reset_enable(tp); - -out_mod_timer: - mod_timer(timer, jiffies + timeout); -} - static void rtl_schedule_task(struct rtl8169_private *tp, enum rtl_flag flag) { if (!test_and_set_bit(flag, tp->wk.flags)) schedule_work(&tp->wk.work); } -static void rtl8169_phy_timer(struct timer_list *t) -{ - struct rtl8169_private *tp = from_timer(tp, t, timer); - - rtl_schedule_task(tp, RTL_FLAG_TASK_PHY_PENDING); -} - DECLARE_RTL_COND(rtl_phy_reset_cond) { return rtl8169_xmii_reset_pending(tp); @@ -6909,7 +6859,6 @@ static void rtl_task(struct work_struct *work) /* XXX - keep rtl_slow_event_work() as first element. */ { RTL_FLAG_TASK_SLOW_PENDING, rtl_slow_event_work }, { RTL_FLAG_TASK_RESET_PENDING, rtl_reset_work }, - { RTL_FLAG_TASK_PHY_PENDING,rtl_phy_work } }; struct rtl8169_private *tp = container_of(work, struct rtl8169_private, wk.work); @@ -6982,8 +6931,6 @@ static void rtl8169_down(struct net_device *dev) { struct rtl8169_private *tp = netdev_priv(dev); - del_timer
Re: Anyone know if strongswan works with vrf?
On 6/29/18 4:10 PM, Ben Greear wrote: > Hello, > > We're trying to create lots of strongswan VPN tunnels on network devices > bound to different VRFs. We are using Fedora-24 on the client side, > with a 4.16.15+ kernel > and updated 'ip' package, etc. > > So far, no luck getting it to work. > > Any idea if this is supported or not? Kernel side xfrm code does work; been a couple of years since I tried it. As I recall strongswan needs an update. Looking at the 'ip xfrm' based scripts, you need to add 'sel dev ${VRF}' to the state. eg., VRF="sel dev blue" ip xfrm state add src ${REMIP} dst ${MYIP} \ proto esp spi 0x02122b77 reqid 0 mode tunnel \ replay-window 4 replay-oseq 0x4 \ auth-trunc 'hmac(md5)' 0xd94fcfea65fddf21dc6e0d24a0253508 96 \ enc 'cbc(des3_ede)' 0xfc46c20f8048be9725930ff3fb07ac2a91f0347dffeacf62 \ ${VRF}
Re: [net-next 01/12] net/mlx5e: Add UDP GSO support
On 06/30/18 19:06, Boris Pismenny wrote: On 06/30/18 01:19, Willem de Bruijn wrote: On Fri, Jun 29, 2018 at 2:24 AM Saeed Mahameed wrote: From: Boris Pismenny This patch enables UDP GSO support. We enable this by using two WQEs the first is a UDP LSO WQE for all segments with equal length, and the second is for the last segment in case it has different length. Due to HW limitation, before sending, we must adjust the packet length fields. We measure performance between two Intel(R) Xeon(R) CPU E5-2643 v2 @3.50GHz machines connected back-to-back with Connectx4-Lx (40Gbps) NICs. We compare single stream UDP, UDP GSO and UDP GSO with offload. Performance: | MSS (bytes) | Throughput (Gbps) | CPU utilization (%) UDP GSO offload | 1472 | 35.6 | 8% UDP GSO | 1472 | 25.5 | 17% UDP | 1472 | 10.2 | 17% UDP GSO offload | 1024 | 35.6 | 8% UDP GSO | 1024 | 19.2 | 17% UDP | 1024 | 5.7 | 17% UDP GSO offload | 512 | 33.8 | 16% UDP GSO | 512 | 10.4 | 17% UDP | 512 | 3.5 | 17% Very nice results :) Thanks. We expect to have 100Gbps results by netdev0x12. +static void mlx5e_udp_gso_prepare_last_skb(struct sk_buff *skb, + struct sk_buff *nskb, + int remaining) +{ + int bytes_needed = remaining, remaining_headlen, remaining_page_offset; + int headlen = skb_transport_offset(skb) + sizeof(struct udphdr); + int payload_len = remaining + sizeof(struct udphdr); + int k = 0, i, j; + + skb_copy_bits(skb, 0, nskb->data, headlen); + nskb->dev = skb->dev; + skb_reset_mac_header(nskb); + skb_set_network_header(nskb, skb_network_offset(skb)); + skb_set_transport_header(nskb, skb_transport_offset(skb)); + skb_set_tail_pointer(nskb, headlen); + + /* How many frags do we need? */ + for (i = skb_shinfo(skb)->nr_frags - 1; i >= 0; i--) { + bytes_needed -= skb_frag_size(&skb_shinfo(skb)->frags[i]); + k++; + if (bytes_needed <= 0) + break; + } + + /* Fill the first frag and split it if necessary */ + j = skb_shinfo(skb)->nr_frags - k; + remaining_page_offset = -bytes_needed; + skb_fill_page_desc(nskb, 0, + skb_shinfo(skb)->frags[j].page.p, + skb_shinfo(skb)->frags[j].page_offset + remaining_page_offset, + skb_shinfo(skb)->frags[j].size - remaining_page_offset); + + skb_frag_ref(skb, j); + + /* Fill the rest of the frags */ + for (i = 1; i < k; i++) { + j = skb_shinfo(skb)->nr_frags - k + i; + + skb_fill_page_desc(nskb, i, + skb_shinfo(skb)->frags[j].page.p, + skb_shinfo(skb)->frags[j].page_offset, + skb_shinfo(skb)->frags[j].size); + skb_frag_ref(skb, j); + } + skb_shinfo(nskb)->nr_frags = k; + + remaining_headlen = remaining - skb->data_len; + + /* headlen contains remaining data? */ + if (remaining_headlen > 0) + skb_copy_bits(skb, skb->len - remaining, nskb->data + headlen, + remaining_headlen); + nskb->len = remaining + headlen; + nskb->data_len = payload_len - sizeof(struct udphdr) + + max_t(int, 0, remaining_headlen); + nskb->protocol = skb->protocol; + if (nskb->protocol == htons(ETH_P_IP)) { + ip_hdr(nskb)->id = htons(ntohs(ip_hdr(nskb)->id) + + skb_shinfo(skb)->gso_segs); + ip_hdr(nskb)->tot_len = + htons(payload_len + sizeof(struct iphdr)); + } else { + ipv6_hdr(nskb)->payload_len = htons(payload_len); + } + udp_hdr(nskb)->len = htons(payload_len); + skb_shinfo(nskb)->gso_size = 0; + nskb->ip_summed = skb->ip_summed; + nskb->csum_start = skb->csum_start; + nskb->csum_offset = skb->csum_offset; + nskb->queue_mapping = skb->queue_mapping; +} + +/* might send skbs and update wqe and pi */ +struct sk_buff *mlx5e_udp_gso_handle_tx_skb(struct net_device *netdev, + struct mlx5e_txqsq *sq, + struct sk_buff *skb, + struct mlx5e_tx_wqe **wqe, + u16 *pi) +{ + int payload_len = skb_shinfo(skb)->gso_size + sizeof(struct udphdr); + int headlen = skb_transport_offset(skb) + sizeof(struct udphdr); + int remaining = (skb->len - headlen) % skb_shinfo(skb)->gso_size; + struct sk_buff *nskb; + + if (skb->protocol == htons(ETH_P_IP)) + ip_hdr(skb)->tot_le
Re: [PATCH net-next 2/2] tcp: ack immediately when a cwr packet arrives
On Fri, Jun 29, 2018 at 9:48 PM Lawrence Brakmo wrote: > > We observed high 99 and 99.9% latencies when doing RPCs with DCTCP. The > problem is triggered when the last packet of a request arrives CE > marked. The reply will carry the ECE mark causing TCP to shrink its cwnd > to 1 (because there are no packets in flight). When the 1st packet of > the next request arrives it was sometimes delayed adding up to 40ms to > the latency. > > This patch insures that CWR makred data packets arriving will be acked > immediately. > > Signed-off-by: Lawrence Brakmo > --- Thanks, Larry. Ensuring that CWR-marked data packets arriving will be acked immediately sounds like a good goal to me. I am wondering if we can have a simpler fix. The dctcp_ce_state_0_to_1() code is setting the TCP_ECN_DEMAND_CWR bit in ecn_flags, which disables the code in __tcp_ecn_check_ce() that would have otherwise used the tcp_enter_quickack_mode() mechanism to force an ACK: __tcp_ecn_check_ce(): ... case INET_ECN_CE: if (tcp_ca_needs_ecn(sk)) tcp_ca_event(sk, CA_EVENT_ECN_IS_CE); // -> dctcp_ce_state_0_to_1() // -> tp->ecn_flags |= TCP_ECN_DEMAND_CWR; if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR)) { /* Better not delay acks, sender can have a very low cwnd */ tcp_enter_quickack_mode(sk, 1); tp->ecn_flags |= TCP_ECN_DEMAND_CWR; } tp->ecn_flags |= TCP_ECN_SEEN; break; So it seems like the bug here may be that dctcp_ce_state_0_to_1() is setting the TCP_ECN_DEMAND_CWR bit in ecn_flags, when really it should let its caller, __tcp_ecn_check_ce() set TCP_ECN_DEMAND_CWR, in which case the code would already properly force an immediate ACK. So, what if we use this fix instead (not compiled, not tested): diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c index 5f5e5936760e..4fecd2824edb 100644 --- a/net/ipv4/tcp_dctcp.c +++ b/net/ipv4/tcp_dctcp.c @@ -152,8 +152,6 @@ static void dctcp_ce_state_0_to_1(struct sock *sk) ca->prior_rcv_nxt = tp->rcv_nxt; ca->ce_state = 1; - - tp->ecn_flags |= TCP_ECN_DEMAND_CWR; } static void dctcp_ce_state_1_to_0(struct sock *sk) What do you think? neal
[PATCH net-next 2/5] net: gemini: Improve connection prints
Instead of just specify that a PHY is connected at some speed, also specify which one. This is helpful with several PHYs on the system. Signed-off-by: Linus Walleij --- drivers/net/ethernet/cortina/gemini.c | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/cortina/gemini.c b/drivers/net/ethernet/cortina/gemini.c index 8fc31723f700..b49ed8964026 100644 --- a/drivers/net/ethernet/cortina/gemini.c +++ b/drivers/net/ethernet/cortina/gemini.c @@ -300,23 +300,26 @@ static void gmac_speed_set(struct net_device *netdev) status.bits.speed = GMAC_SPEED_1000; if (phydev->interface == PHY_INTERFACE_MODE_RGMII) status.bits.mii_rmii = GMAC_PHY_RGMII_1000; - netdev_info(netdev, "connect to RGMII @ 1Gbit\n"); + netdev_info(netdev, "connect %s to RGMII @ 1Gbit\n", + phydev_name(phydev)); break; case 100: status.bits.speed = GMAC_SPEED_100; if (phydev->interface == PHY_INTERFACE_MODE_RGMII) status.bits.mii_rmii = GMAC_PHY_RGMII_100_10; - netdev_info(netdev, "connect to RGMII @ 100 Mbit\n"); + netdev_info(netdev, "connect %s to RGMII @ 100 Mbit\n", + phydev_name(phydev)); break; case 10: status.bits.speed = GMAC_SPEED_10; if (phydev->interface == PHY_INTERFACE_MODE_RGMII) status.bits.mii_rmii = GMAC_PHY_RGMII_100_10; - netdev_info(netdev, "connect to RGMII @ 10 Mbit\n"); + netdev_info(netdev, "connect %s to RGMII @ 10 Mbit\n", + phydev_name(phydev)); break; default: - netdev_warn(netdev, "Not supported PHY speed (%d)\n", - phydev->speed); + netdev_warn(netdev, "Unsupported PHY speed (%d) on %s\n", + phydev->speed, phydev_name(phydev)); } if (phydev->duplex == DUPLEX_FULL) { -- 2.17.1
[PATCH net-next 4/5] net: gemini: Move main init to port
The initialization sequence for the ethernet, setting up interrupt routing and such things, need to be done after both the ports are clocked and reset. Before this the config will not "take". Move the initialization to the port probe function and keep track of init status in the state. Signed-off-by: Linus Walleij --- drivers/net/ethernet/cortina/gemini.c | 16 ++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/cortina/gemini.c b/drivers/net/ethernet/cortina/gemini.c index 8d192fcd51c8..79324bbfd768 100644 --- a/drivers/net/ethernet/cortina/gemini.c +++ b/drivers/net/ethernet/cortina/gemini.c @@ -146,6 +146,7 @@ struct gemini_ethernet { void __iomem *base; struct gemini_ethernet_port *port0; struct gemini_ethernet_port *port1; + bool initialized; spinlock_t irq_lock; /* Locks IRQ-related registers */ unsigned intfreeq_order; @@ -2301,6 +2302,14 @@ static void gemini_port_remove(struct gemini_ethernet_port *port) static void gemini_ethernet_init(struct gemini_ethernet *geth) { + /* Only do this once both ports are online */ + if (geth->initialized) + return; + if (geth->port0 && geth->port1) + geth->initialized = true; + else + return; + writel(0, geth->base + GLOBAL_INTERRUPT_ENABLE_0_REG); writel(0, geth->base + GLOBAL_INTERRUPT_ENABLE_1_REG); writel(0, geth->base + GLOBAL_INTERRUPT_ENABLE_2_REG); @@ -2447,6 +2456,10 @@ static int gemini_ethernet_port_probe(struct platform_device *pdev) geth->port0 = port; else geth->port1 = port; + + /* This will just be done once both ports are up and reset */ + gemini_ethernet_init(geth); + platform_set_drvdata(pdev, port); /* Set up and register the netdev */ @@ -2564,7 +2577,6 @@ static int gemini_ethernet_probe(struct platform_device *pdev) spin_lock_init(&geth->irq_lock); spin_lock_init(&geth->freeq_lock); - gemini_ethernet_init(geth); /* The children will use this */ platform_set_drvdata(pdev, geth); @@ -2577,8 +2589,8 @@ static int gemini_ethernet_remove(struct platform_device *pdev) { struct gemini_ethernet *geth = platform_get_drvdata(pdev); - gemini_ethernet_init(geth); geth_cleanup_freeq(geth); + geth->initialized = false; return 0; } -- 2.17.1
[PATCH net-next 3/5] net: gemini: Allow multiple ports to instantiate
The code was not tested with two ports actually in use at the same time. (I blame this on lack of actual hardware using that feature.) Now after locating a system using both ports, add necessary fix to make both ports come up. Signed-off-by: Linus Walleij --- drivers/net/ethernet/cortina/gemini.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/cortina/gemini.c b/drivers/net/ethernet/cortina/gemini.c index b49ed8964026..8d192fcd51c8 100644 --- a/drivers/net/ethernet/cortina/gemini.c +++ b/drivers/net/ethernet/cortina/gemini.c @@ -1787,7 +1787,10 @@ static int gmac_open(struct net_device *netdev) phy_start(netdev->phydev); err = geth_resize_freeq(port); - if (err) { + /* It's fine if it's just busy, the other port has set up +* the freeq in that case. +*/ + if (err && (err != -EBUSY)) { netdev_err(netdev, "could not resize freeq\n"); goto err_stop_phy; } -- 2.17.1
[PATCH net-next 5/5] net: gemini: Indicate that we can handle jumboframes
The hardware supposedly handles frames up to 10236 bytes and implements .ndo_change_mtu() so accept 10236 minus the ethernet header for a VLAN tagged frame on the netdevices. Signed-off-by: Linus Walleij --- drivers/net/ethernet/cortina/gemini.c | 5 + 1 file changed, 5 insertions(+) diff --git a/drivers/net/ethernet/cortina/gemini.c b/drivers/net/ethernet/cortina/gemini.c index 79324bbfd768..ae475393e4ac 100644 --- a/drivers/net/ethernet/cortina/gemini.c +++ b/drivers/net/ethernet/cortina/gemini.c @@ -2473,6 +2473,11 @@ static int gemini_ethernet_port_probe(struct platform_device *pdev) netdev->hw_features = GMAC_OFFLOAD_FEATURES; netdev->features |= GMAC_OFFLOAD_FEATURES | NETIF_F_GRO; + /* We can handle jumbo frames up to 10236 bytes so, let's accept +* payloads of 10236 bytes minus VLAN and ethernet header +*/ + netdev->min_mtu = 256; + netdev->max_mtu = 10236 - VLAN_ETH_HLEN; port->freeq_refill = 0; netif_napi_add(netdev, &port->napi, gmac_napi_poll, -- 2.17.1
[PATCH net-next 1/5] net: gemini: Look up L3 maxlen from table
The code to calculate the hardware register enumerator for the maximum L3 length isn't entirely simple to read. Use the existing defines and rewrite the function into a table look-up. Signed-off-by: Linus Walleij --- drivers/net/ethernet/cortina/gemini.c | 61 --- 1 file changed, 46 insertions(+), 15 deletions(-) diff --git a/drivers/net/ethernet/cortina/gemini.c b/drivers/net/ethernet/cortina/gemini.c index 6d7404f66f84..8fc31723f700 100644 --- a/drivers/net/ethernet/cortina/gemini.c +++ b/drivers/net/ethernet/cortina/gemini.c @@ -401,26 +401,57 @@ static int gmac_setup_phy(struct net_device *netdev) return 0; } -static int gmac_pick_rx_max_len(int max_l3_len) -{ - /* index = CONFIG_MAXLEN_XXX values */ - static const int max_len[8] = { - 1536, 1518, 1522, 1542, - 9212, 10236, 1518, 1518 - }; - int i, n = 5; +/* The maximum frame length is not logically enumerated in the + * hardware, so we do a table lookup to find the applicable max + * frame length. + */ +struct gmac_max_framelen { + unsigned int max_l3_len; + u8 val; +}; - max_l3_len += ETH_HLEN + VLAN_HLEN; +static const struct gmac_max_framelen gmac_maxlens[] = { + { + .max_l3_len = 1518, + .val = CONFIG0_MAXLEN_1518, + }, + { + .max_l3_len = 1522, + .val = CONFIG0_MAXLEN_1522, + }, + { + .max_l3_len = 1536, + .val = CONFIG0_MAXLEN_1536, + }, + { + .max_l3_len = 1542, + .val = CONFIG0_MAXLEN_1542, + }, + { + .max_l3_len = 9212, + .val = CONFIG0_MAXLEN_9k, + }, + { + .max_l3_len = 10236, + .val = CONFIG0_MAXLEN_10k, + }, +}; + +static int gmac_pick_rx_max_len(unsigned int max_l3_len) +{ + const struct gmac_max_framelen *maxlen; + int maxtot; + int i; - if (max_l3_len > max_len[n]) - return -1; + maxtot = max_l3_len + ETH_HLEN + VLAN_HLEN; - for (i = 0; i < 5; i++) { - if (max_len[i] >= max_l3_len && max_len[i] < max_len[n]) - n = i; + for (i = 0; i < ARRAY_SIZE(gmac_maxlens); i++) { + maxlen = &gmac_maxlens[i]; + if (maxtot <= maxlen->max_l3_len) + return maxlen->val; } - return n; + return -1; } static int gmac_init(struct net_device *netdev) -- 2.17.1
Re: [net-next 01/12] net/mlx5e: Add UDP GSO support
On 06/30/18 01:19, Willem de Bruijn wrote: On Fri, Jun 29, 2018 at 2:24 AM Saeed Mahameed wrote: From: Boris Pismenny This patch enables UDP GSO support. We enable this by using two WQEs the first is a UDP LSO WQE for all segments with equal length, and the second is for the last segment in case it has different length. Due to HW limitation, before sending, we must adjust the packet length fields. We measure performance between two Intel(R) Xeon(R) CPU E5-2643 v2 @3.50GHz machines connected back-to-back with Connectx4-Lx (40Gbps) NICs. We compare single stream UDP, UDP GSO and UDP GSO with offload. Performance: | MSS (bytes) | Throughput (Gbps) | CPU utilization (%) UDP GSO offload | 1472 | 35.6 | 8% UDP GSO | 1472 | 25.5 | 17% UDP | 1472 | 10.2 | 17% UDP GSO offload | 1024 | 35.6 | 8% UDP GSO | 1024 | 19.2 | 17% UDP | 1024 | 5.7 | 17% UDP GSO offload | 512 | 33.8 | 16% UDP GSO | 512 | 10.4 | 17% UDP | 512 | 3.5 | 17% Very nice results :) Thanks. We expect to have 100Gbps results by netdev0x12. +static void mlx5e_udp_gso_prepare_last_skb(struct sk_buff *skb, + struct sk_buff *nskb, + int remaining) +{ + int bytes_needed = remaining, remaining_headlen, remaining_page_offset; + int headlen = skb_transport_offset(skb) + sizeof(struct udphdr); + int payload_len = remaining + sizeof(struct udphdr); + int k = 0, i, j; + + skb_copy_bits(skb, 0, nskb->data, headlen); + nskb->dev = skb->dev; + skb_reset_mac_header(nskb); + skb_set_network_header(nskb, skb_network_offset(skb)); + skb_set_transport_header(nskb, skb_transport_offset(skb)); + skb_set_tail_pointer(nskb, headlen); + + /* How many frags do we need? */ + for (i = skb_shinfo(skb)->nr_frags - 1; i >= 0; i--) { + bytes_needed -= skb_frag_size(&skb_shinfo(skb)->frags[i]); + k++; + if (bytes_needed <= 0) + break; + } + + /* Fill the first frag and split it if necessary */ + j = skb_shinfo(skb)->nr_frags - k; + remaining_page_offset = -bytes_needed; + skb_fill_page_desc(nskb, 0, + skb_shinfo(skb)->frags[j].page.p, + skb_shinfo(skb)->frags[j].page_offset + remaining_page_offset, + skb_shinfo(skb)->frags[j].size - remaining_page_offset); + + skb_frag_ref(skb, j); + + /* Fill the rest of the frags */ + for (i = 1; i < k; i++) { + j = skb_shinfo(skb)->nr_frags - k + i; + + skb_fill_page_desc(nskb, i, + skb_shinfo(skb)->frags[j].page.p, + skb_shinfo(skb)->frags[j].page_offset, + skb_shinfo(skb)->frags[j].size); + skb_frag_ref(skb, j); + } + skb_shinfo(nskb)->nr_frags = k; + + remaining_headlen = remaining - skb->data_len; + + /* headlen contains remaining data? */ + if (remaining_headlen > 0) + skb_copy_bits(skb, skb->len - remaining, nskb->data + headlen, + remaining_headlen); + nskb->len = remaining + headlen; + nskb->data_len = payload_len - sizeof(struct udphdr) + + max_t(int, 0, remaining_headlen); + nskb->protocol = skb->protocol; + if (nskb->protocol == htons(ETH_P_IP)) { + ip_hdr(nskb)->id = htons(ntohs(ip_hdr(nskb)->id) + +skb_shinfo(skb)->gso_segs); + ip_hdr(nskb)->tot_len = + htons(payload_len + sizeof(struct iphdr)); + } else { + ipv6_hdr(nskb)->payload_len = htons(payload_len); + } + udp_hdr(nskb)->len = htons(payload_len); + skb_shinfo(nskb)->gso_size = 0; + nskb->ip_summed = skb->ip_summed; + nskb->csum_start = skb->csum_start; + nskb->csum_offset = skb->csum_offset; + nskb->queue_mapping = skb->queue_mapping; +} + +/* might send skbs and update wqe and pi */ +struct sk_buff *mlx5e_udp_gso_handle_tx_skb(struct net_device *netdev, + struct mlx5e_txqsq *sq, + struct sk_buff *skb, + struct mlx5e_tx_wqe **wqe, + u16 *pi) +{ + int payload_len = skb_shinfo(skb)->gso_size + sizeof(struct udphdr); + int headlen = skb_transport_offset(skb) + sizeof(struct udphdr); + int remaining = (skb->len - headlen) % skb_shinfo(s
[PATCH net] net: fix use-after-free in GRO with ESP
Since the addition of GRO for ESP, gro_receive can consume the skb and return -EINPROGRESS. In that case, the lower layer GRO handler cannot touch the skb anymore. Commit 5f114163f2f5 ("net: Add a skb_gro_flush_final helper.") converted some of the gro_receive handlers that can lead to ESP's gro_receive so that they wouldn't access the skb when -EINPROGRESS is returned, but missed other spots, mainly in tunneling protocols. This patch finishes the conversion to using skb_gro_flush_final(), and adds a new helper, skb_gro_flush_final_remcsum(), used in VXLAN and GUE. Fixes: 5f114163f2f5 ("net: Add a skb_gro_flush_final helper.") Signed-off-by: Sabrina Dubroca Reviewed-by: Stefano Brivio --- drivers/net/geneve.c | 2 +- drivers/net/vxlan.c | 4 +--- include/linux/netdevice.h | 20 net/8021q/vlan.c | 2 +- net/ipv4/fou.c| 4 +--- net/ipv4/gre_offload.c| 2 +- net/ipv4/udp_offload.c| 2 +- 7 files changed, 26 insertions(+), 10 deletions(-) diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c index 750eaa53bf0c..ada33c2d9ac2 100644 --- a/drivers/net/geneve.c +++ b/drivers/net/geneve.c @@ -476,7 +476,7 @@ static struct sk_buff **geneve_gro_receive(struct sock *sk, out_unlock: rcu_read_unlock(); out: - NAPI_GRO_CB(skb)->flush |= flush; + skb_gro_flush_final(skb, pp, flush); return pp; } diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index aee0e60471f1..f6bb1d54d4bd 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -623,9 +623,7 @@ static struct sk_buff **vxlan_gro_receive(struct sock *sk, flush = 0; out: - skb_gro_remcsum_cleanup(skb, &grc); - skb->remcsum_offload = 0; - NAPI_GRO_CB(skb)->flush |= flush; + skb_gro_flush_final_remcsum(skb, pp, flush, &grc); return pp; } diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 3ec9850c7936..3d0cc0b5cec2 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2789,11 +2789,31 @@ static inline void skb_gro_flush_final(struct sk_buff *skb, struct sk_buff **pp, if (PTR_ERR(pp) != -EINPROGRESS) NAPI_GRO_CB(skb)->flush |= flush; } +static inline void skb_gro_flush_final_remcsum(struct sk_buff *skb, + struct sk_buff **pp, + int flush, + struct gro_remcsum *grc) +{ + if (PTR_ERR(pp) != -EINPROGRESS) { + NAPI_GRO_CB(skb)->flush |= flush; + skb_gro_remcsum_cleanup(skb, grc); + skb->remcsum_offload = 0; + } +} #else static inline void skb_gro_flush_final(struct sk_buff *skb, struct sk_buff **pp, int flush) { NAPI_GRO_CB(skb)->flush |= flush; } +static inline void skb_gro_flush_final_remcsum(struct sk_buff *skb, + struct sk_buff **pp, + int flush, + struct gro_remcsum *grc) +{ + NAPI_GRO_CB(skb)->flush |= flush; + skb_gro_remcsum_cleanup(skb, grc); + skb->remcsum_offload = 0; +} #endif static inline int dev_hard_header(struct sk_buff *skb, struct net_device *dev, diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c index 73a65789271b..8ccee3d01822 100644 --- a/net/8021q/vlan.c +++ b/net/8021q/vlan.c @@ -693,7 +693,7 @@ static struct sk_buff **vlan_gro_receive(struct sk_buff **head, out_unlock: rcu_read_unlock(); out: - NAPI_GRO_CB(skb)->flush |= flush; + skb_gro_flush_final(skb, pp, flush); return pp; } diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c index 1540db65241a..c9ec1603666b 100644 --- a/net/ipv4/fou.c +++ b/net/ipv4/fou.c @@ -448,9 +448,7 @@ static struct sk_buff **gue_gro_receive(struct sock *sk, out_unlock: rcu_read_unlock(); out: - NAPI_GRO_CB(skb)->flush |= flush; - skb_gro_remcsum_cleanup(skb, &grc); - skb->remcsum_offload = 0; + skb_gro_flush_final_remcsum(skb, pp, flush, &grc); return pp; } diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c index 1859c473b21a..6a7d980105f6 100644 --- a/net/ipv4/gre_offload.c +++ b/net/ipv4/gre_offload.c @@ -223,7 +223,7 @@ static struct sk_buff **gre_gro_receive(struct sk_buff **head, out_unlock: rcu_read_unlock(); out: - NAPI_GRO_CB(skb)->flush |= flush; + skb_gro_flush_final(skb, pp, flush); return pp; } diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c index 92dc9e5a7ff3..69c54540d5b4 100644 --- a/net/ipv4/udp_offload.c +++ b/net/ipv4/udp_offload.c @@ -394,7 +394,7 @@ struct sk_buff **udp_gro_receive(struct sk_buff **head, struct sk_buff *skb, out_unlock: rcu_read_unlock(); out: - NAPI_GRO_CB(skb)->flush |= flush; + skb_gro_flush_final(skb, pp, flush); return pp; } EXPORT_SYMBOL(udp_
[bpf PATCH 1/2] bpf: sockmap, error path can not release psock in multi-map case
The current code, in the error path of sock_hash_ctx_update_elem, checks if the sock has a psock in the user data and if so decrements the reference count of the psock. However, if the error happens early in the error path we may have never incremented the psock reference count and if the psock exists because the sock is in another map then we may inadvertently decrement the reference count. Fix this by making the error path only call smap_release_sock if the error happens after the increment. Reported-by: syzbot+d464d2c20c717ef5a...@syzkaller.appspotmail.com Fixes: 81110384441a ("bpf: sockmap, add hash map support") Signed-off-by: John Fastabend --- 0 files changed diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c index 4fc2cb1..63fb047 100644 --- a/kernel/bpf/sockmap.c +++ b/kernel/bpf/sockmap.c @@ -1896,7 +1896,7 @@ static int __sock_map_ctx_update_elem(struct bpf_map *map, e = kzalloc(sizeof(*e), GFP_ATOMIC | __GFP_NOWARN); if (!e) { err = -ENOMEM; - goto out_progs; + goto out_free; } } @@ -2324,7 +2324,12 @@ static int sock_hash_ctx_update_elem(struct bpf_sock_ops_kern *skops, if (err) goto err; - /* bpf_map_update_elem() can be called in_irq() */ + psock = smap_psock_sk(sock); + if (unlikely(!psock)) { + err = -EINVAL; + goto err; + } + raw_spin_lock_bh(&b->lock); l_old = lookup_elem_raw(head, hash, key, key_size); if (l_old && map_flags == BPF_NOEXIST) { @@ -2342,12 +2347,6 @@ static int sock_hash_ctx_update_elem(struct bpf_sock_ops_kern *skops, goto bucket_err; } - psock = smap_psock_sk(sock); - if (unlikely(!psock)) { - err = -EINVAL; - goto bucket_err; - } - rcu_assign_pointer(e->hash_link, l_new); rcu_assign_pointer(e->htab, container_of(map, struct bpf_htab, map)); @@ -2370,12 +2369,10 @@ static int sock_hash_ctx_update_elem(struct bpf_sock_ops_kern *skops, raw_spin_unlock_bh(&b->lock); return 0; bucket_err: + smap_release_sock(psock, sock); raw_spin_unlock_bh(&b->lock); err: kfree(e); - psock = smap_psock_sk(sock); - if (psock) - smap_release_sock(psock, sock); return err; }
[bpf PATCH 0/2] sockmap, syzbot fix error path and RCU fix
This applies on top of "BPF fixes for sockhash" I just didn't want to confuse that series yet again by re-ordering/adding these patches in it I missed fixing the error path in the sockhash code to align with supporting socks in multiple maps. Simply checking if the psock is present does not mean we can decrement the reference count because it could be part of another map. Fix this by cleaning up the error path so this situation does not happen. As far as I know this should be the last fix to the fallout from relaxing the single map restriction. Sorry about the multiple fixes but these patches were all written before the initial submission then converted and I missed this detail. But at least we caught these early in the net cycle. Will continue reviewing/testing however to see if we catch anything else. Also we need one more series to check ESTABLISH state as Eric noted. That will be sent out shortly just going over the patches once more. The ESTABLISH/unhash fix is also needed in kTLS. --- John Fastabend (2): bpf: sockmap, error path can not release psock in multi-map case bpf: sockmap, hash table is RCU so readers do not need locks 0 files changed -- Signature
[bpf PATCH 2/2] bpf: sockmap, hash table is RCU so readers do not need locks
This removes locking from readers of RCU hash table. Its not necessary. Fixes: 81110384441a ("bpf: sockmap, add hash map support") Signed-off-by: John Fastabend --- 0 files changed diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c index 63fb047..12ac10a 100644 --- a/kernel/bpf/sockmap.c +++ b/kernel/bpf/sockmap.c @@ -2451,10 +2451,8 @@ struct sock *__sock_hash_lookup_elem(struct bpf_map *map, void *key) b = __select_bucket(htab, hash); head = &b->head; - raw_spin_lock_bh(&b->lock); l = lookup_elem_raw(head, hash, key, key_size); sk = l ? l->sk : NULL; - raw_spin_unlock_bh(&b->lock); return sk; }
[bpf PATCH v5 2/4] bpf: sockmap, fix smap_list_map_remove when psock is in many maps
If a hashmap is free'd with open socks it removes the reference to the hash entry from the psock. If that is the last reference to the psock then it will also be free'd by the reference counting logic. However the current logic that removes the hash reference from the list of references is broken. In smap_list_remove() we first check if the sockmap entry matches and then check if the hashmap entry matches. But, the sockmap entry sill always match because its NULL in this case which causes the first entry to be removed from the list. If this is always the "right" entry (because the user adds/removes entries in order) then everything is OK but otherwise a subsequent bpf_tcp_close() may reference a free'd object. To fix this create two list handlers one for sockmap and one for sockhash. Reported-by: syzbot+0ce137753c78f7b6a...@syzkaller.appspotmail.com Fixes: 81110384441a ("bpf: sockmap, add hash map support") Acked-by: Martin KaFai Lau Signed-off-by: John Fastabend --- kernel/bpf/sockmap.c | 34 ++ 1 file changed, 22 insertions(+), 12 deletions(-) diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c index d7fd17a..5adff4b 100644 --- a/kernel/bpf/sockmap.c +++ b/kernel/bpf/sockmap.c @@ -1602,17 +1602,27 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr) return ERR_PTR(err); } -static void smap_list_remove(struct smap_psock *psock, -struct sock **entry, -struct htab_elem *hash_link) +static void smap_list_map_remove(struct smap_psock *psock, +struct sock **entry) { struct smap_psock_map_entry *e, *tmp; list_for_each_entry_safe(e, tmp, &psock->maps, list) { - if (e->entry == entry || e->hash_link == hash_link) { + if (e->entry == entry) + list_del(&e->list); + } +} + +static void smap_list_hash_remove(struct smap_psock *psock, + struct htab_elem *hash_link) +{ + struct smap_psock_map_entry *e, *tmp; + + list_for_each_entry_safe(e, tmp, &psock->maps, list) { + struct htab_elem *c = e->hash_link; + + if (c == hash_link) list_del(&e->list); - break; - } } } @@ -1647,7 +1657,7 @@ static void sock_map_free(struct bpf_map *map) * to be null and queued for garbage collection. */ if (likely(psock)) { - smap_list_remove(psock, &stab->sock_map[i], NULL); + smap_list_map_remove(psock, &stab->sock_map[i]); smap_release_sock(psock, sock); } write_unlock_bh(&sock->sk_callback_lock); @@ -1706,7 +1716,7 @@ static int sock_map_delete_elem(struct bpf_map *map, void *key) if (psock->bpf_parse) smap_stop_sock(psock, sock); - smap_list_remove(psock, &stab->sock_map[k], NULL); + smap_list_map_remove(psock, &stab->sock_map[k]); smap_release_sock(psock, sock); out: write_unlock_bh(&sock->sk_callback_lock); @@ -1908,7 +1918,7 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops, struct smap_psock *opsock = smap_psock_sk(osock); write_lock_bh(&osock->sk_callback_lock); - smap_list_remove(opsock, &stab->sock_map[i], NULL); + smap_list_map_remove(opsock, &stab->sock_map[i]); smap_release_sock(opsock, osock); write_unlock_bh(&osock->sk_callback_lock); } @@ -2124,7 +2134,7 @@ static void sock_hash_free(struct bpf_map *map) * (psock) to be null and queued for garbage collection. */ if (likely(psock)) { - smap_list_remove(psock, NULL, l); + smap_list_hash_remove(psock, l); smap_release_sock(psock, sock); } write_unlock_bh(&sock->sk_callback_lock); @@ -2304,7 +2314,7 @@ static int sock_hash_ctx_update_elem(struct bpf_sock_ops_kern *skops, psock = smap_psock_sk(l_old->sk); hlist_del_rcu(&l_old->hash_node); - smap_list_remove(psock, NULL, l_old); + smap_list_hash_remove(psock, l_old); smap_release_sock(psock, l_old->sk); free_htab_elem(htab, l_old); } @@ -2372,7 +2382,7 @@ static int sock_hash_delete_elem(struct bpf_map *map, void *key) * to be null and queued for garbage collection. */ if (likely(psock)) { - smap_list_remove(psock, NULL, l); + smap_list_hash_remove(psock, l); smap_release_sock(psock, sock); }
[bpf PATCH v5 3/4] bpf: sockhash fix omitted bucket lock in sock_close
First the sk_callback_lock() was being used to protect both the sock callback hooks and the psock->maps list. This got overly convoluted after the addition of sockhash (in sockmap it made some sense because masp and callbacks were tightly coupled) so lets split out a specific lock for maps and only use the callback lock for its intended purpose. This fixes a couple cases where we missed using maps lock when it was in fact needed. Also this makes it easier to follow the code because now we can put the locking closer to the actual code its serializing. Next, in sock_hash_delete_elem() the pattern was as follows, sock_hash_delete_elem() [...] spin_lock(bucket_lock) l = lookup_elem_raw() if (l) hlist_del_rcu() write_lock(sk_callback_lock) destroy psock ... write_unlock(sk_callback_lock) spin_unlock(bucket_lock) The ordering is necessary because we only know the {p}sock after dereferencing the hash table which we can't do unless we have the bucket lock held. Once we have the bucket lock and the psock element it is deleted from the hashmap to ensure any other path doing a lookup will fail. Finally, the refcnt is decremented and if zero the psock is destroyed. In parallel with the above (or free'ing the map) a tcp close event may trigger tcp_close(). Which at the moment omits the bucket lock altogether (oops!) where the flow looks like this, bpf_tcp_close() [...] write_lock(sk_callback_lock) for each psock->maps // list of maps this sock is part of hlist_del_rcu(ref_hash_node); destroy psock ... write_unlock(sk_callback_lock) Obviously, and demonstrated by syzbot, this is broken because we can have multiple threads deleting entries via hlist_del_rcu(). To fix this we might be tempted to wrap the hlist operation in a bucket lock but that would create a lock inversion problem. In summary to follow locking rules the psocks maps list needs the sk_callback_lock (after this patch maps_lock) but we need the bucket lock to do the hlist_del_rcu. To resolve the lock inversion problem pop the head of the maps list repeatedly and remove the reference until no more are left. If a delete happens in parallel from the BPF API that is OK as well because it will do a similar action, lookup the lock in the map/hash, delete it from the map/hash, and dec the refcnt. We check for this case before doing a destroy on the psock to ensure we don't have two threads tearing down a psock. The new logic is as follows, bpf_tcp_close() e = psock_map_pop(psock->maps) // done with map lock bucket_lock() // lock hash list bucket l = lookup_elem_raw(head, hash, key, key_size); if (l) { //only get here if elmnt was not already removed hlist_del_rcu() ... destroy psock... } bucket_unlock() And finally for all the above to work add missing locking around map operations per above. Then add RCU annotations and use rcu_dereference/rcu_assign_pointer to manage values relying on RCU so that the object is not free'd from sock_hash_free() while it is being referenced in bpf_tcp_close(). Reported-by: syzbot+0ce137753c78f7b6a...@syzkaller.appspotmail.com Fixes: 81110384441a ("bpf: sockmap, add hash map support") Signed-off-by: John Fastabend --- kernel/bpf/sockmap.c | 145 +- 1 file changed, 96 insertions(+), 49 deletions(-) diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c index 5adff4b..60b6e8b 100644 --- a/kernel/bpf/sockmap.c +++ b/kernel/bpf/sockmap.c @@ -72,6 +72,7 @@ struct bpf_htab { u32 n_buckets; u32 elem_size; struct bpf_sock_progs progs; + struct rcu_head rcu; }; struct htab_elem { @@ -89,8 +90,8 @@ enum smap_psock_state { struct smap_psock_map_entry { struct list_head list; struct sock **entry; - struct htab_elem *hash_link; - struct bpf_htab *htab; + struct htab_elem __rcu *hash_link; + struct bpf_htab __rcu *htab; }; struct smap_psock { @@ -120,6 +121,7 @@ struct smap_psock { struct bpf_prog *bpf_parse; struct bpf_prog *bpf_verdict; struct list_head maps; + spinlock_t maps_lock; /* Back reference used when sock callback trigger sockmap operations */ struct sock *sock; @@ -258,16 +260,54 @@ static void bpf_tcp_release(struct sock *sk) rcu_read_unlock(); } +static struct htab_elem *lookup_elem_raw(struct hlist_head *head, +u32 hash, void *key, u32 key_size) +{ + struct htab_elem *l; + + hlist_for_each_entry_rcu(l, head, hash_node) { + if (l->hash == hash && !memcmp(&l->key, key, key_size)) + return l; + } + + return NULL; +} + +static inline struct bucket *__select_bucket(struct bpf_htab *htab, u32 hash) +{ + return &htab->buckets[hash & (htab->n_buckets - 1)]; +} + +static inline struct hlist_head *se
[bpf PATCH v5 4/4] bpf: sockhash, add release routine
Add map_release_uref pointer to hashmap ops. This was dropped when original sockhash code was ported into bpf-next before initial commit. Fixes: 81110384441a ("bpf: sockmap, add hash map support") Acked-by: Martin KaFai Lau Signed-off-by: John Fastabend --- kernel/bpf/sockmap.c |1 + 1 file changed, 1 insertion(+) diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c index 60b6e8b..4fc2cb1 100644 --- a/kernel/bpf/sockmap.c +++ b/kernel/bpf/sockmap.c @@ -2478,6 +2478,7 @@ struct sock *__sock_hash_lookup_elem(struct bpf_map *map, void *key) .map_get_next_key = sock_hash_get_next_key, .map_update_elem = sock_hash_update_elem, .map_delete_elem = sock_hash_delete_elem, + .map_release_uref = sock_map_release, }; BPF_CALL_4(bpf_sock_map_update, struct bpf_sock_ops_kern *, bpf_sock,
[bpf PATCH v5 0/4] BPF fixes for sockhash
This addresses two syzbot issues that lead to identifying (by Eric and Wei) a class of bugs where we don't correctly check for IPv4/v6 sockets and their associated state. The second issue was a locking omission in sockhash. The first patch addresses IPv6 socks and fixing an error where sockhash would overwrite the prot pointer with IPv4 prot. To fix this build similar solution to TLS ULP. Although we continue to allow socks in all states not just ESTABLISH in this patch set because as Martin points out there should be no issue with this on the sockmap ULP because we don't use the ctx in this code. Once multiple ULPs coexist we may need to revisit this. However we can do this in *next trees. The other issue syzbot found that the tcp_close() handler missed locking the hash bucket lock which could result in corrupting the sockhash bucket list if delete and close ran at the same time. And also the smap_list_remove() routine was not working correctly at all. This was not caught in my testing because in general my tests (to date at least lets add some more robust selftest in bpf-next) do things in the "expected" order, create map, add socks, delete socks, then tear down maps. The tests we have that do the ops out of this order where only working on single maps not multi- maps so we never saw the issue. Thanks syzbot. The fix is to restructure the tcp_close() lock handling. And fix the obvious bug in smap_list_remove(). Finally, during review I noticed the release handler was omitted from the upstream code (patch 4) due to an incorrect merge conflict fix when I ported the code to latest bpf-next before submitting. This would leave references to the map around if the user never closes the map. v3: rework patches, dropping ESTABLISH check and adding rcu annotation along with the smap_list_remove fix v4: missed one more case where maps was being accessed without the sk_callback_lock, spoted by Martin as well. v5: changed to use a specific lock for maps and reduced callback lock so that it is only used to gaurd sk callbacks. I think this makes the logic a bit cleaner and avoids confusion ovoer what each lock is doing. Also big thanks to Martin for thorough review he caught at least one case where I missed a rcu_call(). --- John Fastabend (4): bpf: sockmap, fix crash when ipv6 sock is added bpf: sockmap, fix smap_list_map_remove when psock is in many maps bpf: sockhash fix omitted bucket lock in sock_close bpf: sockhash, add release routine kernel/bpf/sockmap.c | 236 +++--- 1 file changed, 166 insertions(+), 70 deletions(-) -- Signature
[bpf PATCH v5 1/4] bpf: sockmap, fix crash when ipv6 sock is added
This fixes a crash where we assign tcp_prot to IPv6 sockets instead of tcpv6_prot. Previously we overwrote the sk->prot field with tcp_prot even in the AF_INET6 case. This patch ensures the correct tcp_prot and tcpv6_prot are used. Tested with 'netserver -6' and 'netperf -H [IPv6]' as well as 'netperf -H [IPv4]'. The ESTABLISHED check resolves the previously crashing case here. Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support") Reported-by: syzbot+5c063698bdbfac19f...@syzkaller.appspotmail.com Acked-by: Martin KaFai Lau Signed-off-by: John Fastabend Signed-off-by: Wei Wang --- kernel/bpf/sockmap.c | 58 +- 1 file changed, 48 insertions(+), 10 deletions(-) diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c index 52a91d8..d7fd17a 100644 --- a/kernel/bpf/sockmap.c +++ b/kernel/bpf/sockmap.c @@ -140,6 +140,7 @@ static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, static int bpf_tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size); static int bpf_tcp_sendpage(struct sock *sk, struct page *page, int offset, size_t size, int flags); +static void bpf_tcp_close(struct sock *sk, long timeout); static inline struct smap_psock *smap_psock_sk(const struct sock *sk) { @@ -161,7 +162,42 @@ static bool bpf_tcp_stream_read(const struct sock *sk) return !empty; } -static struct proto tcp_bpf_proto; +enum { + SOCKMAP_IPV4, + SOCKMAP_IPV6, + SOCKMAP_NUM_PROTS, +}; + +enum { + SOCKMAP_BASE, + SOCKMAP_TX, + SOCKMAP_NUM_CONFIGS, +}; + +static struct proto *saved_tcpv6_prot __read_mostly; +static DEFINE_SPINLOCK(tcpv6_prot_lock); +static struct proto bpf_tcp_prots[SOCKMAP_NUM_PROTS][SOCKMAP_NUM_CONFIGS]; +static void build_protos(struct proto prot[SOCKMAP_NUM_CONFIGS], +struct proto *base) +{ + prot[SOCKMAP_BASE] = *base; + prot[SOCKMAP_BASE].close= bpf_tcp_close; + prot[SOCKMAP_BASE].recvmsg = bpf_tcp_recvmsg; + prot[SOCKMAP_BASE].stream_memory_read = bpf_tcp_stream_read; + + prot[SOCKMAP_TX]= prot[SOCKMAP_BASE]; + prot[SOCKMAP_TX].sendmsg= bpf_tcp_sendmsg; + prot[SOCKMAP_TX].sendpage = bpf_tcp_sendpage; +} + +static void update_sk_prot(struct sock *sk, struct smap_psock *psock) +{ + int family = sk->sk_family == AF_INET6 ? SOCKMAP_IPV6 : SOCKMAP_IPV4; + int conf = psock->bpf_tx_msg ? SOCKMAP_TX : SOCKMAP_BASE; + + sk->sk_prot = &bpf_tcp_prots[family][conf]; +} + static int bpf_tcp_init(struct sock *sk) { struct smap_psock *psock; @@ -181,14 +217,17 @@ static int bpf_tcp_init(struct sock *sk) psock->save_close = sk->sk_prot->close; psock->sk_proto = sk->sk_prot; - if (psock->bpf_tx_msg) { - tcp_bpf_proto.sendmsg = bpf_tcp_sendmsg; - tcp_bpf_proto.sendpage = bpf_tcp_sendpage; - tcp_bpf_proto.recvmsg = bpf_tcp_recvmsg; - tcp_bpf_proto.stream_memory_read = bpf_tcp_stream_read; + /* Build IPv6 sockmap whenever the address of tcpv6_prot changes */ + if (sk->sk_family == AF_INET6 && + unlikely(sk->sk_prot != smp_load_acquire(&saved_tcpv6_prot))) { + spin_lock_bh(&tcpv6_prot_lock); + if (likely(sk->sk_prot != saved_tcpv6_prot)) { + build_protos(bpf_tcp_prots[SOCKMAP_IPV6], sk->sk_prot); + smp_store_release(&saved_tcpv6_prot, sk->sk_prot); + } + spin_unlock_bh(&tcpv6_prot_lock); } - - sk->sk_prot = &tcp_bpf_proto; + update_sk_prot(sk, psock); rcu_read_unlock(); return 0; } @@ -,8 +1150,7 @@ static void bpf_tcp_msg_add(struct smap_psock *psock, static int bpf_tcp_ulp_register(void) { - tcp_bpf_proto = tcp_prot; - tcp_bpf_proto.close = bpf_tcp_close; + build_protos(bpf_tcp_prots[SOCKMAP_IPV4], &tcp_prot); /* Once BPF TX ULP is registered it is never unregistered. It * will be in the ULP list for the lifetime of the system. Doing * duplicate registers is not a problem.
Re: [PATCH net] net: fib_rules: bring back rule_exists to match rule during add
From: Roopa Prabhu Date: Fri, 29 Jun 2018 14:32:15 -0700 > From: Roopa Prabhu > > After commit f9d4b0c1e969 ("fib_rules: move common handling of newrule > delrule msgs into fib_nl2rule"), rule_exists got replaced by rule_find > for existing rule lookup in both the add and del paths. While this > is good for the delete path, it solves a few problems but opens up > a few invalid key matches in the add path. > > $ip -4 rule add table main tos 10 fwmark 1 > $ip -4 rule add table main tos 10 > RTNETLINK answers: File exists > > The problem here is rule_find does not check if the key masks in > the new and old rule are the same and hence ends up matching a more > secific rule. Rule key masks cannot be easily compared today without > an elaborate if-else block. Its best to introduce key masks for easier > and accurate rule comparison in the future. Until then, due to fear of > regressions this patch re-introduces older loose rule_exists during add. > Also fixes both rule_exists and rule_find to cover missing attributes. > > Fixes: f9d4b0c1e969 ("fib_rules: move common handling of newrule delrule msgs > into fib_nl2rule") > Signed-off-by: Roopa Prabhu Applied, thanks for resolving all of these issues.
Re: [PATCH net-next 00/13] mlxsw: Add resource scale tests
From: Petr Machata Date: Sat, 30 Jun 2018 02:44:23 +0200 > There are a number of tests that check features of the Linux networking > stack. By running them on suitable interfaces, one can exercise the > mlxsw offloading code. However none of these tests attempts to push > mlxsw to the limits supported by the ASIC. > > As an additional wrinkle, the "limits supported by the ASIC" themselves > may not be a set of fixed numbers, but rather depend on a profile that > determines how the ASIC resources are allocated for different purposes. > > This patchset introduces several tests that verify capability of mlxsw > to offload amounts of routes, flower rules, and mirroring sessions that > match predicted ASIC capacity, at different configuration profiles. > Additionally they verify that amounts exceeding the predicted capacity > can *not* be offloaded. ... Good stuff, series applied, thanks!
Re: [PATCH net-next 4/5] net/sched: flower: Dump the ethertype encapsulated in vlan
Hi Jianbo, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on net-next/master] url: https://github.com/0day-ci/linux/commits/Jianbo-Liu/net-flow_dissector-Save-vlan-ethertype-from-headers/20180630-180158 reproduce: # apt-get install sparse make ARCH=x86_64 allmodconfig make C=1 CF=-D__CHECK_ENDIAN__ sparse warnings: (new ones prefixed by >>) net/sched/cls_flower.c:544:15: sparse: cast to restricted __be32 net/sched/cls_flower.c:544:15: sparse: cast to restricted __be32 net/sched/cls_flower.c:544:15: sparse: cast to restricted __be32 net/sched/cls_flower.c:544:15: sparse: cast to restricted __be32 net/sched/cls_flower.c:544:15: sparse: cast to restricted __be32 net/sched/cls_flower.c:544:15: sparse: cast to restricted __be32 net/sched/cls_flower.c:545:16: sparse: cast to restricted __be32 net/sched/cls_flower.c:545:16: sparse: cast to restricted __be32 net/sched/cls_flower.c:545:16: sparse: cast to restricted __be32 net/sched/cls_flower.c:545:16: sparse: cast to restricted __be32 net/sched/cls_flower.c:545:16: sparse: cast to restricted __be32 net/sched/cls_flower.c:545:16: sparse: cast to restricted __be32 include/linux/slab.h:631:13: sparse: undefined identifier '__builtin_mul_overflow' include/linux/slab.h:631:13: sparse: not a function >> net/sched/cls_flower.c:1317:70: sparse: incorrect type in argument 3 >> (different base types) @@expected unsigned short [unsigned] [usertype] >> value @@got short [unsigned] [usertype] value @@ net/sched/cls_flower.c:1317:70:expected unsigned short [unsigned] [usertype] value net/sched/cls_flower.c:1317:70:got restricted __be16 [usertype] n_proto include/linux/slab.h:631:13: sparse: call with no type! vim +1317 net/sched/cls_flower.c 1264 1265 static int fl_dump(struct net *net, struct tcf_proto *tp, void *fh, 1266 struct sk_buff *skb, struct tcmsg *t) 1267 { 1268 struct cls_fl_filter *f = fh; 1269 struct nlattr *nest; 1270 struct fl_flow_key *key, *mask; 1271 1272 if (!f) 1273 return skb->len; 1274 1275 t->tcm_handle = f->handle; 1276 1277 nest = nla_nest_start(skb, TCA_OPTIONS); 1278 if (!nest) 1279 goto nla_put_failure; 1280 1281 if (f->res.classid && 1282 nla_put_u32(skb, TCA_FLOWER_CLASSID, f->res.classid)) 1283 goto nla_put_failure; 1284 1285 key = &f->key; 1286 mask = &f->mask->key; 1287 1288 if (mask->indev_ifindex) { 1289 struct net_device *dev; 1290 1291 dev = __dev_get_by_index(net, key->indev_ifindex); 1292 if (dev && nla_put_string(skb, TCA_FLOWER_INDEV, dev->name)) 1293 goto nla_put_failure; 1294 } 1295 1296 if (!tc_skip_hw(f->flags)) 1297 fl_hw_update_stats(tp, f); 1298 1299 if (fl_dump_key_val(skb, key->eth.dst, TCA_FLOWER_KEY_ETH_DST, 1300 mask->eth.dst, TCA_FLOWER_KEY_ETH_DST_MASK, 1301 sizeof(key->eth.dst)) || 1302 fl_dump_key_val(skb, key->eth.src, TCA_FLOWER_KEY_ETH_SRC, 1303 mask->eth.src, TCA_FLOWER_KEY_ETH_SRC_MASK, 1304 sizeof(key->eth.src)) || 1305 fl_dump_key_val(skb, &key->basic.n_proto, TCA_FLOWER_KEY_ETH_TYPE, 1306 &mask->basic.n_proto, TCA_FLOWER_UNSPEC, 1307 sizeof(key->basic.n_proto))) 1308 goto nla_put_failure; 1309 1310 if (fl_dump_key_mpls(skb, &key->mpls, &mask->mpls)) 1311 goto nla_put_failure; 1312 1313 if (fl_dump_key_vlan(skb, &key->vlan, &mask->vlan)) 1314 goto nla_put_failure; 1315 1316 if (mask->vlan.vlan_tpid && > 1317 nla_put_u16(skb, TCA_FLOWER_KEY_VLAN_ETH_TYPE, > key->basic.n_proto)) 1318 goto nla_put_failure; 1319 1320 if ((key->basic.n_proto == htons(ETH_P_IP) || 1321 key->basic.n_proto == htons(ETH_P_IPV6)) && 1322 (fl_dump_key_val(skb, &key->basic.ip_proto, TCA_FLOWER_KEY_IP_PROTO, 1323 &mask->basic.ip_proto, TCA_FLOWER_UNSPEC, 1324 sizeof(key->basic.ip_proto)) || 1325 fl_dump_key_ip(skb, &key->ip, &mask->ip))) 1326 goto nla_put_failure; 1327 1328 if (key->cont
Re: [PATCH net-next 0/9] nfp: flower updates and netconsole
From: Jakub Kicinski Date: Fri, 29 Jun 2018 17:04:33 -0700 > This set contains assorted updates to driver base and flower. > First patch is a follow up to a fix to calculating counters which > went into net. For ethtool counters we should also make sure > they are visible even after ring reconfiguration. Next patch > is a safety measure in case we are running on a machine with a > broken BIOS we should fail the probe when risk of incorrect > operation is too high. The next two patches add netpoll support > and make use of napi_consume_skb(). Last we populate bus info > on all representors. > > Pieter adds support for offload of the checksum action in flower. > > John follows up to another fix he's done in net, we set TTL > values on tunnels to stack's default, now Johns does a full > route lookup to get a more precise information, he populates > ToS field as well. Last but not least he follows up on Jiri's > request to enable LAG offload in case the team driver is used > and then hash function is unknown. Series applied, thanks.
Re: [Patch net] net: use dev_change_tx_queue_len() for SIOCSIFTXQLEN
From: Cong Wang Date: Fri, 29 Jun 2018 13:42:48 -0700 > As noticed by Eric, we need to switch to the helper > dev_change_tx_queue_len() for SIOCSIFTXQLEN call path too, > otheriwse still miss dev_qdisc_change_tx_queue_len(). > > Fixes: 6a643ddb5624 ("net: introduce helper dev_change_tx_queue_len()") > Reported-by: Eric Dumazet > Signed-off-by: Cong Wang Applied.
Re: [PATCH net-next] strparser: Call skb_unclone conditionally
From: Vakul Garg Date: Sat, 30 Jun 2018 00:45:55 +0530 > Calling skb_unclone() is expensive as it triggers a memcpy operation. > Instead of calling skb_unclone() unconditionally, call it only when skb > has a shared frag_list. This improves tls rx throughout significantly. > > Signed-off-by: Vakul Garg > Suggested-by: Boris Pismenny Applied.
Re: [PATCH net 0/5] s390/qeth: fixes 2018-06-29
From: Julian Wiedmann Date: Fri, 29 Jun 2018 19:45:49 +0200 > please apply a few qeth fixes for -net and your 4.17 stable queue. > > Patches 1-3 fix several issues wrt to MAC address management that were > introduced during the 4.17 cycle. > Patch 4 tackles a long-standing issue with busy multi-connection workloads > on devices in af_iucv mode. > Patch 5 makes sure to re-enable all active HW offloads, after a card was > previously set offline and thus lost its HW context. Series applied and queued up for -stable.
Re: [PATCH net] alx: take rtnl before calling __alx_open from resume
From: Sabrina Dubroca Date: Fri, 29 Jun 2018 17:51:26 +0200 > The __alx_open function can be called from ndo_open, which is called > under RTNL, or from alx_resume, which isn't. Since commit d768319cd427, > we're calling the netif_set_real_num_{tx,rx}_queues functions, which > need to be called under RTNL. > > This is similar to commit 0c2cc02e571a ("igb: Move the calls to set the > Tx and Rx queues into igb_open"). > > Fixes: d768319cd427 ("alx: enable multiple tx queues") > Signed-off-by: Sabrina Dubroca Applied and queued up for -stable.
Re: [PATCH net] sfc: correctly initialise filter rwsem for farch
From: Bert Kenward Date: Fri, 29 Jun 2018 16:29:28 +0100 > Fixes: fc7a6c287ff3 ("sfc: use a semaphore to lock farch filters too") > Suggested-by: Joseph Korty > Signed-off-by: Bert Kenward Applied and queued up for -stable.
Re: [PATCH net-next v2 1/1] tc-testing: initial version of tunnel_key unit tests
From: Keara Leibovitz Date: Fri, 29 Jun 2018 10:47:31 -0400 > Create unittests for the tc tunnel_key action. > > v2: > For the tests expecting failures, added non-zero exit codes in the > teardowns. This prevents those tests from failing if the act_tunnel_key > module is unloaded. > > Signed-off-by: Keara Leibovitz Applied.
Re: [PATCH] r8169: remove TBI 1000BaseX support
From: Heiner Kallweit Date: Fri, 29 Jun 2018 08:07:04 +0200 > The very first version of RTL8169 from 2002 (and only this one) has > support for a TBI 1000BaseX fiber interface. The TBI support in the > driver makes switching to phylib tricky, so best would be to get > rid of it. I found no report from anybody using a device with RTL8169 > and fiber interface, also the vendor driver doesn't support this mode > (any longer). > So remove TBI support and bail out with a message if a card with > activated TBI is detected. If there really should be any user of it > out there, we could add a stripped-down version of the driver > supporting chip version 01 and TBI only (and maybe move it to > staging). > > Signed-off-by: Heiner Kallweit Ok, I guess this is reasonable. Applied, but it will be a mini-disaster if someone is actually making use of this fiber support for whatever reason.
Re: [PATCH net] net/ipv6: Fix updates to prefix route
From: dsah...@kernel.org Date: Thu, 28 Jun 2018 13:36:55 -0700 > From: David Ahern > > Sowmini reported that a recent commit broke prefix routes for linklocal > addresses. The newly added modify_prefix_route is attempting to add a > new prefix route when the ifp priority does not match the route metric > however the check needs to account for the default priority. In addition, > the route add fails because the route already exists, and then the delete > removes the one that exists. Flip the order to do the delete first. > > Fixes: 8308f3ff1753 ("net/ipv6: Add support for specifying metric of > connected routes") > Reported-by: Sowmini Varadhan > Tested-by: Sowmini Varadhan > Signed-off-by: David Ahern Applied, thanks David.
Re: [PATCH net-next] net: phy: realtek: add support for RTL8211
From: Heiner Kallweit Date: Thu, 28 Jun 2018 20:46:45 +0200 > In preparation of adding phylib support to the r8169 driver we need > PHY drivers for all chip-internal PHY types. Fortunately almost all > of them are either supported by the Realtek PHY driver already or work > with the genphy driver. > Still missing is support for the PHY of RTL8169s, it requires a quirk > to properly support 100Mbit-fixed mode. The quirk was copied from > r8169 driver which copied it from the vendor driver. > Based on the PHYID the internal PHY seems to be a RTL8211. > > Signed-off-by: Heiner Kallweit Applied.
Re: [PATCH net-next 00/10] pnetid and SMC-D support
From: Ursula Braun Date: Thu, 28 Jun 2018 19:05:03 +0200 > SMC requires a configured pnet table to map Ethernet interfaces to > RoCE adapter ports. For s390 there exists hardware support to group > such devices. The first three patches cover the s390 pnetid support, > enabling SMC-R usage on s390 without configuring an extra pnet table. > > SMC currently requires RoCE adapters, and uses RDMA-techniques > implemented with IB-verbs. But s390 offers another method for > intra-CEC Shared Memory communication. The following seven patches > implement a solution to run SMC traffic based on intra-CEC DMA, > called SMC-D. Series applied.
Re: [PATCH net-next] r8169: use standard debug output functions
From: Heiner Kallweit Date: Thu, 28 Jun 2018 20:36:15 +0200 > I see no need to define a private debug output symbol, let's use the > standard debug output functions instead. In this context also remove > the deprecated PFX define. > > The one assertion is wrong IMO anyway, this code path is used also > by chip version 01. > > Signed-off-by: Heiner Kallweit Applied.
Re: [PATCH net-next 0/4] Fixes for running mirror-to-gretap tests on veth
From: Petr Machata Date: Thu, 28 Jun 2018 18:56:15 +0200 > The forwarding selftests infrastructure makes it possible to run the > individual tests on a purely software netdevices. Names of interfaces to > run the test with can be passed as command line arguments to a test. > lib.sh then creates veth pairs backing the interfaces if none exist in > the system. > > However, the tests need to recognize that they might be run on a soft > device. Many mirror-to-gretap tests are buggy in this regard. This patch > set aims to fix the problems in running mirror-to-gretap tests on veth > devices. > > In patch #1, a service function is split out of setup_wait(). > In patch #2, installing a trap is made optional. > In patch #3, tc filters in several tests are tweaked to work with veth. > In patch #4, the logic for waiting for neighbor is fixed for veth. Series applied, thanks.
Re: [patch net-next v2 0/9] net: sched: introduce chain templates support with offloading to mlxsw
Sat, Jun 30, 2018 at 12:18:02AM CEST, xiyou.wangc...@gmail.com wrote: >On Fri, Jun 29, 2018 at 10:06 AM Samudrala, Sridhar > wrote: >> >> So instead of introducing 'chaintemplate' object in the kernel, can't we add >> 'chain' >> object in the kernel that takes the 'template' as an attribute? > >This is exactly what I mean above. Making the chain a standalone object >in kernel would benefit: > >1. Align with 'tc chain' in iproute2, add/del an object is natural > >2. Template is an attribute of this object when creating it: ># tc chain add template ># tc chain add ... # non-template chain Okay. So that would allow either create a chain or "chain with template". Once that is done, there would be no means to manipulate the template. One can only remove the chain. What about refounting? I think it would make sense that this implicit chain addition would take one reference. That means if later on the last filter is removed, the chain would stay there until user removes it by hand. Okay. Sounds good to me. Will do. Thanks! > >3. Easier for sharing by qdiscs: ># tc chain add X block Y ... ># tc filter add ... chain X block Y ... ># tc qdisc add dev eth0 block Y ... > >The current 'ingress_block 22 ingress' syntax is ugly.
[PATCH iproute2/net-next] tc: flower: Add support for QinQ
To support matching on both outer and inner vlan headers, we add new cvlan_id/cvlan_prio/cvlan_ethtype for inner vlan header. Example: # tc filter add dev eth0 protocol 802.1ad parent : \ flower vlan_id 1000 vlan_ethtype 802.1q \ cvlan_id 100 cvlan_ethtype ipv4 \ action vlan pop \ action vlan pop \ action mirred egress redirect dev eth1 # tc filter show dev eth0 ingress filter protocol 802.1ad pref 1 flower chain 0 filter protocol 802.1ad pref 1 flower chain 0 handle 0x1 vlan_id 1000 vlan_ethtype 802.1Q cvlan_id 100 cvlan_ethtype ip eth_type ipv4 in_hw Signed-off-by: Jianbo Liu Acked-by: Jiri Pirko --- include/uapi/linux/pkt_cls.h | 4 ++ man/man8/tc-flower.8 | 23 ++ tc/f_flower.c| 103 ++- 3 files changed, 118 insertions(+), 12 deletions(-) diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h index be05e66..e7f7c52 100644 --- a/include/uapi/linux/pkt_cls.h +++ b/include/uapi/linux/pkt_cls.h @@ -468,6 +468,10 @@ enum { TCA_FLOWER_KEY_IP_TTL, /* u8 */ TCA_FLOWER_KEY_IP_TTL_MASK, /* u8 */ + TCA_FLOWER_KEY_CVLAN_ID,/* be16 */ + TCA_FLOWER_KEY_CVLAN_PRIO, /* u8 */ + TCA_FLOWER_KEY_CVLAN_ETH_TYPE, /* be16 */ + __TCA_FLOWER_MAX, }; diff --git a/man/man8/tc-flower.8 b/man/man8/tc-flower.8 index a561443..c684652 100644 --- a/man/man8/tc-flower.8 +++ b/man/man8/tc-flower.8 @@ -32,6 +32,12 @@ flower \- flow based traffic control filter .IR PRIORITY " | " .BR vlan_ethtype " { " ipv4 " | " ipv6 " | " .IR ETH_TYPE " } | " +.B cvlan_id +.IR VID " | " +.B cvlan_prio +.IR PRIORITY " | " +.BR cvlan_ethtype " { " ipv4 " | " ipv6 " | " +.IR ETH_TYPE " } | " .B mpls_label .IR LABEL " | " .B mpls_tc @@ -132,6 +138,23 @@ Match on layer three protocol. .I VLAN_ETH_TYPE may be either .BR ipv4 ", " ipv6 +or an unsigned 16bit value in hexadecimal format. To match on QinQ packet, it must be 802.1Q or 802.1AD. +.TP +.BI cvlan_id " VID" +Match on QinQ inner vlan tag id. +.I VID +is an unsigned 12bit value in decimal format. +.TP +.BI cvlan_prio " PRIORITY" +Match on QinQ inner vlan tag priority. +.I PRIORITY +is an unsigned 3bit value in decimal format. +.TP +.BI cvlan_ethtype " VLAN_ETH_TYPE" +Match on QinQ layer three protocol. +.I VLAN_ETH_TYPE +may be either +.BR ipv4 ", " ipv6 or an unsigned 16bit value in hexadecimal format. .TP .BI mpls_label " LABEL" diff --git a/tc/f_flower.c b/tc/f_flower.c index ba8eb66..71f2bc1 100644 --- a/tc/f_flower.c +++ b/tc/f_flower.c @@ -50,6 +50,9 @@ static void explain(void) " vlan_id VID |\n" " vlan_prio PRIORITY |\n" " vlan_ethtype [ ipv4 | ipv6 | ETH-TYPE ] |\n" + " cvlan_id VID |\n" + " cvlan_prio PRIORITY |\n" + " cvlan_ethtype [ ipv4 | ipv6 | ETH-TYPE ] |\n" " dst_mac MASKED-LLADDR |\n" " src_mac MASKED-LLADDR |\n" " ip_proto [tcp | udp | sctp | icmp | icmpv6 | IP-PROTO ] |\n" @@ -128,15 +131,21 @@ err: return err; } +static bool eth_type_vlan(__be16 ethertype) +{ + return ethertype == htons(ETH_P_8021Q) || + ethertype == htons(ETH_P_8021AD); +} + static int flower_parse_vlan_eth_type(char *str, __be16 eth_type, int type, __be16 *p_vlan_eth_type, struct nlmsghdr *n) { __be16 vlan_eth_type; - if (eth_type != htons(ETH_P_8021Q)) { - fprintf(stderr, - "Can't set \"vlan_ethtype\" if ethertype isn't 802.1Q\n"); + if (!eth_type_vlan(eth_type)) { + fprintf(stderr, "Can't set \"%s\" if ethertype isn't 802.1Q or 802.1AD\n", + type == TCA_FLOWER_KEY_VLAN_ETH_TYPE ? "vlan_ethtype" : "cvlan_ethtype"); return -1; } @@ -586,6 +595,7 @@ static int flower_parse_opt(struct filter_util *qu, char *handle, struct rtattr *tail; __be16 eth_type = TC_H_MIN(t->tcm_info); __be16 vlan_ethtype = 0; + __be16 cvlan_ethtype = 0; __u8 ip_proto = 0xff; __u32 flags = 0; __u32 mtf = 0; @@ -661,9 +671,8 @@ static int flower_parse_opt(struct filter_util *qu, char *handle, __u16 vid; NEXT_ARG(); - if (eth_type != htons(ETH_P_8021Q)) { - fprintf(stderr, - "Can't set \"vlan_id\" if ethertype isn't 802.1Q\n"); + if (!eth_type_vlan(eth_type)) { + fprintf(stderr, "Can't set \"vlan_id\" if ethertype is
[PATCH net-next 4/5] net/sched: flower: Dump the ethertype encapsulated in vlan
Currently the encapsulated ethertype is not dumped as it's the same as TCA_FLOWER_KEY_ETH_TYPE keyvalue. But the dumping result is inconsistent with input, we add dumping it with TCA_FLOWER_KEY_VLAN_ETH_TYPE. Signed-off-by: Jianbo Liu Acked-by: Jiri Pirko --- net/sched/cls_flower.c | 4 1 file changed, 4 insertions(+) diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c index e6774af..30f3e18 100644 --- a/net/sched/cls_flower.c +++ b/net/sched/cls_flower.c @@ -1255,6 +1255,10 @@ static int fl_dump(struct net *net, struct tcf_proto *tp, void *fh, if (fl_dump_key_vlan(skb, &key->vlan, &mask->vlan)) goto nla_put_failure; + if (mask->vlan.vlan_tpid && + nla_put_u16(skb, TCA_FLOWER_KEY_VLAN_ETH_TYPE, key->basic.n_proto)) + goto nla_put_failure; + if ((key->basic.n_proto == htons(ETH_P_IP) || key->basic.n_proto == htons(ETH_P_IPV6)) && (fl_dump_key_val(skb, &key->basic.ip_proto, TCA_FLOWER_KEY_IP_PROTO, -- 2.9.5
[PATCH net-next 2/5] net/sched: flower: Add support for matching on vlan ethertype
As flow dissector stores vlan ethertype, tc flower now can match on that. It is to make preparation for supporting QinQ. Signed-off-by: Jianbo Liu Acked-by: Jiri Pirko --- net/sched/cls_flower.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c index 4e74508..e6774af 100644 --- a/net/sched/cls_flower.c +++ b/net/sched/cls_flower.c @@ -483,6 +483,7 @@ static int fl_set_key_mpls(struct nlattr **tb, } static void fl_set_key_vlan(struct nlattr **tb, + __be16 ethertype, struct flow_dissector_key_vlan *key_val, struct flow_dissector_key_vlan *key_mask) { @@ -499,6 +500,8 @@ static void fl_set_key_vlan(struct nlattr **tb, VLAN_PRIORITY_MASK; key_mask->vlan_priority = VLAN_PRIORITY_MASK; } + key_val->vlan_tpid = ethertype; + key_mask->vlan_tpid = cpu_to_be16(~0); } static void fl_set_key_flag(u32 flower_key, u32 flower_mask, @@ -575,8 +578,8 @@ static int fl_set_key(struct net *net, struct nlattr **tb, if (tb[TCA_FLOWER_KEY_ETH_TYPE]) { ethertype = nla_get_be16(tb[TCA_FLOWER_KEY_ETH_TYPE]); - if (ethertype == htons(ETH_P_8021Q)) { - fl_set_key_vlan(tb, &key->vlan, &mask->vlan); + if (eth_type_vlan(ethertype)) { + fl_set_key_vlan(tb, ethertype, &key->vlan, &mask->vlan); fl_set_key_val(tb, &key->basic.n_proto, TCA_FLOWER_KEY_VLAN_ETH_TYPE, &mask->basic.n_proto, TCA_FLOWER_UNSPEC, -- 2.9.5
[PATCH net-next 5/5] net/sched: flower: Add supprt for matching on QinQ vlan headers
As support dissecting of QinQ inner and outer vlan headers, user can add rules to match on QinQ vlan headers. Signed-off-by: Jianbo Liu Acked-by: Jiri Pirko --- include/uapi/linux/pkt_cls.h | 4 +++ net/sched/cls_flower.c | 65 ++-- 2 files changed, 55 insertions(+), 14 deletions(-) diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h index 84e4c1d..c4262d9 100644 --- a/include/uapi/linux/pkt_cls.h +++ b/include/uapi/linux/pkt_cls.h @@ -469,6 +469,10 @@ enum { TCA_FLOWER_KEY_IP_TTL, /* u8 */ TCA_FLOWER_KEY_IP_TTL_MASK, /* u8 */ + TCA_FLOWER_KEY_CVLAN_ID,/* be16 */ + TCA_FLOWER_KEY_CVLAN_PRIO, /* u8 */ + TCA_FLOWER_KEY_CVLAN_ETH_TYPE, /* be16 */ + __TCA_FLOWER_MAX, }; diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c index 30f3e18..e3ecffd 100644 --- a/net/sched/cls_flower.c +++ b/net/sched/cls_flower.c @@ -35,6 +35,7 @@ struct fl_flow_key { struct flow_dissector_key_basic basic; struct flow_dissector_key_eth_addrs eth; struct flow_dissector_key_vlan vlan; + struct flow_dissector_key_vlan cvlan; union { struct flow_dissector_key_ipv4_addrs ipv4; struct flow_dissector_key_ipv6_addrs ipv6; @@ -432,6 +433,9 @@ static const struct nla_policy fl_policy[TCA_FLOWER_MAX + 1] = { [TCA_FLOWER_KEY_IP_TOS_MASK]= { .type = NLA_U8 }, [TCA_FLOWER_KEY_IP_TTL] = { .type = NLA_U8 }, [TCA_FLOWER_KEY_IP_TTL_MASK]= { .type = NLA_U8 }, + [TCA_FLOWER_KEY_CVLAN_ID] = { .type = NLA_U16 }, + [TCA_FLOWER_KEY_CVLAN_PRIO] = { .type = NLA_U8 }, + [TCA_FLOWER_KEY_CVLAN_ETH_TYPE] = { .type = NLA_U16 }, }; static void fl_set_key_val(struct nlattr **tb, @@ -484,19 +488,20 @@ static int fl_set_key_mpls(struct nlattr **tb, static void fl_set_key_vlan(struct nlattr **tb, __be16 ethertype, + int vlan_id_key, int vlan_prio_key, struct flow_dissector_key_vlan *key_val, struct flow_dissector_key_vlan *key_mask) { #define VLAN_PRIORITY_MASK 0x7 - if (tb[TCA_FLOWER_KEY_VLAN_ID]) { + if (tb[vlan_id_key]) { key_val->vlan_id = - nla_get_u16(tb[TCA_FLOWER_KEY_VLAN_ID]) & VLAN_VID_MASK; + nla_get_u16(tb[vlan_id_key]) & VLAN_VID_MASK; key_mask->vlan_id = VLAN_VID_MASK; } - if (tb[TCA_FLOWER_KEY_VLAN_PRIO]) { + if (tb[vlan_prio_key]) { key_val->vlan_priority = - nla_get_u8(tb[TCA_FLOWER_KEY_VLAN_PRIO]) & + nla_get_u8(tb[vlan_prio_key]) & VLAN_PRIORITY_MASK; key_mask->vlan_priority = VLAN_PRIORITY_MASK; } @@ -579,11 +584,25 @@ static int fl_set_key(struct net *net, struct nlattr **tb, ethertype = nla_get_be16(tb[TCA_FLOWER_KEY_ETH_TYPE]); if (eth_type_vlan(ethertype)) { - fl_set_key_vlan(tb, ethertype, &key->vlan, &mask->vlan); - fl_set_key_val(tb, &key->basic.n_proto, - TCA_FLOWER_KEY_VLAN_ETH_TYPE, - &mask->basic.n_proto, TCA_FLOWER_UNSPEC, - sizeof(key->basic.n_proto)); + fl_set_key_vlan(tb, ethertype, TCA_FLOWER_KEY_VLAN_ID, + TCA_FLOWER_KEY_VLAN_PRIO, &key->vlan, + &mask->vlan); + + ethertype = nla_get_be16(tb[TCA_FLOWER_KEY_VLAN_ETH_TYPE]); + if (eth_type_vlan(ethertype)) { + fl_set_key_vlan(tb, ethertype, + TCA_FLOWER_KEY_CVLAN_ID, + TCA_FLOWER_KEY_CVLAN_PRIO, + &key->cvlan, &mask->cvlan); + fl_set_key_val(tb, &key->basic.n_proto, + TCA_FLOWER_KEY_CVLAN_ETH_TYPE, + &mask->basic.n_proto, + TCA_FLOWER_UNSPEC, + sizeof(key->basic.n_proto)); + } else { + key->basic.n_proto = ethertype; + mask->basic.n_proto = cpu_to_be16(~0); + } } else { key->basic.n_proto = ethertype; mask->basic.n_proto = cpu_to_be16(~0); @@ -809,6 +828,8 @@ static void fl_init_dissector(struct fl_flow_mask *mask) FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt, FLO
[PATCH net-next 3/5] net/flow_dissector: Add support for QinQ dissection
Dissect the QinQ packets to get both outer and inner vlan information, then store to the extended flow keys. Signed-off-by: Jianbo Liu Acked-by: Jiri Pirko --- include/net/flow_dissector.h | 2 ++ net/core/flow_dissector.c| 32 +--- 2 files changed, 19 insertions(+), 15 deletions(-) diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h index 8f89968..c644067 100644 --- a/include/net/flow_dissector.h +++ b/include/net/flow_dissector.h @@ -206,6 +206,7 @@ enum flow_dissector_key_id { FLOW_DISSECTOR_KEY_MPLS, /* struct flow_dissector_key_mpls */ FLOW_DISSECTOR_KEY_TCP, /* struct flow_dissector_key_tcp */ FLOW_DISSECTOR_KEY_IP, /* struct flow_dissector_key_ip */ + FLOW_DISSECTOR_KEY_CVLAN, /* struct flow_dissector_key_flow_vlan */ FLOW_DISSECTOR_KEY_MAX, }; @@ -237,6 +238,7 @@ struct flow_keys { struct flow_dissector_key_basic basic; struct flow_dissector_key_tags tags; struct flow_dissector_key_vlan vlan; + struct flow_dissector_key_vlan cvlan; struct flow_dissector_key_keyid keyid; struct flow_dissector_key_ports ports; struct flow_dissector_key_addrs addrs; diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c index e068ccf..09360ae 100644 --- a/net/core/flow_dissector.c +++ b/net/core/flow_dissector.c @@ -589,7 +589,7 @@ bool __skb_flow_dissect(const struct sk_buff *skb, struct flow_dissector_key_tags *key_tags; struct flow_dissector_key_vlan *key_vlan; enum flow_dissect_ret fdret; - bool skip_vlan = false; + enum flow_dissector_key_id dissector_vlan = FLOW_DISSECTOR_KEY_MAX; int num_hdrs = 0; u8 ip_proto = 0; bool ret; @@ -748,15 +748,14 @@ bool __skb_flow_dissect(const struct sk_buff *skb, } case htons(ETH_P_8021AD): case htons(ETH_P_8021Q): { - const struct vlan_hdr *vlan; + const struct vlan_hdr *vlan = NULL; struct vlan_hdr _vlan; - bool vlan_tag_present = skb && skb_vlan_tag_present(skb); __be16 saved_vlan_tpid = proto; - if (vlan_tag_present) + if (dissector_vlan == FLOW_DISSECTOR_KEY_MAX && + skb && skb_vlan_tag_present(skb)) { proto = skb->protocol; - - if (!vlan_tag_present || eth_type_vlan(skb->protocol)) { + } else { vlan = __skb_header_pointer(skb, nhoff, sizeof(_vlan), data, hlen, &_vlan); if (!vlan) { @@ -766,20 +765,23 @@ bool __skb_flow_dissect(const struct sk_buff *skb, proto = vlan->h_vlan_encapsulated_proto; nhoff += sizeof(*vlan); - if (skip_vlan) { - fdret = FLOW_DISSECT_RET_PROTO_AGAIN; - break; - } } - skip_vlan = true; - if (dissector_uses_key(flow_dissector, - FLOW_DISSECTOR_KEY_VLAN)) { + if (dissector_vlan == FLOW_DISSECTOR_KEY_MAX) { + dissector_vlan = FLOW_DISSECTOR_KEY_VLAN; + } else if (dissector_vlan == FLOW_DISSECTOR_KEY_VLAN) { + dissector_vlan = FLOW_DISSECTOR_KEY_CVLAN; + } else { + fdret = FLOW_DISSECT_RET_PROTO_AGAIN; + break; + } + + if (dissector_uses_key(flow_dissector, dissector_vlan)) { key_vlan = skb_flow_dissector_target(flow_dissector, - FLOW_DISSECTOR_KEY_VLAN, +dissector_vlan, target_container); - if (vlan_tag_present) { + if (!vlan) { key_vlan->vlan_id = skb_vlan_tag_get_id(skb); key_vlan->vlan_priority = (skb_vlan_tag_get_prio(skb) >> VLAN_PRIO_SHIFT); -- 2.9.5
[PATCH net-next 0/5] Introduce matching on double vlan/QinQ headers for TC flower
Currently TC flower supports only one vlan tag, it doesn't match on both outer and inner vlan headers for QinQ. To do this, we add support to get both outer and inner vlan headers for flow dissector, and then TC flower do matching on those information. We also plan to extend TC command to support this feature. We add new cvlan_id/cvlan_prio/cvlan_ethtype keywords for inner vlan header. The existing vlan_id/vlan_prio/vlan_ethtype are for outer vlan header, and vlan_ethtype must be 802.1q or 802.1ad. The examples for command and output are as the following. # tc filter add dev ens1f1 parent : protocol 802.1ad pref 33 \ flower vlan_id 1000 vlan_ethtype 802.1q \ cvlan_id 100 cvlan_ethtype ipv4 \ action vlan pop \ action vlan pop \ action mirred egress redirect dev ens1f1_0 # tc filter show dev ens1f1 ingress filter protocol 802.1ad pref 33 flower chain 0 filter protocol 802.1ad pref 33 flower chain 0 handle 0x1 vlan_id 1000 vlan_ethtype 802.1Q cvlan_id 100 cvlan_ethtype ip eth_type ipv4 in_hw ... Jianbo Liu (5): net/flow_dissector: Save vlan ethertype from headers net/sched: flower: Add support for matching on vlan ethertype net/flow_dissector: Add support for QinQ dissection net/sched: flower: Dump the ethertype encapsulated in vlan net/sched: flower: Add supprt for matching on QinQ vlan headers include/net/flow_dissector.h | 4 ++- include/uapi/linux/pkt_cls.h | 4 +++ net/core/flow_dissector.c| 34 +++-- net/sched/cls_flower.c | 70 4 files changed, 83 insertions(+), 29 deletions(-) -- 2.9.5
[PATCH net-next 1/5] net/flow_dissector: Save vlan ethertype from headers
Change vlan dissector key to save vlan tpid to support both 802.1Q and 802.1AD ethertype. Signed-off-by: Jianbo Liu Acked-by: Jiri Pirko --- include/net/flow_dissector.h | 2 +- net/core/flow_dissector.c| 2 ++ 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h index adc24df5..8f89968 100644 --- a/include/net/flow_dissector.h +++ b/include/net/flow_dissector.h @@ -47,7 +47,7 @@ struct flow_dissector_key_tags { struct flow_dissector_key_vlan { u16 vlan_id:12, vlan_priority:3; - u16 padding; + __be16 vlan_tpid; }; struct flow_dissector_key_mpls { diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c index 4fc1e84..e068ccf 100644 --- a/net/core/flow_dissector.c +++ b/net/core/flow_dissector.c @@ -751,6 +751,7 @@ bool __skb_flow_dissect(const struct sk_buff *skb, const struct vlan_hdr *vlan; struct vlan_hdr _vlan; bool vlan_tag_present = skb && skb_vlan_tag_present(skb); + __be16 saved_vlan_tpid = proto; if (vlan_tag_present) proto = skb->protocol; @@ -789,6 +790,7 @@ bool __skb_flow_dissect(const struct sk_buff *skb, (ntohs(vlan->h_vlan_TCI) & VLAN_PRIO_MASK) >> VLAN_PRIO_SHIFT; } + key_vlan->vlan_tpid = saved_vlan_tpid; } fdret = FLOW_DISSECT_RET_PROTO_AGAIN; -- 2.9.5
Re: [PATCH net 1/1] bnx2x: Fix receiving tx-timeout in error or recovery state.
From: Sudarsana Reddy Kalluru Date: Thu, 28 Jun 2018 04:52:15 -0700 > Driver performs the internal reload when it receives tx-timeout event from > the OS. Internal reload might fail in some scenarios e.g., fatal HW issues. > In such cases OS still see the link, which would result in undesirable > functionalities such as re-generation of tx-timeouts. > The patch addresses this issue by indicating the link-down to OS when > tx-timeout is detected, and keeping the link in down state till the > internal reload is successful. > > Please consider applying it to 'net' branch. > > Signed-off-by: Sudarsana Reddy Kalluru > Signed-off-by: Ariel Elior Applied.
Re: [PATCH net] cnic: tidy up a size calculation
From: Dan Carpenter Date: Thu, 28 Jun 2018 12:31:25 +0300 > Static checkers complain that id_tbl->table points to longs and 4 bytes > is smaller than sizeof(long). But the since other side is dividing by > 32 instead of sizeof(long), that means the current code works fine. > > Anyway, it's more conventional to use the BITS_TO_LONGS() macro when > we're allocating a bitmap. > > Signed-off-by: Dan Carpenter Applied.
Re: [PATCH net] atm: iphase: fix a 64 bit bug
From: Dan Carpenter Date: Thu, 28 Jun 2018 12:24:42 +0300 > The code assumes that there is 4 bytes in a pointer and it doesn't > allocate enough memory. > > Signed-off-by: Dan Carpenter Applied.
Re: [PATCH net] tcp: fix Fast Open key endianness
From: Yuchung Cheng Date: Wed, 27 Jun 2018 16:04:48 -0700 > Fast Open key could be stored in different endian based on the CPU. > Previously hosts in different endianness in a server farm using > the same key config (sysctl value) would produce different cookies. > This patch fixes it by always storing it as little endian to keep > same API for LE hosts. > > Reported-by: Daniele Iamartino > Signed-off-by: Yuchung Cheng > Signed-off-by: Eric Dumazet > Signed-off-by: Neal Cardwell Applied and queued up for -stable.
Re: [PATCH net-next] net: stmmac: Add support for CBS QDISC
From: Jose Abreu Date: Wed, 27 Jun 2018 13:43:20 +0100 > This adds support for CBS reconfiguration using the TC application. > > A new callback was added to TC ops struct and another one to DMA ops to > reconfigure the channel mode. > > Tested in GMAC5.10. > > Signed-off-by: Jose Abreu Applied.
[net-next PATCH v6 7/7] Documentation: Add explanation for XPS using Rx-queue(s) map
Signed-off-by: Amritha Nambiar --- Documentation/ABI/testing/sysfs-class-net-queues | 11 Documentation/networking/scaling.txt | 61 ++ 2 files changed, 61 insertions(+), 11 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-class-net-queues b/Documentation/ABI/testing/sysfs-class-net-queues index 0c0df91..978b763 100644 --- a/Documentation/ABI/testing/sysfs-class-net-queues +++ b/Documentation/ABI/testing/sysfs-class-net-queues @@ -42,6 +42,17 @@ Description: network device transmit queue. Possible vaules depend on the number of available CPU(s) in the system. +What: /sys/class//queues/tx-/xps_rxqs +Date: June 2018 +KernelVersion: 4.18.0 +Contact: netdev@vger.kernel.org +Description: + Mask of the receive queue(s) currently enabled to participate + into the Transmit Packet Steering packet processing flow for this + network device transmit queue. Possible values depend on the + number of available receive queue(s) in the network device. + Default is disabled. + What: /sys/class//queues/tx-/byte_queue_limits/hold_time Date: November 2011 KernelVersion: 3.3 diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt index f55639d..b7056a8 100644 --- a/Documentation/networking/scaling.txt +++ b/Documentation/networking/scaling.txt @@ -366,8 +366,13 @@ XPS: Transmit Packet Steering Transmit Packet Steering is a mechanism for intelligently selecting which transmit queue to use when transmitting a packet on a multi-queue -device. To accomplish this, a mapping from CPU to hardware queue(s) is -recorded. The goal of this mapping is usually to assign queues +device. This can be accomplished by recording two kinds of maps, either +a mapping of CPU to hardware queue(s) or a mapping of receive queue(s) +to hardware transmit queue(s). + +1. XPS using CPUs map + +The goal of this mapping is usually to assign queues exclusively to a subset of CPUs, where the transmit completions for these queues are processed on a CPU within this set. This choice provides two benefits. First, contention on the device queue lock is @@ -377,15 +382,40 @@ transmit queue). Secondly, cache miss rate on transmit completion is reduced, in particular for data cache lines that hold the sk_buff structures. -XPS is configured per transmit queue by setting a bitmap of CPUs that -may use that queue to transmit. The reverse mapping, from CPUs to -transmit queues, is computed and maintained for each network device. -When transmitting the first packet in a flow, the function -get_xps_queue() is called to select a queue. This function uses the ID -of the running CPU as a key into the CPU-to-queue lookup table. If the +2. XPS using receive queues map + +This mapping is used to pick transmit queue based on the receive +queue(s) map configuration set by the administrator. A set of receive +queues can be mapped to a set of transmit queues (many:many), although +the common use case is a 1:1 mapping. This will enable sending packets +on the same queue associations for transmit and receive. This is useful for +busy polling multi-threaded workloads where there are challenges in +associating a given CPU to a given application thread. The application +threads are not pinned to CPUs and each thread handles packets +received on a single queue. The receive queue number is cached in the +socket for the connection. In this model, sending the packets on the same +transmit queue corresponding to the associated receive queue has benefits +in keeping the CPU overhead low. Transmit completion work is locked into +the same queue-association that a given application is polling on. This +avoids the overhead of triggering an interrupt on another CPU. When the +application cleans up the packets during the busy poll, transmit completion +may be processed along with it in the same thread context and so result in +reduced latency. + +XPS is configured per transmit queue by setting a bitmap of +CPUs/receive-queues that may use that queue to transmit. The reverse +mapping, from CPUs to transmit queues or from receive-queues to transmit +queues, is computed and maintained for each network device. When +transmitting the first packet in a flow, the function get_xps_queue() is +called to select a queue. This function uses the ID of the receive queue +for the socket connection for a match in the receive queue-to-transmit queue +lookup table. Alternatively, this function can also use the ID of the +running CPU as a key into the CPU-to-queue lookup table. If the ID matches a single queue, that is used for transmission. If multiple queues match, one is selected by using the flow hash to compute an index -into the set. +into the set. When selecting the transmit queue based on receive queue(s) +map, the transmit device is not validated against the rece
[net-next PATCH v6 5/7] net: Enable Tx queue selection based on Rx queues
This patch adds support to pick Tx queue based on the Rx queue(s) map configuration set by the admin through the sysfs attribute for each Tx queue. If the user configuration for receive queue(s) map does not apply, then the Tx queue selection falls back to CPU(s) map based selection and finally to hashing. Signed-off-by: Amritha Nambiar --- include/net/sock.h | 10 net/core/dev.c | 62 ++-- 2 files changed, 55 insertions(+), 17 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index 2b097cc..2ed99bf 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1730,6 +1730,16 @@ static inline void sk_rx_queue_clear(struct sock *sk) #endif } +#ifdef CONFIG_XPS +static inline int sk_rx_queue_get(const struct sock *sk) +{ + if (sk && sk->sk_rx_queue_mapping != NO_QUEUE_MAPPING) + return sk->sk_rx_queue_mapping; + + return -1; +} +#endif + static inline void sk_set_socket(struct sock *sk, struct socket *sock) { sk_tx_queue_clear(sk); diff --git a/net/core/dev.c b/net/core/dev.c index 43b5575..08d58e0 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3459,35 +3459,63 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) } #endif /* CONFIG_NET_EGRESS */ -static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb) +#ifdef CONFIG_XPS +static int __get_xps_queue_idx(struct net_device *dev, struct sk_buff *skb, + struct xps_dev_maps *dev_maps, unsigned int tci) +{ + struct xps_map *map; + int queue_index = -1; + + if (dev->num_tc) { + tci *= dev->num_tc; + tci += netdev_get_prio_tc_map(dev, skb->priority); + } + + map = rcu_dereference(dev_maps->attr_map[tci]); + if (map) { + if (map->len == 1) + queue_index = map->queues[0]; + else + queue_index = map->queues[reciprocal_scale( + skb_get_hash(skb), map->len)]; + if (unlikely(queue_index >= dev->real_num_tx_queues)) + queue_index = -1; + } + return queue_index; +} +#endif + +static int get_xps_queue(struct net_device *dev, struct sk_buff *skb) { #ifdef CONFIG_XPS struct xps_dev_maps *dev_maps; - struct xps_map *map; + struct sock *sk = skb->sk; int queue_index = -1; if (!static_key_false(&xps_needed)) return -1; rcu_read_lock(); - dev_maps = rcu_dereference(dev->xps_cpus_map); + if (!static_key_false(&xps_rxqs_needed)) + goto get_cpus_map; + + dev_maps = rcu_dereference(dev->xps_rxqs_map); if (dev_maps) { - unsigned int tci = skb->sender_cpu - 1; + int tci = sk_rx_queue_get(sk); - if (dev->num_tc) { - tci *= dev->num_tc; - tci += netdev_get_prio_tc_map(dev, skb->priority); - } + if (tci >= 0 && tci < dev->num_rx_queues) + queue_index = __get_xps_queue_idx(dev, skb, dev_maps, + tci); + } - map = rcu_dereference(dev_maps->attr_map[tci]); - if (map) { - if (map->len == 1) - queue_index = map->queues[0]; - else - queue_index = map->queues[reciprocal_scale(skb_get_hash(skb), - map->len)]; - if (unlikely(queue_index >= dev->real_num_tx_queues)) - queue_index = -1; +get_cpus_map: + if (queue_index < 0) { + dev_maps = rcu_dereference(dev->xps_cpus_map); + if (dev_maps) { + unsigned int tci = skb->sender_cpu - 1; + + queue_index = __get_xps_queue_idx(dev, skb, dev_maps, + tci); } } rcu_read_unlock();
[net-next PATCH v6 6/7] net-sysfs: Add interface for Rx queue(s) map per Tx queue
Extend transmit queue sysfs attribute to configure Rx queue(s) map per Tx queue. By default no receive queues are configured for the Tx queue. - /sys/class/net/eth0/queues/tx-*/xps_rxqs Signed-off-by: Amritha Nambiar --- net/core/net-sysfs.c | 83 ++ 1 file changed, 83 insertions(+) diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index b39987c..f25ac5f 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -1283,6 +1283,88 @@ static ssize_t xps_cpus_store(struct netdev_queue *queue, static struct netdev_queue_attribute xps_cpus_attribute __ro_after_init = __ATTR_RW(xps_cpus); + +static ssize_t xps_rxqs_show(struct netdev_queue *queue, char *buf) +{ + struct net_device *dev = queue->dev; + struct xps_dev_maps *dev_maps; + unsigned long *mask, index; + int j, len, num_tc = 1, tc = 0; + + index = get_netdev_queue_index(queue); + + if (dev->num_tc) { + num_tc = dev->num_tc; + tc = netdev_txq_to_tc(dev, index); + if (tc < 0) + return -EINVAL; + } + mask = kcalloc(BITS_TO_LONGS(dev->num_rx_queues), sizeof(long), + GFP_KERNEL); + if (!mask) + return -ENOMEM; + + rcu_read_lock(); + dev_maps = rcu_dereference(dev->xps_rxqs_map); + if (!dev_maps) + goto out_no_maps; + + for (j = -1; j = netif_attrmask_next(j, NULL, dev->num_rx_queues), +j < dev->num_rx_queues;) { + int i, tci = j * num_tc + tc; + struct xps_map *map; + + map = rcu_dereference(dev_maps->attr_map[tci]); + if (!map) + continue; + + for (i = map->len; i--;) { + if (map->queues[i] == index) { + set_bit(j, mask); + break; + } + } + } +out_no_maps: + rcu_read_unlock(); + + len = bitmap_print_to_pagebuf(false, buf, mask, dev->num_rx_queues); + kfree(mask); + + return len < PAGE_SIZE ? len : -EINVAL; +} + +static ssize_t xps_rxqs_store(struct netdev_queue *queue, const char *buf, + size_t len) +{ + struct net_device *dev = queue->dev; + struct net *net = dev_net(dev); + unsigned long *mask, index; + int err; + + if (!ns_capable(net->user_ns, CAP_NET_ADMIN)) + return -EPERM; + + mask = kcalloc(BITS_TO_LONGS(dev->num_rx_queues), sizeof(long), + GFP_KERNEL); + if (!mask) + return -ENOMEM; + + index = get_netdev_queue_index(queue); + + err = bitmap_parse(buf, len, mask, dev->num_rx_queues); + if (err) { + kfree(mask); + return err; + } + + err = __netif_set_xps_queue(dev, mask, index, true); + kfree(mask); + return err ? : len; +} + +static struct netdev_queue_attribute xps_rxqs_attribute __ro_after_init + = __ATTR_RW(xps_rxqs); #endif /* CONFIG_XPS */ static struct attribute *netdev_queue_default_attrs[] __ro_after_init = { @@ -1290,6 +1372,7 @@ static struct attribute *netdev_queue_default_attrs[] __ro_after_init = { &queue_traffic_class.attr, #ifdef CONFIG_XPS &xps_cpus_attribute.attr, + &xps_rxqs_attribute.attr, &queue_tx_maxrate.attr, #endif NULL
[net-next PATCH v6 0/7] Symmetric queue selection using XPS for Rx queues
This patch series implements support for Tx queue selection based on Rx queue(s) map. This is done by configuring Rx queue(s) map per Tx-queue using sysfs attribute. If the user configuration for Rx queues does not apply, then the Tx queue selection falls back to XPS using CPUs and finally to hashing. XPS is refactored to support Tx queue selection based on either the CPUs map or the Rx-queues map. The config option CONFIG_XPS needs to be enabled. By default no receive queues are configured for the Tx queue. - /sys/class/net//queues/tx-*/xps_rxqs A set of receive queues can be mapped to a set of transmit queues (many:many), although the common use case is a 1:1 mapping. This will enable sending packets on the same Tx-Rx queue association as this is useful for busy polling multi-threaded workloads where it is not possible to pin the threads to a CPU. This is a rework of Sridhar's patch for symmetric queueing via socket option: https://www.spinics.net/lists/netdev/msg453106.html Testing Hints: Kernel: Linux 4.17.0-rc7+ Interface: driver: ixgbe version: 5.1.0-k firmware-version: 0x00015e0b Configuration: ethtool -L $iface combined 16 ethtool -C $iface rx-usecs 1000 sysctl net.core.busy_poll=1000 ATR disabled: ethtool -K $iface ntuple on Workload: Modified memcached that changes the thread selection policy to be based on the incoming rx-queue of a connection using SO_INCOMING_NAPI_ID socket option. The default is round-robin. Default: No rxqs_map configured Symmetric queues: Enable rxqs_map for all queues 1:1 mapped to Tx queue System: Architecture: x86_64 CPU(s):72 Model name:Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz 16 threads 400K requests/sec = --- Default Symmetric queues --- RTT min/avg/max 4/51/2215 2/30/5163 (usec) intr/sec26655 18606 contextswitch/sec 51454044 insn per cycle 0.430.72 cache-misses6.919 4.310 (% of all cache refs) L1-dcache-load- 4.493.29 -misses (% of all L1-dcache hits) LLC-load-misses 13.26 8.96 (% of all LL-cache hits) --- 32 threads 400K requests/sec = --- Default Symmetric queues --- RTT min/avg/max 10/112/5562 9/46/4637 (usec) intr/sec30456 27666 contextswitch/sec 75525133 insn per cycle 0.410.49 cache-misses9.357 2.769 (% of all cache refs) L1-dcache-load- 4.093.98 -misses (% of all L1-dcache hits) LLC-load-misses 12.96 3.96 (% of all LL-cache hits) --- 16 threads 800K requests/sec = --- Default Symmetric queues --- RTT min/avg/max 5/151/4989 9/69/2611 (usec) intr/sec35686 22907 contextswitch/sec 25522 12281 insn per cycle 0.670.74 cache-misses8.652 6.38 (% of all cache refs) L1-dcache-load- 3.192.86 -misses (% of all L1-dcache hits) LLC-load-misses 16.53 11.99 (% of all LL-cache hits) --- 32 threads 800K requests/sec = --- Default Symmetric queues --- RTT min/avg/max 6/163/6152 8/88/4209 (usec) intr/sec47079 26548 contextswitch/sec 42190 39168 insn per cycle 0.450.54 cache-misses8.798
[net-next PATCH v6 1/7] net: Refactor XPS for CPUs and Rx queues
Refactor XPS code to support Tx queue selection based on CPU(s) map or Rx queue(s) map. Signed-off-by: Amritha Nambiar --- include/linux/cpumask.h | 11 ++ include/linux/netdevice.h | 98 - net/core/dev.c| 211 ++--- net/core/net-sysfs.c |4 - 4 files changed, 244 insertions(+), 80 deletions(-) diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h index bf53d89..57f20a0 100644 --- a/include/linux/cpumask.h +++ b/include/linux/cpumask.h @@ -115,12 +115,17 @@ extern struct cpumask __cpu_active_mask; #define cpu_active(cpu)((cpu) == 0) #endif -/* verify cpu argument to cpumask_* operators */ -static inline unsigned int cpumask_check(unsigned int cpu) +static inline void cpu_max_bits_warn(unsigned int cpu, unsigned int bits) { #ifdef CONFIG_DEBUG_PER_CPU_MAPS - WARN_ON_ONCE(cpu >= nr_cpumask_bits); + WARN_ON_ONCE(cpu >= bits); #endif /* CONFIG_DEBUG_PER_CPU_MAPS */ +} + +/* verify cpu argument to cpumask_* operators */ +static inline unsigned int cpumask_check(unsigned int cpu) +{ + cpu_max_bits_warn(cpu, nr_cpumask_bits); return cpu; } diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index c6b377a..8bf8d61 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -731,10 +731,15 @@ struct xps_map { */ struct xps_dev_maps { struct rcu_head rcu; - struct xps_map __rcu *cpu_map[0]; + struct xps_map __rcu *attr_map[0]; /* Either CPUs map or RXQs map */ }; -#define XPS_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) + \ + +#define XPS_CPU_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) + \ (nr_cpu_ids * (_tcs) * sizeof(struct xps_map *))) + +#define XPS_RXQ_DEV_MAPS_SIZE(_tcs, _rxqs) (sizeof(struct xps_dev_maps) +\ + (_rxqs * (_tcs) * sizeof(struct xps_map *))) + #endif /* CONFIG_XPS */ #define TC_MAX_QUEUE 16 @@ -1910,7 +1915,8 @@ struct net_device { int watchdog_timeo; #ifdef CONFIG_XPS - struct xps_dev_maps __rcu *xps_maps; + struct xps_dev_maps __rcu *xps_cpus_map; + struct xps_dev_maps __rcu *xps_rxqs_map; #endif #ifdef CONFIG_NET_CLS_ACT struct mini_Qdisc __rcu *miniq_egress; @@ -3259,6 +3265,92 @@ static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index) #ifdef CONFIG_XPS int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask, u16 index); +int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask, + u16 index, bool is_rxqs_map); + +/** + * netif_attr_test_mask - Test a CPU or Rx queue set in a mask + * @j: CPU/Rx queue index + * @mask: bitmask of all cpus/rx queues + * @nr_bits: number of bits in the bitmask + * + * Test if a CPU or Rx queue index is set in a mask of all CPU/Rx queues. + */ +static inline bool netif_attr_test_mask(unsigned long j, + const unsigned long *mask, + unsigned int nr_bits) +{ + cpu_max_bits_warn(j, nr_bits); + return test_bit(j, mask); +} + +/** + * netif_attr_test_online - Test for online CPU/Rx queue + * @j: CPU/Rx queue index + * @online_mask: bitmask for CPUs/Rx queues that are online + * @nr_bits: number of bits in the bitmask + * + * Returns true if a CPU/Rx queue is online. + */ +static inline bool netif_attr_test_online(unsigned long j, + const unsigned long *online_mask, + unsigned int nr_bits) +{ + cpu_max_bits_warn(j, nr_bits); + + if (online_mask) + return test_bit(j, online_mask); + + return (j < nr_bits); +} + +/** + * netif_attrmask_next - get the next CPU/Rx queue in a cpu/Rx queues mask + * @n: CPU/Rx queue index + * @srcp: the cpumask/Rx queue mask pointer + * @nr_bits: number of bits in the bitmask + * + * Returns >= nr_bits if no further CPUs/Rx queues set. + */ +static inline unsigned int netif_attrmask_next(int n, const unsigned long *srcp, + unsigned int nr_bits) +{ + /* -1 is a legal arg here. */ + if (n != -1) + cpu_max_bits_warn(n, nr_bits); + + if (srcp) + return find_next_bit(srcp, nr_bits, n + 1); + + return n + 1; +} + +/** + * netif_attrmask_next_and - get the next CPU/Rx queue in *src1p & *src2p + * @n: CPU/Rx queue index + * @src1p: the first CPUs/Rx queues mask pointer + * @src2p: the second CPUs/Rx queues mask pointer + * @nr_bits: number of bits in the bitmask + * + * Returns >= nr_bits if no further CPUs/Rx queues set in both. + */ +static inline int netif_attrmask_next_and(int n, const unsigned long *src1p, + const unsigned long *src2p, +
[net-next PATCH v6 3/7] net: sock: Change tx_queue_mapping in sock_common to unsigned short
Change 'skc_tx_queue_mapping' field in sock_common structure from 'int' to 'unsigned short' type with ~0 indicating unset and other positive queue values being set. This will accommodate adding a new 'unsigned short' field in sock_common in the next patch for rx_queue_mapping. Signed-off-by: Amritha Nambiar --- include/net/sock.h | 14 +++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index b3b7541..37b09c8 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -214,7 +214,7 @@ struct sock_common { struct hlist_node skc_node; struct hlist_nulls_node skc_nulls_node; }; - int skc_tx_queue_mapping; + unsigned short skc_tx_queue_mapping; union { int skc_incoming_cpu; u32 skc_rcv_wnd; @@ -1681,17 +1681,25 @@ static inline int sk_receive_skb(struct sock *sk, struct sk_buff *skb, static inline void sk_tx_queue_set(struct sock *sk, int tx_queue) { + /* sk_tx_queue_mapping accept only upto a 16-bit value */ + if (WARN_ON_ONCE((unsigned short)tx_queue >= USHRT_MAX)) + return; sk->sk_tx_queue_mapping = tx_queue; } +#define NO_QUEUE_MAPPING USHRT_MAX + static inline void sk_tx_queue_clear(struct sock *sk) { - sk->sk_tx_queue_mapping = -1; + sk->sk_tx_queue_mapping = NO_QUEUE_MAPPING; } static inline int sk_tx_queue_get(const struct sock *sk) { - return sk ? sk->sk_tx_queue_mapping : -1; + if (sk && sk->sk_tx_queue_mapping != NO_QUEUE_MAPPING) + return sk->sk_tx_queue_mapping; + + return -1; } static inline void sk_set_socket(struct sock *sk, struct socket *sock)
[net-next PATCH v6 4/7] net: Record receive queue number for a connection
This patch adds a new field to sock_common 'skc_rx_queue_mapping' which holds the receive queue number for the connection. The Rx queue is marked in tcp_finish_connect() to allow a client app to do SO_INCOMING_NAPI_ID after a connect() call to get the right queue association for a socket. Rx queue is also marked in tcp_conn_request() to allow syn-ack to go on the right tx-queue associated with the queue on which syn is received. Signed-off-by: Amritha Nambiar Signed-off-by: Sridhar Samudrala --- include/net/busy_poll.h |1 + include/net/sock.h | 28 net/core/sock.c |2 ++ net/ipv4/tcp_input.c|3 +++ 4 files changed, 34 insertions(+) diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h index c518743..9e36fda6 100644 --- a/include/net/busy_poll.h +++ b/include/net/busy_poll.h @@ -151,6 +151,7 @@ static inline void sk_mark_napi_id(struct sock *sk, const struct sk_buff *skb) #ifdef CONFIG_NET_RX_BUSY_POLL sk->sk_napi_id = skb->napi_id; #endif + sk_rx_queue_set(sk, skb); } /* variant used for unconnected sockets */ diff --git a/include/net/sock.h b/include/net/sock.h index 37b09c8..2b097cc 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -139,6 +139,7 @@ typedef __u64 __bitwise __addrpair; * @skc_node: main hash linkage for various protocol lookup tables * @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol * @skc_tx_queue_mapping: tx queue number for this connection + * @skc_rx_queue_mapping: rx queue number for this connection * @skc_flags: place holder for sk_flags * %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE, * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings @@ -215,6 +216,9 @@ struct sock_common { struct hlist_nulls_node skc_nulls_node; }; unsigned short skc_tx_queue_mapping; +#ifdef CONFIG_XPS + unsigned short skc_rx_queue_mapping; +#endif union { int skc_incoming_cpu; u32 skc_rcv_wnd; @@ -326,6 +330,9 @@ struct sock { #define sk_nulls_node __sk_common.skc_nulls_node #define sk_refcnt __sk_common.skc_refcnt #define sk_tx_queue_mapping__sk_common.skc_tx_queue_mapping +#ifdef CONFIG_XPS +#define sk_rx_queue_mapping__sk_common.skc_rx_queue_mapping +#endif #define sk_dontcopy_begin __sk_common.skc_dontcopy_begin #define sk_dontcopy_end__sk_common.skc_dontcopy_end @@ -1702,6 +1709,27 @@ static inline int sk_tx_queue_get(const struct sock *sk) return -1; } +static inline void sk_rx_queue_set(struct sock *sk, const struct sk_buff *skb) +{ +#ifdef CONFIG_XPS + if (skb_rx_queue_recorded(skb)) { + u16 rx_queue = skb_get_rx_queue(skb); + + if (WARN_ON_ONCE(rx_queue == NO_QUEUE_MAPPING)) + return; + + sk->sk_rx_queue_mapping = rx_queue; + } +#endif +} + +static inline void sk_rx_queue_clear(struct sock *sk) +{ +#ifdef CONFIG_XPS + sk->sk_rx_queue_mapping = NO_QUEUE_MAPPING; +#endif +} + static inline void sk_set_socket(struct sock *sk, struct socket *sock) { sk_tx_queue_clear(sk); diff --git a/net/core/sock.c b/net/core/sock.c index bcc4182..dac6d78 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -2818,6 +2818,8 @@ void sock_init_data(struct socket *sock, struct sock *sk) sk->sk_pacing_rate = ~0U; sk->sk_pacing_shift = 10; sk->sk_incoming_cpu = -1; + + sk_rx_queue_clear(sk); /* * Before updating sk_refcnt, we must commit prior changes to memory * (Documentation/RCU/rculist_nulls.txt for details) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 9c5b341..b3b5aef 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -78,6 +78,7 @@ #include #include #include +#include int sysctl_tcp_max_orphans __read_mostly = NR_FILE; @@ -5588,6 +5589,7 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb) if (skb) { icsk->icsk_af_ops->sk_rx_dst_set(sk, skb); security_inet_conn_established(sk, skb); + sk_mark_napi_id(sk, skb); } tcp_init_transfer(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB); @@ -6416,6 +6418,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops, tcp_rsk(req)->snt_isn = isn; tcp_rsk(req)->txhash = net_tx_rndhash(); tcp_openreq_init_rwin(req, sk, dst); + sk_rx_queue_set(req_to_sk(req), skb); if (!want_cookie) { tcp_reqsk_record_syn(sk, req, skb); fastopen_sk = tcp_try_fastopen(sk, skb, req, &foc, dst);
[net-next PATCH v6 2/7] net: Use static_key for XPS maps
Use static_key for XPS maps to reduce the cost of extra map checks, similar to how it is used for RPS and RFS. This includes static_key 'xps_needed' for XPS and another for 'xps_rxqs_needed' for XPS using Rx queues map. Signed-off-by: Amritha Nambiar --- net/core/dev.c | 31 +-- 1 file changed, 25 insertions(+), 6 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index 7105955..43b5575 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2081,6 +2081,10 @@ int netdev_txq_to_tc(struct net_device *dev, unsigned int txq) EXPORT_SYMBOL(netdev_txq_to_tc); #ifdef CONFIG_XPS +struct static_key xps_needed __read_mostly; +EXPORT_SYMBOL(xps_needed); +struct static_key xps_rxqs_needed __read_mostly; +EXPORT_SYMBOL(xps_rxqs_needed); static DEFINE_MUTEX(xps_map_mutex); #define xmap_dereference(P)\ rcu_dereference_protected((P), lockdep_is_held(&xps_map_mutex)) @@ -2168,14 +2172,18 @@ static void netif_reset_xps_queues(struct net_device *dev, u16 offset, struct xps_dev_maps *dev_maps; unsigned int nr_ids; - mutex_lock(&xps_map_mutex); + if (!static_key_false(&xps_needed)) + return; - dev_maps = xmap_dereference(dev->xps_rxqs_map); - if (dev_maps) { - nr_ids = dev->num_rx_queues; - clean_xps_maps(dev, possible_mask, dev_maps, nr_ids, offset, - count, true); + mutex_lock(&xps_map_mutex); + if (static_key_false(&xps_rxqs_needed)) { + dev_maps = xmap_dereference(dev->xps_rxqs_map); + if (dev_maps) { + nr_ids = dev->num_rx_queues; + clean_xps_maps(dev, possible_mask, dev_maps, nr_ids, + offset, count, true); + } } dev_maps = xmap_dereference(dev->xps_cpus_map); @@ -2189,6 +2197,10 @@ static void netif_reset_xps_queues(struct net_device *dev, u16 offset, false); out_no_maps: + if (static_key_enabled(&xps_rxqs_needed)) + static_key_slow_dec(&xps_rxqs_needed); + + static_key_slow_dec(&xps_needed); mutex_unlock(&xps_map_mutex); } @@ -2297,6 +2309,10 @@ int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask, if (!new_dev_maps) goto out_no_new_maps; + static_key_slow_inc(&xps_needed); + if (is_rxqs_map) + static_key_slow_inc(&xps_rxqs_needed); + for (j = -1; j = netif_attrmask_next(j, possible_mask, nr_ids), j < nr_ids;) { /* copy maps belonging to foreign traffic classes */ @@ -3450,6 +3466,9 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb) struct xps_map *map; int queue_index = -1; + if (!static_key_false(&xps_needed)) + return -1; + rcu_read_lock(); dev_maps = rcu_dereference(dev->xps_cpus_map); if (dev_maps) {
Re: [PATCH v2 net-next 1/3] rds: Changing IP address internal representation to struct in6_addr
From: Ka-Cheong Poon Date: Wed, 27 Jun 2018 03:23:27 -0700 > This patch changes the internal representation of an IP address to use > struct in6_addr. IPv4 address is stored as an IPv4 mapped address. > All the functions which take an IP address as argument are also > changed to use struct in6_addr. But RDS socket layer is not modified > such that it still does not accept IPv6 address from an application. > And RDS layer does not accept nor initiate IPv6 connections. > > v2: Fixed sparse warnings. > > Signed-off-by: Ka-Cheong Poon I really don't like this. An ipv4 mapped ipv6 address is not the same as an ipv4 address. You are effectively preventing the use of ipv6 connections using ipv4 mapped addresses. Also, assuming the sockaddr type based upon size is wrong. You have to check the family field, then you can decide to interpret the rest of the sockaddr in one way or another and also validate it's length.
Re: [PATCH net-next v2 3/4] net: check tunnel option type in tunnel flags
On 06/29/2018 07:01 PM, Jakub Kicinski wrote: > On Fri, 29 Jun 2018 09:04:15 +0200, Daniel Borkmann wrote: >> On 06/28/2018 06:54 PM, Jakub Kicinski wrote: >>> On Thu, 28 Jun 2018 09:42:06 +0200, Jiri Benc wrote: On Wed, 27 Jun 2018 11:49:49 +0200, Daniel Borkmann wrote: > Looks good to me, and yes in BPF case a mask like TUNNEL_OPTIONS_PRESENT > is > right approach since this is opaque info and solely defined by the BPF > prog > that is using the generic helper. Wouldn't it make sense to introduce some safeguards here (in a backward compatible way, of course)? It's easy to mistakenly set data for a different tunnel type in a BPF program and then be surprised by the result. It might help users if such usage was detected by the kernel, one way or another. >>> >>> Well, that's how it works today ;) >> >> Well, it was designed like that on purpose, to be i) agnostic of the >> underlying >> device, ii) to not clutter BPF API with tens of different APIs effectively >> doing >> the same thing, and at the same time to avoid adding protocol specifics. >> E.g. at >> least core bits of bpf_skb_{set,get}_tunnel_key() will work whether I use >> vxlan >> or geneve underneath (we are actually using it this way) and I could use >> things >> like tun_id to encode custom meta data from BPF for either of them depending >> on flavor >> picked by orchestration system. For the tunnel options in >> bpf_skb_{set,get}_tunnel_opt() >> it's similar although here there needs to be awareness of the underlying dev >> depending >> on whether you encode data into e.g. gbp or tlvs, etc. However, downside >> right now I >> can see with a patch like below is that: >> >> i) People might still just keep using 'TUNNEL_OPTIONS_PRESENT path' since >> available >> and backwards compatible with current/older kernels, ii) we cut bits away >> from >> size over time for each new tunnel proto added in future that would support >> tunnel >> options, iii) that extension is one-sided (at least below) and same would be >> needed >> in getter part, and iv) there needs to be a way for the case when folks add >> new >> tunnel options where we don't need to worry that we forget updating >> BPF_F_TUN_* >> each time otherwise this will easily slip through and again people will just >> rely >> on using TUNNEL_OPTIONS_PRESENT catchall. Given latter and in particular >> point i) >> I wouldn't think it's worth the pain, the APIs were added to BPF in v4.6 so >> this >> would buy them 2 more years wrt kernel compatibility with same functionality >> level. >> And point v), I just noticed the patch is actually buggy: size is >> ARG_CONST_SIZE and >> verifier will attempt to check the value whether the buffer passed in >> argument 2 is >> valid or not, so using flags here in upper bits would let verification fail, >> you'd >> really have to make a new helper just for this. > > Ah, indeed. I'd rather avoid a new helper, if we reuse an old one > people can always write a program like: > > err = helper(all_flags); > if (err == -EINVAL) > err = helper(fallback_flags); > > With a new helper the program will not even load on old kernels :( > > Could we add the flags new to bpf_skb_set_tunnel_key(), maybe? It is a > bit ugly because only options care about the flags and in theory > bpf_skb_set_tunnel_key() doesn't have to be called before > bpf_skb_set_tunnel_opt() ... Right, though it's also ugly splitting things this way over two helpers when this is _specifically_ only for one of them (unless you do need it for hw offload where you need to signal for tunnel key _and_ options in BPF itself but that's not from what I see is the case here). For sake of 'safeguard'-only I bet users won't do such fallback for reasons above but again use currently TUNNEL_OPTIONS_PRESENT route, as is since it's simpler and available since long time already. Thus I would prefer to leave it as is in such case since ship has sailed long ago on this, and usage of this extra check is not really obvious either, so I don't think it's worth it and 'value-add' is marginal. Thanks, Daniel
Re: [PATCH net-next 3/5] sctp: add spp_ipv6_flowlabel and spp_dscp for sctp_paddrparams
Hi, 2018-06-28 15:40 GMT+09:00 Xin Long : > On Tue, Jun 26, 2018 at 8:02 PM, 吉藤英明 > wrote: >> 2018-06-26 13:33 GMT+09:00 Xin Long : >>> On Tue, Jun 26, 2018 at 12:31 AM, Marcelo Ricardo Leitner >>> wrote: Hi, On Tue, Jun 26, 2018 at 01:12:00AM +0900, 吉藤英明 wrote: > Hi, > > 2018-06-25 22:03 GMT+09:00 Marcelo Ricardo Leitner > : > > On Mon, Jun 25, 2018 at 07:28:47AM -0400, Neil Horman wrote: > >> On Mon, Jun 25, 2018 at 04:31:26PM +0900, David Miller wrote: > >> > From: Xin Long > >> > Date: Mon, 25 Jun 2018 10:14:35 +0800 > >> > > >> > > struct sctp_paddrparams { > >> > > @@ -773,6 +775,8 @@ struct sctp_paddrparams { > >> > > __u32 spp_pathmtu; > >> > > __u32 spp_sackdelay; > >> > > __u32 spp_flags; > >> > > + __u32 spp_ipv6_flowlabel; > >> > > + __u8spp_dscp; > >> > > } __attribute__((packed, aligned(4))); > >> > > >> > I don't think you can change the size of this structure like this. > >> > > >> > This check in sctp_setsockopt_peer_addr_params(): > >> > > >> > if (optlen != sizeof(struct sctp_paddrparams)) > >> > return -EINVAL; > >> > > >> > is going to trigger in old kernels when executing programs > >> > built against the new struct definition. > > > > That will happen, yes, but do we really care about being future-proof > > here? I mean: if we also update such check(s) to support dealing with > > smaller-than-supported structs, newer kernels will be able to run > > programs built against the old struct, and the new one; while building > > using newer headers and running on older kernel may fool the > > application in other ways too (like enabling support for something > > that is available on newer kernel and that is not present in the older > > one). > > We should not break existing apps. > We still accept apps of pre-2.4 era without sin6_scope_id > (e.g., net/ipv6/af_inet6.c:inet6_bind()). Yes. That's what I tried to say. That is supporting an old app built with old kernel headers and running on a newer kernel, and not the other way around (an app built with fresh headers and running on an old kernel). >>> To make it, I will update the check like: >>> >>> diff --git a/net/sctp/socket.c b/net/sctp/socket.c >>> index 1df5d07..c949d8c 100644 >>> --- a/net/sctp/socket.c >>> +++ b/net/sctp/socket.c >>> @@ -2715,13 +2715,18 @@ static int >>> sctp_setsockopt_peer_addr_params(struct sock *sk, >>> struct sctp_sock*sp = sctp_sk(sk); >>> int error; >>> int hb_change, pmtud_change, sackdelay_change; >>> + int plen = sizeof(params); >>> + int old_plen = plen - sizeof(u32) * 2; >> >> if (optlen < offsetof(struct sctp_paddrparams, spp_ipv6_flowlabel)) >> maybe? > Hi, yoshfuji, > offsetof() is better. thank you. > >> >>> >>> - if (optlen != sizeof(struct sctp_paddrparams)) >>> + if (optlen != plen && optlen != old_plen) >>> return -EINVAL; >>> >>> if (copy_from_user(¶ms, optval, optlen)) >>> return -EFAULT; >>> >>> + if (optlen == old_plen) >>> + params.spp_flags &= ~(SPP_DSCP | SPP_IPV6_FLOWLABEL); >> >> I think we should return -EINVAL if size is not new one. > Sorry, if we returned -EINVAL when size is the old one, > how can we guarantee an old app built with old kernel > headers and running on a newer kernel works well? > or you meant? > if ((params.spp_flags & (SPP_DSCP | SPP_IPV6_FLOWLABEL)) && > optlen != plen) > return EINVAL; Yes, I meant this (it should be -EINVAL though). > >> >> --yoshfuji >> >>> + >>> /* Validate flags and value parameters. */ >>> hb_change= params.spp_flags & SPP_HB; >>> pmtud_change = params.spp_flags & SPP_PMTUD; >>> @@ -5591,10 +5596,13 @@ static int >>> sctp_getsockopt_peer_addr_params(struct sock *sk, int len, >>> struct sctp_transport *trans = NULL; >>> struct sctp_association *asoc = NULL; >>> struct sctp_sock*sp = sctp_sk(sk); >>> + int plen = sizeof(params); >>> + int old_plen = plen - sizeof(u32) * 2; >>> >>> - if (len < sizeof(struct sctp_paddrparams)) >>> + if (len < old_plen) >>> return -EINVAL; >>> - len = sizeof(struct sctp_paddrparams); >>> + >>> + len = len >= plen ? plen : old_plen; >>> if (copy_from_user(¶ms, optval, len)) >>> return -EFAULT; >>> >>> does it look ok to you?
Re: [net-next PATCH v5 4/7] net: Record receive queue number for a connection
On 6/29/2018 6:06 AM, David Miller wrote: > From: Amritha Nambiar > Date: Wed, 27 Jun 2018 15:31:34 -0700 > >> @@ -1702,6 +1709,13 @@ static inline int sk_tx_queue_get(const struct sock >> *sk) >> return -1; >> } >> >> +static inline void sk_rx_queue_set(struct sock *sk, const struct sk_buff >> *skb) >> +{ >> +#ifdef CONFIG_XPS >> +sk->sk_rx_queue_mapping = skb_get_rx_queue(skb); >> +#endif >> +} > > Maybe add a >= NO_QUEUE_MAPPING check like you did for > skc_tx_queue_mapping. > Will fix.
Re: [net-next PATCH v5 3/7] net: sock: Change tx_queue_mapping in sock_common to unsigned short
On 6/29/2018 6:05 AM, David Miller wrote: > From: Amritha Nambiar > Date: Wed, 27 Jun 2018 15:31:28 -0700 > >> @@ -1681,17 +1681,25 @@ static inline int sk_receive_skb(struct sock *sk, >> struct sk_buff *skb, >> >> static inline void sk_tx_queue_set(struct sock *sk, int tx_queue) >> { >> +/* sk_tx_queue_mapping accept only upto a 16-bit value */ >> +if (WARN_ON_ONCE((unsigned short)tx_queue > USHRT_MAX)) >> +return; >> sk->sk_tx_queue_mapping = tx_queue; >> } >> >> +#define NO_QUEUE_MAPPINGUSHRT_MAX > > I think you need to check ">= USHRT_MAX" since USHRT_MAX is how you > indicate no queue. > Agree, I'll fix this.
Re: [net-next PATCH v5 1/7] net: Refactor XPS for CPUs and Rx queues
On 6/29/2018 5:59 AM, David Miller wrote: > From: Amritha Nambiar > Date: Wed, 27 Jun 2018 15:31:18 -0700 > >> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h >> index c6b377a..3790ac9 100644 >> --- a/include/linux/netdevice.h >> +++ b/include/linux/netdevice.h > ... >> +static inline bool attr_test_online(unsigned long j, >> +const unsigned long *online_mask, >> +unsigned int nr_bits) > > This is a networking header file, so for names you are putting into > the global namespace please give them some kind of prefix that > indicates it is about networking. > Sure, will prefix these with 'netif_'.